Predicting Advertisements' Clickability¶

The dataset indicates whether or not a user clicked on an advertisement on a company website. I created a logistic regression model to predict whether the user will click on an ad based off the features of that user. Dataset:

Daily Time Spent on Site
Age
Area Income: Avg income of the geographical area of the user
Daily Internet Usage
Ad Topic Line
City
Male
Country
Timestamp
Clicked on Ad (0 or 1)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('whitegrid')

ad_data = pd.read_csv('advertising.csv')

ad_data.head(5)

ad_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
Daily Time Spent on Site    1000 non-null float64
Age                         1000 non-null int64
Area Income                 1000 non-null float64
Daily Internet Usage        1000 non-null float64
Ad Topic Line               1000 non-null object
City                        1000 non-null object
Male                        1000 non-null int64
Country                     1000 non-null object
Timestamp                   1000 non-null object
Clicked on Ad               1000 non-null int64
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB

ad_data.describe()

Exploratory Data Analysis¶

Histogram of Age

sns.distplot(ad_data["Age"], bins = 30,kde = False)

<matplotlib.axes._subplots.AxesSubplot at 0x2352e9c9080>

Jointplot for Area Income vs Age

sns.jointplot("Age", "Area Income", data = ad_data, kind = 'scatter')

<seaborn.axisgrid.JointGrid at 0x2352eb737f0>

Jointplot to compare Daily Time Spent on the Site and Age

sns.jointplot("Age", "Daily Time Spent on Site", data = ad_data, kind = 'kde')

C:\Users\Saumya\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

<seaborn.axisgrid.JointGrid at 0x213259dfa90>

Jointplot of Daily Time Spent on Site vs Daily Internet Usage

sns.jointplot("Daily Time Spent on Site", "Daily Internet Usage", data = ad_data)

<seaborn.axisgrid.JointGrid at 0x2352eacaa20>

Pairplot for the "Clicked on Ad" field (This takes a while to load)

sns.pairplot(ad_data, hue = "Clicked on Ad", palette = "bright")

<seaborn.axisgrid.PairGrid at 0x2352ee4a4e0>

Logistic Regression¶

Based on the charts above, I would train my model using the following fields:

Daily Time Spent on Site
Age
Area of Income
Daily Internet Usage
Male

ad_data.drop(['Timestamp', 'City', 'Country', 'Ad Topic Line'], axis = 1, inplace = True)

ad_data.head(2)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(ad_data.drop('Clicked on Ad', axis = 1), ad_data['Clicked on Ad'], test_size = 0.3, random_state = 101)

from sklearn.linear_model import LogisticRegression

lg = LogisticRegression()

lg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

predictions = lg.predict(X_test)

from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

          0       0.91      0.95      0.93       157
          1       0.94      0.90      0.92       143

avg / total       0.92      0.92      0.92       300

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp
0	68.95	35	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	2016-03-27 00:53:11
1	80.23	31	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	2016-04-04 01:39:02
2	69.47	26	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	2016-03-13 20:35:42
3	74.15	29	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	2016-01-10 02:31:19
4	68.37	35	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	2016-06-03 03:36:18

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Male	Clicked on Ad
count	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.00000
mean	65.000200	36.009000	55000.000080	180.000100	0.481000	0.50000
std	15.853615	8.785562	13414.634022	43.902339	0.499889	0.50025
min	32.600000	19.000000	13996.500000	104.780000	0.000000	0.00000
25%	51.360000	29.000000	47031.802500	138.830000	0.000000	0.00000
50%	68.215000	35.000000	57012.300000	183.130000	0.000000	0.50000
75%	78.547500	42.000000	65470.635000	218.792500	1.000000	1.00000
max	91.430000	61.000000	79484.800000	269.960000	1.000000	1.00000