Predicting Advertisements' Clickability

The dataset indicates whether or not a user clicked on an advertisement on a company website. I created a logistic regression model to predict whether the user will click on an ad based off the features of that user. Dataset:

  • Daily Time Spent on Site
  • Age
  • Area Income: Avg income of the geographical area of the user
  • Daily Internet Usage
  • Ad Topic Line
  • City
  • Male
  • Country
  • Timestamp
  • Clicked on Ad (0 or 1)
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
sns.set_style('whitegrid')
In [3]:
ad_data = pd.read_csv('advertising.csv')
In [4]:
ad_data.head(5)
Out[4]:
Daily Time Spent on Site Age Area Income Daily Internet Usage Ad Topic Line City Male Country Timestamp Clicked on Ad
0 68.95 35 61833.90 256.09 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia 2016-03-27 00:53:11 0
1 80.23 31 68441.85 193.77 Monitored national standardization West Jodi 1 Nauru 2016-04-04 01:39:02 0
2 69.47 26 59785.94 236.50 Organic bottom-line service-desk Davidton 0 San Marino 2016-03-13 20:35:42 0
3 74.15 29 54806.18 245.89 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy 2016-01-10 02:31:19 0
4 68.37 35 73889.99 225.58 Robust logistical utilization South Manuel 0 Iceland 2016-06-03 03:36:18 0
In [4]:
ad_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
Daily Time Spent on Site    1000 non-null float64
Age                         1000 non-null int64
Area Income                 1000 non-null float64
Daily Internet Usage        1000 non-null float64
Ad Topic Line               1000 non-null object
City                        1000 non-null object
Male                        1000 non-null int64
Country                     1000 non-null object
Timestamp                   1000 non-null object
Clicked on Ad               1000 non-null int64
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB
In [5]:
ad_data.describe()
Out[5]:
Daily Time Spent on Site Age Area Income Daily Internet Usage Male Clicked on Ad
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000
mean 65.000200 36.009000 55000.000080 180.000100 0.481000 0.50000
std 15.853615 8.785562 13414.634022 43.902339 0.499889 0.50025
min 32.600000 19.000000 13996.500000 104.780000 0.000000 0.00000
25% 51.360000 29.000000 47031.802500 138.830000 0.000000 0.00000
50% 68.215000 35.000000 57012.300000 183.130000 0.000000 0.50000
75% 78.547500 42.000000 65470.635000 218.792500 1.000000 1.00000
max 91.430000 61.000000 79484.800000 269.960000 1.000000 1.00000

Exploratory Data Analysis

Histogram of Age

In [6]:
sns.distplot(ad_data["Age"], bins = 30,kde = False)
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x2352e9c9080>

Jointplot for Area Income vs Age

In [7]:
sns.jointplot("Age", "Area Income", data = ad_data, kind = 'scatter')
Out[7]:
<seaborn.axisgrid.JointGrid at 0x2352eb737f0>

Jointplot to compare Daily Time Spent on the Site and Age

In [21]:
sns.jointplot("Age", "Daily Time Spent on Site", data = ad_data, kind = 'kde')
C:\Users\Saumya\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[21]:
<seaborn.axisgrid.JointGrid at 0x213259dfa90>

Jointplot of Daily Time Spent on Site vs Daily Internet Usage

In [8]:
sns.jointplot("Daily Time Spent on Site", "Daily Internet Usage", data = ad_data)
Out[8]:
<seaborn.axisgrid.JointGrid at 0x2352eacaa20>

Pairplot for the "Clicked on Ad" field (This takes a while to load)

In [9]:
sns.pairplot(ad_data, hue = "Clicked on Ad", palette = "bright")
Out[9]:
<seaborn.axisgrid.PairGrid at 0x2352ee4a4e0>

Logistic Regression

Based on the charts above, I would train my model using the following fields:

  • Daily Time Spent on Site
  • Age
  • Area of Income
  • Daily Internet Usage
  • Male
In [10]:
ad_data.drop(['Timestamp', 'City', 'Country', 'Ad Topic Line'], axis = 1, inplace = True)
In [11]:
ad_data.head(2)
Out[11]:
Daily Time Spent on Site Age Area Income Daily Internet Usage Male Clicked on Ad
0 68.95 35 61833.90 256.09 0 0
1 80.23 31 68441.85 193.77 1 0
In [12]:
from sklearn.model_selection import train_test_split
In [13]:
X_train, X_test, y_train, y_test = train_test_split(ad_data.drop('Clicked on Ad', axis = 1), ad_data['Clicked on Ad'], test_size = 0.3, random_state = 101)
In [14]:
from sklearn.linear_model import LogisticRegression
In [15]:
lg = LogisticRegression()
In [16]:
lg.fit(X_train, y_train)
Out[16]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [17]:
predictions = lg.predict(X_test)
In [18]:
from sklearn.metrics import classification_report
In [19]:
print(classification_report(y_test, predictions))
             precision    recall  f1-score   support

          0       0.91      0.95      0.93       157
          1       0.94      0.90      0.92       143

avg / total       0.92      0.92      0.92       300

In [ ]: