The Fisher's Iris Data set is a popular multivariate dataset introduced in 1936.
It consists of 50 samples from each of the 3 species of Iris: Iris Setosa, Iris Virginica, and Iris Versicolor.
It provides four features for each sample: the length and width of the sepals and petals.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
iris = sns.load_dataset('iris')
iris.info()
iris.head()
First, we check the distribution of species by their sepal and petal dimensions
x = sns.FacetGrid(iris, hue = "species", size = 5)
x.map(plt.scatter, "sepal_length", "sepal_width")
x = sns.FacetGrid(iris, hue = "species", size = 5)
x.map(plt.scatter, "petal_length", "petal_width")
sns.violinplot(x = "species", y = "petal_length", data = iris)
It appears that Iris Setosa is the most distinct out of the three species.
setosa = iris[iris['species']=='setosa']
sns.kdeplot( setosa['sepal_width'], setosa['sepal_length'],cmap="viridis", shade=True, shade_lowest=False)
We use the parallel coordinates visualization provided by Pandas to visualize all the four features for the samples
from pandas.tools.plotting import parallel_coordinates
parallel_coordinates(iris, "species")
Since the dataset isn't very big, we can use the grid search algorithm. This technique implements "fitting" on all the possible combinations of parameters and retains the one with the best cross-validation score.
from sklearn.model_selection import train_test_split
X = iris.drop('species',axis=1)
y = iris['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001]}
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=2)
grid.fit(X_train,y_train)
grid_predictions = grid.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,grid_predictions))