In this post, we will show you how to implement PCA in Python.
The basic setup.
import numpy as np
import pandas as pd
import matplotlib.pyplot as pltimport tensorflow as tf
We will use the Tensorflow package to load the Iris dataset.
CSV_COLUMN_NAMES = [‘SepalLength’, ‘SepalWidth’, ‘PetalLength’ , ‘PetalWidth’, ‘Species’]SPECIES = [‘Setosa’, ‘Versicolor’, ‘Virginica’]
> We have specified the data frame’s column names and its species values.
train_path = tf.keras.utils.get_file("iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv")test_path = tf.keras.utils.get_file("iris_test.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv")train = pd.read_csv(train_path, names=CSV_COLUMN_NAMES, header=0)
test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)train.head()train_y = train.pop('Species')
test_y = test.pop('Species')# The label column has now been removed from the features.train.head()
> Thanks to the Tensorflow utils package, we have easily load the dataset with a line of url. The dataset named “train” will be looked like this.
0. Exploratory Data Analysis with simple data visualizations
Before dive deep into the field, there is a step that we should better take: Exploratory Data Analysis. Even though the term sounds fancy, what it does is pretty practical and close to our daily data process routines: draw graphs, and try to understand the data distribution as it is. Here we will draw a pairplot using a function of the Seaborn library( https://seaborn.pydata.org/generated/seaborn.pairplot.html).
import seaborn as snsimport matplotlib.pyplot as plt%matplotlib inlinesns.pairplot(train,hue=’Species’,size=2.6)
And the result is below.
You can try many different graphs using the Seaborn library.
Getting back to our the dataframe of ours, our training dataset, named “train” has 120 rows and 5 columns. And our test dataset, named “test” has 30 rows and the same number of columns. (We can check this with the code “train.shape”, and “test.shape”.)
train_y = train.pop(‘Species’)test_y = test.pop(‘Species’)# The label column has now been removed from the features.train.head()
> We have popped out the column “Species” from each of our train and test dataset, since we will use this column as a label, in other words ‘target’. We stored the target value in different variables, each named as “train_y”, and “test_y”.
Now, let us make a Principal Component Analysis(PCA) function with Numpy.
We have loaded Numpy library in the very beginning with the code line “import numpy as np”
Let’s assume the train dataset is stored in “x” variable. And our “x” looks like the below:
To compute the covariance matrix, we will also use x’s transpose matrix which looks like below:
1. Normalize
Before computing the covariance matrix, we have to normalize our data to normal distribution setting, making the mean into 0, and the variance to 1.
There are two different ways to do that. 1) Using sklearn’s preprocessing library, 2) Using Pandas library and compute the value in more labor-intensive way. We will look into both.
First, the code for using the preprocessing tools that the Sklearn is offering.
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(x)
Second, let’s try Pandas.
x_scaled = (x-x.mean(axis = 0))/x.std(axis = 0)
And the outcome numerical values are exactly the same.
2. Eigendecomposition
As PCA is the outcome of eigendecomposition on the covariance matrix, we will compute the covariance matrix first.
2.1 Covariance Matrix
Ingredients:
- 1) Mean value of the scaled data
- 2) The dot product between the transpose of (x_scaled — mean)and (x_scaled — mean)
- 3) Dividing the value into (n-1), and ’n’ is the number of units(rows) in x_scaled, which can be reached with python code “x_scaled.shape[0]”.
mean_vec = np.mean(x_scaled, axis = 0) # 1
dotP = (x_scaled — mean_vec).T.dot(x_scaled-mean_vec) # 2
n = x_scaled.shape[0] # 3cov_mat = dotP / (n-1)
2.2 Eigendecomposition
Ingredients:
- 1) Covariance matrix that we just computed
- 2) Numpy library “linalg.eig()” which is super kind enough to compute the value in one single line of code. (Check this document for the details: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.eig.html)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
2.3 Singular Vector Decomposition
From computing covariance to performing eigendecomposition, hopefully we have got a better and deeper grasp over the algorithm inside PCA. However, unlike the 2-stage method we have just done, there is a single-stage method using Singular Vector Decomposition. This latter one is used more widely in terms of making PCA programs since it is easier and faster to implement as a code.
Ingredients:
- Scaled input matrix (in our case, “x_scaled”, and this will be transposed to get inside the function)
- Numpy.linalg.svd(), a function offered by Numpy library for Singular Vector Decomposition, a.k.a SVD. (Check this document for further details, https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html)
u,s,v = np.linalg.svd(x_scaled.T)u
And we can see that “u” is exactly the same as the above result we got from eigendecomposition.
3. Sorting out principal components
We will pair the computed eigenvalues and eigenvectors and will sort them out based on the size of the eigenvalues, which signify how much variance that the paired eigenvector can take among all.
Ingredients:
- 1) (eigenvalue, eigenvector) pair, let’s name this as ‘eigen_pair’.
- 2) Sorting out the ‘eigen_pair’ depending on the eigenvalue which each can be reached with the code ‘eigen_pair[0]’.
- 3) A total sum of eigenvalues
- 4) Percentage calculation to show how much variance has been explained with the eigenvector
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))] # 1eig_pairs.sort(key=lambda x: x[0], reverse=True) # 2total_sum = sum(eig_vals) # 3variance_explained = [(i/total_sum)*100 for i in sorted(eig_vals, reverse = True)] # 4
And if we print out the “variance_explained”, then the result is the following.
So, basically the most powerful eigenvector can explain the variance around 72%, and the second-highest powerful vector can take charge around 24%, which combined around 96% of variance has been explained with only two eigenvectors.
4. Projection Matrix
The projection matrix consists of eigenvectors that have highest power to explain variance in the data. So this time, we will use the two most powerful eigenvectors to make our projection matrix.
Ingredients:
- 1) The first eigenvector with the highest eigenvalue
- 2) The second-highest eigenvalue’s eigenvector
- 3) Concatenating the two eigenvectors column-wise, more powerful one comes first.
the_highest = eig_pairs[0][1] # 1second_highest = eig_pairs[1][1] # 2matrix_w = np.hstack((the_highest.reshape(4,1),second_highest.reshape(4,1)))# 3
And if we print out ‘matrix_x’, the following would come up.
5. Migration from the existing data space into the newly constructed feature space using the projection matrix
y = x_scaled.dot(matrix_w)
Simple. the shape of the “matrix_w” is (4,2) and the number of columns in “x_scaled” is 4. The result y is the newly projected data values.
Thank you for reading this long posting, and hope you have found something useful or insightful while scrolling all the way down until here.
See you soon with more interesting topics!