[Behind the ML] Principal Component Analysis I.

8 min readNov 16, 2019

Hello! From this week, we will change our topics and format a bit. We have been focusing on introducing technical methods to help you feel more familiar with TensorFlow 2.0. However, we figured out that some theoretical backgrounds will save some of you from being lost in the deep ocean of codes.

Simply following the codes that are constructed by others will not cause you much harm(rather it facilitates our lives), however, it is very probable that you will be lost at some point.

Jack Sparrow always know where he should be heading for because he has the magic compass. He checks his compass from time to time not to be lost in the middle of nowhere.
Same here. Your theoretical understanding will guide you to a right direction whenever you are confronting a new face or lost within your codes.

We need a certain belief that we will be doing something beneficial to us. Am I persuasive enough for you to learn some statistical techniques? ‘Techniques’ not ‘Statistics’. So do not freak out and stay calm!

I. Introduction

PCA (Principal Component Analysis) is purposed as a solution to multicollinearity problem in linear regression analysis.

Then, what is multicollinearity problem? When we construct a linear regression Y=X∗β+u (n observations, p explanatory variables), can anyone argue that his/her colums of X matrix are not correlated? In a fancy terminology, the X where its p number of variables are not related has rank of p or full rank.

Let’s take an example. We want to analyze what factors can influence the wage of a person (the famous mincer equation). In a common sense, we can think of education level, gender, political party, social status, working experience, etc.

Then, are education level and gender independent? Maybe not.
For example, universities in South Korea (where I am living) were dominated by males even until the 80’s. If a household cannot afford the tuition fee for every children, a very common example was to send the first son to school while his female siblings go to the job market right after graduating high school or even at a younger age. In this case gender and education level are highly correlated.

Or, social status of your family is highly related with your education level in any society

In some sense, multicollinearity problem is inevitable. So, some people say that we should admit its existence and keep it with some efficiency loss in analysis.

However, if your data obviously has multicollinearity problem and it is undeniable, one of your options would be PCA. By its construction, PCA creates a transformed independent variable matrix(T) that has no multicollinearity. If we keep all the elements of matrix T, the information that T contains is the same as what the original independent variable matrix X contains although it seems to be different. The format changes but they are intrinsically identical.

Also, in the context of machine learning, PCA is used to decrease the dimension of a data (X matrix). I will elaborate more on this but it is known that we can just select the first few columns of principal component transformed matrix T and that we do not lose much information. We can discard most of information without much cost in reality. It is more of a miracle, isn’t it?

II. Data Preparation

For simplicity, I will take an independent variable matrix X with 2 explanatory variables and 6 observations (p=2 and n=6). Our purpose is to see if X1 and X2 are correlated and if so how much they are interrelated.

Let’s check the points 1 to 6 in a 2-D diagram.

We have pivoted the diagram so that the (E[X1],E[X2]) to be placed in (0,0) in the second diagram.

Numerically, it is done by subtracting E[X1] and E[X2] to the variables of X1 and X2 so that each column of converted X has expectation of Zero.
E[X1_converted]=E[X1]-E[[X1]]=E[X1]-E[X1]=0

From now on, the centered matrix will be denoted X or X1 and X2 and x11,…x62 for simplicity.

III. PCA1

1. Fitted line — ‘projection’

Now we will find the fitted line that best explains the relationship between X1 and X2.

Among the lines that are pivoting around (0,0), let’s say that the red line fits the best. The question is how do we find the best fitting line?

Our familiar candidate would be the simple linear regression.

In a simple linear regression, Y=X∗β+u, we find the β so that it can minimize the sum of squared residuals. FYI, in simple linear regression, the slope of best fitted line is calculated as shown. The basic concept of linear regression and minimizing residuals are based on the idea that X causes Y.

However, in our context X1 and X2 have correlation not causation. So, we slightly change our point of view and will measure squared distance between the fitted line and points. By doing so, we do not fix neither of the variable, or being just for both X1 and X2. What we are doing here is called Projection; drawing lines that are perpendicular to the fitted line and measuring the distance.

Let’s look deeper and take p2 for example.

Our aim is to minimize summation of b².
Since we have the a² (squared distance between (0,0) and p2), minimizing b² is equivalent to maximizing c².

2. Maximizing c²

— finding the eigenvectors and eigenvalues of covariance matrix.

Now our aim is to maximize the summation of all c²’s.
To do that let’s first look into how (X’X) looks like;

and this exactly corresponds to (n-1)*Σ.
So, Σ=(X’X)/(n-1).
c.f.) the covariance matrix of X is denoted as Σ which is p*p matrix.

Side note) It is trivial but here we are dividing the (X’X) by (n-1) to calculate the covariance matrix. This is because we have already ‘centered’ the data so we have lost 1 degrees of freedom. It is intuitive if we remind how we calculate the sample standard deviation.

Keeping that in mind we will denote the projections in a matrix form; XW.
Alsox the summation of c² is proportional to the variance of XW.

Now, we will do some a simple maximization of variance of XW with Lagrangian multiplier.
First, we will set up a constraint that the norm of vector W is 1, or simply we will constrain W to be a unit vector. (This helps a lot in the conclusion!)

Then, we will construct a simple Lagrange,

and then, F.O.C.

Then, the optimal setting the FOCs to zero will yield,

*Find all W’s and λ’s that satisfies WΣ= λW such that W is a unit vector.*

WoW! 👏 👏 👏

Isn’t it familiar to you? We have seen this in linear algebra. W is the eigenvector of Σ and λi’s are the eigenvalues. Among the numerous λ’s, we pick the largest λ and corresponding W. We will denote this largest λ as λ1 and the corresponding W from first PCA as W1.

3. How to interpret W?

Now we have obtained 2 by 1 vector W1,

It functions as a slope of the PCA1 representing the correlation of X1 and X2.
Or it can be viewed as a recipe ratio of X1 and X2 when constructing a PCA transformed matrix T.
Also, W1 functions as weights of X1 and X2.
Here, the K depends on how many PCAs you can or will construct.
Note that K≤P!

IV. PCA2

1. Second PCA

In the second PCA, we will draw a line that is perpendicular to our first PCA1. Remember that we want to create a transformed matrix T that has no multicollinearity problem i.o.w., orthogonal matrix.

Are you feeling frustrated that you have to read for another 10 minutes? Well, don’t be!
Because we already have the numbers calculated.

Our λ2 will be the second largest eigenvalue of Σ. This is because the eigenvectors of Σ a.k.a W are orthogonal to each other.

🤜So, the slope of the second perpendicular line is equivalent to the second column of W i.e.,W2. 🤛

V. Generalization and proofs.

1. Prove that W is orthogonal matrix.

To prove the last statement, we will do some generalization. It is not worth proving in 2by2 dimension.(We can do better than that!)

Here, the dimension of W depends on K which is the number of PCA’s done. I mentioned that if we transform all the information from X, K=P and that each W1 to Wp has dimension of Px1.

Now we have the following equations ready.

Now let’s do some linear algebra.

Then,

The transformed matrix T is also orthogonal for this reason.

2. Prove that sum of λi’s = Var(X)

First, Var(Ti)=λi.

3. The first few (empirically 3) columns of T are enough

In other words, K=3 is enough to contain the variability of X (n by P) or the information that X contains.

In our simple P=2 example, we can see in the diagram that total variance of T is pretty much explained by the first PCA.

Do you remember that the summation of all c² from PCA1 is proportional to variance of T1=XW1?

So, λi and the distance from (0,0) and the points in the diagram can be regarded as equivalent.

Intuitively, PCA1 is the lion🦁 which hunts a deer🦌 and the following PCAs are the hyenas which feed on the remains.
The circle of life.🌎

In the next posting we will do some tutorials to show you that the proven statements are true.

Also, from data analysis we can see that only by doing first few PCA’s we can explain the most of variance of T. In other words, although we decrease the dimension tremendously we do not lose much information.

**Caution**

There is not much mathematical evidence supporting that we can disregard most information of X and sustain the information nor the magic number 3 can actually be proved. But we can see that it is what really happens empirically.

Thank you for following such a long and “mathy” posting. Hope that it helped you. And stay tuned for python tutorial on PCA! 👍🏻👍🏻

Claps and comments will help us improve! 👏 👏