PCA — Principal Component Analysis Explained with Python Example

Gayathri siva
8 min readJul 19, 2022

--

A technique for reducing the dimensionality of datasets, increasing interpretability but at the same time minimizing information loss.

First, you have to know about Dimensionality Reduction. It is the process of reducing the number of variables/features. It also reduces the model complexity and overfitting. It has two subcategories:

  1. Feature Selection
  2. Feature Extraction

PCA comes under Feature Extraction.

PCA — Principal Component Analysis:

It is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

It aims to find the directions of maximum variance in high-dimension data and projects the data onto a new subspace with equal or fewer dimensions than the original one.

comparing both the principal components, we can find that the data points are sufficiently spaced in the Principle component PC1. Whereas, in PC2 they are less spaced which makes the observation and further calculation much more difficult.

Therefore, we accept the PC1 and not PC2 as the data points are more spaced.

Important Properties:

  1. The number of principal components is always less than or equal to the number of attributes/features.
  2. Principal Components are orthogonal.
  3. The priority of principal components decreases as their number increase.

There are four steps involved in the PCA.

Step 1: Standardization

First, standardize the data before performing PCA. The aim of this step is to standardize the range of the continuous initial variables so that each one of them contributes equally to the analysis.

More specifically, the reason why it is critical to perform standardization, is that they are quite sensitive regarding the variances of the initial variables.

If there are large differences between the range of initial variables, those variables with larger ranges will dominate over those with small ranges.

For eg:— A variable that ranges between 0 & 100 will dominate over a variable that ranges between 0 and 1, which leads to biased results. So, transforming the data to comparable scales can prevent this problem.

Mathematically this can be done by subtracting the mean and dividing by the standard deviation for each value of each variable.

Once the standardization is done, all the variables will be transformed to the same scale.

Step 2: Covariance matrix computation

It is used to express the correlation between any two or more attributes in a multidimensional dataset. The covariance matrix is symmetric rows and columns equal to the number of dimensions in the data.

For eg: — For a 3-dimensional data set with 3 variables x, y, and z, the covariance matrix is a 3×3 matrix of this form:

Since the covariance of a variable with itself is its variance (Cov(x, x)=Var(x)), in the main diagonal (Top left to bottom right) we actually have the variances of each initial variable.

The covariance is commutative (Cov(x, y)=Cov(y, x)), the entries of the covariance matrix are symmetric with respect to the main diagonal, which means that the upper and the lower triangular portions are equal.

Entries of the covariance matrix tell us how the two or more variables are related?

It’s actually the sign of the covariance that matters :

  • Positive Covariance — Indicate that the value of one variable is directly proportional to another variable.
  • Negative Covariance — Indicate that the value of one variable is inversely proportional to another variable.

Step 3: Find the Eigenvectors and Eigenvalues

Eigenvectors are linearly independent vectors that do not change the direction when a matrix transformation is applied.

Eigenvalues are scalars that indicate the magnitude of the eigenvector.

Eigenvectors and eigenvalues are the mathematical values that we need to compute from the covariance matrix in order to determine the principal components of the data.

First, understand what we mean by principal component

Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. It is uncorrelated and most of the information is compressed into the first components.

For eg: — If we have 10-dimensional data gives 10 principal components, but PCA tries to put the maximum possible information in the first component, then the maximum remaining information in the second, and so on. It allows you to reduce dimensionality without losing much information, and this by discarding the components with low information and considering the remaining components as your new variables.

How PCA constructs the principal components

As there are as many principal components as there are variables in the data, principal components are constructed in such a manner that the first principal component accounts for the largest possible variance in the data set.

This continues until a total of ‘p’ principal components have been calculated, equal to the original number of variables.

Eigenvectors and Eigenvalues always come in pairs, so every eigenvector has an eigenvalue. And their number is equal to the number of dimensions of the data.

For eg: — For a 2-dimensional dataset, there are 2variables, therefore there are 2 eigenvectors with 2 corresponding eigenvalues.

If we take a 2-dimensional dataset with 2 variables x,y and that the eigenvectors and eigenvalues of the covariance matrix are as follows:

If we rank the eigenvalues in descending order, we get λ1>λ2, which means (PC1) is v1 and (PC2) is v2.

After having the principal components, to compute the percentage of variance (information) accounted for by each component, we divide the eigenvalue of each component by the sum of eigenvalues. Apply this to the above example, we get PC1 and PC2 carry 96% and 4% of the variance of the data.

Step 4: Feature Vector

The feature vector is simply a matrix that has eigenvectors of the components that we decide to keep as columns.

Computing the eigenvectors and ordering them by their eigenvalues in descending order, allows us to find the principal components in order of significance.

In this step, to choose whether to keep all these components or discard those of lesser significance (of low eigenvalues), and form with the remaining ones a matrix of vectors that we call Feature vectors.

So, the feature vector is simply a matrix that has as columns the eigenvectors of the components that we decide to keep. This makes it the first step towards dimensionality reduction because if we choose to keep only p eigenvectors (components) out of n, the final data set will have only p dimensions.

Example:

Continuing with the example from the previous step, we can either form a feature vector with both of the eigenvectors v1 and v2:

Or discard the eigenvector v2, which is the one of lesser significance, and form a feature vector with v1 only:

Discarding the eigenvector v2 will reduce dimensionality by 1, and will consequently cause a loss of information in the final data set. But given that v2 was carrying only 4% of the information, the loss will be therefore not important and we will still have 96% of the information that is carried by v1.

PCA with Python

Now, Let’s understand Principal Component Analysis with Python. In this example, I have used the wine dataset from Scikit-learn.

Importing the libraries and Dataset

from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Distributing the dataset into x and y components for data analysis

winedata = load_wine()
X, y = winedata['data'], winedata['target']
print(X.shape)
print(y.shape)

Splitting the dataset into training set and testing set

#splitting dataset into a training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

Doing the pre-processing part on training and testing set such as fitting the Standard scale.

# Scaling the data
sc = StandardScaler()
X_train = sc.fit_transform(X_train)

x_test = sc.transform(X_test)

Applying the PCA function into the training and testing set for analysis.

# Set the n_components=5
principal = PCA(n_components=5)
principal.fit(X_train)
principal.transform(X_train)


# check how much variance is explained by each principal component
print(principal.explained_variance_ratio_)

# number of components
n_pcs = principal.components_.shape[0]

print(n_pcs)

Fitting Logistic Regression To the training set and Predicting the test set result

#fitting logistic Regression to training set
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
#predicting results
y_pred = classifier.predict(x_test)
print("accuracy score:", accuracy_score(y_test,y_pred))

Here, the complete code

from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
winedata = load_wine()
X, y = winedata['data'], winedata['target']
print(X.shape)
print(y.shape)
#splitting dataset into a training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)
# Scaling the data
sc = StandardScaler()
X_train = sc.fit_transform(X_train)

x_test = sc.transform(X_test)
# Set the n_components=5
principal = PCA(n_components=5)
principal.fit(X_train)
principal.transform(X_train)


# check how much variance is explained by each principal component
print(principal.explained_variance_ratio_)

# number of components
n_pcs = principal.components_.shape[0]

print(n_pcs)
#fitting logistic Regression to training set
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

#predicting results
y_pred = classifier.predict(x_test)
print("accuracy score:", accuracy_score(y_test,y_pred))
Output:(178, 13)
(178,)
[0.36884109 0.19318394 0.10752862 0.07421996 0.06245904 0.04909
0.04117287 0.02495984 0.02308855 0.01864124 0.01731766 0.01252785
0.00696933]
13
accuracy score: 0.97222222

As we see that in predicting results our accuracy score came to 0.97222222 is approx 97% which is good for predicting test set results.

--

--