PCA : Choose your performers differently !!!!

Ever wondered on why conclusions are always derived based on factors? What are the factors which generalize facts & results into principals? Tones of data has actually been hiding an untold story, or an upcoming block bluster? Excel sheets, server logs, every click on html page is pointing towards new trend new upcoming era on digital world.

I categorize COVID as a black hole witnessed by all of us. Just galloping human life's with no clue on deriving supporting factors. 2020 & 2021 were & still are so threating to recall/revisit. In last 2 years human victory against virus is the outcome of all the data reported, captured, processed & shared across the globe with correctly identified PCA hero's.

PCA stands for Principal Component Analysis. This is very powerful statistical technique used in Data Science specifically in Machine Learning to derive theories/classifications based out of uncorrelated variables concluded from correlated factors. COVID’s 1st wave, 2nd wave, 3rd wave, pandemic, epidemic, endemic conclusions all are coming from basic numbers : 0,1,2,3,4,6,7,8,9.

Let’s go deeper into this heroic technique.

PCA is a feature extraction technique to create new variables using maximum information from old variables & reducing noise to the minimum. Since we really have no clue on future categorization PCA fall under Unsupervised Learning. Purpose is to filter out all noise & only capture musical notes !!!

All new independent (orthogonal / perpendicular to each others) features are linear combinations of old features hence main focus is always on co-efficient of old variables. New features have the power to explain the variance with respect to original dataset, which means instead of using all old features now I have super powerful few features that can explain >90% of variance.

Steps involved in extracting PCA variables are:

  1. Eliminate extras. Too many cooks will confuse the dish. This is called Curse of Dimensionality in Data Science meaning Data Transformation from High Dimensional space into a Low Dimensional space where newly obtained representation retains majority of the information of the original data.
  2. Scale the data. Normalize the data. By this technique all numbers will be in scale 0 to 1. Surprised? Normalization worth's an article to understand (my next topic !!!!) the magic of scaling 0s, 10s, 100s, 1000s, Millions into 0–1 range. Hence algorithm will treat all variables with equal importance.

3. Generate covariance matrix. Matrix which represents relationship between 2 random variables. The value of covariance lies between -∞ and +∞.

4. Break the covariance matrix into magnitude (Eigen value) & direction (Eigen Vector). Magnitude translates into power of impact & direction shows +ve or -ve trend.

5. Eigen values are the Principal Components (orthogonal to each other) which can be used to calculate the percentage of variation explained by each direction (Eigen Vectors)

6. PC1 & PC2 are newly created PCA components which may further be used to create Supervised Category on Unsupervised data.

Conclusion. PCA removes correlated features hence confusion & noise. Improves machine algorithm performance by focusing only on right factors. Reduces overfitting implies NO overclaims. But as they say nothing is perfect in this world. PCA thou Independent variables shall less interpretable. PCA has Information loss as good cells are also lost while targeting cancerous cells during chemo.

Happy PCAing !!!!!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store