:-) Thank you to the University of Kansas, the Hexacoral project at http://www.kgs.ku.edu/Hexacoral/ for the use of the beautiful animated GIF which is a visualization of Principle Component Analysis . I copied it and I hope they don't mind. This animation shows how 12 clusters of data [the different colors] are reduced to 3 Principle Components. [the 3 orthogonal vectors (perpendicular to each other)] To see the animation, click on it to view it in a new window.
In a world where so much information is available, we must learn to manage information overload. When we have several data feed on the same subject, chances are that much of the information overlaps i.e. is redundant. Overlap, meaning that mathematically speaking, if you could and did draw circles representing the true information content of the each data set, the circles would overlap. This is important when we are trying to use the data to model some aspect of the real world.
For example, if we are trying to analyze and build a model of some aspect of the U.S. economy, and we look at the dozens of Economic Indicators churned out regularly by the various government agencies, by academic bodies and policy think-tanks as possible input into the model. If you arrange all the data as a matrix with columns for the variables, and rows for time period [i.e. each is a time series], and if you then calculate correlation between all the variables, you will find that there is a strong degree of correlation among all the variables in the covariance matrix. That means that much of the information content is redundant because it overlaps.
We can reduce the information load without losing much information content, if we can take away the area of overlap, leaving only the essential information. One of the techniques for doing this is Principal Components Analysis. "PCA is, at its essence, a rotation and scaling of a data set. The rotation is selected so that the axes are aligned with the directions of greatest variation in the data set. The scaling is selected so that distances along each axis are comparable in a statistical sense. Rotation and scaling are linear operations, so the PCA transformation maintains all linear relationships. It is designed to capture the variance in a dataset in terms of principle components. In effect, one is trying to reduce the dimensionality of the data to summarise the most important (i.e. defining) parts whilst simultaneously filtering out noise.
The rotation and scaling for PCA are given by the eigenvectors and eigenvalues of the covariance matrix. The covariance matrix contains the relationships (correlations) between the variables in the data set. One way to think about PCA is that it generates a set of directions, or vectors in the data space. The first vector shows you the direction of greatest variation in the data set; the second vector shows the next direction of greatest variation, and so on. The amount of variation represented by each subsequent vector decreases monotonically". **<--- Since I cannot articulate on PCA well without resort to mathematical equations [which this blog is not capable of producing], I took this description from Kansas University where they are using it to do a study on Hexacorals [Coral, Sea Anemones and their allies] to interface geospatial, taxonomic and environmental data on these creatures. PCA is well-suited for such a study as the data is from very diverse sources.
Here are some U.S. Economic Indicators in various categories.
Average Hourly Earnings
Employment Cost Index
Disposable Personal Income
Per Capita Income
Balance On Current Account
International Trade in Goods and Services
R&D Expenditures As Percentage of GDP
Net Oil Imports
Aggregate Money Supply
Interest Rates and Bond Yields
Producer Price Index
Gross Domestic Product
Crude Oil Prices
Housing Starts Building Permits
Durable Goods Manufacturer's Shipments, Inventory
Advance Order On Durable Goods
Monthly Sales Retail Services
Intuitively we know that there is a lot of information overlap between each data set as well as between data categories. In a Complex Adaptive System like the Economy, almost everything interacts with almost everything, in a non-linear, non-sequential way with feedback loops and exponential effects. Doing a Principle Components Analysis on this set of Economic Indicators will reduce it to a few Principle Components that explain the variance in 99% of the data. It should however be remembered that such a diverse set of Indicators from Employment, Income, Money, Output, Trade, Production will inevitably be presented in widely different formats. This will tempt us to pre-process the data to scale and normalize it for easier calculation of the PCA. However in PCA, pre-processing may take away some of the real information in the data sets, even before it is fed into the PCA alogorithm. Another problem with PCA especially for financial data is that PCA linearizes everything, which is a bad thing in an essentially non-linear world. In effect with PCA, we are forcing the data into fixed areas of the multidimensional data space and losing some information in the process. Therefore some hedge funds use Wavelets, although here we also have the problem of identifying the appropriate type of Wavelet to use. Lastly, although I am out of touch with the latest techniques for PCA, I believe there are generalized PCA algorithms which deal with the problem of linearity.
In Finance, PCA or any other dimension reduction and clustering method can be use to identify the extreme tails of the probability distribution in financial markets data, where the big money is to be made.