This post shows how to implement Principal Components Analysis (PCA) with the UTL_NLA package. It covers some of the uses of PCA for data reduction and visualization with a series of examples. It also provides details on how to build attribute maps and chromaticity diagrams, two powerful visualization techniques.
This is the second post in a series on how to do Linear Algebra in the Oracle Database using the UTL_NLA package. The first illustrated how to solve a system of linear equations. As promised in part 1, I am posting the code for a couple of procedures for reading and writing data from tables to arrays and vice-versa. Links to the data and code used in the examples are at the end of the post.
Principal Component Analysis
In data sets with many attributes groups of attributes often move together. This is usually due to the fact that more than one attribute may be measuring the same underlying force controlling the behavior of the system. In most systems there are only a few such driving forces. In these cases, it should be possible to eliminate the redundant information by creating a new set of attributes that extract the essential characteristics of the system. In effect we would like to replace a group of attributes with a single new one. PCA is a method that does exactly this. It creates a new set of attributes, the principal components, where each one of them is a linear combination of the original attributes. It is basically a special transformation of the data. It transforms numerical data to a new coordinate system where there is no redundant information recorded in the attributes.
It is easier to understand what PCA does by looking at a couple of pictures (see Technical Note 1 below for a more technical explanation). Let's assume that we have a table (data set) with two columns (X1 and X2) and that all the rows (observations) in this table falls within the gray ellipsis in Figure 1. The double-edged arrows in the picture indicate the range for the X1 and X2 attributes. A couple of things to notice in the picture:
- The data dispersion for X1 and X2 is about the same
- The distribution of the data falls along a diagonal pattern
The first point indicates that information from both attributes is required to distinguish the observations. The second point indicates that X1 and X2 are correlated. That is, when X1 is large X2 is also usually large, and when X1 is small X2 is also usually small.
What PCA does is to compute a new set of axes that can describe the data more efficiently. The idea is to rotate the axes so that the new axes (also called the principal components, i.e., PCs for short) are such that the variance of the data on each axis goes down from axis to axis. The first new axis is called the first principal component (PC1) and it is in the direction of the greatest variance in the data. Each new axis is constructed orthogonal to the previous ones and along the direction with the largest remaining variance. As a result the axes are labeled in order of decreasing variance. That is, if we label PC1 the axis for the first principal component then the variance for PC1 is larger than that for PC2 and so on.
Figure 2 shows the new axes PC1 and PC2 created by PCA for the example in Figure 1. As indicated by the sizes of the double-edged arrows, there is more variability (greater variance) on the PC1 axis than on the PC2 axis. Also the PC1 variance is bigger than that in the original attributes (X1 and X2). This is not surprising as PC1 is constructed to lie on the direction of the greatest variance in the data. The fact that PC2 has less variability means that it is less important for differentiating between observations. If there were more PCs, the data projection on these axes would have less and less variability. That is, they would capture less and less important aspects of the data. This is what allows compressing the data. By ignoring higher-order PCs we can great a new version of the data with fewer attributes than the original data.
Figure 3 shows the coordinates for a point in the original coordinate system (attributes X1 and X2) and in the PC coordinate system (PC1 and PC2). We can see that, for the PC coordinate system, the majority of points can be distinguished from each other by just looking at the value of PC1. This is not true for the original coordinate system. For example, there are many fewer points with coordinate b1 than with coordinate a1.
The following are the steps commonly taken when working with PCA:
- Center and/or scale the data. If an attribute spans different orders of magnitudes then first LOG transform the attribute
- Compute PCs
- Look at the variance explained (inspection of the Scree plot)
- Select the number of PCs to keep
The selected reduced set of PCs can then be used for different types of activities depending on the task at hand. Some common applications are: data compression, faster searches, data visualization, and finding groups and outliers. These steps will become clearer in the examples below.
Implementing PCA with UTL_NLA
The PCA functionality is implemented in the package DMT_PROJECTION by the following procedures and functions (see code at the end of the post):
The pca procedure has two signatures. The first works with arrays and it is the core procedure. The second signature wraps the first one and works with tables. The wrapper procedure uses a set of utilities to convert from table to array and vice-versa (see Technical Note 3). The wrapper procedure also estimates the number of PCs to keep and produces a query string that can be used to score new data. Technical Note 2 below describes the wrapper procedure in more details. All the examples below use the wrapper procedure.
The pca_eval procedure estimates the ideal number of PCs to keep (see discussion in Example 3 below). It also computes the information for a Scree plot and a Log-Likelihood plot. The latter is used to estimate the number of PCs based on the method described in [ZHU06].
The function mk_xfm_query generates a query string that transforms a source data set to make it more suitable to PCA. It supports three approaches: Centralize only (subtract the mean), scale only (divide by the standard deviation), and z-score (centralize and scale). The scripts for the examples below show how this function is used.
The mk_attr_map procedure creates 2D attribute maps for a set of attributes. The names of the columns with the two coordinates for the map are provided in the argument list. The result table contains the attribute maps for all the attributes listed in the argument list. The first coordinate labels the rows and the second coordinate labels the columns of this table. Example 2 showcases the use of this procedure for data visualization.
The function mk_pca_query generates a query string that can be used to compute the PC representation of data in a table, view, or query. To use this query it is only necessary to append a table name or subquery to the end of the returned string. This procedure is used by the pca wrapper procedure.
Example 1: Planets - Visualizing Major Features of the Data
The data for this example come from François Labelle's website on dimensionality reduction. The data contain the distance of each planet to the Sun, the planet's equatorial diameter, and the planet's density. The goal in this example is to find interesting features in the data set by visualizing it in 2D using PCA.
The following script transforms the data and finds the principal components using the pca wrapper procedure (see comments in the DMT_PROJECTION package code for a description of the arguments):
Because the data span a couple orders of magnitudes I applied the LOG transformation before centering the data using the mk_xfm_query function.
The planet_scr table (see Table 1) contains data for the Scree and Log-Likelihood plots. The Scree plot (Figure 4) charts all the eigenvalues in their decreasing order (from lower to higher PCs). I was curious so I checked out why it is called a Scree plot. The word scree comes from the Old Norse term for landslide. The term is used in this context because the plot resembles the side of a mountain with debris from the mountain lying at its base. The Log-Likelihood plot is covered in Example 3 below.
Table 1 shows that the first principal component (PC1) alone accounts for about 69% of the explained variance (PCT_VAR column). Adding PC2 brings it up to 98.8% (CUM_PCT_VAR column). As a result we can describe the data quite well by only considering the first two PCs.
Figure 5 shows the distribution of the planets based on their PC1 (horizontal) and PC2 (vertical) coordinates. These numbers (scores) are persisted in the planet_u table. The red arrows show the projection of the original attributes in the PC coordinate system. The arrows point in the direction of increasing values of the original attribute. The direction of the arrows can be obtained from the V matrix returned by the pca procedure in the planet_v table. Each row in this table gives the PC coordinates for one of the original attributes.
The picture reveals some interesting features about the data. Even someone not trained in Astronomy would easily notice that Pluto is quite different from the other planets. This should come as no surprise given the recent debate about Pluto being a planet or not. It is also quite evident that there are two groups of planets. The first group contains smaller dense planets. The second group contains larger less dense planets. The second group is also farther away from the sun than the first group.
Example 2: Countries - Visualizing Attribute Information with Attribute Maps
The previous example highlighted how principal component projections can be used to uncover outliers and groups in data sets. However, it also showed that is not very easy to relate these projections to the original attributes (the red arrows help but can also be confusing). An alternative approach is to create attribute maps in the lower dimensional projection space (2D or 3D). For a given data set, attribute maps can be compared to each other and allow to visually relate the distribution of different attributes. Again it is easier to explain the idea with an example.
The data for this example also come from François Labelle's website on dimensionality reduction. The data contain the population, average income, and area for twenty countries.
Figure 5 shows the projection of the LOG of the data in the first two PCs (PC1 is the horizontal axis). From the picture we see that countries with large populations are in the upper left corner. Countries with large areas and large average incomes are closer to the lower right corner. It is harder to get insight on the countries in the middle. Also this type of analyzes becomes more and more difficult as the number of attributes increases (more red arrows in the picture).
We can construct attribute maps with the following steps:
- Start with the PC1 and PC2 coordinates used to create a map like the one in Figure 5
- Bin (discretize) the PC1 and PC2 values into a small number of buckets.
- Aggregate (e.g., AVG, MAX, MIN, COUNT) the attribute of interest over all combinations of binned PC1 and PC2 values, even those not represented in the data so that we have a full matrix for plotting. The attribute being mapped can be one of those used in the PCA step (population, average income, area) or any other numeric attribute
- Organize the results in a tabular format suitable for plotting using a surface chart: rows index one of the binned PCs and columns the other
The DMT_PROJECTION.MK_ATTR_MAP procedure implements all these steps. This procedure relies on DMT_DATA_MINING_TRANSFORM package for binning the data. So you need to have the Oracle Data Mining option installed to use it. You can always comment out this procedure or replace the binning with your own if you do not have the Data Mining option.
Using DMT_PROJECTION.MK_ATTR_MAP procedure I computed attribute maps for Population, Average Income, and Area. Here is the code:
The code has two parts. First it does PCA on the data. Next it creates the maps. The maps are persisted in the COUNTRY_ATTR_MAP table. The bin coordinates for each country are returned in the COUNTRY_BVIEW view. The query used for creating the maps joins the original data with the PCA projections from the country_u table. To create a map for attributes not in the original data we just need to edit this query and replace the original table in the join with the new table containing the data we want to plot.
For this example I set the number of bins to 4 because the data set is small. The more rows in the data the more bins we can have. This usually creates smoother maps. The procedure supports either equi-width or quantile binning. If we use quantile binning the maps display an amplification effect similar to a Self-Organizing Map (SOM). Because we create more bins in areas with higher concentration of cases, the map "zooms in" these areas and spreads clumped regions out. This improves visualization of dense regions at a cost that the map now does not show distances accurately. A map created with quantile binning is not as useful for visual clustering as one created with equi-width binning or using the original PC projections as in Figure 5.
Figures 6-8 display the three maps using MS Excel's surface chart. The maps place the countries on a regular grid (PC1 vs. PC2, where PC1 is the horizontal axis) by binning the values of the coordinates PC1 and PC2 into four equi-width buckets. The colors provide a high to low landscape with peaks (purple) and valleys (blue). Each point in the grid can contain more than one country. I labeled a few countries per point for reference purpose. The value of the attribute at a grid node represents the average (AVG) value for the countries in that grid node. Values (and colors) between grid points are automatically created by interpolation by the surface chart routine in Excel.
The maps provide a visual summary of the distribution of each attribute. For example, Figure 7 shows that China and India have very large populations. As we move away from China (the population peak) population drops. Figure 8 shows high average income countries in the middle of the map and cutting across to the upper right corner.
It is also very easy to relate the attribute values in one map to the attribute values in the other maps. For example, the maps show that, in general, countries with larger average income (purple in Figure 8) have smaller population (yellow in Figure 7), the USA being an exception. Also large countries (purple and red in Figure 9) have large populations (purple and red in Figure 6). Australia is the exception in this case.
Those interested in creating plots like Figures 6-8 with Excel's surface chart should take a look at Jon Peltier's excellent tutorial: Surface and Contour Charts in Microsoft Excel. The output produced by DMT_PROJECTION.MK_ATTR_MAP is already in a format that can be used for plotting with Excel or most surface charting tools. In my case I just selected from the table p_res_tab in SQLDeveloper and then pasted the selection directly into Excel.
Example 3: Electric Utilities - Scree Plots and Number of PCs
An important question in PCA is how many PCs to retain. If we project a data set with PCA for visualization we would like to know that the few PCs we are visualizing accounts for a significant share of the variation in the data. Also for compression applications we would like to select the minimum number of PCs that achieve good compression. Again this will become clearer with an example.
The data set for this example can be found in [JOH02] and Richard Johnson's web site (here). The data set gives corporate data on 22 US public utilities. There are eight measurements on each utility as follows:
- X1: Fixed-charge covering ratio (income/debt)
- X2: Rate of return on capital
- X3: Cost per KW capacity in place
- X4: Annual Load Factor
- X5: Peak KWH demand growth from 1974 to 1975
- X6: Sales (KWH use per year)
- X7: Percent Nuclear
- X8: Total fuel costs (cents per KWH)
The task for this example is to form groups (clusters) of similar utilities. An application, suggested in this lecture at MIT, where clustering would be useful is a study to predict the cost impact of deregulation. To do the required analysis economists would need to build a detailed cost model for each utility. It would save a significant amount of time and effort if we could:
- Cluster similar utility firms
- Build detailed cost models for just a "typical" firm in each cluster
- Adjust the results from these models to estimate results for all the utility firms
Instead of applying a clustering algorithm directly to the data let's analyze the data visually and try to detect groups of similar utilities. First we use PCA to reduce the number of dimensions. Here is the code:
Next we need to determine if a small number of PCs (2 or 3) captures enough of the characteristics of the data that we can feel confident that our visual approach has merit. There are many heuristics that can be used to estimate the number of PCs to keep (see [ZHU06] for a review). A common approach is to find a cut-off point (a PC number) in the Scree plot where the change from PC to PC in the plot levels off. That is, the mountain ends and the debris start. Until not long ago, there were no principled low computational cost approaches to select the cut-off point on the Scree plot. A recent paper by Zhu and Ghodsi [ZHU06] changed that. The paper introduces the profile log-likelihood, a metric that can be used to estimate the number of PCs to keep. Basically one should keep all the PCs up to and including the one for which the Profile Log-Likelihood plot peaks.
The pca wrapper procedure implements the approach described in [ZHU06]. The estimated number of PCs is returned in the p_num_pc argument. The procedure also returns the information required to create a Scree plot and the Profile Log-Likelihood plot in the table specified by the p_scr_tab_name argument.
Figures 10 and 11 show, respectively, the Scree and the Log-Likelihood plots using the data persisted in the utility_scr table. The Profile Log-Likelihood plot indicates that we should keep the first three PCs (the plot peaks at PC3). This might not have been immediately obvious from the Scree plot itself. The cumulative percentage explained variance for the first three PCs is 67.5%. PC3 accounts for about 16.5% of the variance. By keeping only the first three PCs we capture a good deal of the variance in the data.
Let's start our analysis by looking at the projection of the data on the first two PCs. This is represented in Figure 12. From the picture we can see that the values in PC1 (horizontal axis) divide the data into three sections: A left one with the Green group (San Diego), a right one with the Blue group, and a middle section with the remaining of the utilities. Along the PC2 axis the middle section separates into three subsections, colored red, yellow, and black in the picture. Finally, the Black group seems to have two subgroups separated by a line in the picture. This visual partition of the data is somewhat arbitrary but it is can be quite effective.
What happens if we add the information from PC3? Do we still have the same partitions? Can it help to further separate the yellow group?
Figure 13 illustrates the projection of the data using the first three PCs (see Technical Note 4 for details on how to construct this plot). Each vertex of the triangle represents one of the first three PCs, the closer to the vertex the higher the value for the associated PC. The numbers have been scaled so that at a vertex the associated PC has a value of 1 and the other two PCs are zero. Also points at the line opposite to a vertex have a value of zero for the PC associated with the vertex. For example, San Diego has a value of almost 0 for PC1.
Most of the structures uncovered in Figure 12 are also present in Figure 13. The Blue and Red groups are preserved. San Diego (Green group) is still very different from the rest. One of the Black subgroups (Consolid, NewEngla, Hawaiian) was also kept intact. However, the other Black subgroup was merged with the Yellow group. We can also see that adding PC3 helped split the Yellow group into two subgroups. The partition of the Yellow utility firms takes place along a line that connects PC3 with the opposite side of the triangle.
Recapping, the task was to identify groups of similar utility firms. Using PCA we were able to visually identify 6 groups (Figure 13) with ease. Trying to uncover patterns by inspecting the original data directly would have been nearly impossible. The alternative would have been to use a clustering algorithm. The latter is a more sophisticated analysis requiring greater analytical skills from the user.
In a follow-up post I will show how to implement data interpolation using cubic splines. The implementation does not use the procedures in UTL_NLA, only the array types. However it is another application of Linear Algebra and it can be code against the UTL_NLA procedures.
Technical Note 1: The Math
Given the original data matrix X, PCA finds a rotation matrix P such that the new data Y=XP have a diagonal covariance matrix. The P matrix contains the coefficients used to create the linear combination of the original attributes that make up the new ones. Furthermore, P is a matrix with the eigenvectors of the covariance matrix of X in each column and ordered by descending eigenvalue size. We can achieve compression by truncating P. That is, we remove the columns of P that have the smallest eigenvalues. As a result Y is not only a rotated version of X but it is also smaller.
Let X be an m by n matrix where the mean of each column is zero. The covariance matrix of X is given by the n by n matrix C=X'X. The Singular Value Decomposition (SVD) of X is X=USV'.
The matrix U is m by n and has orthogonal columns, that is, U'U = I, where I is an n by n identity matrix. The matrix S is an n by n diagonal matrix. The matrix V is an n by n matrix with orthogonal columns, that is, V'V = I. Using this decomposition we can rewrite C as
From this expression we see that V is a matrix of eigenvectors of C and S^2 is the matrix of eigenvalues of C, the column of both matrices are also ordered by decreasing size of the associated eigenvalue. For example, let Vi be the
where Si^2 is the i-th element of the diagonal of S^2. It is also the eigenvalue associated with Vi. Setting P=V also solves the PCA problem, that is, V is a rotation matrix with all the desired properties. To illustrate the point further, given the covariance of Y as
If we replace P with V we make Y'Y a diagonal matrix:
So V is a rotation matrix that turns the covariance matrix of Y into a diagonal matrix.
The scores Y on the new coordinate system can also be obtained from the results of SVD. Y can be computed by multiplying each column of U by the respective diagonal element of S as follows
This is what is done in the step that scales U in the pca core procedure (see Technical Note 2 below).
Technical Note 2: The Magic in the Box - Inside the Core and the Wrapper PCA Procedures
The actual code for computing PCA (the core PCA procedure) is very small. This procedure is a good example of how the UTL_NLA package allows implementing complex calculations with a few lines of code. The code is as follows:
The above code uses Singular Value Decomposition (SVD) for computing PCA. It has three parts. First it does SVD calling a single UTL_NLA procedure (UTL_NLA.LAPACK_GESVD). Then it scales the U matrix to obtain the projections. Finally it scales and squares the S matrix to get the variances.
The pca wrapper procedure is a good example of how easy it is to write powerful code using the UTL_NLA package and the read and write procedures in DMT_MATRIX (see technical note 3 below). The full pca wrapper procedure is listed below:
The procedure has the following main steps:
- Read the data from a table into an array
- Perform PCA on the array
- Generate the scoring query
- Estimate the number of PCs and compute the scree and log-likelihood plots
- Persist the PCA results (eigenvalues, PCs, scores, and scree and log-likelihood plot information)
With the exception of the last one, each one of these steps is implemented with a single procedure call. The last step requires a procedure call for each persisted result. This is a total of three calls to the array2table procedure. In the code I use the scoring query for persisting the scores instead of using the U matrix. This way I am sure that I get projections for all the rows in the original table even for cases where the original table did not fit in memory.
Technical Note 3: Read and Write Procedures
DMT_MATRIX package contains the following procedures for reading and writing data from tables to arrays and vice-versa:
- table2array - Reads data from a table and stores them in a matrix represented by an array of type UTL_NLA_ARRAY_DBL.
- table2array2 - Reads data from three columns in a table (or source) and stores them in a matrix represented by an array of type UTL_NLA_ARRAY_DBL. The first column contains, the row index, the second column the column index, and the third the value. The row and column indexes do not have to be numeric. However, there should be only one instance of a given row-column index pair. This procedure is useful for reading data generated by the array2table2 procedure (see below). It is also useful for reading data with more than one thousand attributes or that it is naturally stored in transactional format.
- array2table - Writes data from a matrix into a table. The table's column names and the matrix row IDs are either specified in the argument list or automatically generated.
- array2table2 - Writes data from a matrix into a table with 3 or 4 columns. The first (optional) column contains an ID for the matrix. This allows storing many matrices in a single table. The ID for a matrix should be unique for each matrix stored in a table. The second column contains the row index. The third contains the column index. The fourth contains the values.
These procedures greatly simplify coding linear algebra operations using data in tables. The wrapper version of the PCA procedure uses these procedures. The code for the DMT_MATRIX package provides more information on the procedures and dmt_matrix_sample2.sql file has a few examples using them.
Technical Note 4: Chromaticity Diagram
The approach used for Figure 13 is based on chromaticity diagrams used for color spaces. These diagrams enable three-dimensional data to be compressed into a two-dimensional diagram. The caveat is that points in the original three-dimensional space that are scaled versions of each other are mapped to the same point in the two-dimensional diagram. For example the points (0.1, 0.2, 0.2) and (0.2, 0.4, 0.4) are mapped to the same point (0.78, 0.53) in the diagram. These points represent the same overall pattern. The difference between them is a matter of "strength or scale." In the case of colors the difference between the two points is one of luminance. The second point is a brighter version of the first one. In the case of utilities (see Example 3 above), the second utility has the same operational characteristics of the first but it operates at a larger scale.
The graphics in Figure 13 was created using Excel's 2D scatter plot. The data for the graphics was generated as follows:
- Compute the pi projections (i=1, 2, 3) of the points in the cube onto the plane PC1+PC2+PC3 = 1, where (PC1, PC2, PC3) are the normalized PC projections from the previous step:
- Compute the 2D scatter plot coordinates (T1, T2), where T1 and T2 are, respectively, the horizontal and the vertical distances from the triangle's PC1 corner:
This link points to the source code for the dmt_projection and dmt_matrix packages as well as scripts for the examples in the post. There are also links to a dump file containing the data for the examples and a script to create a schema and load the data.
[JOH02] Johnson, Richard Arnold, and Wichern, Dean W. (2002). Applied Multivariate Statistical Analysis, Prentice Hall, 5th edition.
[ZHU06] Zhu M., and Ghodsi A. (2006), "Automatic dimensionality selection from the scree plot via the use of profile likelihood." Comput Stat Data Anal, 51(2), 918 - 930.