Way Cool: Linear Algebra in the Oracle Database 1

New to the Oracle Database 10g Release 2 is a hidden gem, the UTL_NLA package. This not very well known package (you don't get many hits for it in Google) brings linear algebra functionality to the Oracle Database. It makes the Oracle Database an even better platform for scientific and advanced analytics programming. Now it is possible to write performant matrix code in the database easily and avoid moving data around. Here is a brief description from the Oracle® Database Data Warehousing Guide 10g Release 2 (10.2):

Linear algebra is a branch of mathematics with a wide range of practical applications. Many areas have tasks that can be expressed using linear algebra, and here are some examples from several fields: statistics (multiple linear regression and principle components analysis), data mining (clustering and classification), bioinformatics (analysis of microarray data), operations research (supply chain and other optimization problems), econometrics (analysis of consumer demand data), and finance (asset allocation problems). Various libraries for linear algebra are freely available for anyone to use. Oracle's UTL_NLA package exposes matrix PL/SQL data types and wrapper PL/SQL subprograms for two of the most popular and robust of these libraries, BLAS and LAPACK.

Linear algebra depends on matrix manipulation. Performing matrix manipulation in PL/SQL in the past required inventing a matrix representation based on PL/SQL's native data types and then writing matrix manipulation routines from scratch. This required substantial programming effort and the performance of the resulting implementation was limited. If developers chose to send data to external packages for processing rather than create their own routines, data transfer back and forth could be time consuming. Using the UTL_NLA package lets data stay within Oracle, removes the programming effort, and delivers a fast implementation.

BLAS and LAPACK are probably the predominant linear algebra libraries out there. These libraries are extensively used by a large number of scientific programs and specialized tools.

For developers, these libraries provide the building blocks for easily implementing a large number of advanced techniques. Take for example a toolbox like MATLAB, many of its capabilities are built on top of linear algebra primitives provided by packages like BLAS and LAPACK. Having these libraries in the database allow developers to write compact and ease to read code written using vector operations. Also, because these libraries have efficient and robust implementation of linear algebra operations, code using the UTL_NLA package inherent these qualities for free.

Besides scientific programming, Oracle's linear algebra support can also be used for business analysis. One example is multiple linear regression. The database ships with a multiple regression application built using the UTL_NLA package. This application is implemented in an object called OLS_Regression. Note that sample files for the OLS Regression object can be found in $ORACLE_HOME/plsql/demo. Take a look here for an example of how to use this functionality.

Another example of business analysis is solving a system of linear equations. In this post I give a couple of examples of how to solve a system of linear equations with UTL_NLA. In a follow up post I will show how to implement Principal Components Analysis (PCA) using the package.

Required Expertise, Limitations, and Usability

Before using the UTL_NLA package for development there are a couple of things to keep in mind. Oracle documentation states that developers using this package are expected to have a sound grasp of linear algebra in general and of the BLAS and LAPACK libraries in particular. I believe basic knowledge of linear algebra is enough if you are just trying to implement a well-known algorithm using this package. With respect to BLAS and LAPACK it is important to familiarize yourself with some basic concepts (e.g., matrix storage representation: column- or row-wise) and also to better understand some of the arguments the procedures in the package take. Besides the Oracle documentation some other useful references are:

The Lapack Users' Guide, the BLAS and the LAPACK chapters in the in CRC Handbook of Linear Algebra.

The UTL_NLA package currently only supports matrices with up to 1,000,000 elements. For example, if we think of a table as a matrix, where table columns map to matrix columns and table rows to matrix rows, then for a table with 100 columns we can store up to 10,000 rows in a single matrix. It is useful to think in these terms because the number of rows is usually larger than the number of columns in a table. Furthermore, for many applications, we can obtain good results working on a sample of the data. This can be easily done in the database.

In order to use UTL_NLA, matrices have to be represented as either UTL_NLA_ARRAY_DBL or UTL_NLA_ARRAY_FLT. These types are PL/SQL VARRAY. However, many applications require data that resides in tables. It would be a nice addition to the package to have the ability to create a matrix from a query and to persist matrices to tables as well. This would alleviate the need for developers to code their own data read and write routines and increase usability. In the next post, along with the PCA example, I will include a couple of procedures to help with these tasks.

Solving Systems of Linear Equations

Systems of linear equations (see here for a less technical description) are of great importance in mathematics and its applications to areas of physical sciences, economics, engineering and many more. The goal in solving a system of linear equations is to find a set of values for the unknowns that satisfies all the equations in the system. Let me illustrate this with a couple of examples.

Example 1: Trip Planning (adapted from AlgebraLAB)

You are planning a 7 day trip to the East Coast. You estimate that it will cost $300 per day in Boston and $675 per day in New York City. Your total budget for the 7 days is $2850. How many days should you spend in each location?

This problem can be mapped to the following system of linear equations:

$$\left\{ \begin{array}{rcrcr} x_1 & + & x_2 & = & 7 \\ 300x_1 & + & 675x_2 & = & 2850 \end{array} \right. $$

where x1 is the number of days in Boston and x2 is the number of days in New York City. x1 and x2 are the unknowns, the values we want to compute.The first equation requires that the total time be equal to 7 days. The second equation requires that the total amount spent in the trip to be equal to the available budget. In matrix form this system can be expressed as A*X = B where the matrices A, X, and B are as follows:

$$ A=\left( \begin{array}{cc} 1 & 1 \\ 300 & 675 \end{array} \right) ,\; X=\left( \begin{array}{c} x_1\\ x_2 \end{array} \right) ,\; B=\left( \begin{array}{c} 7\\ 2850 \end{array} \right) $$

We can use the LAPACK_GESV procedure in UTL_NLA to solve this problem. This procedure computes the solution to a real system of linear equations A*X=B where A is an n by n matrix and X and B are n by 1 matrices. Here is a code snippet that solves this problem:

The above code returns:

The arrays A and B store, respectively, matrix A and B in column order. It only takes a single procedure to solve the problem. On exit, if the argument info = 0 (success), the procedure overwrites the array B with the n by 1 solution matrix X. For this example, the solution is x1=5 and x2=2.

Example 2: BurgerRama Cartoon Dolls 

(originally in the October 1991 edition of Mathematics Teacher - adapted from here)

Joan King is marketing director for the BurgerRama restaurant chain. BurgerRama has decided to have a cartoon-character doll made to sell at a premium price at participating BurgerRama locations. The company can choose from several different versions of the doll that sell at different prices. King’s problem is to decide which selling price will best suit the needs of BurgerRama’s customers and store managers. King has data (Table 1) from previous similar promotions to help her make a decision.

 Table 1

Table 1

To solve this problem we need to proceed in two steps. First we need to estimate the supply and demand equations from the above data. Then we solve a system of linear equations to find the market equilibrium price where supply equals demand.

For the supply equation we want find a linear equation relating supply (S) to price (P) that fits the data in Table 1. Let write this equation as S = c1 P + c2, where c1 and c2 are the parameters we want to estimate. By replacing S and P in the supply equation with the values in Table 1 we obtain the following system of three linear equations:

$$\left\{ \begin{array}{rcrcr} c_1 & + & c_2 & = & 35 \\ 2c_1 & + & c_2 & = & 130 \\ 4c_1 & + & c_2 & = & 320 \end{array}\right. $$

In matrix form this system can be expressed as A*X = B where the matrices A, X, and B are as follows:

$$ A=\left( \begin{array}{cc} 1 & 1 \\ 2 & 1 \\ 4 & 1 \end{array} \right) ,\; X=\left( \begin{array}{c} c_1\\ c_2 \end{array} \right) ,\; B=\left( \begin{array}{c} 35 \\ 130 \\ 320 \end{array} \right) $$

This type of system, with more equations (3) than unknowns (c1, c2) is called an overdetermined system. Solving this problem is the same as solving a multivariate linear regression problem.

We can use the procedure LAPACK_GELS to solve this system. The LAPACK_GELS procedure solves overdetermined and underdetermined real linear systems. The relevant code snippet is:

The above code returns:

We have just implemented multivariate linear regression using a single PL/SQL procedure call! On entering the procedure the array B contains the values for the matrix B above. On exiting the procedure, the first two elements of B are the values for c1 and c2, the coefficients for the supply equation. The third element is the residual sum of squares for the solution, that is, the sum of the squared differences between the values predicted by the solution for S and the actual values in the data. Replacing the obtained values for c1 and c2 in the supply equation we obtain: S = 95 P - 60.

Setting up a similar problem for the demand data gives a system with the same A matrix and a different B matrix (B=(530,400,140)). Solving this system of equations (see code at the end of the post) gives the following demand equation: D = -130 P + 600.

In order to find the market equilibrium price (step 2) we create a new system of linear equations combining the supply and demand equations with a new equation requiring that supply equals demand. The system looks like this:

$$\left\{ \begin{array}{rcrcrcr} S & & & - & 95P & = & -60 \\ & & D & + & 135P & = & 600 \\ S & - & D & & & = & 0 \end{array}\right. $$

The first two equations are, respectively, the supply and demand equations we obtained in step 1 above. The third equation is the market equilibrium condition (S=D). The matrix representation for the system is as follows:

$$ A=\left( \begin{array}{crr} 1 & 1 & -95 \\ 0 & 1 & 130\\ 1 & -1 & 0 \end{array} \right) ,\; X=\left( \begin{array}{c} S \\ D \\ P \end{array} \right) ,\; B=\left( \begin{array}{r} -60 \\ 600 \\ 0 \end{array} \right) $$

The solution of this system of linear equations is S = D = 244 units and P = $3.20.

Here is the code for all the steps in this example:

The above code has only three procedure calls. It is even possible to solve steps 1a and 1b with a single call by combining the B matrices used in both steps into a single one. This would require only two procedure calls to solve Example 2.

In the next post in this series I will show how to do Principal Components Analysis (PCA) using the UTL_NLA package.

Posted on April 20, 2007 .