# Finding the Most Typical Record in a Group

I recently came across the following question: How can I find the most typical record in a group or cluster of records? For example, suppose we have a set of customer records, what is the customer that best typifies the group or cluster? The answer to this question can be used for characterizing groups of records of all types. For example, it can be used for characterizing multimedia collections (e.g., text documents or images).

A general strategy to answer this question is:
• Prepare the data (usually required for distance measure computations)
• Compute the centroid (see definition below) for the group of records
• Calculate a distance or proximity measure between the records and the centroid
• Select the record with the smallest distance to the centroid
Below I describe how to implement these steps using SQL. I give examples for cases where:
• Groups are created based on user-specified criteria
• Groups are automatically generated using clustering
Meet the Centroid
A centroid is one of the pieces of information used in clustering models to describe a cluster. It can be seen as a prototype, a typical value for the records in a cluster or group. For example, in a group of people we can ask what are the typical height, weight, and gender. A reasonable answer would be the average height, the average weight, and the most frequent gender in the group. Taken together this answer could be interpreted as a typical member of the group or the prototype of the group. It may be the case that nobody in the group matches the description though. This prototype is what we call the centroid. For numerical attributes, like weight, the centroid is the average (mean) value in the cluster. For categorical attribute, like gender, the centroid is the most frequent value, also known as the mode of the attribute in the cluster. Another way of describing the centroid is that it is the expected value for each attribute.

Because the centroid is an abstraction, there might not be any member of the group that actually looks like the centroid. Instead of using the centroid as a prototype or representative for the group, in many cases, it is preferable to use the closest record to the centroid. This is also called the medoid. In the example above, this can be seen as the most typical or average person in the group. To make the point more meaningful, suppose we have a set of documents. The centroid of a group of documents is not a document that we can read. If we use the centroid to describe what it is a typical document in the group we cannot readily grasp what are the types of documents in the group. We can find out what are the most frequent terms used in the documents in the group, but that is not as instructive as a representative text. If we use the medoid instead, it is very clear what a typical document looks like. The same idea can be applied to collections of images and groups of record (e.g., customer demographics).

Sample Data
For the discussion below, I will use a data set from Lewis and Taylor (1967) containing the sex, age (months), height (inches), and weights (pounds) of school children. A table (kids) with these data can be created using a script found here.

It is often necessary to prepare (transform) the data before computing a distance metric. Common transformations include:
• Normalization of numeric attributes
• "Explosion" of categorical attributes
• Missing value replacement
These can be easily implemented using SQL (see Technical Note I at the end for details) or the PL/SQL package DBMS_DATA_MINING_TRANSFORM. The queries below include data preparation. They have a subquery (kids_n) that normalizes the data in the kids table . The explosion of categorical attributes and missing value replacement treatment are also efficiently integrated in the computations (see Technical Note II).

Centroid Computation
We can use the SQL functions AVG and STATS_MODE to compute the centroid for a group of records. This is illustrated in the following query:
SELECT stats_mode(sex) sex, avg(age) age, avg(height) height,       avg(weight) weightFROM kids;
The query implements the centroid computation using AVG for numeric attributes and STATS_MODE for categorical ones. In this example the centroid is computed for the whole data. For computing the centroid for different groups of records we need to add a GROUP BY clause. For example, the following query computes the centroids for "young" (age <= 150 months) and "old" (age > 150 months) children:
SELECT age_group, stats_mode(sex) sex, avg(age) age,             avg(height) height, avg(weight) weightFROM (SELECT (CASE WHEN  age<=150 THEN 0 ELSE 1 END) age_group, a.*      FROM kids a)GROUP BY age_group;
When the categorical attributes are "exploded," instead of the mode, it is common to use the mean value of the exploded columns as an alternative representation for the centroid. This facilitates the computation of some distance metrics. I follow this approach in the queries below (see Technical Note II for details).

Distance to the Centroid Computation
There are many types of distance metrics. The Euclidean distance is probably the most used distance metric. It is defined as the square root of the sum of squared differences between the attribute values for two records. For our problem it becomes the sum of squared differences between the attribute values for a record in the group and those for the centroid of the group.

Here are the pieces of SQL that we need to add for each attribute based on the attribute type. To compute the distance between a row and a centroid we only need to add these pieces together. The computations provided below take advantage of some simplification tricks. These computations do not actually compute the full distance metric. They only compute the relevant pieces of the distance metric required for ranking how close a record is to the centroid. Factors that are common for all records are not included. The details are provided in Technical Note II.

For each numeric attributes add the following:
DECODE(r.num, NULL, STDDEV(r.num) OVER() *        (STDDEV(r.num) OVER() - 2 * c.num),        r.num * (r.num - 2 * c.num))
where num is the name of the attribute and r aliases the source with the rows and c the source with the centroid values.

For each categorical attribute add the following:
DECODE(r.cat, NULL, 0,        -2 * COUNT(*) OVER (PARTITION BY r.cat, r.rec_group)/       COUNT(*) OVER(PARTITION BY r.rec_group))
where cat is the name of the attribute, rec_group is a column with that identifies the different groups, and r aliases the source with the rows and c the source with the centroid values.

For example, for the age, weight, and sex attributes in the kids table we would compute the relevant piece of the Euclidean distance metric as:
DECODE(r.age, NULL,        STDDEV(r.age) OVER() *        (STDDEV(r.age) OVER() - 2 * c.age),        r.age * (r.age - 2 * c.age)) + DECODE(r.weight, NULL,        STDDEV(r.weight) OVER() *        (STDDEV(r.weight) OVER() - 2 * c.weight),        r.weight * (r.weight  - 2 * c.weight)) +DECODE(r.sex, NULL, 0,        -2 * COUNT(*) OVER (PARTITION BY r.sex, r.rec_group)/       COUNT(*) OVER(PARTITION BY r.rec_group))
The next two sections illustrate how to put it all together with two examples.

Example 1- Typical Records For User-Defined Partitions
In this example the goal is to find the most typical record for sets of records grouped based on user-defined criteria. More specifically, we want to find the most typical record for "young" (age <= 150 months) and "old" (age > 150 months) children in the kids table. This can be answered with the following query:
WITH    kids_n AS      (SELECT (CASE WHEN  age<=150 THEN 0 ELSE 1 END) rec_group,             cid, sex,             (age-AVG(age) OVER())/STDDEV(age) OVER () age,             (height-AVG(height) OVER())/STDDEV(height) OVER() height,
             (weight-AVG(weight) OVER())/STDDEV(weight) OVER() weight
      FROM kids),   ce AS      (SELECT rec_group,             STATS_MODE(sex) sex, AVG(age) age, AVG(height) height,              AVG(weight) weight      FROM kids_n      GROUP BY rec_group),   dis AS      (SELECT r.rec_group,              r.cid,             DECODE(r.age, NULL,                     STDDEV(r.age) OVER() *                     (STDDEV(r.age) OVER() - 2 * c.age),                     r.age * (r.age - 2 * c.age)) +              DECODE(r.weight, NULL,                     STDDEV(r.weight) OVER() *                     (STDDEV(r.weight) OVER() - 2 * c.weight),                     r.weight * (r.weight  - 2 * c.weight)) +             DECODE(r.height, NULL,                     STDDEV(r.height) OVER() *                     (STDDEV(r.height) OVER() - 2 * c.height),                     r.height * (r.height  - 2 * c.height)) +             DECODE(r.sex, NULL, 0, -2 *                     COUNT(*) OVER (PARTITION BY r.sex, r.rec_group)/                    COUNT(*) OVER(PARTITION BY r.rec_group)) dist       FROM   kids_n r, ce c       WHERE  r.rec_group = c.rec_group),   med AS      (SELECT *      FROM (SELECT d.*, min(dist) OVER(PARTITION BY rec_group) mdist            FROM dis d)      WHERE dist = mdist)SELECT m.rec_group, k.*FROM   med m, kids kWHERE  m.cid = k.cidORDER BY m.rec_group, m.dist;
I broke down the query into subqueries that map to the four steps of the approach I outlined above. The kids_n subquery prepares the data. It normalizes the numeric attributes using z-score. The ce subquery computes the centroid for each partition. In this example there is a partition for each unique rec_group value. The dis subquery computes the Euclidean distance from each record in a partition to the centroid of the partition. It constructs the distance computation using the building blocks described above. The med subquery computes the minimum distance to the centroid and selects, for each partition, the record with the smallest distance to the partition's centroid. The final query joins the results with the original data to obtain a description of the record without the normalization.

The above query returns the following results:
REC_GROUP      CID SEX    AGE     HEIGHT     WEIGHT                 --------- -------- --- ------ ---------- ----------         0       40   m    144       59.5         88           1       11   f    173       62.8      102.5
Compare these numbers with those for the group centroid computed on the non-normalized data (results for the explosion of the sex column are also shown below) :
SELECT rec_group,        STATS_MODE(sex) sex, AVG(f) f, AVG(m) m,         AVG(age) age, AVG(height) height,         AVG(weight) weight FROM (SELECT (CASE WHEN  age<=150 THEN 0 ELSE 1 END) rec_group,              DECODE(sex,'f',1,0) f, DECODE(sex,'m',1,0) m, a.*      FROM kids a)GROUP BY rec_group;REC_GROUP SEX      F      M      AGE     HEIGHT     WEIGHT                 --------- --- ------ ------ -------- ---------- ----------        0   m   0.35   0.65    144.8     58.875       88.9                           1   f  0.525  0.475  173.575     63.498    107.225
Example 2 - Typical Record For a Cluster
Let's assume that the data set from the previous example was clustered into 3 clusters using the kids_cl model (Technical Note III shows how to do it) and we want to identify the most typical record for each cluster. Can we use the clustering model to identify the closest record to a cluster centroid? Yes and no. We can use clustering scoring to partition the records into groups. However, we need to then apply the approach described above. Unfortunately, we cannot use cluster probability for computing the closest record. Those interested can read Technical Note IV below for the details.

The following query returns the most typical record for each cluster:
WITH    kids_n AS      (SELECT CLUSTER_ID(kids_cl USING a.*) rec_group, a.*      FROM (SELECT cid, sex,             (age-AVG(age) OVER())/STDDEV(age) OVER () age,             (height-AVG(height) OVER())/STDDEV(height) OVER() height,             (weight-AVG(weight) OVER())/STDDEV(weight) OVER() weight            FROM kids) a),   ce AS      (SELECT rec_group,             STATS_MODE(sex) sex, AVG(age) age, AVG(height) height,              AVG(weight) weight      FROM kids_n      GROUP BY rec_group),   dis AS      (SELECT r.rec_group,              r.cid,             DECODE(r.age, NULL,                     STDDEV(r.age) OVER() *                     (STDDEV(r.age) OVER() - 2 * c.age),                     r.age * (r.age - 2 * c.age)) +              DECODE(r.weight, NULL,                     STDDEV(r.weight) OVER() *                     (STDDEV(r.weight) OVER() - 2 * c.weight),                     r.weight * (r.weight  - 2 * c.weight)) +             DECODE(r.height, NULL,                     STDDEV(r.height) OVER() *                     (STDDEV(r.height) OVER() - 2 * c.height),                     r.height * (r.height  - 2 * c.height)) +             DECODE(r.sex, NULL, 0, -2 *                     COUNT(*) OVER (PARTITION BY r.sex, r.rec_group)/                    COUNT(*) OVER(PARTITION BY r.rec_group)) dist       FROM   kids_n r, ce c       WHERE  r.rec_group = c.rec_group),   med AS      (SELECT *      FROM (SELECT d.*, min(dist) OVER(PARTITION BY rec_group) mdist            FROM dis d)      WHERE dist = mdist)SELECT m.rec_group, k.*FROM   med m, kids kWHERE  m.cid = k.cidORDER BY m.rec_group, m.dist;
The query is basically the same as the one in the previous example. The text in red indicates the change to the query, namely, replaced the rec_group definition. In this example the rec_group column indicates the cluster ID for the record. This column is computed using the SQL data mining function CLUSTER_ID. The kids_n subquery has now two queries. The innermost query, as before, normalizes the data. The outer query adds the re_group column. This is necessary because the clustering model kids_cl was built with normalized data. In order to score this model we also need to feed it normalized data, which is provided by the inner query.

The above query returns
REC_GROUP      CID SEX    AGE     HEIGHT     WEIGHT                 --------- -------- --- ------ ---------- ---------- 3               60   m    151       59.3         87                     4               37   m    193       66.3        133                    5               10   f    169       62.3       99.5
From these results cids 60, 37, and 10 can be used as typical cases to represent clusters 3, 4, and 5 respectively. Also, cluster 3 contain younger children, cluster 4 somewhat older children, and cluster 5 older ones.

The approach described here is very general and flexible. It can be efficiently implemented in a single query. It can also be used with groups defined by the user or automatically generated using a clustering model. However, the distance metric computations need to be tailored to the data. In order to address this, in a follow up post I will provide the code for a procedure that wraps these steps in an easy to use interface.

Technical Note I - Data Preparation
It is usually a good idea to normalize numeric attributes in order to compute distance metrics. That is, try to make each numeric attribute to have the same range. This avoids a single attribute to dominate the distance metric. Suppose all but one of the numeric attributes in a data set have a range between 0 and 1. Let's say that the remaining attribute INCOME has a range between 10,000 and 100,000. In this example, the distance computation between two records will be completely dominated by the differences in values in the INCOME attribute. This is really an artifact of the scale the attribute is measured. One way to address this is to create a new attribute INCOME1 as INCOME/100,000. This is the idea behind normalization of numeric attributes.

A common normalization approach is called z-score normalization. In this case, we create new attributes by subtracting the mean from the original attributes and dividing by the standard deviation. The following query implements z-score normalization for the above kids table:
SELECT cid, sex,       (age-AVG(age) OVER())/STDDEV(age) OVER () age,       (height-AVG(height) OVER())/STDDEV(height) OVER() height,       (weight-AVG(weight) OVER())/STDDEV(weight) OVER() weightFROM kids;
Note that the categorical attribute SEX was not transformed.

Because distance metrics are often defined for numeric attributes only, it is common to transform categorical attributes into a numeric representation. Usually this is done using the "explosion" transformation. In this case, we transform the categorical attribute into a set of binary numeric attributes, one new attribute for each distinct value of the categorical column. For example, transforming the above sex attribute would create two new columns M and F. M is 1 when sex='m' and 0 otherwise. F is 1 when sex='f' and 0 otherwise. The following query implements the explosion transformation for the sex column:
SELECT a.*,       DECODE(sex,'m',1,0) M, DECODE(sex,'f',1,0) FFROM kids a
This approach is simple but labor intensive. It requires knowledge of the set of distinct values for each categorical column. The computations described in the examples above avoided this step by taking advantage of some simplification tricks (see Technical Note II).

Finally, the presence of missing values should also to be taken into account when computing distance metrics. A simple approach is to replace missing numeric attributes by the mean and categorical by the mode. However, this is not desirable for the problem of finding the closest record to the centroid of a group of records. As discussed above, the centroid is computed as the mean and the mode of the attributes. Replacing missing values with the mean and the mode would make records with missing values look closer to the centroid. In the extreme case, a record with only missing values would be exactly like centroid and would be selected as the typical record for the group. Not a very desirable result. An alternative approach to missing value treatment that works for our problem is as follows:
• Numeric attributes: replace missing values with the standard deviation for the group
• Categorical attributes: explode the attribute (missing values are mapped to 0)
The idea here is to make the contribution to the distance by a missing attribute to be the same as a typical contribution (the standard deviation) for the attribute. This approach assumes z-score normalization. If another normalization is used then, for numeric attributes, we should replace missing values with the mean plus the standard deviation.

The following query implements missing value treatment for the kids table:
SELECT r.cid,     DECODE(r.age, NULL, STDDEV(r.age) OVER(), r.age) age,     DECODE(r.height, NULL, STDDEV(r.height) OVER(), r.height) height,     DECODE(r.weight, NULL, STDDEV(r.weight) OVER(), weight) weight,     DECODE(r.sex,'m',1,0) M, DECODE(r.sex,'f',1,0) FFROM kids r;
The queries in the two examples in the post already implement normalization of numeric attributes, explosion of categorical attributes, and missing value replacement.

Technical Note II - Euclidean Distance Metric Computation

Equation 1 gives an alternative expression for the Euclidean distance (dE) between a row (r) and a centroid (c). Both the row and the centroid can be seen as vectors where each attribute is a component of the vector. Arbitrary unique identifiers (e.g., those generated from a sequence) are left out. Also categorical attributes are represented after applying the explosion transformation. For example, the following record
     CID SEX    AGE     HEIGHT     WEIGHT                 -------- --- ------ ---------- ----------      40   m    144       59.5         88
is represented as the vector (1, 0, 144, 59.5, 88). A couple of things to notice:
• The cid column was left out.
• The first two components of the vector are the exploded representation of sex='m'. A record with sex='f' would be represented by (0, 1, ...).
Equation 2 shows how to compute the dot product (r.c) of the vectors r and c. nn is the number of numeric attributes, nc is the number of categorical attributes and f(rj | c) is the value of the rj component of the centroid for the exploded version of the data. This way of writing the dot product r.c takes into account the explosion of categorical attributes without having to explicitly implement it in the query. Let's say that the centroid for a group has the following values (notice that sex is exploded into two columns, M and F):
     F      M      AGE     HEIGHT     WEIGHT                 ------ ------ -------- ---------- ----------  0.35   0.65    144.8     58.875       88.9
The dot product r.c for the cid=40 record above is:
144 * 144.8 + 59.5 * 58.875 + 88 * 88.9 + 0.65
Equation 3 shows how to compute the norm of the vector r representing a given row. Because the norm of the centroid (||c||) is constant when computing the Euclidean distance between a row and the centroid, we can ignore it in the distance computation. Equation 4 shows that.

Finally, using the results from equations 2 and 3 in equation 4, we obtain the approximation for the Euclidean distance in equation 5. The term nc from equation 3 was left out because it is constant for all rows.

For the kids table, according to equation 5, we would compute the relevant piece of the Euclidean distance metric as:
r.age * (r.age - 2 * c.age) + r.weight * (r.weight  - 2 * c.weight) +r.height * (r.height  - 2 * c.height) +-2 * COUNT(*) OVER (PARTITION BY r.sex, r.rec_group)/     COUNT(*) OVER(PARTITION BY r.rec_group))
Adding missing value treatment as described in Technical Note II the approximation becomes:
DECODE(r.age, NULL,        STDDEV(r.age) OVER() *        (STDDEV(r.age) OVER() - 2 * c.age),        r.age * (r.age - 2 * c.age)) + DECODE(r.weight, NULL,        STDDEV(r.weight) OVER() *       (STDDEV(r.weight) OVER() - 2 * c.weight),        r.weight * (r.weight  - 2 * c.weight)) +DECODE(r.height, NULL,        STDDEV(r.height) OVER() *        (STDDEV(r.height) OVER() - 2 * c.height),        r.height * (r.height  - 2 * c.height)) +DECODE(r.sex, NULL, 0,        -2 * COUNT(*) OVER (PARTITION BY r.sex, r.rec_group)/       COUNT(*) OVER(PARTITION BY r.rec_group))
Technical Note III - Clustering Model Creation
The steps for creating a cluster model (kids_cl), using the data mining PL/SQL API, to cluster the data in the kids table into three clusters are as follows:
1. Normalize the numeric attributes
2. Create a settings table specifying the number of clusters
3. Create the clustering model
1. Normalize the numeric attributes
BEGIN  -- Create a normalization table  dbms_data_mining_transform.create_norm_lin (    norm_table_name => 'cl_num');  -- Indicate which columns to normalize and the type of   -- normalization to use  dbms_data_mining_transform.insert_norm_lin_zscore (    norm_table_name => 'cl_num',    data_table_name => 'kids',    exclude_list    => dbms_data_mining_transform.column_list (                       'CID'),    round_num       => 0  );  -- Create the transformed view. Here is where the actual  -- normalization computation takes place  dbms_data_mining_transform.xform_norm_lin (    norm_table_name => 'cl_num',    data_table_name => 'kids',    xform_view_name => 'kids_norm');END;/
The above normalization step used z-score normalization (subtract the mean and divide by the standard deviation) and excluded the cid column as it is a unique identifier column. The kids_norm view implements the same computations as those in kids_n subquery described above.

2. Create settings table
CREATE TABLE cl_settings (  setting_name  VARCHAR2(30),  setting_value VARCHAR2(30));
3. Create the clustering model
BEGIN  -- Populate settings table  INSERT INTO cl_settings VALUES    (dbms_data_mining.clus_num_clusters, 3);  COMMIT;  -- Create model  DBMS_DATA_MINING.CREATE_MODEL(    model_name          => 'KIDS_CL',    mining_function     => DBMS_DATA_MINING.CLUSTERING,    data_table_name     => 'KIDS_NORM',    case_id_column_name => 'CID',    settings_table_name => 'cl_settings');END;/
Technical Note IV - Clustering Model Probabilities
For a given record, Oracle Data Mining clustering models return the probability of the record belonging to a given cluster. However, this probability cannot be used to identify the closest record to the cluster centroid. The reason is that a clustering model returns the probability of the cluster given the data record (P(cluster | data)). This probability is similar to what we obtain from a classification model. From this perspective, clustering is treated as a classification problem where first the class labels (clusters) are identified and then the records are classified into clusters from a pre-defined set of clusters.

In order to compute the closest record to the centroid we would need P(data | cluster). This probability indicates how likely a given row is to belong to a cluster. In this case, there is no requirement that the data row belongs to any known cluster. Oracle Data Mining clustering models, however, do not return this probability.

References

Lewis, T. and Taylor, L.R. (1967), Introduction to Experimental Ecology, New York: Academic Press, Inc.