# The SQL of Analytics 1 - Data Mining

What do we get when we combine the power of SQL with advanced analytics? The answer: a revolution. The Oracle RDBMS has added, over recent releases, support for advanced analytics to the SQL vocabulary. Because analytics is now part of the vocabulary, it can be used to answer all types of questions. Below I show examples of questions that we can ask using this new vocabulary. At the end I also show how easy it is to create models and how to find out the name of the models available to a user.

This is Part 1 in a series on analytics and SQL. This post covers data mining. A future post will cover statistical functions.

The combination of simplified model creation (see below) with the flexibility of SQL is a powerful tool for developing intelligent applications. It moves analytics away from the domain of experts, capable of navigating complex methodologies, into the hands of database and application developers. Developers can now use their creativity to combine and leverage data mining results in a myriad of ways. This changes the dynamics of how data mining is used in applications and opens up the opportunity for an increase in the number of "intelligent" applications. That is why I believe this is revolutionary.

My goal with the following examples is to illustrate how easy it is to use data mining results to create interesting and powerful queries. These types of queries make it straightforward for developers to add data mining results to applications and reports. The examples progress from simple to more complex queries.

Simple Prediction Query
Question:
Select all customers who have a high propensity to attrite (> 80% chance)
SQL Query:
      SELECT A.cust_name, A.contact_info      FROM   customers A      WHERE  PREDICTION_PROBABILITY(tree_model, ‘attrite’                                     USING A.*) > 0.8
This question can be answered with a simple predictive SQL query. The PREDICTION_PROBABILITY SQL function computes the probability that a customer will attrite using a customer attrition model called tree_model. This probability is used in the WHERE clause to select customers with a high propensity to attrite (probability greater than 80%). The customer attrition model used in the query was built with the type of information present in the customers table. The USING clause allows the * wildcard. It takes all the columns in customers and utilizes only the ones required by model tree_model.

Incorporating Other Relational Information
Question:
Select all customers who have a high propensity to attrite (> 80% chance) and have a customer value rating of more than 90
SQL Query:
      SELECT A.cust_name, A.contact_info      FROM   customers A      WHERE  PREDICTION_PROBABILITY(tree_model,‘attrite’                                     USING A.*) > 0.8      AND    A.cust_value > 90
It is also very easy to answer more complex questions by incorporating other relational information. In this example, customers are further filtered based on the information in the cust_value column.

Multi-domain, Multi-database Analysis
Question:
Select all customers who have a high propensity to attrite (> 80% chance) and have a customer value rating of more than 90 and have had a recent conversation with customer service regarding a Checking Plus account.
SQL Query:
     SELECT A.cust_name, A.contact_info     FROM   customers A     WHERE  PREDICTION_PROBABILITY(tree_model,‘attrite’                                    USING A.*) > 0.8     AND    A.cust_value > 90     AND    A.cust_id IN                  (SELECT B.cust_id                  FROM   call_center@HQ_DB B                  WHERE  B.call_date BETWEEN ’01-Jan-2005’                  AND    ’30-Jun-2005’                  AND    CONTAINS(B.notes, ‘Checking Plus’, 1) > 0)
This example builds upon the previous query and shows how the many capabilities of the RDBMS can be combined to answer complex questions. The query combines the previous results with call center information from the table call_center residing in a remote database. The remote database is accessed through a database link (HQ_DB). The query also uses Oracle Text CONTAINS operator to select customers that have mentioned 'Checking Plus' in their conversations with customer services within a six-months time window.

Gaining Further Insight
Question:
Select all customers who have a high propensity to attrite (> 80% chance) and have a customer value rating of more than 90 and have had a recent conversation with customer service regarding a Checking Plus account.
BTW, also tell me how you got this prediction.
SQL Query:
      SELECT A.cust_name, A.contact_info,             PREDICTION_DETAILS(tree_model USING A.*)      FROM   customers A      WHERE  PREDICTION_PROBABILITY(tree_model,‘attrite’                                     USING A.*) > 0.8      AND    A.cust_value > 90      AND    A.cust_id IN                   (SELECT B.cust_id                   FROM   call_center B                   WHERE  B.call_date BETWEEN ’01-Jan-2005’                   AND    ’30-Jun-2005’                   AND    CONTAINS(B.notes, ‘Checking Plus’, 1) > 0);
This query adds an explanation, in the form of the rule used to make the decision, to the previous query. The function PREDICTION_DETAILS returns the node ID for the rule used in the decision. The actual rule can be obtained using the function DBMS_DATA_MINING.GET_MODEL_DETAILS_XML.

Descriptive and Predictive SQL - Combining Multiple Models
Question:
For each segment based on our customer segmentation model, provide the number of customers in that segment as well as an average score for customer value and an average probability to attrite. Return this aggregate information ordered based on the overall segments’ propensity to attrite.
SQL Query:
      SELECT count(*) as cnt,             avg(PREDICTION_PROBABILITY(svmC_model, ‘attrite’                                        USING *)) as avg_attrite,             avg(cust_value_score)      FROM   customers      GROUP BY CLUSTER_ID(clus_model USING *)      ORDER BY avg_attrite DESC;
This example introduces another data mining SQL function (CLUSTER_ID). It shows how we can combine models of different types to answer a larger array of interesting questions.

Function Composition (e.g., Missing Value Imputation)
Question:
Select the 10 customers who are most likely to attrite based solely on: age, gender, annual_income, and zipcode.

In addition, since annual_income is often missing, perform missing value imputation for the annual_income attribute using all of the customer demographics.
SQL Query:
      SELECT *      FROM (SELECT cust_name, cust_contact_info            FROM   customers            ORDER BY PREDICTION_PROBABILITY(                            tree_model, 'attrite'                            USING age, gender, zipcode,                            NVL(annual_income,                                PREDICTION(svmR USING *)                               ) AS annual_income) DESC            )      WHERE rownum < 11;
This example orders customers by their probability to attrite using an attrition model (tree_model). It also uses another model (svmR) for missing value imputation for the annual_income attribute.

Auxiliary/Dependent Data Structures
Data mining models can also be used in auxiliary/dependent data structures, for example:
• Materialized views
• Functional indexes
This allows pre-computation of results for faster response time. The dependencies between a model and the dependent structures are also maintained. For example, if we drop a model used in a materialized view, the view is invalidated.

A functional index using a data mining model (svmR) can be created as follows:
      CREATE INDEX anninc_idx on customers      (PREDICTION (svmR USING *))
As new data are added to the customers table the model is scored and the results are stored in the index. The index can then be used to speed up queries on customers that require the model's prediction. For example, let's say the model predicts customer value, then the following query would order customers by their value and could be used to select the top customers (order-by elimination):
      SELECT *      FROM  customers      ORDER BY PREDICTION (svmR_model USING *);
The index can also help in cases where the prediction is used in the WHERE clause (index-driven filtering). For example, the following query returns customers with value above 150000:
      SELECT cust_name      FROM customers      WHERE PREDICTION (svmR_model USING *) > 150000;
Building Models
Oracle Data Mining has algorithms that provide powerful capabilities with very little work required from the user. It is possible to easily create models using these algorithms. As shown above, these models can then be used in arbitrarily complex queries, functional indexes, and materialized views. Two of these algorithms are: decision tree and support vector machine (SVM).

The decision tree algorithm can be used for generating profiles, predictions, and constructing hierarchies. It is a fast algorithm that provides easy to interpret rules (more information here and here). We can create a decision tree model as follows:
CREATE TABLE dt_settings (setting_name  VARCHAR2(30),setting_value VARCHAR2(30));BEGIN  -- Populate settings table  INSERT INTO dt_sample_settings VALUES    (dbms_data_mining.algo_name, dbms_data_mining.algo_decision_tree);  COMMIT;  DBMS_DATA_MINING.CREATE_MODEL(    model_name          => 'sales_type_model',    mining_function     => dbms_data_mining.classification,    data_table_name     => 'sales_dataset',    case_id_column_name => 'sales_id',    target_column_name  => 'sales_type',    settings_table_name => 'dt_settings');END;
The SVM algorithm provides support for regression, time series, anomaly detection, and classification (more information here and here). We can create an SVM model for regression as follows:
BEGIN  DBMSDATAMINING.CREATEMODEL(    model_name          => 'sales_model',    mining_function     => dbms_data_mining.regression,    data_table_name     => 'sales_dataset',    case_id_column_name => 'sales_id',    target_column_name  => 'sales');END;
In general, it is useful to normalize the data before building an SVM model. This can be done easily with one of the functions in the DBMS_DATA_MINING_TRANSFORM package or through a query. An example of the latter can be found here.

Finding Available Models

The list of models available for scoring can be obtained from the dm_user_models view. This view can be filtered to provide a list of model that match a given criterion. The following query lists the demo models available in the DM_USER schema.
SQL> SELECT 'NAME', 'FUNCTION_NAME', 'ALGORITHM_NAME' FROM DM_USER_MODELS;NAME                  FUNCTION_NAME         ALGORITHM_NAME--------------------------------------------------------------------ABN_SH_CLAS_SAMPLE    CLASSIFICATION        ADAPTIVE_BAYES_NETWORK AI_SH_SAMPLE          ATTRIBUTE_IMPORTANCE  MINIMUM_DESCRIPTION_LENGTHAR_SH_SAMPLE          ASSOCIATION_RULES     APRIORI_ASSOCIATION_RULESDT_SH_CLAS_SAMPLE     CLASSIFICATION        DECISION_TREE            KM_SH_CLUS_SAMPLE     CLUSTERING            KMEANS                   NB_SH_CLAS_SAMPLE     CLASSIFICATION        NAIVE_BAYES              NMF_SH_SAMPLE         FEATURE_EXTRACTION    NONNEGATIVE_MATRIX_FACTOR OC_SH_CLUS_SAMPLE     CLUSTERING            O_CLUSTER   SVMC_SH_CLAS_SAMPLE   CLASSIFICATION        SUPPORT_VECTOR_MACHINES   SVMO_SH_CLAS_SAMPLE   CLASSIFICATION        SUPPORT_VECTOR_MACHINES SVMR_SH_REGR_SAMPLE   REGRESSION            SUPPORT_VECTOR_MACHINES   T_SVM_CLAS_SAMPLE     CLASSIFICATION        SUPPORT_VECTOR_MACHINES