What Is Data Mining?

I have seen this question asked many times. This question has also created a popular thread in the Oracle Data Mining forum. In this post I'll discuss what data mining is and is not. I will also try to contrast data mining with other activities such as OLAP and Statistics.

Data mining has been a buzzword for sometime now. The term has been used and misused in many different contexts. Some definitions of data mining include:
"Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics, machine learning and pattern recognition" [1].

"The nontrivial extraction of implicit, previously unknown, and potentially useful information from data" [2].

"The science of extracting useful information from large data sets or databases" [3].
The above definitions highlight some key elements associated with the data mining activity:
  • Automatic discovery of patterns
  • Discovery of patterns that are not easy to detect (non-trivial)
  • Creation of actionable information
  • Focus on large data sets and databases
In summary, data mining is about finding patterns in data that are not easily spotted. Many times query and reporting has been incorrectly called data mining. But to its true nature, data mining implies building models, thus the notion of automatic discovery. Data mining models create abstractions (actionable information) of the data that can be used to answer questions that one would not be able to ask the data directly using just a query.

Data Mining and Statistics
There is a great deal of overlap between data mining and Statistics. In fact most of the techniques used in data mining can be placed in a statistical framework. However, data mining techniques are not amongst traditional statistical techniques. This can lead to the impression that data mining and statistics are competing disciples. Traditional statistical method, in general, require a great deal of user interaction in order to validate the correctness of the model, and thus they are harder to automate. These methods also do not usually scale well to very large datasets. On the other hand, data mining methods are suitable for large datasets and can be more readily automated. In fact, data mining algorithms, in many cases, require large data sets for the creation of quality models.

Data Mining and OLAP
On-Line Analytical Processing (OLAP) can been defined as fast analysis of shared multidimensional data [4]. OLAP and data mining are different but complementary activities. OLAP analysis may include time series analysis, cost allocations, currency translation, goal seeking, ad hoc multidimensional structural changes, non-procedural modeling, exception alerting, and data mining. However, most OLAP systems do not have inductive inference (or data mining) capabilities beyond the support for time-series forecast.

Multidimensional data is a key concept in OLAP. OLAP systems provide a multidimensional conceptual view of the data, including full support for hierarchies and multiple hierarchies. This view of the data is a natural way to analyze businesses and organizations. Data mining systems (and models), on the other hand, usually do not have a concept of dimensions and hierarchies.

Data mining and OLAP can be integrated in a number of ways. For example, data mining can be used to select the dimensions for a cube, create new attributes for a dimension, and create new measures for a cube. OLAP can be used to analyze data mining results at different levels of granularity.

In future posts I will describe how to mine OLAP data and to analyze data mining results using OLAP techniques.

Readings: Business intelligence, Data mining, Oracle analytics
Posted on January 9, 2006 .