For a long time I have thought that we needed data mining books written for developers. Most data mining books are written for business or data analysts. Given that, it was a pleasant surprise to read Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran. The book provides a good discussion on data mining concepts anchored with interesting examples. It also provides guidance on when to use the techniques. I still think that there is room for improvement on how data mining and analytics should be presented to developers. However, the book is a great step forward in enabling developers to use analytics.
Toby's book has many Web 2.0 examples illustrated with data obtained from different web sites (e.g., blogs, Facebook, Zebo, etc). The examples are coded in Python or rely on freely available software for the data gathering and analytics. This got me thinking. If one wanted to illustrate how to create scalable and robust applications exploiting collective intelligence, would it not be interesting to replicate some of those examples utilizing the analytical and XML processing capabilities in the Oracle database? So I decided to do exactly that. This is the first post in a series showing how we can do the problems discussed in the book using technology in the Oracle database. Although most that is described in the book can be done using the Oracle database, this series will only showcase some of the examples in the book. To implement all of them would be like writing a new book.
Before we can start mining we need to collect data and store it in a database. There are many Web 2.0 websites with interesting data for mining. XML is the de facto data format returned by these websites. This post covers the basic steps on how to get this data by showing how to build an archive of entries from a list of RSS feeds. In later posts I will describe the mining.
First the good news, the Oracle RDBMS has a great set of features supporting XML and HTTP processing. Now the bad news, before writing this post I knew nothing about those features. So, after some digging around the web, I found lots of good information in a couple of articles. I have to say that I am really impressed by what can be done with the functionality in the database.
Below I describe some preparatory steps needed, how to query multiple RSS feeds at once, and finally how to automate the whole process to build a RSS feed archive.
Lucas Jellema has a nice series of articles describing how to build a RSS feed reader and archive (1, 2). I used them as a departing point. The comments in the first article pointed me in the right direction to move from a PL/SQL implementation, as described in the article, to the SQL version given below.
Here is a list of things that one need to do before running the code in Lucas' articles or the code below:
- Configure fine-grained access using ACL (access control lists) - this is needed for 11g
- Take care of proxy server - optional if you have a proxy server
- Configure the database character set
Configuring Fine-Grained Access
Oracle provides access to external network services through several PL/SQL APIs (UTL_TCP, UTL_SMTP, UTL_MAIL, UTL_HTTP and UTL_INADDR). Before Oracle 11g, access to these external services was based on whether a user was granted execute permissions on a specific package. Oracle 11g introduced fine-grained access to network services using access control lists (ACL). ACL allows to control which users access which network resources regardless of package grants.
Basically we need to do the following to be able to take advantage of the features in these packages:
- Create an access control list (ACL)
- Assign the ACL to the network
The assign_acl call above is very generous. It allows the DMUSER user to connect to any server (*) through port 80. For more details take a look at this nice article by Tim Hall.
If your machine sits behind a firewall you need to account for proxies by executing the following command in your session:
For more details see the documentation for the UTL_HTTP package (link).
Configuring Database Character Set
When retrieving XML from RSS feeds it is a good idea to use a database with a Unicode character set, for example AL32UTF8. This prevents getting an error when trying to persist the XML content to a table.
Querying Multiple RSS Feeds
First we create a table rss_feeds to hold the RSS feeds we want to track.
Next we insert a couple of RSS feeds in the table assigning a category to each feed, such as: news, technology, etc.
Now we can use the SQL statement below to query the RSS feeds and map the XML documents retrieved to a relational schema:
The above query has a number of interesting features. It uses the HTTPURITYPE function to extract the XML for each RSS feed in the rss_feeds table. The XMLTABLE statements then process the XML document returned from each feed to extract the desired pieces of information and convert the document into a relational source. The first XMLTABLE statement extracts the tile and the item elements from each XML document returned by p. Because each XML document contains multiple items, a second XMLTABLE statement is used to return the pieces of each item element extracted by the first XMLTABLE. This XMLTABLE statement also returns the category as XMLTYPE. This is needed because some sites have multiple category elements for a given item. The category elements are then extracted using the XMLQUERY function so that we have all these elements concatenated as a single string separated by commas. For alternative ways of doing the last bit take a look at this post by Mark Volkmann.
One problem with the above query is that if one of the RSS feed sites is down or is unreachable the whole query will fail. Given the nature of the web this is not an unlikely event if we are trying to collect feeds from many sites. To overcome this problem, we can implement a PL/SQL function that wraps the HTTPURITYPE function and validates the URL before invoking the HTTPURITYPE function. If the site for a URL does not respond, the function returns NULL. The code in the sections below follows this approach.
Building the RSS Feed Archive
To build the RSS feed archive we want to periodically run the above query and persist to a table only the entries that have not been stored already.
First let's create a table for holding the RSS feed entries:
Next we define a function to validate a RSS feed's URL and return the XML for the feed.
We can now use the following MERGE statement to merge only newer entries into the archive:
It is very impressive that a single SQL statement can retrieve entries from multiple RSS feeds, parse the XML, and merge the newer entries into a table for archival.
Finally let's setup a DBMS_SCHEDULER job to run this merge statement every fifteen minutes. First we create a PL/SQL procedure wrapping the above MERGE statement:
Besides the MERGE statement, the above procedure also includes some code for handling proxy servers as explained above.
Next we create the job that will execute this procedure every fifteen minutes:
For more details on DBMS_SCHEDULER see the Oracle documentation (link).
To check on status of the job we can use the following query:
To manually stop and/or drop the job we can use the following calls:
After a few days running the archive has accumulate quite a few entries:
In later posts I will show how to mine the archive.