December 03, 2007

Hierarchical Course Clusters from Course Profiles App

A couple of weeks ago, I got hold of a copy of the O'Reilly book Programming Collective Intelligence by Toby Segaran. For anyone wanting to get started with 'intelligent' techniques for clustering data, building simple recommendation systems and so on, this book is joy to play with.

Filled with examples in Python (which is a language I've never really played with before... powerful, isn't it?! ;-) you can get 'sort of' working recommender systems up and running in minutes using only a few lines of code...

So - I had a bit of tinker last night with our original Facebook Course Profiles app data set collected from 700 or so anonymised users (the same data set I used to generate the treemap displays). The Collective Intelligence book has an example of building a recommendation system based on delicious links that uses 'users' and 'urls' as the source data... so it was pretty trivial to recast the example to use 'users' and 'course codes' as the input data and build a proof of concept course recommendation system on that basis.

Liam has got another recommender system working in the dev version of the Facebook app already, so it'll be interesting to see if he used any of the heuristics I came up with for reducing the complete data set to a useful subset of data that was good for making recommendations (one strategy I used that looked okay at first glance was to rule out users who had declared less than five courses, and courses that occurred less than five times, from the recommender set).

The book also shows how Python can be used to generate simple visualisations of clustered data, for example. The first clustering algorithm explained in the book tries to cluster similar blogs based on the similarity of words used in them. Again, it was easy to map 'users' and 'course codes' on to 'urls' and 'words' features used in the example, and then reuse the reduced recommender data set (that is, the set based on users who had declared more than five courses which had also been declared more than five times), to see how courses were clustered by the Facebook app users .

Unfortunately, I haven't been able to get the Python JPG generating libraries working on my Mac yet (I'm not convinced everything is there...:-( but as the first data set I wanted to try to visualise was a hierarchical clustering, it struck me that I could use the Freemind viewer if I put the clustered hierarchy results into an appropriate XML form... here's what the result looks like...

From the course codes, the clusters appear to be pretty sensible to me (bearing in mind the data set is quite small).

If you want to have play, here's the Freemind file.

Next coffee break, I'll have a go with the K-means clustering and maybe look at the data in a Many Eyes network diagram...

Blogged with Flock

Tags: , ,

Posted by ajh59 at December 3, 2007 01:02 PM