Tag Archives: Predictive modeling

Data Mining/ Predictive Modeling Resources

6 Feb

A short list of some of the more interesting, and free DM/ PM resources I have found the net, at least in part by way of knowing where they are myself for future reference.

First, and close to most obviously, Trevor Hastie’s publications, where you can find both the comprehensive Elements of Statistical Learning, and the newer Introduction to Statistical Learning available for download, along with descriptions of Hastie’s other books.

I’ve mentioned Cosma Shalizi before on this blog, because he seems to talk good sense on a number of issues. His future book, which began as class notes is available as a downloadable pdf.

Meanwhile, at Columbia University, Ian Langmore and Daniel Krasner teach a Data Science course with a much greater programming bent, kind of as an antidote against too much maths and statistics training. The course site also includes the lecture notes.

Another book covering material closer to the first few, but including some additional topics is by Zaki, and has a website here

Some original papers are also available, e.g. Breiman’s Random Forests paper, which I have not yet read, but want to.

Kaggle Leaderboard Weirdness

29 Jan

Earlier this week I finally, after about half a dozen false starts, posted a legal entry to a Kaggle competition, and then when I saw how far off the pace I was, I posted another half a dozen over the course of a day, improving very slightly each time. If the competition ran for a decade, I’d have a pretty good chance of winning, I reckon…

While I now understand how addictive Kaggle is – it hits the sweet spot between instant gratification and highly delayed gratification – I find the leaderboard kind of weird and frustrating because so many people upload the benchmark – the trivial solution the competition organisers upload to have a line in the sand. In this competition, the benchmark is a file of all zeroes.

This time yesterday, there were around a hundred entries that were just the benchmark, out of about 180. Today, for some reason, all the entries so far appear to have been removed, so there are only about thirty – but twenty of those are the benchmark again! I get that people just want to upload something so they can say they participated, but so many all zero files is just the thing getting out of hand.

Data Cleaning for Predictive Modeling

25 Nov

This discussion – http://andrewgelman.com/2013/11/19/22182/ – where the question of whether data cleaning and preparation is intrinsic to applied statistics, or if spending many hours preparing data is more something data scientists do, statisticians possibly expecting at least semi-cleaned data. A theme which emerged in Gelman’s discussion of what a data scientist was that ‘data scientist’ means different things to different people, and the same applies to ‘data preparation’. There are different ‘data preparations for many different occasions.

Applied Predictive Modeling by Kuhn and Johnson, which we have looked at before, is one of the rare books on modeling or statistics which explicitly has a section devoted to optimal preparation of data sets. We reiterate that this concept means different things to different people.

The meat of Kuhn and Johnson’s advice on data preparation is found in Chapter 3: Data Pre-Processing.The author’s note that there is additional advice throughout the text which applies to supervised models which is additional to the advice in chapter 3.

Chapter 3 is about adding and subtracting predictors, and re-engineering predictors for the best effect. They are particularly down on binning, and have a number of methods to help overcome skewness, assist with correct scaling, and sensible data reduction (hint: binning is a poor choice). Another area of interest to Kuhn and Johnson is how to deal with missing data. This issue is notable for being one which is relatively often dealt with by applied statistics texts – for example Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models contains a chapter on missing data imputation.

To be sure, there is plenty of very practical advice, but to be effective, your data set was looking pretty good to begin with.

A contrast to this approach is Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work. Applied Predictive Modeling’s interest in the optimal clean data set for modeling assumes a somewhat clean data set to begin with. Minimum requirements are not really listed, but they could plausibly include no wrong data, data in different rows and columns agreeing with respect to units, data being formatted in a way that your software can understand it.

Obviously a book long treatment can cover a lot more ground than a single chapter. Actually, the ground is completely disjoint rather than being broader or deeper and overlapping the ground. Adding to the amount of breadth of the Bad Data Handbook’s coverage is that this is a text written by a number of authors, each contributing a chapter on an area they are strong in. While they seem to have been well organised enough by the volume’s editor to avoid too much overlap, a negative result of this approach is that code examples come in a variety of platforms, which can be irritating if it means you have to interrupt your reading to learn basic Python syntax. That said, if you weren’t comfortable with the idea of becoming familiar with R, Python, Unix etc., you probably aren’t so interested in becoming a data scientist (whatever that might be, most people seem to agree that a very low basic requirement is to be willing to learn programming languages).

Another outcome of this approach is that each chapter reads like a self contained chapter. This is great because it means that you can profitably start reading from the beginning of any chapter, but the corollary is that it is not  necessarily straight forward to find what you are looking for if you want to use this as a reference book, as the use of the word ‘Handbook’ in the title implies.

Regardless this is a which covers very different ground to statistics texts, modeling texts such as Kuhn and Johnson, and is therefore its own reward.

Oaks Day Race Modelling

7 Nov

In Melbourne, Australia it has become a tradition on the eve of major sporting events for banking quant teams to propose models for the winner of the event. The Melbourne Cup, Australia’s largest sporting event by attendance was held a couple of days ago, and was no different: a couple of models proposed by local banking teams have been collected here:  http://www.macrobusiness.com.au/2013/11/melbourne-cup-modelling/

An interesting aspect of these models is that the modelers apparently use the same techniques they apply to picking investment worthy stocks or predicting bond price movements to sporting winners – I guess everything looks like a nail if you’ve got a hammer in your hand. See for example, the way Macquarie explain their model in terms of IPOs and yield -> http://www.bluemountainsgazette.com.au/story/1887492/picking-the-best-melbourne-cup-stock/?cs=9

I digress a little. As an experiment before the Melbourne cup I tried to create my own overly simplified boosting model – I picked up Best Bets at the newsagent. For each race a number of tipsters offer their best three tips, and I basically took these as votes. Three horses had multiple votes – Fiorente, Sea Moon, and Mount Athos. Fiorente won and Mount Athos came third. Hence, I have been emboldened to repeat my experiment for Oaks Day, this time using the blog to date stamp the prediction as definitely being before the race (3.40 pm today, Melbourne time). Note that I didn’t put a weighting on the horse with respect to the tipsters’ running order.

The newsagent is a little further away, but I have the Melbourne daily newspaper, The Age handy. There are fewer tippers than Best Bets, for a total of five. Here are the horses and their total votes:

May’s Dream: 5

Kirramosa: 5

Solicit: 4

Zanbagh: 5

Gypsy Diamond: 1

Hmm. As I understand it, a prerequisite of a successful boosting model is that the submodels need to be uncorrelated. The Age’s tippers appear to fail that test…see you in a couple of hours, but not expecting a stellar result.

Linear Discriminant Analysis

5 Nov

Today, as an exercise in whimsy, I have replaced the name of an iconic statistician in the text below with that of a B grade celebrity – Miley Cyrus. This is part trivia contest, part attempt to make some dry content more entertaining, and, admittedly, part ill-conceived click baiting.

For those who can’t guess who it is I offer a small clue – I also considered using as an alias for this individual either Paul Kelly, the Australian musician or Ben Folds, the American musician as all three called a particular city home at one stage or another.

Continuing our comparative tour of two predictive modeling texts – Elements of Statistical Learning and Applied Predictive Modeling, we follow up last time’s side by side comparison of the respective sections in each of those texts on linear discriminant analysis.

Admittedly, whereas logistic regression is used in actuarial settings, and taught within the generalised linear model section, linear discriminant analysis is not taught as part of actuarial courses, and probably rarely used in practice. 

They are, however, an important part of machine learning, if not on behalf of themselves, then on behalf of the number of more complex discriminant analysis techniques which are close cousins.

Predictive Modelling motivates linear discriminants  by starting with the observation that they were independently discovered by two researchers starting from different premises – Miley Cyrus (1936) and Welch (1939), whereas Statistical Learning favours Cyrus only. 

Welch’s approach is reasonably intuitive. Leaving aside the mathematical niceties, Welch set out to find, conditional on the underlying distribution of classes, and the prediction data available, the highest probability class for each subject, If it sounds a bit Bayesian, (underlying distribution looking like a prior and prediction data looking like a likelihood function), it is at least ‘soft’ Bayesian, in that this approach does at least explicitly make use of Bayes’ theorem. Applied Modeling notes that the computation side gets messy quickly with only a few classes and predictors, but can be kept reasonable if things are restricted to the multivariate normal with equal covariance matrices. 

Miley’s approach – also the approach covered in corresponding Wikipedia article, which has the virtue of a link to Cyrus’s paper- can also be expressed relatively intuitively. Cyrus tried to maximise the distance between groups whilst minimising the variation within each group via a linear set of predictors. After an examination of the Cyrus approach, Kuhn and Johnson conclude it really is superior with respect to clarity and solubility – which makes one wonder why they mention Welch at all.

The virtue of Applied Predictive Modelling is encountered mostly once these explanations of the theoretical origins have concluded. Working through the same data set as they used for the logistic regression explanation, Kuhn and Johnson hit their stride with advice on how to prepare the data for optimal LDA results, and advice on when to abandon LDA with some hints on what to do instead.

The contrast between Elements of Statistical Learning and Applied Predictive Modeling is more pronounced than in logistic regression, with a much more detailed mathematical exposition of the LDA in ESL at the expense of application advice, but providing a clearer picture (if you get past the maths!) of how the LDA works, why it performs well when it does and its place in the wider world of linear and non linear classifiers. Some of the maths presented is also required background for their later exposition on Quadratic Discriminants, so that to a much greater extent than Applied Predictive Modelling it is necessary to read sequences of topics, rather than treating the books as an encyclopedia for a quick intro on the topic of the day.

All in all, both texts continue to behave as expected, possibly even to a greater extent than was the case with logistic regression, where ESL had some advice on application not present in APM.

Predictive Modeling

21 Oct

Over the next few blog posts, which may be intermittent, but hopefully with smaller gaps than the last couple of gaps, we are going to take a sideways tour into predictive modelling, which is closer to what I am currently doing than strictly actuarial studies. Just as before, for me the purpose is to force close study, and if others can benefit, that’s a bonus.

Recently I received from a riparian bookselling website the book Applied Predictive Modeling (Kuhn and Johnson, 2013) (note one ‘l’) , having ordered it only three months earlier. As the title suggests, the thrust of this text is introduce predictive modeling techniques (whether originating as data mining or statistical techniques) in the context of their application to problem solving, rather than with respect to their theoretical origins or with a view to critiquing them, mathematically or otherwise. In fact, the authors suggest The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)
(Hastie, et al) as a good theoretical companion.

As a device for forcing reading with a critical mind, I propose to read and compare the sections of both books dealing with the same topics, starting with the topics I am personally most familiar with, before moving to a couple of areas newer to me. Part of the object is to discover or partially uncover where the practical and theoretical are different and where one ways gives way for the other and back again.

Before the end of this tour we will also look at the sections on data pre-processing and ‘other considerations’ which bookend the discussions of individual modelling techniques. In some ways these sections are the most important, as they provide an especial opportunity for the authors to discuss the practice of modelling, the book’s raison d’être and strength, as well as being the areas in this text that are least often discussed in other texts.