Archive | November, 2013

Data Cleaning for Predictive Modeling

25 Nov

This discussion – – where the question of whether data cleaning and preparation is intrinsic to applied statistics, or if spending many hours preparing data is more something data scientists do, statisticians possibly expecting at least semi-cleaned data. A theme which emerged in Gelman’s discussion of what a data scientist was that ‘data scientist’ means different things to different people, and the same applies to ‘data preparation’. There are different ‘data preparations for many different occasions.

Applied Predictive Modeling by Kuhn and Johnson, which we have looked at before, is one of the rare books on modeling or statistics which explicitly has a section devoted to optimal preparation of data sets. We reiterate that this concept means different things to different people.

The meat of Kuhn and Johnson’s advice on data preparation is found in Chapter 3: Data Pre-Processing.The author’s note that there is additional advice throughout the text which applies to supervised models which is additional to the advice in chapter 3.

Chapter 3 is about adding and subtracting predictors, and re-engineering predictors for the best effect. They are particularly down on binning, and have a number of methods to help overcome skewness, assist with correct scaling, and sensible data reduction (hint: binning is a poor choice). Another area of interest to Kuhn and Johnson is how to deal with missing data. This issue is notable for being one which is relatively often dealt with by applied statistics texts – for example Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models contains a chapter on missing data imputation.

To be sure, there is plenty of very practical advice, but to be effective, your data set was looking pretty good to begin with.

A contrast to this approach is Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work. Applied Predictive Modeling’s interest in the optimal clean data set for modeling assumes a somewhat clean data set to begin with. Minimum requirements are not really listed, but they could plausibly include no wrong data, data in different rows and columns agreeing with respect to units, data being formatted in a way that your software can understand it.

Obviously a book long treatment can cover a lot more ground than a single chapter. Actually, the ground is completely disjoint rather than being broader or deeper and overlapping the ground. Adding to the amount of breadth of the Bad Data Handbook’s coverage is that this is a text written by a number of authors, each contributing a chapter on an area they are strong in. While they seem to have been well organised enough by the volume’s editor to avoid too much overlap, a negative result of this approach is that code examples come in a variety of platforms, which can be irritating if it means you have to interrupt your reading to learn basic Python syntax. That said, if you weren’t comfortable with the idea of becoming familiar with R, Python, Unix etc., you probably aren’t so interested in becoming a data scientist (whatever that might be, most people seem to agree that a very low basic requirement is to be willing to learn programming languages).

Another outcome of this approach is that each chapter reads like a self contained chapter. This is great because it means that you can profitably start reading from the beginning of any chapter, but the corollary is that it is not  necessarily straight forward to find what you are looking for if you want to use this as a reference book, as the use of the word ‘Handbook’ in the title implies.

Regardless this is a which covers very different ground to statistics texts, modeling texts such as Kuhn and Johnson, and is therefore its own reward.

Empty Vessels

18 Nov

Influence is big at the moment, partly thanks to LinkedIn’s promotion of Influencers (who usually aren’t) – essentially people who write short career oriented inspirational stuff that is piped into your email inbox. Which is all good, but when a word is used incorrectly, or at best, loosely, its meaning is diluted, and when you want to use it for its original meaning, it doesn’t work as well.

To be clear, you are influential if a great number people in your field act differently because of work you have done. Picasso was an influential artist because many prominent artists paint differently because of his example. Bob Dylan is influential because a great many singer songwriters changed their methods because of his example and/ or found audiences which were created because Bob Dylan came first. While the idea of influence is slippery and subjective, in those cases, and others like them, we can make some progress towards objectivity using this definition.

A couple of weeks ago, Time magazine made itself a large and slow moving target by publishing a ‘Gods of Food’ cover story, featuring 3 males and no females on the cover, and 9 males in a list of 13 inside. See here for some discussion of associated brickbats:

Another one of the many Big Data/ Data Science/ Predictive Modeling bloggers has flung themselves onto the same hand grenade by suggesting a list of ‘Top 10 Most Influential Data Scientists’ which includes no women at all.

Note that the first comment is a plea from someone whose name looks female for the inclusion of women, with another comment from the same person that has been deleted. I like to think that that comment was deleted because it was a howl of outrage, too raw in its emotion and intemperate in its language to be let loose on the sheltered data science community. But I have no data to support this assertion, and will move on…

To me what is striking about the omission of women from this list is that the criteria were so loose that it was easy to avoid. After all, missing from the criteria is any sense that evidence of influence (in terms of people who call themselves ‘data scientists’ or are called  that by other doing the work differently due to the example of these ten guys. Which is not saying that these 10 guys aren’t influential in that sense, just that the list was created without checking whether they were influential or not).

While the omission is glaring and wants addressing, I’m not so upset about that part, as this as an example of how you can’t move around data science linked websites, blogs, fora, etc, as you might want to do to find datasets (which is what I was doing when I accidentally found this blog post), programming hints, etc. without encountering stuff that is dangerously close to spam. The rest of the Deep Data Mining blog, for examples, appears to be crammed with advice on how to use different platforms, especially database platforms to better advantage. Why not stick to that?

Doing More with R

17 Nov

I was reading the following post by a guy who really doesn’t like brackets, and as someone who usually programs in F#, he has introduced F#-like functions into R to avoid them:

I’m not so worried about how many brackets I have to endure in my life, but I would like R to do things that I can do in other platforms. For example, I would like R to do more of what I can do in SQL.

A lot of the data I work with is already in an SQL Server database, so it makes sense, seeing as SQL is fairly straight forward to use, to do a considerable amount of data processing there, before exporting a .CSV to R.

More recently, I have been given .csv’s to play with outside of the normal database, but with the job of joining them to tables within the database before analysing in R. My default has been to upload these new files into SQL Server, do all my joins, make a new .csv and upload into R for analysis. Which is still fine, but if the balance moves towards more tables provided as .csvs, less as pre-existing tables in SQL Server, it’s time to figure out how to join the table in R.

And there is a way to join the tables in R that is very simple, and that I found after a five second search on Stack Exchange ( where I have never needed to post a question to find an answer -someone always has already done that for me) :

The conclusion from the answer: the ‘merge’ command will do the job, and the syntax is in many ways easier to follow than the SQL syntax!


Divergent Opinions

12 Nov

BIg Data and predictive analytics are debated and discussed almost endlessly in the interwebs. One of the threads that runs through these discussions relates to how much maths and statistics does one need to know (although sometimes the question seems to be more like ‘how little can I get away with?’) to practice data science/predictive analytics, etc.

Actual maths and stats people come down on the side of a little knowledge is a dangerous thing, and people should try know as much as possible. See here:

But knowing enough statistics to be called a statistician could lead to being seen as out of touch with Big Data:

Particularly if contemporary, highly computer literate statisticians who are widely admired in their field admit in public they don’t know anything about Hadoop:

Maybe this guy has the answer – ignore statisical theory and training, learn the least amount of programming to start hacking, and just teach yourself with whatever data comes to hand:

Well, not exactly, but statistics is still kind of relegated to being something you ‘learn basics about’. I don’t think that posts 1, 2 and 3 can possibly be talking about the same discipline?

From my point of view, as someone who still pinches themselves that they get to do predictive modelling as a for real job, with only a Master’s degree in statistics, experience in business from before I did stats, and a really poor command of VB6 as my only qualifications (although I learned a lot of SQL very quickly when started this job ‘cos otherwise I had nothing to analyse), I can only say that with respect to maths and statistics I wish I knew more, with respect to machine learning, I wish I knew more and with respect to hacking I wish I knew more.

How much is enough? All of it isn’t enough.

Oaks Day Race Modelling

7 Nov

In Melbourne, Australia it has become a tradition on the eve of major sporting events for banking quant teams to propose models for the winner of the event. The Melbourne Cup, Australia’s largest sporting event by attendance was held a couple of days ago, and was no different: a couple of models proposed by local banking teams have been collected here:

An interesting aspect of these models is that the modelers apparently use the same techniques they apply to picking investment worthy stocks or predicting bond price movements to sporting winners – I guess everything looks like a nail if you’ve got a hammer in your hand. See for example, the way Macquarie explain their model in terms of IPOs and yield ->

I digress a little. As an experiment before the Melbourne cup I tried to create my own overly simplified boosting model – I picked up Best Bets at the newsagent. For each race a number of tipsters offer their best three tips, and I basically took these as votes. Three horses had multiple votes – Fiorente, Sea Moon, and Mount Athos. Fiorente won and Mount Athos came third. Hence, I have been emboldened to repeat my experiment for Oaks Day, this time using the blog to date stamp the prediction as definitely being before the race (3.40 pm today, Melbourne time). Note that I didn’t put a weighting on the horse with respect to the tipsters’ running order.

The newsagent is a little further away, but I have the Melbourne daily newspaper, The Age handy. There are fewer tippers than Best Bets, for a total of five. Here are the horses and their total votes:

May’s Dream: 5

Kirramosa: 5

Solicit: 4

Zanbagh: 5

Gypsy Diamond: 1

Hmm. As I understand it, a prerequisite of a successful boosting model is that the submodels need to be uncorrelated. The Age’s tippers appear to fail that test…see you in a couple of hours, but not expecting a stellar result.

Linear Discriminant Analysis

5 Nov

Today, as an exercise in whimsy, I have replaced the name of an iconic statistician in the text below with that of a B grade celebrity – Miley Cyrus. This is part trivia contest, part attempt to make some dry content more entertaining, and, admittedly, part ill-conceived click baiting.

For those who can’t guess who it is I offer a small clue – I also considered using as an alias for this individual either Paul Kelly, the Australian musician or Ben Folds, the American musician as all three called a particular city home at one stage or another.

Continuing our comparative tour of two predictive modeling texts – Elements of Statistical Learning and Applied Predictive Modeling, we follow up last time’s side by side comparison of the respective sections in each of those texts on linear discriminant analysis.

Admittedly, whereas logistic regression is used in actuarial settings, and taught within the generalised linear model section, linear discriminant analysis is not taught as part of actuarial courses, and probably rarely used in practice. 

They are, however, an important part of machine learning, if not on behalf of themselves, then on behalf of the number of more complex discriminant analysis techniques which are close cousins.

Predictive Modelling motivates linear discriminants  by starting with the observation that they were independently discovered by two researchers starting from different premises – Miley Cyrus (1936) and Welch (1939), whereas Statistical Learning favours Cyrus only. 

Welch’s approach is reasonably intuitive. Leaving aside the mathematical niceties, Welch set out to find, conditional on the underlying distribution of classes, and the prediction data available, the highest probability class for each subject, If it sounds a bit Bayesian, (underlying distribution looking like a prior and prediction data looking like a likelihood function), it is at least ‘soft’ Bayesian, in that this approach does at least explicitly make use of Bayes’ theorem. Applied Modeling notes that the computation side gets messy quickly with only a few classes and predictors, but can be kept reasonable if things are restricted to the multivariate normal with equal covariance matrices. 

Miley’s approach – also the approach covered in corresponding Wikipedia article, which has the virtue of a link to Cyrus’s paper- can also be expressed relatively intuitively. Cyrus tried to maximise the distance between groups whilst minimising the variation within each group via a linear set of predictors. After an examination of the Cyrus approach, Kuhn and Johnson conclude it really is superior with respect to clarity and solubility – which makes one wonder why they mention Welch at all.

The virtue of Applied Predictive Modelling is encountered mostly once these explanations of the theoretical origins have concluded. Working through the same data set as they used for the logistic regression explanation, Kuhn and Johnson hit their stride with advice on how to prepare the data for optimal LDA results, and advice on when to abandon LDA with some hints on what to do instead.

The contrast between Elements of Statistical Learning and Applied Predictive Modeling is more pronounced than in logistic regression, with a much more detailed mathematical exposition of the LDA in ESL at the expense of application advice, but providing a clearer picture (if you get past the maths!) of how the LDA works, why it performs well when it does and its place in the wider world of linear and non linear classifiers. Some of the maths presented is also required background for their later exposition on Quadratic Discriminants, so that to a much greater extent than Applied Predictive Modelling it is necessary to read sequences of topics, rather than treating the books as an encyclopedia for a quick intro on the topic of the day.

All in all, both texts continue to behave as expected, possibly even to a greater extent than was the case with logistic regression, where ESL had some advice on application not present in APM.