Normal Deviate is no more! Vale and vive!

Hmmm, I have no cats, but my posting is haphazard! Hopefully, I can rise to Normal Deviate’s standard of targetted and thoughtful blogging as my blogging prowess matures.

Tag Archives: statistics
## Sad News

17
Dec
## Divergent Opinions

12
Nov
## Logistic Regression: a Predictive Modelling View

25
Oct
## Predictive Modeling

21
Oct

Normal Deviate is no more! Vale and vive!

Hmmm, I have no cats, but my posting is haphazard! Hopefully, I can rise to Normal Deviate’s standard of targetted and thoughtful blogging as my blogging prowess matures.

BIg Data and predictive analytics are debated and discussed almost endlessly in the interwebs. One of the threads that runs through these discussions relates to how much maths and statistics does one need to know (although sometimes the question seems to be more like ‘how little can I get away with?’) to practice data science/predictive analytics, etc.

Actual maths and stats people come down on the side of a little knowledge is a dangerous thing, and people should try know as much as possible. See here:

http://mathbabe.org/2013/04/04/k-nearest-neighbors-dangerously-simple/

But knowing enough statistics to be called a statistician could lead to being seen as out of touch with Big Data:

http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/

Particularly if contemporary, highly computer literate statisticians who are widely admired in their field admit in public they don’t know anything about Hadoop:

http://andrewgelman.com/2013/11/01/data-science/

Maybe this guy has the answer – ignore statisical theory and training, learn the least amount of programming to start hacking, and just teach yourself with whatever data comes to hand:

http://www.datasciencecentral.com/profiles/blogs/proposal-for-an-apprenticeship-in-data-science

Well, not exactly, but *statistics* is still kind of relegated to being something you ‘learn basics about’. I don’t think that posts 1, 2 and 3 can possibly be talking about the same discipline?

From my point of view, as someone who still pinches themselves that they get to do predictive modelling as a for real job, with only a Master’s degree in statistics, experience in business from before I did stats, and a really poor command of VB6 as my only qualifications (although I learned a lot of SQL *very *quickly when started this job ‘cos otherwise I had nothing to analyse), I can only say that with respect to maths and statistics I wish I knew more, with respect to machine learning, I wish I knew more and with respect to hacking I wish I knew more.

How much is enough? All of it isn’t enough.

As suggested in my last post but one, I am attempting a parallel reading of Johnson and Kuhn’s Applied Predictive Modelling and Hastie, Tibshirani and Friedman’s Elements of Statistical Learning. Johnson and Kuhn themselves give implicit support to this approach, recommending Hastie et al. as a more deeply mathematical companion to their applied text. Hence, this is something of an experiment to see how reading the two together ‘works’. There’s fairly obviously a large amount of material between the two texts – I will only attempt a reading of a small and hopefully representational selection of topics.

Kuhn and Johnson almost don’t seem to have their hearts in the discussion on logistic regression. Their conclusion is that there are better ways to model binary data, and logistic regression is just the warm up act for those. More properly, they prefer algorithms more suited to unsupervised learning, and want to introduce algorithms which require less human training.

For mine, this is a pity, as clients who claim they want the most accurate model may later turn out to really want the most explainable model. It may be that they just think that there has been so much written on topic and enough of it sufficiently well that there is nothing new to say. They recommend Regression Modelling Strategies (Harrell, 2001) as a text for learning more about logistic regression.

I don’t have this book, but from the course notes made available by the author (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/RmS/rms.pdf) it does look like a text with a great deal of useful advice for practicioners. According to the book’s website (which has a great many useful links but also a great many dead links), there is a second edition due any time now (intended for September 2013), so maybe hold off until it appears.

Hastie et al’s approach, is, as expected, more mathematical than the approach used by Kuhn and Johnson. The Hastie et al approach is to derive the loglinear equations for conditional probability in the multinomial case, and observe that the 2 class model simplifies the equations considerably.

In the Hastie et al treatment, this derivation is necessary as they proceed to explain how to use estimate the parameters using the Newton-Raphson algorithm – providing the nuts and bolts for someone who wants to program the algorithm for themselves. Kuhn and Johnson assume that you think more like engineers than maths researchers, and are happy for someone else to do the programming for you.

While not everyone may have been craving such a strong dose of matrix algebra and numerical integration, if you need a refresher, or just extra advice, on how logistic regression models are traditionally interpreted, Hastie et al’s example performs this function, whereas Kuhn and Johnson adds little in this area, instead briefly calculating their example’s ROC curve and AUC in order to compare it with other classifier models. Hastie et al also provides high level advice on how to choose variables when using a logistic regression model.

Possibly this was not an ideal staring place, as the best aspects of the logistic regression in Kuhn Johnson were more the pointers to other resources. We will see later how things pan out when looking at linear discriminant analysis.

Over the next few blog posts, which may be intermittent, but hopefully with smaller gaps than the last couple of gaps, we are going to take a sideways tour into predictive modelling, which is closer to what I am currently doing than strictly actuarial studies. Just as before, for me the purpose is to force close study, and if others can benefit, that’s a bonus.

Recently I received from a riparian bookselling website the book Applied Predictive Modeling (Kuhn and Johnson, 2013) (note one ‘l’) , having ordered it only three months earlier. As the title suggests, the thrust of this text is introduce predictive modeling techniques (whether originating as data mining or statistical techniques) in the context of their application to problem solving, rather than with respect to their theoretical origins or with a view to critiquing them, mathematically or otherwise. In fact, the authors suggest The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)

(Hastie, et al) as a good theoretical companion.

As a device for forcing reading with a critical mind, I propose to read and compare the sections of both books dealing with the same topics, starting with the topics I am personally most familiar with, before moving to a couple of areas newer to me. Part of the object is to discover or partially uncover where the practical and theoretical are different and where one ways gives way for the other and back again.

Before the end of this tour we will also look at the sections on data pre-processing and ‘other considerations’ which bookend the discussions of individual modelling techniques. In some ways these sections are the most important, as they provide an especial opportunity for the authors to discuss the practice of modelling, the book’s raison d’être and strength, as well as being the areas in this text that are least often discussed in other texts.