Archive | Data Science RSS feed for this section

The Epitome of Data Science

3 Dec

Robert Christian is a leading Bayesian statistician, and, like many Bayesian statisticians, an avid blogger (really, frequentists don’t seem to blog as much. Or maybe, there are really only Bayesian and ambivalent/ agnostic statisticians these days).

Christian generously posts what he is doing with his classes on his blog (or ‘og, as he prefers). For a few years now, he held a seminar series on classic papers (list found here: Last week, one of his students found a paper not included on the list which in some ways symbolises the meaning of data science as where statistics meets computer science:

The paper is here:

And here is Christian’s write up of his student’s seminar with his own response to the paper

The paper is simply a proposal of how to calculate some commonly used statistics on data too big to fit in memory, using the approach of chopping the data set into smaller pieces. Christian raises some mathematical concerns.

In some ways, though, the correctness of the approach is not as interesting as the fact that academic statisticians are putting serious effort into dealing with the obstacles thrown up by datasets being greater than computers’ ability to process them, which will hopefully lead to the discipline of statistics having more of a Big Data voice. It is weird, though, that by doing this sort of work, we have gone full circle to the pre-computing age, where finding workable approximations to allow calculation by hand of statistics on data with a few hundred rows was a serious topic of interest. All of which makes re-reading the review ( of Quenouille’s Rapid Statistical Calculations (which I have never seen for sale anywhere) a slightly odd experience when the reviewer says that computers have made that sort of thing irrelevant!


Data Cleaning for Predictive Modeling

25 Nov

This discussion – – where the question of whether data cleaning and preparation is intrinsic to applied statistics, or if spending many hours preparing data is more something data scientists do, statisticians possibly expecting at least semi-cleaned data. A theme which emerged in Gelman’s discussion of what a data scientist was that ‘data scientist’ means different things to different people, and the same applies to ‘data preparation’. There are different ‘data preparations for many different occasions.

Applied Predictive Modeling by Kuhn and Johnson, which we have looked at before, is one of the rare books on modeling or statistics which explicitly has a section devoted to optimal preparation of data sets. We reiterate that this concept means different things to different people.

The meat of Kuhn and Johnson’s advice on data preparation is found in Chapter 3: Data Pre-Processing.The author’s note that there is additional advice throughout the text which applies to supervised models which is additional to the advice in chapter 3.

Chapter 3 is about adding and subtracting predictors, and re-engineering predictors for the best effect. They are particularly down on binning, and have a number of methods to help overcome skewness, assist with correct scaling, and sensible data reduction (hint: binning is a poor choice). Another area of interest to Kuhn and Johnson is how to deal with missing data. This issue is notable for being one which is relatively often dealt with by applied statistics texts – for example Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models contains a chapter on missing data imputation.

To be sure, there is plenty of very practical advice, but to be effective, your data set was looking pretty good to begin with.

A contrast to this approach is Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work. Applied Predictive Modeling’s interest in the optimal clean data set for modeling assumes a somewhat clean data set to begin with. Minimum requirements are not really listed, but they could plausibly include no wrong data, data in different rows and columns agreeing with respect to units, data being formatted in a way that your software can understand it.

Obviously a book long treatment can cover a lot more ground than a single chapter. Actually, the ground is completely disjoint rather than being broader or deeper and overlapping the ground. Adding to the amount of breadth of the Bad Data Handbook’s coverage is that this is a text written by a number of authors, each contributing a chapter on an area they are strong in. While they seem to have been well organised enough by the volume’s editor to avoid too much overlap, a negative result of this approach is that code examples come in a variety of platforms, which can be irritating if it means you have to interrupt your reading to learn basic Python syntax. That said, if you weren’t comfortable with the idea of becoming familiar with R, Python, Unix etc., you probably aren’t so interested in becoming a data scientist (whatever that might be, most people seem to agree that a very low basic requirement is to be willing to learn programming languages).

Another outcome of this approach is that each chapter reads like a self contained chapter. This is great because it means that you can profitably start reading from the beginning of any chapter, but the corollary is that it is not  necessarily straight forward to find what you are looking for if you want to use this as a reference book, as the use of the word ‘Handbook’ in the title implies.

Regardless this is a which covers very different ground to statistics texts, modeling texts such as Kuhn and Johnson, and is therefore its own reward.

Empty Vessels

18 Nov

Influence is big at the moment, partly thanks to LinkedIn’s promotion of Influencers (who usually aren’t) – essentially people who write short career oriented inspirational stuff that is piped into your email inbox. Which is all good, but when a word is used incorrectly, or at best, loosely, its meaning is diluted, and when you want to use it for its original meaning, it doesn’t work as well.

To be clear, you are influential if a great number people in your field act differently because of work you have done. Picasso was an influential artist because many prominent artists paint differently because of his example. Bob Dylan is influential because a great many singer songwriters changed their methods because of his example and/ or found audiences which were created because Bob Dylan came first. While the idea of influence is slippery and subjective, in those cases, and others like them, we can make some progress towards objectivity using this definition.

A couple of weeks ago, Time magazine made itself a large and slow moving target by publishing a ‘Gods of Food’ cover story, featuring 3 males and no females on the cover, and 9 males in a list of 13 inside. See here for some discussion of associated brickbats:

Another one of the many Big Data/ Data Science/ Predictive Modeling bloggers has flung themselves onto the same hand grenade by suggesting a list of ‘Top 10 Most Influential Data Scientists’ which includes no women at all.

Note that the first comment is a plea from someone whose name looks female for the inclusion of women, with another comment from the same person that has been deleted. I like to think that that comment was deleted because it was a howl of outrage, too raw in its emotion and intemperate in its language to be let loose on the sheltered data science community. But I have no data to support this assertion, and will move on…

To me what is striking about the omission of women from this list is that the criteria were so loose that it was easy to avoid. After all, missing from the criteria is any sense that evidence of influence (in terms of people who call themselves ‘data scientists’ or are called  that by other doing the work differently due to the example of these ten guys. Which is not saying that these 10 guys aren’t influential in that sense, just that the list was created without checking whether they were influential or not).

While the omission is glaring and wants addressing, I’m not so upset about that part, as this as an example of how you can’t move around data science linked websites, blogs, fora, etc, as you might want to do to find datasets (which is what I was doing when I accidentally found this blog post), programming hints, etc. without encountering stuff that is dangerously close to spam. The rest of the Deep Data Mining blog, for examples, appears to be crammed with advice on how to use different platforms, especially database platforms to better advantage. Why not stick to that?

Divergent Opinions

12 Nov

BIg Data and predictive analytics are debated and discussed almost endlessly in the interwebs. One of the threads that runs through these discussions relates to how much maths and statistics does one need to know (although sometimes the question seems to be more like ‘how little can I get away with?’) to practice data science/predictive analytics, etc.

Actual maths and stats people come down on the side of a little knowledge is a dangerous thing, and people should try know as much as possible. See here:

But knowing enough statistics to be called a statistician could lead to being seen as out of touch with Big Data:

Particularly if contemporary, highly computer literate statisticians who are widely admired in their field admit in public they don’t know anything about Hadoop:

Maybe this guy has the answer – ignore statisical theory and training, learn the least amount of programming to start hacking, and just teach yourself with whatever data comes to hand:

Well, not exactly, but statistics is still kind of relegated to being something you ‘learn basics about’. I don’t think that posts 1, 2 and 3 can possibly be talking about the same discipline?

From my point of view, as someone who still pinches themselves that they get to do predictive modelling as a for real job, with only a Master’s degree in statistics, experience in business from before I did stats, and a really poor command of VB6 as my only qualifications (although I learned a lot of SQL very quickly when started this job ‘cos otherwise I had nothing to analyse), I can only say that with respect to maths and statistics I wish I knew more, with respect to machine learning, I wish I knew more and with respect to hacking I wish I knew more.

How much is enough? All of it isn’t enough.