Tag Archives: Data science

Selling Data Science

9 May

Creating sales documents and pitches that list out all the shiny new things that our data science application can do is very tempting. We worked hard on those features and everyone will appreciate them, right?

Well, not really. For one, it’s very likely your target audience doesn’t have the technical ability to understand the point of what you’re selling. After all, if they had your technical skills, they wouldn’t be thinking of hiring a data science, they’d just be doing it themselves.

The next problem is that you can’t trust that the customer realises how your solution helps them out of their present predicament. Moreover, it’s disrespectful to get them to do your job for you. Hence, you need to make sure your pitch joins the dots between what you intend to do for the customer and how it’s going to make their life easier.

In sales parlance this is known as ‘selling the benefits’ – that is, making it clear to the potential customer how buying your product will improve their lives, and has been encapsulated in the phrase ‘nobody wants to buy a bed – they want a good night’s sleep’.The rub is that in most data science scenarios the problem that corresponds to the potential benefit is a business problem – such as reduced inventory or decreased cost of sales – rather than a human problem, such as a getting a good night’s sleep.

Therefore, being able to complete the journey from feature to benefit requires some knowledge of your customer’s business (whereas everyone knows the benefits of a good night sleep – and the horrors of not getting one – far fewer under the fine points of mattress springing and bed construction) and the ability to explain the links. This last is crucial, as the benefits of your work are too important to allow your customer an opportunity to miss them.

What all this means in the end is that the approach of inspecting data sets in the hope of finding ‘insights’ will often fail, and may border on being dangerous. Instead you need to start with what your customer is trying to achieve, what problems they are facing before seeing which problems correspond with data that can be used to build tools that can overcome the problem.

Kaggle Leaderboard Weirdness

29 Jan

Earlier this week I finally, after about half a dozen false starts, posted a legal entry to a Kaggle competition, and then when I saw how far off the pace I was, I posted another half a dozen over the course of a day, improving very slightly each time. If the competition ran for a decade, I’d have a pretty good chance of winning, I reckon…

While I now understand how addictive Kaggle is – it hits the sweet spot between instant gratification and highly delayed gratification – I find the leaderboard kind of weird and frustrating because so many people upload the benchmark – the trivial solution the competition organisers upload to have a line in the sand. In this competition, the benchmark is a file of all zeroes.

This time yesterday, there were around a hundred entries that were just the benchmark, out of about 180. Today, for some reason, all the entries so far appear to have been removed, so there are only about thirty – but twenty of those are the benchmark again! I get that people just want to upload something so they can say they participated, but so many all zero files is just the thing getting out of hand.

Data Cleaning for Predictive Modeling

25 Nov

This discussion – http://andrewgelman.com/2013/11/19/22182/ – where the question of whether data cleaning and preparation is intrinsic to applied statistics, or if spending many hours preparing data is more something data scientists do, statisticians possibly expecting at least semi-cleaned data. A theme which emerged in Gelman’s discussion of what a data scientist was that ‘data scientist’ means different things to different people, and the same applies to ‘data preparation’. There are different ‘data preparations for many different occasions.

Applied Predictive Modeling by Kuhn and Johnson, which we have looked at before, is one of the rare books on modeling or statistics which explicitly has a section devoted to optimal preparation of data sets. We reiterate that this concept means different things to different people.

The meat of Kuhn and Johnson’s advice on data preparation is found in Chapter 3: Data Pre-Processing.The author’s note that there is additional advice throughout the text which applies to supervised models which is additional to the advice in chapter 3.

Chapter 3 is about adding and subtracting predictors, and re-engineering predictors for the best effect. They are particularly down on binning, and have a number of methods to help overcome skewness, assist with correct scaling, and sensible data reduction (hint: binning is a poor choice). Another area of interest to Kuhn and Johnson is how to deal with missing data. This issue is notable for being one which is relatively often dealt with by applied statistics texts – for example Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models contains a chapter on missing data imputation.

To be sure, there is plenty of very practical advice, but to be effective, your data set was looking pretty good to begin with.

A contrast to this approach is Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work. Applied Predictive Modeling’s interest in the optimal clean data set for modeling assumes a somewhat clean data set to begin with. Minimum requirements are not really listed, but they could plausibly include no wrong data, data in different rows and columns agreeing with respect to units, data being formatted in a way that your software can understand it.

Obviously a book long treatment can cover a lot more ground than a single chapter. Actually, the ground is completely disjoint rather than being broader or deeper and overlapping the ground. Adding to the amount of breadth of the Bad Data Handbook’s coverage is that this is a text written by a number of authors, each contributing a chapter on an area they are strong in. While they seem to have been well organised enough by the volume’s editor to avoid too much overlap, a negative result of this approach is that code examples come in a variety of platforms, which can be irritating if it means you have to interrupt your reading to learn basic Python syntax. That said, if you weren’t comfortable with the idea of becoming familiar with R, Python, Unix etc., you probably aren’t so interested in becoming a data scientist (whatever that might be, most people seem to agree that a very low basic requirement is to be willing to learn programming languages).

Another outcome of this approach is that each chapter reads like a self contained chapter. This is great because it means that you can profitably start reading from the beginning of any chapter, but the corollary is that it is not  necessarily straight forward to find what you are looking for if you want to use this as a reference book, as the use of the word ‘Handbook’ in the title implies.

Regardless this is a which covers very different ground to statistics texts, modeling texts such as Kuhn and Johnson, and is therefore its own reward.