Archive | Data Science RSS feed for this section

Selling Data Science

9 May

Creating sales documents and pitches that list out all the shiny new things that our data science application can do is very tempting. We worked hard on those features and everyone will appreciate them, right?

Well, not really. For one, it’s very likely your target audience doesn’t have the technical ability to understand the point of what you’re selling. After all, if they had your technical skills, they wouldn’t be thinking of hiring a data science, they’d just be doing it themselves.

The next problem is that you can’t trust that the customer realises how your solution helps them out of their present predicament. Moreover, it’s disrespectful to get them to do your job for you. Hence, you need to make sure your pitch joins the dots between what you intend to do for the customer and how it’s going to make their life easier.

In sales parlance this is known as ‘selling the benefits’ – that is, making it clear to the potential customer how buying your product will improve their lives, and has been encapsulated in the phrase ‘nobody wants to buy a bed – they want a good night’s sleep’.The rub is that in most data science scenarios the problem that corresponds to the potential benefit is a business problem – such as reduced inventory or decreased cost of sales – rather than a human problem, such as a getting a good night’s sleep.

Therefore, being able to complete the journey from feature to benefit requires some knowledge of your customer’s business (whereas everyone knows the benefits of a good night sleep – and the horrors of not getting one – far fewer under the fine points of mattress springing and bed construction) and the ability to explain the links. This last is crucial, as the benefits of your work are too important to allow your customer an opportunity to miss them.

What all this means in the end is that the approach of inspecting data sets in the hope of finding ‘insights’ will often fail, and may border on being dangerous. Instead you need to start with what your customer is trying to achieve, what problems they are facing before seeing which problems correspond with data that can be used to build tools that can overcome the problem.

Advertisement

R Packages for Managers

22 Apr

Roger Peng, in his e-text, ‘Mastering Data Science’, makes the off-hand comment to the effect that if you are going to do something twice in R, write a function, but if you’re going to do it three times write a package (actually he’s self-plagiarising from his own book, Executive Data Science, which I don’t have)

When writing about functions and packages in R, Peng advances several of the usual arguments in favour of their use, such as avoiding rework, creating more readable code etc. In my opinion just listing off those standard reasons undersells the benefits of creating functions and packages, especially in a corporate environment.

A huge challenge in a corporate environment is to convert employee knowledge and experience , in an environment where lack of time and sometimes people breeds a culture of getting things out the door quickly without pausing for reflection or, crucially, documentation. Hence, if an employee goes out the door, their knowledge and experience goes with them. Asking people to write packages which collect the processes they applied during a particular project keeps a substantial part of that knowledge inside the organisation.

The other undersold virtue of writing functions and packages is that it is an antidote to R turning into a command line environment rather than a software environment. That is, it moves users away from inputting strings of R commands, effectively making themselves part of the program, to writing something closer to conventional programs, though usually small ones.

In my own work, I see a particular opening for moving activity toward functions and packages as we try to sell the same idea to three different potential customers, involving a similar process of providing customised (i.e. to the potential customer’s data set) toy examples before doing similar work across each customer’s data set. With three being Peng’s threshold where I need to create packages (and there will likely be some rework for individual customers, e.g. the same task performed on different date ranges), I seem to be squarely in the category that needs to write packages.

Lattice R Users Group Talk

14 Aug

Last night I gave a talk on the Lattice package of R, held together by the idea that Lattice is an expression of Bill Cleveland’s overall philosophy of visualisation. I don’t know that I put my argument very clearly, but I think the fact of having an argument made the talk a tiny bit less incoherent!

After the talk there was some discussion of a few things – the use of colour schemes for one, but also two questions I didn’t have answers to, although I made a couple of totally wrong guesses!

Question 1: Can you include numeric scale on Lattice trivariate functions (and how)?

Answer – Yes, there is a scale argument, which must be included as a list.

Hence, given you have your surface dataframe pre-prepared:

wireframe(z~x*y, data=your.surface.data, scales=list(z=list(arrows=FALSE, distance =1)))

Question 2: Can you use the print() function to arrange graphics objects produced from multiple R graphics packages?

So far as I can tell, the answer is a qualified ‘yes’, where the qualification is that you need to be working with a graphics package which produces a storeable graphics object – lattice obviously does, and it looks like ggplot2 does also. Another package I selected at random, vcd, does not, however.

 

 

Actual Biggish Data

29 Apr

My first reaction when I read that Kaggle was promoting a competition with its biggest ever data set (Aquire Value Customers Challenge)  was ‘Oh no, that’s one I will definitely need to avoid,’ as I am still struggling to find the time to hone my Hadoop, Mahout et al. chops.

As luck would have it, though, I have since stumbled across some R packages that may mean I can at least make a semi-sensible entry, and which have therefore motivated me to download the data (8 gb compressed, which took five hours, so not completely trivial)

The difficulty is that according to Kaggle, the uncompressed data is in excess of 22 gb, and R on my machine certainly balks at loading much smaller dataframes into memory. Part of the answer is probably to sample from the data set out of SQL before attempting any analysis. Even then, I was going to have to work with a very small subset of the data to get anywhere with R on my machine.

The first discovery that looks likely to assist me here is the bigmemory package, which comes with sibling packages bigalgebra, bigananalytics, bigtabulate. This suite of packages aims to improve the facility of working with very large matrices that still fit in memory.The paper below presents the features of the bigmemory packages.

Click to access bigmemory-vignette.pdf

By contrast, the ff package aims to allow the user to work with data frames which are too large to fit in memory. Hence, even larger sets than are tractable using bigmemory and the related packages on their own become tractable

The following informal paper introduces the ff package: http://wsopuppenkiste.wiso.uni-goettingen.de/ff/ff_1.0/inst/doc/ff.pdf

Another quick look at the ff package is available via the presentation files of the awesomely acronymed Dallas R Users group, whose members were luck enough to have a presentation on the ff package in Jan ’13. The presentation slides and R two related R files are available from this page:

Dallas R Users Group

Dallas, TX
345 Members

Dallas and Ft. Worth metroplex region of R Users. Learn, present, teach, and apply statistics, optimization, and mathematics with the R Project for Statistical Computing.Wel…

Check out this Meetup Group →

Lastly, the following paper discusses when to use the different packages mentioned above, and better still, some ways to avoid needing to use them in the first place:

Click to access bigdata.pdf

 

 

Lattice Trellis and Other Pretty Patterns

16 Mar

My local R users group is sufficiently in need of speakers that it was suggested that I give a talk, which I am quite willing to do. I have started to prepare a talk about the Lattice package, its ‘parent package’ Trellis from S-Plus, and some of the research of Trellis’s creator, Bill Cleveland, and his continuing influence.

Admittedly, I’m not an expert on visualisation, so I will cite Fell In Love With Data, to back up my assertion Cleveland is a big name in this area. FILWD listed their selection of the greatest papers of all time , with one of Cleveland’s early visualisation papers at the top.

That paper explains Cleveland’s feeling that there was a gap in the market for intellectual rigour (for want of a better phrase) in the application of statistical graphics. Cleveland’s contribution was to assess the cognitive difficulty of interpreting some of the key genres of data display, confirm the findings experimentally, and apply those findings to common trends in data display (pie graphs?)

Overall, Cleveland’s work history is too large to fit in one web page – he has two. One is his Bell Labs page, which remains preserved although he has since moved on. Cleveland’s new home is Purdue University, and his web page there has another selection of papers. Of course, of greatest interest to me in getting deeper in lattice was the trellis home page at Bell Labs, which includes the original Trellis manual, which can be used without changes with lattice in R to produce graphics.

A handy extra resource from Paul Murrell, the statistician who developed grid, the R package which underlies lattice is some sample chapters of his book ‘R Graphics’ that he has made available. In particular, he has made his chapter on lattice available.

Discovering Michael Friendly’s history of data visualisation has been a pleasantly surprising side benefit of looking into the lattice package, and related developments in data visualisation. Although there isn’t any detailed discussion of any of the specific topics I am researching, the timeline developed by Friendly showing peaks and troughs in data visualisation development (figure 1) gives some interesting context to Cleveland’s complaint of insufficient rigour in data visualisation. Ironically, Friendly posits the success of statistics via calculation as the leading cause of the lack of interest in visualisation throughout the first half of the twentieth century. Interestingly, Cleveland appears to be just after the rebirth of visualisation, (1950-1975) rather than part of the rebirth.

Kaggle Leaderboard Weirdness

29 Jan

Earlier this week I finally, after about half a dozen false starts, posted a legal entry to a Kaggle competition, and then when I saw how far off the pace I was, I posted another half a dozen over the course of a day, improving very slightly each time. If the competition ran for a decade, I’d have a pretty good chance of winning, I reckon…

While I now understand how addictive Kaggle is – it hits the sweet spot between instant gratification and highly delayed gratification – I find the leaderboard kind of weird and frustrating because so many people upload the benchmark – the trivial solution the competition organisers upload to have a line in the sand. In this competition, the benchmark is a file of all zeroes.

This time yesterday, there were around a hundred entries that were just the benchmark, out of about 180. Today, for some reason, all the entries so far appear to have been removed, so there are only about thirty – but twenty of those are the benchmark again! I get that people just want to upload something so they can say they participated, but so many all zero files is just the thing getting out of hand.

2014: New Year’s Plans (Dreams?)

14 Jan

This is the first in a short series, and covers my R and computer programming pipe dreams for 2014. Another post will cover my maths and statistics pipe dreams, and who knows, I may find there are other dreams not covered at all.

To a certain extent, these pipe dreams begin to make concrete the drift away from actuarial studies that some of the more careful readers may have noticed. Since I left engineering, and became effectively a predictive modeler, a lot of the impetus to complete actuarial studies has fallen away. To me, though, the two areas are certainly related, and I present exhibit A, my earlier post on ‘Data Mining in the Insurance Industry’, which effectively covers papers explaining how to do some of the goals of CT6 by different means, to support this claim.

My immediate plans, then, come in three buckets – learn more maths, learn more computer programming and learn more statistics (in which category I include statistical and machine learning). The aim of the first two is obviously to support the last aim, so the selection of topics will be somewhat influenced by this consideration.

In this post, I will just talk about computer programming, as I rambling enough, without trying to cover three different areas of self learning. I am taking my cues in this area from a couple of blog posts from Cosma Shalizi, where he puts the case for computer programming as a vital skill for statisticians, and gives some basic prompts on what this means in practice.

Shalizi’s first piece of advice is to take a real programming class, or, if you can’t do that, read a real programming book. He recommends Structure and Interpretation of Computer Programs, and seeing as it is available for free, I say ‘that will do just fine’.

SICP, as it seems to be popularly known, teaches programming via the functional programming language Scheme. I would like to learn a little about functional programming, but I would also like to lean a programming language which is more commonly used for data analysis. Hence in addition to reading SICP  I want to read Think Python, which is also free, but which teaches the Python language (obviously)

Both of these books are listed, with many others on the GITHUB Free Programming Books page

The Epitome of Data Science

3 Dec

Robert Christian is a leading Bayesian statistician, and, like many Bayesian statisticians, an avid blogger (really, frequentists don’t seem to blog as much. Or maybe, there are really only Bayesian and ambivalent/ agnostic statisticians these days).

Christian generously posts what he is doing with his classes on his blog (or ‘og, as he prefers). For a few years now, he held a seminar series on classic papers (list found here: https://www.ceremade.dauphine.fr/~xian/M2classics.html). Last week, one of his students found a paper not included on the list which in some ways symbolises the meaning of data science as where statistics meets computer science:

The paper is here:

http://www.personal.psu.edu/users/j/x/jxz203/lin/Lin_pub/2013_ASMBI.pdf

And here is Christian’s write up of his student’s seminar with his own response to the paper

http://xianblog.wordpress.com/2013/11/29/reading-classics-3-2/

The paper is simply a proposal of how to calculate some commonly used statistics on data too big to fit in memory, using the approach of chopping the data set into smaller pieces. Christian raises some mathematical concerns.

In some ways, though, the correctness of the approach is not as interesting as the fact that academic statisticians are putting serious effort into dealing with the obstacles thrown up by datasets being greater than computers’ ability to process them, which will hopefully lead to the discipline of statistics having more of a Big Data voice. It is weird, though, that by doing this sort of work, we have gone full circle to the pre-computing age, where finding workable approximations to allow calculation by hand of statistics on data with a few hundred rows was a serious topic of interest. All of which makes re-reading the review (http://www.tandfonline.com/doi/abs/10.1080/00207547308929950#.Up5q8MQW2Cl) of Quenouille’s Rapid Statistical Calculations (which I have never seen for sale anywhere) a slightly odd experience when the reviewer says that computers have made that sort of thing irrelevant!

Data Cleaning for Predictive Modeling

25 Nov

This discussion – http://andrewgelman.com/2013/11/19/22182/ – where the question of whether data cleaning and preparation is intrinsic to applied statistics, or if spending many hours preparing data is more something data scientists do, statisticians possibly expecting at least semi-cleaned data. A theme which emerged in Gelman’s discussion of what a data scientist was that ‘data scientist’ means different things to different people, and the same applies to ‘data preparation’. There are different ‘data preparations for many different occasions.

Applied Predictive Modeling by Kuhn and Johnson, which we have looked at before, is one of the rare books on modeling or statistics which explicitly has a section devoted to optimal preparation of data sets. We reiterate that this concept means different things to different people.

The meat of Kuhn and Johnson’s advice on data preparation is found in Chapter 3: Data Pre-Processing.The author’s note that there is additional advice throughout the text which applies to supervised models which is additional to the advice in chapter 3.

Chapter 3 is about adding and subtracting predictors, and re-engineering predictors for the best effect. They are particularly down on binning, and have a number of methods to help overcome skewness, assist with correct scaling, and sensible data reduction (hint: binning is a poor choice). Another area of interest to Kuhn and Johnson is how to deal with missing data. This issue is notable for being one which is relatively often dealt with by applied statistics texts – for example Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models contains a chapter on missing data imputation.

To be sure, there is plenty of very practical advice, but to be effective, your data set was looking pretty good to begin with.

A contrast to this approach is Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work. Applied Predictive Modeling’s interest in the optimal clean data set for modeling assumes a somewhat clean data set to begin with. Minimum requirements are not really listed, but they could plausibly include no wrong data, data in different rows and columns agreeing with respect to units, data being formatted in a way that your software can understand it.

Obviously a book long treatment can cover a lot more ground than a single chapter. Actually, the ground is completely disjoint rather than being broader or deeper and overlapping the ground. Adding to the amount of breadth of the Bad Data Handbook’s coverage is that this is a text written by a number of authors, each contributing a chapter on an area they are strong in. While they seem to have been well organised enough by the volume’s editor to avoid too much overlap, a negative result of this approach is that code examples come in a variety of platforms, which can be irritating if it means you have to interrupt your reading to learn basic Python syntax. That said, if you weren’t comfortable with the idea of becoming familiar with R, Python, Unix etc., you probably aren’t so interested in becoming a data scientist (whatever that might be, most people seem to agree that a very low basic requirement is to be willing to learn programming languages).

Another outcome of this approach is that each chapter reads like a self contained chapter. This is great because it means that you can profitably start reading from the beginning of any chapter, but the corollary is that it is not  necessarily straight forward to find what you are looking for if you want to use this as a reference book, as the use of the word ‘Handbook’ in the title implies.

Regardless this is a which covers very different ground to statistics texts, modeling texts such as Kuhn and Johnson, and is therefore its own reward.

Empty Vessels

18 Nov

Influence is big at the moment, partly thanks to LinkedIn’s promotion of Influencers (who usually aren’t) – essentially people who write short career oriented inspirational stuff that is piped into your email inbox. Which is all good, but when a word is used incorrectly, or at best, loosely, its meaning is diluted, and when you want to use it for its original meaning, it doesn’t work as well.

To be clear, you are influential if a great number people in your field act differently because of work you have done. Picasso was an influential artist because many prominent artists paint differently because of his example. Bob Dylan is influential because a great many singer songwriters changed their methods because of his example and/ or found audiences which were created because Bob Dylan came first. While the idea of influence is slippery and subjective, in those cases, and others like them, we can make some progress towards objectivity using this definition.

A couple of weeks ago, Time magazine made itself a large and slow moving target by publishing a ‘Gods of Food’ cover story, featuring 3 males and no females on the cover, and 9 males in a list of 13 inside. See here for some discussion of associated brickbats:

http://www.huffingtonpost.com/2013/11/14/female-chefs-respond-time-gods-of-food_n_4273610.html

Another one of the many Big Data/ Data Science/ Predictive Modeling bloggers has flung themselves onto the same hand grenade by suggesting a list of ‘Top 10 Most Influential Data Scientists’ which includes no women at all.

http://www.deep-data-mining.com/2013/05/the-10-most-influential-people-in-data-analytics.html

Note that the first comment is a plea from someone whose name looks female for the inclusion of women, with another comment from the same person that has been deleted. I like to think that that comment was deleted because it was a howl of outrage, too raw in its emotion and intemperate in its language to be let loose on the sheltered data science community. But I have no data to support this assertion, and will move on…

To me what is striking about the omission of women from this list is that the criteria were so loose that it was easy to avoid. After all, missing from the criteria is any sense that evidence of influence (in terms of people who call themselves ‘data scientists’ or are called  that by other doing the work differently due to the example of these ten guys. Which is not saying that these 10 guys aren’t influential in that sense, just that the list was created without checking whether they were influential or not).

While the omission is glaring and wants addressing, I’m not so upset about that part, as this as an example of how you can’t move around data science linked websites, blogs, fora, etc, as you might want to do to find datasets (which is what I was doing when I accidentally found this blog post), programming hints, etc. without encountering stuff that is dangerously close to spam. The rest of the Deep Data Mining blog, for examples, appears to be crammed with advice on how to use different platforms, especially database platforms to better advantage. Why not stick to that?