Archive | Data Science RSS feed for this section

Selling Data Science

9 May

Creating sales documents and pitches that list out all the shiny new things that our data science application can do is very tempting. We worked hard on those features and everyone will appreciate them, right?

Well, not really. For one, it’s very likely your target audience doesn’t have the technical ability to understand the point of what you’re selling. After all, if they had your technical skills, they wouldn’t be thinking of hiring a data science, they’d just be doing it themselves.

The next problem is that you can’t trust that the customer realises how your solution helps them out of their present predicament. Moreover, it’s disrespectful to get them to do your job for you. Hence, you need to make sure your pitch joins the dots between what you intend to do for the customer and how it’s going to make their life easier.

In sales parlance this is known as ‘selling the benefits’ – that is, making it clear to the potential customer how buying your product will improve their lives, and has been encapsulated in the phrase ‘nobody wants to buy a bed – they want a good night’s sleep’.The rub is that in most data science scenarios the problem that corresponds to the potential benefit is a business problem – such as reduced inventory or decreased cost of sales – rather than a human problem, such as a getting a good night’s sleep.

Therefore, being able to complete the journey from feature to benefit requires some knowledge of your customer’s business (whereas everyone knows the benefits of a good night sleep – and the horrors of not getting one – far fewer under the fine points of mattress springing and bed construction) and the ability to explain the links. This last is crucial, as the benefits of your work are too important to allow your customer an opportunity to miss them.

What all this means in the end is that the approach of inspecting data sets in the hope of finding ‘insights’ will often fail, and may border on being dangerous. Instead you need to start with what your customer is trying to achieve, what problems they are facing before seeing which problems correspond with data that can be used to build tools that can overcome the problem.

R Packages for Managers

22 Apr

Roger Peng, in his e-text, ‘Mastering Data Science’, makes the off-hand comment to the effect that if you are going to do something twice in R, write a function, but if you’re going to do it three times write a package (actually he’s self-plagiarising from his own book, Executive Data Science, which I don’t have)

When writing about functions and packages in R, Peng advances several of the usual arguments in favour of their use, such as avoiding rework, creating more readable code etc. In my opinion just listing off those standard reasons undersells the benefits of creating functions and packages, especially in a corporate environment.

A huge challenge in a corporate environment is to convert employee knowledge and experience , in an environment where lack of time and sometimes people breeds a culture of getting things out the door quickly without pausing for reflection or, crucially, documentation. Hence, if an employee goes out the door, their knowledge and experience goes with them. Asking people to write packages which collect the processes they applied during a particular project keeps a substantial part of that knowledge inside the organisation.

The other undersold virtue of writing functions and packages is that it is an antidote to R turning into a command line environment rather than a software environment. That is, it moves users away from inputting strings of R commands, effectively making themselves part of the program, to writing something closer to conventional programs, though usually small ones.

In my own work, I see a particular opening for moving activity toward functions and packages as we try to sell the same idea to three different potential customers, involving a similar process of providing customised (i.e. to the potential customer’s data set) toy examples before doing similar work across each customer’s data set. With three being Peng’s threshold where I need to create packages (and there will likely be some rework for individual customers, e.g. the same task performed on different date ranges), I seem to be squarely in the category that needs to write packages.

Lattice R Users Group Talk

14 Aug

Last night I gave a talk on the Lattice package of R, held together by the idea that Lattice is an expression of Bill Cleveland’s overall philosophy of visualisation. I don’t know that I put my argument very clearly, but I think the fact of having an argument made the talk a tiny bit less incoherent!

After the talk there was some discussion of a few things – the use of colour schemes for one, but also two questions I didn’t have answers to, although I made a couple of totally wrong guesses!

Question 1: Can you include numeric scale on Lattice trivariate functions (and how)?

Answer – Yes, there is a scale argument, which must be included as a list.

Hence, given you have your surface dataframe pre-prepared:

wireframe(z~x*y,, scales=list(z=list(arrows=FALSE, distance =1)))

Question 2: Can you use the print() function to arrange graphics objects produced from multiple R graphics packages?

So far as I can tell, the answer is a qualified ‘yes’, where the qualification is that you need to be working with a graphics package which produces a storeable graphics object – lattice obviously does, and it looks like ggplot2 does also. Another package I selected at random, vcd, does not, however.



Actual Biggish Data

29 Apr

My first reaction when I read that Kaggle was promoting a competition with its biggest ever data set (Aquire Value Customers Challenge)  was ‘Oh no, that’s one I will definitely need to avoid,’ as I am still struggling to find the time to hone my Hadoop, Mahout et al. chops.

As luck would have it, though, I have since stumbled across some R packages that may mean I can at least make a semi-sensible entry, and which have therefore motivated me to download the data (8 gb compressed, which took five hours, so not completely trivial)

The difficulty is that according to Kaggle, the uncompressed data is in excess of 22 gb, and R on my machine certainly balks at loading much smaller dataframes into memory. Part of the answer is probably to sample from the data set out of SQL before attempting any analysis. Even then, I was going to have to work with a very small subset of the data to get anywhere with R on my machine.

The first discovery that looks likely to assist me here is the bigmemory package, which comes with sibling packages bigalgebra, bigananalytics, bigtabulate. This suite of packages aims to improve the facility of working with very large matrices that still fit in memory.The paper below presents the features of the bigmemory packages.

By contrast, the ff package aims to allow the user to work with data frames which are too large to fit in memory. Hence, even larger sets than are tractable using bigmemory and the related packages on their own become tractable

The following informal paper introduces the ff package:

Another quick look at the ff package is available via the presentation files of the awesomely acronymed Dallas R Users group, whose members were luck enough to have a presentation on the ff package in Jan ’13. The presentation slides and R two related R files are available from this page:

Dallas R Users Group

Dallas, TX
345 Members

Dallas and Ft. Worth metroplex region of R Users. Learn, present, teach, and apply statistics, optimization, and mathematics with the R Project for Statistical Computing.Wel…

Check out this Meetup Group →

Lastly, the following paper discusses when to use the different packages mentioned above, and better still, some ways to avoid needing to use them in the first place:



Lattice Trellis and Other Pretty Patterns

16 Mar

My local R users group is sufficiently in need of speakers that it was suggested that I give a talk, which I am quite willing to do. I have started to prepare a talk about the Lattice package, its ‘parent package’ Trellis from S-Plus, and some of the research of Trellis’s creator, Bill Cleveland, and his continuing influence.

Admittedly, I’m not an expert on visualisation, so I will cite Fell In Love With Data, to back up my assertion Cleveland is a big name in this area. FILWD listed their selection of the greatest papers of all time , with one of Cleveland’s early visualisation papers at the top.

That paper explains Cleveland’s feeling that there was a gap in the market for intellectual rigour (for want of a better phrase) in the application of statistical graphics. Cleveland’s contribution was to assess the cognitive difficulty of interpreting some of the key genres of data display, confirm the findings experimentally, and apply those findings to common trends in data display (pie graphs?)

Overall, Cleveland’s work history is too large to fit in one web page – he has two. One is his Bell Labs page, which remains preserved although he has since moved on. Cleveland’s new home is Purdue University, and his web page there has another selection of papers. Of course, of greatest interest to me in getting deeper in lattice was the trellis home page at Bell Labs, which includes the original Trellis manual, which can be used without changes with lattice in R to produce graphics.

A handy extra resource from Paul Murrell, the statistician who developed grid, the R package which underlies lattice is some sample chapters of his book ‘R Graphics’ that he has made available. In particular, he has made his chapter on lattice available.

Discovering Michael Friendly’s history of data visualisation has been a pleasantly surprising side benefit of looking into the lattice package, and related developments in data visualisation. Although there isn’t any detailed discussion of any of the specific topics I am researching, the timeline developed by Friendly showing peaks and troughs in data visualisation development (figure 1) gives some interesting context to Cleveland’s complaint of insufficient rigour in data visualisation. Ironically, Friendly posits the success of statistics via calculation as the leading cause of the lack of interest in visualisation throughout the first half of the twentieth century. Interestingly, Cleveland appears to be just after the rebirth of visualisation, (1950-1975) rather than part of the rebirth.

Kaggle Leaderboard Weirdness

29 Jan

Earlier this week I finally, after about half a dozen false starts, posted a legal entry to a Kaggle competition, and then when I saw how far off the pace I was, I posted another half a dozen over the course of a day, improving very slightly each time. If the competition ran for a decade, I’d have a pretty good chance of winning, I reckon…

While I now understand how addictive Kaggle is – it hits the sweet spot between instant gratification and highly delayed gratification – I find the leaderboard kind of weird and frustrating because so many people upload the benchmark – the trivial solution the competition organisers upload to have a line in the sand. In this competition, the benchmark is a file of all zeroes.

This time yesterday, there were around a hundred entries that were just the benchmark, out of about 180. Today, for some reason, all the entries so far appear to have been removed, so there are only about thirty – but twenty of those are the benchmark again! I get that people just want to upload something so they can say they participated, but so many all zero files is just the thing getting out of hand.

2014: New Year’s Plans (Dreams?)

14 Jan

This is the first in a short series, and covers my R and computer programming pipe dreams for 2014. Another post will cover my maths and statistics pipe dreams, and who knows, I may find there are other dreams not covered at all.

To a certain extent, these pipe dreams begin to make concrete the drift away from actuarial studies that some of the more careful readers may have noticed. Since I left engineering, and became effectively a predictive modeler, a lot of the impetus to complete actuarial studies has fallen away. To me, though, the two areas are certainly related, and I present exhibit A, my earlier post on ‘Data Mining in the Insurance Industry’, which effectively covers papers explaining how to do some of the goals of CT6 by different means, to support this claim.

My immediate plans, then, come in three buckets – learn more maths, learn more computer programming and learn more statistics (in which category I include statistical and machine learning). The aim of the first two is obviously to support the last aim, so the selection of topics will be somewhat influenced by this consideration.

In this post, I will just talk about computer programming, as I rambling enough, without trying to cover three different areas of self learning. I am taking my cues in this area from a couple of blog posts from Cosma Shalizi, where he puts the case for computer programming as a vital skill for statisticians, and gives some basic prompts on what this means in practice.

Shalizi’s first piece of advice is to take a real programming class, or, if you can’t do that, read a real programming book. He recommends Structure and Interpretation of Computer Programs, and seeing as it is available for free, I say ‘that will do just fine’.

SICP, as it seems to be popularly known, teaches programming via the functional programming language Scheme. I would like to learn a little about functional programming, but I would also like to lean a programming language which is more commonly used for data analysis. Hence in addition to reading SICP  I want to read Think Python, which is also free, but which teaches the Python language (obviously)

Both of these books are listed, with many others on the GITHUB Free Programming Books page