New Blog

10 Jan

I have created a new blog, which will be my focus from now on. It is called ‘Tilting at Data’ (in honour of the Ingenious Knight of La Mancha) and its main focus will be adding data science functionality to Haskell (most likely a quixotic activity, hence reference). I will also try to season the mix with more general musings on data science, and some other data science programming activities in languages like Python (I have a number of texts where useful routines/ functions are presented in R code – I figure it could be a good way to understand these routines and Python better if I translate some of them from R to Python, and someone may occasionally find the result useful. Ideally, I would start with some of the classification diagnostics from this book by Japkowicz and Shah).

Last year, I wrote a couple of general data science pieces on this blog – most likely I will migrate these across to the new blog, after bit of a fitness review.  In any case, check it out here!

Selling Data Science

9 May

Creating sales documents and pitches that list out all the shiny new things that our data science application can do is very tempting. We worked hard on those features and everyone will appreciate them, right?

Well, not really. For one, it’s very likely your target audience doesn’t have the technical ability to understand the point of what you’re selling. After all, if they had your technical skills, they wouldn’t be thinking of hiring a data science, they’d just be doing it themselves.

The next problem is that you can’t trust that the customer realises how your solution helps them out of their present predicament. Moreover, it’s disrespectful to get them to do your job for you. Hence, you need to make sure your pitch joins the dots between what you intend to do for the customer and how it’s going to make their life easier.

In sales parlance this is known as ‘selling the benefits’ – that is, making it clear to the potential customer how buying your product will improve their lives, and has been encapsulated in the phrase ‘nobody wants to buy a bed – they want a good night’s sleep’.The rub is that in most data science scenarios the problem that corresponds to the potential benefit is a business problem – such as reduced inventory or decreased cost of sales – rather than a human problem, such as a getting a good night’s sleep.

Therefore, being able to complete the journey from feature to benefit requires some knowledge of your customer’s business (whereas everyone knows the benefits of a good night sleep – and the horrors of not getting one – far fewer under the fine points of mattress springing and bed construction) and the ability to explain the links. This last is crucial, as the benefits of your work are too important to allow your customer an opportunity to miss them.

What all this means in the end is that the approach of inspecting data sets in the hope of finding ‘insights’ will often fail, and may border on being dangerous. Instead you need to start with what your customer is trying to achieve, what problems they are facing before seeing which problems correspond with data that can be used to build tools that can overcome the problem.

Timothy, Paul and Data Science

7 May

Like any other atheist who regularly attends an evangelical church, I often find myself wondering how to apply the sermon to my life. A recent example which seemed a little easier than other occasions was a sermon from a guest preacher on succession planning.

Part of the point for this preacher is that he’s a kind of mentor for a number of churches, so he traipses around Australia advising other pastors how to do things better – and also sees them failing, often for predictable reasons. Hence, when he spoke about succession, he was talking from experience.

Of course, succession planning isn’t specifically about churches. The phrase is more commonly heard in corporate settings. His solution to the problem – in the end a call to spread the Gospel, not all that surprisingly from an evangelical preacher – initially seemed one that had no application to the corporate world, but after a little reflection actually seemed very applicable.

By the pastor’s logic, the gospel was effectively the knowledge needed to participate in his religion. So, by extension, succession planning was about the transfer of knowledge. In a way, this is not a revolutionary idea – of course succession planning is about the transfer of knowledge of a working environment, customers, skills to get a job done.

But the emphasis is so often on the leader of an organisation and (too often, in both senses) his immediate reports. Hence the emphasis is on the knowledge that they will bring into the company, the skills in running a business they learnt elsewhere that they will aply to your company. It’s like judging future converts for the abilities they bring to a church from outside – their ability to speak in public, to be great fund raisers – rather than the pastor’s idea of succession planning through

The alternative is succession planning starting from the ground up – making succession planning being with the people make your product or provide your service, and the people who secure your customers. In a very real way they are your business. Certainly, in my own career in manufacturing, I’ve seen the results of failing to ensure knowledge is transferred from people who make the product to others. In short, when they retire, there are delays and defects as people attempt to re-discover the skills.

The previous blog post was about one way that skills can be transferred – the knowledge of a process can be converted into a computer program, with sensible commenting and documentation. Not a solution in itself, not the only alternative, but an extra tool that can be employed. From this perspective, the sermon was another way of seeing the big picture way that that tool can be employed in a Data Science setting.

R Packages for Managers

22 Apr

Roger Peng, in his e-text, ‘Mastering Data Science’, makes the off-hand comment to the effect that if you are going to do something twice in R, write a function, but if you’re going to do it three times write a package (actually he’s self-plagiarising from his own book, Executive Data Science, which I don’t have)

When writing about functions and packages in R, Peng advances several of the usual arguments in favour of their use, such as avoiding rework, creating more readable code etc. In my opinion just listing off those standard reasons undersells the benefits of creating functions and packages, especially in a corporate environment.

A huge challenge in a corporate environment is to convert employee knowledge and experience , in an environment where lack of time and sometimes people breeds a culture of getting things out the door quickly without pausing for reflection or, crucially, documentation. Hence, if an employee goes out the door, their knowledge and experience goes with them. Asking people to write packages which collect the processes they applied during a particular project keeps a substantial part of that knowledge inside the organisation.

The other undersold virtue of writing functions and packages is that it is an antidote to R turning into a command line environment rather than a software environment. That is, it moves users away from inputting strings of R commands, effectively making themselves part of the program, to writing something closer to conventional programs, though usually small ones.

In my own work, I see a particular opening for moving activity toward functions and packages as we try to sell the same idea to three different potential customers, involving a similar process of providing customised (i.e. to the potential customer’s data set) toy examples before doing similar work across each customer’s data set. With three being Peng’s threshold where I need to create packages (and there will likely be some rework for individual customers, e.g. the same task performed on different date ranges), I seem to be squarely in the category that needs to write packages.

Coding is not the answer (for every question)

23 Nov

There is a movement gathering steam at the moment with the aim of proliferating coding education, in itself a fine idea. Computers are everywhere, they are harder to detect than before and people need to know when people are using computers to game them – understanding a little bit of computer science is somewhere between very helpful to essential for these things – Defense Against the Dark Arts for the contemporary developed world.

Somewhere out of this movement has emerged a second movement proclaiming that teaching coding will teach people to think – seemingly an insufficient number of people were thinking until programmers started banding together to enlighten us.

Yes, part of my objection is to the slightly condescending way these people relate to the rest of us rather than their actual arguments, but there I still have objections to the content of their argument as well.They mainly stem from the fact that they’re arguing for programming as a way of teaching thinking as though other ways of learning to think were not available. In point of fact, the notion of teaching is at least as old as Socratic philosophy, and exists in a wide variety of forms, from Western and non-Western perspectives.

Sometimes coding proponents go as far as to suggest that coding is an ideal to learn maths or logic. Maybe they have a point about logic – formal logic studies maybe too esoteric for a lot of tastes.

On the other to recommend programming as a way of learning maths is kind of odd. You can only learn maths by learning maths. Natural aptitude for maths is highly correlated with natural aptitude for programming – it’s hard to imagine those weak at maths will have an easy time in programming.The strong ones will learn whichever they spend time on – either way time away from maths coding is just time away from maths.

This last is the crux of it – the proponents of coding in schools discuss the idea as though they are several hours per week of fallow time up for grabs. There are not. Something else has to go to make room for time spent coding. My personal guess is that most of the proponents of the coding in school idea are thinking of something in the humanities rather than a science or maths subject (although at least one self-identified software developer commenting on another blog wanted to reduce arithmetic teaching in schools – as if our society wasn’t innumerate enough!). I’ve read a number of data scientist cvs  – if I could change the education of that group of people, I’d be taking coding out and putting English lit in.

MOOCs and Mathematics Self-Learning

21 Oct

Over time, this blog has morphed from being about actuarial self-learning to being more about mathematics and statistics self learning, reflecting my personal career peregrinations. I kind of hope that such readers as there are not too turned off – to me it seems there is a lot of crossover. It also seems, at least from reading forums for actuarial learners (weak evidence, so feel free to provide your own counterpoint), that insufficient mathematics is at the root of a lot of difficulties that actuarial students discover along the way.

I intend to soon write a blog piece about my attempts to learn linear algebra and group theory via the slightly indirect route of learning about symmetry groups in crystallography, but today I just want to make a quick observation – the MOOC revolution isn’t coming, at least not yet, at least for mathematics.

2013 appeared to be the year of the MOOC. There were new MOOCs springing up all the time, in an ever wider array of subjects. Today it seems like the revolution has stalled – looking at Class Central, the MOOC aggregator, there are only six entries under the heading ‘Mathematics and Statistics’ (Recently Started or Starting Soon – there were also 17 courses in progress, of which 5 were non english, and two were kind of maths meta-courses a la ‘How to learn maths’), of which three are in Languages other than English. At one stage, MathBabe forecast that MOOC offers in the maths subjects most associated with non-mathematicians – single and multivariable calculus, elementary linear algebra mostly, probably also intro stats for non-statisticians, would put tertiary maths departments out of a job.

To me, unless there is growth in the courses offered – including at least a selection of the standard undergrad maths major subjects (so far, no English language abstract algebra or number theory), MOOCs, for better or worse, just aren’t going to take over the world of teaching, or even be a supplement for students beyond first year.

Self Learning Mathematics

9 Sep

The theme of this  blog has always been self learning – we started at self learning actuarial studies, have dabbled in self learning predictive modelling and now we are looking at self (re) learning mathematics, in order for a deeper push into predictive modelling and statistics.

I was reminded of the self-learning angle the other day when I stumbled across this blog:

which charts the adventures of a gentleman self-re-learning the Latin and Ancient Greek he learnt up to the point he left tertiary education now that his time in the workforce has ended.

We have in common that there is an element of dishonesty in calling this ‘self-learning’ – this blogger above left tertiary education with an enviable grasp of the languages, helped by lecturers at uni and probably his high school teachers. He wasn’t going to stumble because the ablative case was too weird to understand, or become disheartened by deponent verbs.

In my case, I am re-learning some material that I have seen before and some other material that I haven’t seen before, and next year hope to take Linear Algebra, Abstract Algebra and Number Theory courses as non-award subjects to make sure that I have learned that material correctly.

Compared to the blogger, at least I have the advantage that where I have seen the material, it is only four or five years rather than 35 or 40 years since I worked with it. At the same time, half the motivation is to study some branches of mathematics that I think I should have studied before taking somewhat more advanced studies – Linear Algebra especially, which is obviously a foundation of statistics and spectral analysis.

My current foray into re-learning linear algebra is being supported by Serge Lang’s Introduction to Undergraduate Linear Algebra, which seems to have a terrible reputation among Amazon reviewers and commenters on places like math stack exchange. I think the reason is that the pace is fairly brisk. 

For my own part, I find the brevity a little bit refreshing, even when I am looking at stuff I have never seen before (or at least have no memory of seeing before!) The best part, is the portability which allows me to put it in a coat pocket, and take wherever I am going (some not true of the calculus text I used, by Anton Bivens Davis). Despite its brevity, it also seems to get to material which is advanced enough for my purposes – just short of the lecture notes for the course I plan to do next year, without the distracting ‘matrix operation’ notation and covering just about all of the same topics within the subject.

I also mentioned before that I had taken some more advanced studies in statistics and spectral analysis than my command of Linear Algebra ought to have allowed – it is certainly pleasurable to have various puzzles and obstacles of past studies resolved, although frustrating in the sense that I could have done better at the time with just a smidgen more Linear Algebra knowledge at my fingertips.

Lattice R Users Group Talk

14 Aug

Last night I gave a talk on the Lattice package of R, held together by the idea that Lattice is an expression of Bill Cleveland’s overall philosophy of visualisation. I don’t know that I put my argument very clearly, but I think the fact of having an argument made the talk a tiny bit less incoherent!

After the talk there was some discussion of a few things – the use of colour schemes for one, but also two questions I didn’t have answers to, although I made a couple of totally wrong guesses!

Question 1: Can you include numeric scale on Lattice trivariate functions (and how)?

Answer – Yes, there is a scale argument, which must be included as a list.

Hence, given you have your surface dataframe pre-prepared:

wireframe(z~x*y,, scales=list(z=list(arrows=FALSE, distance =1)))

Question 2: Can you use the print() function to arrange graphics objects produced from multiple R graphics packages?

So far as I can tell, the answer is a qualified ‘yes’, where the qualification is that you need to be working with a graphics package which produces a storeable graphics object – lattice obviously does, and it looks like ggplot2 does also. Another package I selected at random, vcd, does not, however.



Actual Biggish Data

29 Apr

My first reaction when I read that Kaggle was promoting a competition with its biggest ever data set (Aquire Value Customers Challenge)  was ‘Oh no, that’s one I will definitely need to avoid,’ as I am still struggling to find the time to hone my Hadoop, Mahout et al. chops.

As luck would have it, though, I have since stumbled across some R packages that may mean I can at least make a semi-sensible entry, and which have therefore motivated me to download the data (8 gb compressed, which took five hours, so not completely trivial)

The difficulty is that according to Kaggle, the uncompressed data is in excess of 22 gb, and R on my machine certainly balks at loading much smaller dataframes into memory. Part of the answer is probably to sample from the data set out of SQL before attempting any analysis. Even then, I was going to have to work with a very small subset of the data to get anywhere with R on my machine.

The first discovery that looks likely to assist me here is the bigmemory package, which comes with sibling packages bigalgebra, bigananalytics, bigtabulate. This suite of packages aims to improve the facility of working with very large matrices that still fit in memory.The paper below presents the features of the bigmemory packages.

Click to access bigmemory-vignette.pdf

By contrast, the ff package aims to allow the user to work with data frames which are too large to fit in memory. Hence, even larger sets than are tractable using bigmemory and the related packages on their own become tractable

The following informal paper introduces the ff package:

Another quick look at the ff package is available via the presentation files of the awesomely acronymed Dallas R Users group, whose members were luck enough to have a presentation on the ff package in Jan ’13. The presentation slides and R two related R files are available from this page:

Dallas R Users Group

Dallas, TX
345 Members

Dallas and Ft. Worth metroplex region of R Users. Learn, present, teach, and apply statistics, optimization, and mathematics with the R Project for Statistical Computing.Wel…

Check out this Meetup Group →

Lastly, the following paper discusses when to use the different packages mentioned above, and better still, some ways to avoid needing to use them in the first place:

Click to access bigdata.pdf



Lattice Trellis and Other Pretty Patterns

16 Mar

My local R users group is sufficiently in need of speakers that it was suggested that I give a talk, which I am quite willing to do. I have started to prepare a talk about the Lattice package, its ‘parent package’ Trellis from S-Plus, and some of the research of Trellis’s creator, Bill Cleveland, and his continuing influence.

Admittedly, I’m not an expert on visualisation, so I will cite Fell In Love With Data, to back up my assertion Cleveland is a big name in this area. FILWD listed their selection of the greatest papers of all time , with one of Cleveland’s early visualisation papers at the top.

That paper explains Cleveland’s feeling that there was a gap in the market for intellectual rigour (for want of a better phrase) in the application of statistical graphics. Cleveland’s contribution was to assess the cognitive difficulty of interpreting some of the key genres of data display, confirm the findings experimentally, and apply those findings to common trends in data display (pie graphs?)

Overall, Cleveland’s work history is too large to fit in one web page – he has two. One is his Bell Labs page, which remains preserved although he has since moved on. Cleveland’s new home is Purdue University, and his web page there has another selection of papers. Of course, of greatest interest to me in getting deeper in lattice was the trellis home page at Bell Labs, which includes the original Trellis manual, which can be used without changes with lattice in R to produce graphics.

A handy extra resource from Paul Murrell, the statistician who developed grid, the R package which underlies lattice is some sample chapters of his book ‘R Graphics’ that he has made available. In particular, he has made his chapter on lattice available.

Discovering Michael Friendly’s history of data visualisation has been a pleasantly surprising side benefit of looking into the lattice package, and related developments in data visualisation. Although there isn’t any detailed discussion of any of the specific topics I am researching, the timeline developed by Friendly showing peaks and troughs in data visualisation development (figure 1) gives some interesting context to Cleveland’s complaint of insufficient rigour in data visualisation. Ironically, Friendly posits the success of statistics via calculation as the leading cause of the lack of interest in visualisation throughout the first half of the twentieth century. Interestingly, Cleveland appears to be just after the rebirth of visualisation, (1950-1975) rather than part of the rebirth.