Archive | Uncategorized RSS feed for this section

New Blog

10 Jan

I have created a new blog, which will be my focus from now on. It is called ‘Tilting at Data’ (in honour of the Ingenious Knight of La Mancha) and its main focus will be adding data science functionality to Haskell (most likely a quixotic activity, hence reference). I will also try to season the mix with more general musings on data science, and some other data science programming activities in languages like Python (I have a number of texts where useful routines/ functions are presented in R code – I figure it could be a good way to understand these routines and Python better if I translate some of them from R to Python, and someone may occasionally find the result useful. Ideally, I would start with some of the classification diagnostics from this book by Japkowicz and Shah).

Last year, I wrote a couple of general data science pieces on this blog – most likely I will migrate these across to the new blog, after bit of a fitness review.  In any case, check it out here!

Timothy, Paul and Data Science

7 May

Like any other atheist who regularly attends an evangelical church, I often find myself wondering how to apply the sermon to my life. A recent example which seemed a little easier than other occasions was a sermon from a guest preacher on succession planning.

Part of the point for this preacher is that he’s a kind of mentor for a number of churches, so he traipses around Australia advising other pastors how to do things better – and also sees them failing, often for predictable reasons. Hence, when he spoke about succession, he was talking from experience.

Of course, succession planning isn’t specifically about churches. The phrase is more commonly heard in corporate settings. His solution to the problem – in the end a call to spread the Gospel, not all that surprisingly from an evangelical preacher – initially seemed one that had no application to the corporate world, but after a little reflection actually seemed very applicable.

By the pastor’s logic, the gospel was effectively the knowledge needed to participate in his religion. So, by extension, succession planning was about the transfer of knowledge. In a way, this is not a revolutionary idea – of course succession planning is about the transfer of knowledge of a working environment, customers, skills to get a job done.

But the emphasis is so often on the leader of an organisation and (too often, in both senses) his immediate reports. Hence the emphasis is on the knowledge that they will bring into the company, the skills in running a business they learnt elsewhere that they will aply to your company. It’s like judging future converts for the abilities they bring to a church from outside – their ability to speak in public, to be great fund raisers – rather than the pastor’s idea of succession planning through

The alternative is succession planning starting from the ground up – making succession planning being with the people make your product or provide your service, and the people who secure your customers. In a very real way they are your business. Certainly, in my own career in manufacturing, I’ve seen the results of failing to ensure knowledge is transferred from people who make the product to others. In short, when they retire, there are delays and defects as people attempt to re-discover the skills.

The previous blog post was about one way that skills can be transferred – the knowledge of a process can be converted into a computer program, with sensible commenting and documentation. Not a solution in itself, not the only alternative, but an extra tool that can be employed. From this perspective, the sermon was another way of seeing the big picture way that that tool can be employed in a Data Science setting.

Coding is not the answer (for every question)

23 Nov

There is a movement gathering steam at the moment with the aim of proliferating coding education, in itself a fine idea. Computers are everywhere, they are harder to detect than before and people need to know when people are using computers to game them – understanding a little bit of computer science is somewhere between very helpful to essential for these things – Defense Against the Dark Arts for the contemporary developed world.

Somewhere out of this movement has emerged a second movement proclaiming that teaching coding will teach people to think – seemingly an insufficient number of people were thinking until programmers started banding together to enlighten us.

Yes, part of my objection is to the slightly condescending way these people relate to the rest of us rather than their actual arguments, but there I still have objections to the content of their argument as well.They mainly stem from the fact that they’re arguing for programming as a way of teaching thinking as though other ways of learning to think were not available. In point of fact, the notion of teaching is at least as old as Socratic philosophy, and exists in a wide variety of forms, from Western and non-Western perspectives.

Sometimes coding proponents go as far as to suggest that coding is an ideal to learn maths or logic. Maybe they have a point about logic – formal logic studies maybe too esoteric for a lot of tastes.

On the other to recommend programming as a way of learning maths is kind of odd. You can only learn maths by learning maths. Natural aptitude for maths is highly correlated with natural aptitude for programming – it’s hard to imagine those weak at maths will have an easy time in programming.The strong ones will learn whichever they spend time on – either way time away from maths coding is just time away from maths.

This last is the crux of it – the proponents of coding in schools discuss the idea as though they are several hours per week of fallow time up for grabs. There are not. Something else has to go to make room for time spent coding. My personal guess is that most of the proponents of the coding in school idea are thinking of something in the humanities rather than a science or maths subject (although at least one self-identified software developer commenting on another blog wanted to reduce arithmetic teaching in schools – as if our society wasn’t innumerate enough!). I’ve read a number of data scientist cvs  – if I could change the education of that group of people, I’d be taking coding out and putting English lit in.

Lattice Trellis and Other Pretty Patterns

16 Mar

My local R users group is sufficiently in need of speakers that it was suggested that I give a talk, which I am quite willing to do. I have started to prepare a talk about the Lattice package, its ‘parent package’ Trellis from S-Plus, and some of the research of Trellis’s creator, Bill Cleveland, and his continuing influence.

Admittedly, I’m not an expert on visualisation, so I will cite Fell In Love With Data, to back up my assertion Cleveland is a big name in this area. FILWD listed their selection of the greatest papers of all time , with one of Cleveland’s early visualisation papers at the top.

That paper explains Cleveland’s feeling that there was a gap in the market for intellectual rigour (for want of a better phrase) in the application of statistical graphics. Cleveland’s contribution was to assess the cognitive difficulty of interpreting some of the key genres of data display, confirm the findings experimentally, and apply those findings to common trends in data display (pie graphs?)

Overall, Cleveland’s work history is too large to fit in one web page – he has two. One is his Bell Labs page, which remains preserved although he has since moved on. Cleveland’s new home is Purdue University, and his web page there has another selection of papers. Of course, of greatest interest to me in getting deeper in lattice was the trellis home page at Bell Labs, which includes the original Trellis manual, which can be used without changes with lattice in R to produce graphics.

A handy extra resource from Paul Murrell, the statistician who developed grid, the R package which underlies lattice is some sample chapters of his book ‘R Graphics’ that he has made available. In particular, he has made his chapter on lattice available.

Discovering Michael Friendly’s history of data visualisation has been a pleasantly surprising side benefit of looking into the lattice package, and related developments in data visualisation. Although there isn’t any detailed discussion of any of the specific topics I am researching, the timeline developed by Friendly showing peaks and troughs in data visualisation development (figure 1) gives some interesting context to Cleveland’s complaint of insufficient rigour in data visualisation. Ironically, Friendly posits the success of statistics via calculation as the leading cause of the lack of interest in visualisation throughout the first half of the twentieth century. Interestingly, Cleveland appears to be just after the rebirth of visualisation, (1950-1975) rather than part of the rebirth.

Why is math research important?

14 Feb

I’ve been trying to post a comment on this article from MathBabe, with zero success. The comments seem to just disappear, so I am trying this as an alternative way to say my piece. This is what I wanted to say:

Why not start by establishing the value of research in general. Others have gone down this path for example:


From there the argument is over how important maths is to health of the whole research community. The second paper lists benefits of a healthy research community, including ‘increasing stock of useful knowledge’ and ‘forming networks’. Arguably a healthy maths research community is vital for these outcomes to occur across all research communities.

Another way of putting it is that the research community is a community of communities, and all the member communities suffer if one of their number is lessened in some way; maths is a place where many of the member communities meet, so if the maths research community is lessened, the effect will be especially great.


As I’ve already described, I’m worried about the oncoming MOOC revolution and its effect on math research. To say it plainly, I think there will be major cuts in professional math jobs starting very soon, and I’ve even started to discourage young people from their plans to become math professors.

I’d like to start up a conversation – with the public, but starting in the mathematical community – about mathematics research funding and why it’s important.

I’d like to argue for math research as a public good which deserves to be publicly funded. But although I’m sure that we need to make that case, the more I think about it the less sure I am how to make that case. I’d like your help.

So remember, we’re making the case that continuing math research is a good idea for our society, and we should put up some money towards it…

View original post 767 more words

Data Mining/ Predictive Modeling Resources

6 Feb

A short list of some of the more interesting, and free DM/ PM resources I have found the net, at least in part by way of knowing where they are myself for future reference.

First, and close to most obviously, Trevor Hastie’s publications, where you can find both the comprehensive Elements of Statistical Learning, and the newer Introduction to Statistical Learning available for download, along with descriptions of Hastie’s other books.

I’ve mentioned Cosma Shalizi before on this blog, because he seems to talk good sense on a number of issues. His future book, which began as class notes is available as a downloadable pdf.

Meanwhile, at Columbia University, Ian Langmore and Daniel Krasner teach a Data Science course with a much greater programming bent, kind of as an antidote against too much maths and statistics training. The course site also includes the lecture notes.

Another book covering material closer to the first few, but including some additional topics is by Zaki, and has a website here

Some original papers are also available, e.g. Breiman’s Random Forests paper, which I have not yet read, but want to.

New Year’s Plans Continued – Maths

2 Feb

Yes, I know it’s nearly February, I just write slowly (or more to the point, disjointedly)

In my earlier post I discussed my ambitions for learning some computer science, in order to be a more effective data scientist and statistician. In particular, my aim is to follow Cosma Shalizi’s advice that statisticians should at least be aware of how to program like a computer programmer.

To become a better data scientist/ statistician  maths is also an important element. The maths that I think that I am most lacking is probably algebra, in terms of linear algebra and abstract algebra. From what I can see, most algorithms for data start in this area, also making use of probability theory. Whilst my knowledge of probability is also in need of renovation, my knowledge of algebra is much more dilipidated. Professor Shalizi has an area of his personal site devoted to maths he ought to learn – assuredly, if I had such a website, the corresponding area would be much larger.

Fortunately the internet is here to help.

With respect to linear algebra, we can start at’s open university:

Note that this features the winner of Saylor’s open textbook competition,

Click to access Elementary-Linear-Algebra-1-30-11-Kuttler-OTC.pdf

so it seems safe to assume this is one of the best of Saylor’s offerings.

Saylor also have Abstract Algebra I and Algebra II courses in modern and abstract algebra. It is in the Abstract Algebra II course that found the following great video, which discusses the links between group theory and data mining, especially with respect to classification problems. From this video I discovered the existence of John Diaconnis and his area of research in probability on groups, which unfortunately I am nowhere near understanding due to deficiencies in almost all of the pre-requisites, from the group theory perspective and the probability perspective.

A final course I am trying to follow, although the timing is not quite right, is Coursera’s Functional Analysis course. I have enjoyed the videos so far, and seem to mostly understand it. This area is also important for understanding probability on groups, hopefully I will be able to find the time keep following along.

Historical Musings on R

31 Jan

I don’t usually reblog, but this seems like an interesting link

Does what it says on the tin!

It’s actually almost a reblog itself, basically being a YouTube conversation between S orginator John Chambers, and celebrity statistician Trevor Hastie.

Data Mining in Insurance

17 Dec

I have been working on a project which uses data mining techniques to use predict insurance outcomes. I have leaving it for a long time to write up the resources I found, as part of the point of this blog was to diarise the stuff I found during exactly this sort of research. I think that originally I wanted to give relatively detailed summaries of these items, but I begin to realise that I am in danger of never writing them up at all.

This first pamphlet is a good high level summary of data mining techniques and how they can be applied to some general insurance problems. Handy if you need to explain concepts to a non technical person.

The paper below emphasizes CART across a range of insurance contexts, and like the paper below discusses hybridising CART and MARS techniques (although they are by the same authors).

Below is a comprehensive study of claim size prediction using a hybrid CART/ MARS model. Interestingly, the hybridisation is achieved within a single model, rather than creating separate models within an enesmble, for example by boosting. The authors don’t address the topic of boosting at all, in fact, which possibly a more obvious approach. This presentation is in fact a more detailed look at one of examples from the paper above.

Click to access RichardBrookes.pdf

This last is a more specific look at text mining in relation to a topic which is one of the concerns of the CT6 exam – claim prediction – but obviously using techniques not currently set for examination in the actuarial exam system.

Sad News

17 Dec

Normal Deviate is no more! Vale and vive!

Hmmm, I have no cats, but my posting is haphazard! Hopefully, I can rise to Normal Deviate’s standard of targetted and thoughtful blogging as my blogging prowess matures.