Why is math research important?

14 Feb

I’ve been trying to post a comment on this article from MathBabe, with zero success. The comments seem to just disappear, so I am trying this as an alternative way to say my piece. This is what I wanted to say:

Why not start by establishing the value of research in general. Others have gone down this path for example:

http://www.alrc.gov.au/publications/11-publicly-funded-research-and-intellectual-property/public-funding-research

and

http://in3.dem.ist.utl.pt/master/stpolicy03/temas/tema6_1a.pdf

From there the argument is over how important maths is to health of the whole research community. The second paper lists benefits of a healthy research community, including ‘increasing stock of useful knowledge’ and ‘forming networks’. Arguably a healthy maths research community is vital for these outcomes to occur across all research communities.

Another way of putting it is that the research community is a community of communities, and all the member communities suffer if one of their number is lessened in some way; maths is a place where many of the member communities meet, so if the maths research community is lessened, the effect will be especially great.

mathbabe

As I’ve already described, I’m worried about the oncoming MOOC revolution and its effect on math research. To say it plainly, I think there will be major cuts in professional math jobs starting very soon, and I’ve even started to discourage young people from their plans to become math professors.

I’d like to start up a conversation – with the public, but starting in the mathematical community – about mathematics research funding and why it’s important.

I’d like to argue for math research as a public good which deserves to be publicly funded. But although I’m sure that we need to make that case, the more I think about it the less sure I am how to make that case. I’d like your help.

So remember, we’re making the case that continuing math research is a good idea for our society, and we should put up some money towards it…

View original post 767 more words

Advertisement

Data Mining/ Predictive Modeling Resources

6 Feb

A short list of some of the more interesting, and free DM/ PM resources I have found the net, at least in part by way of knowing where they are myself for future reference.

First, and close to most obviously, Trevor Hastie’s publications, where you can find both the comprehensive Elements of Statistical Learning, and the newer Introduction to Statistical Learning available for download, along with descriptions of Hastie’s other books.

I’ve mentioned Cosma Shalizi before on this blog, because he seems to talk good sense on a number of issues. His future book, which began as class notes is available as a downloadable pdf.

Meanwhile, at Columbia University, Ian Langmore and Daniel Krasner teach a Data Science course with a much greater programming bent, kind of as an antidote against too much maths and statistics training. The course site also includes the lecture notes.

Another book covering material closer to the first few, but including some additional topics is by Zaki, and has a website here

Some original papers are also available, e.g. Breiman’s Random Forests paper, which I have not yet read, but want to.

New Year’s Plans Continued – Maths

2 Feb

Yes, I know it’s nearly February, I just write slowly (or more to the point, disjointedly)

In my earlier post I discussed my ambitions for learning some computer science, in order to be a more effective data scientist and statistician. In particular, my aim is to follow Cosma Shalizi’s advice that statisticians should at least be aware of how to program like a computer programmer.

To become a better data scientist/ statistician  maths is also an important element. The maths that I think that I am most lacking is probably algebra, in terms of linear algebra and abstract algebra. From what I can see, most algorithms for data start in this area, also making use of probability theory. Whilst my knowledge of probability is also in need of renovation, my knowledge of algebra is much more dilipidated. Professor Shalizi has an area of his personal site devoted to maths he ought to learn – assuredly, if I had such a website, the corresponding area would be much larger.

Fortunately the internet is here to help.

With respect to linear algebra, we can start at saylor.org’s open university:

http://www.saylor.org/courses/ma211/.

Note that this features the winner of Saylor’s open textbook competition,

Click to access Elementary-Linear-Algebra-1-30-11-Kuttler-OTC.pdf

so it seems safe to assume this is one of the best of Saylor’s offerings.

Saylor also have Abstract Algebra I and Algebra II courses in modern and abstract algebra. It is in the Abstract Algebra II course that found the following great video, which discusses the links between group theory and data mining, especially with respect to classification problems. From this video I discovered the existence of John Diaconnis and his area of research in probability on groups, which unfortunately I am nowhere near understanding due to deficiencies in almost all of the pre-requisites, from the group theory perspective and the probability perspective.

A final course I am trying to follow, although the timing is not quite right, is Coursera’s Functional Analysis course. I have enjoyed the videos so far, and seem to mostly understand it. This area is also important for understanding probability on groups, hopefully I will be able to find the time keep following along.

Historical Musings on R

31 Jan

I don’t usually reblog, but this seems like an interesting link

 

http://blog.revolutionanalytics.com/2014/01/john-chambers-recounts-the-history-of-s-and-r.html

Does what it says on the tin!

It’s actually almost a reblog itself, basically being a YouTube conversation between S orginator John Chambers, and celebrity statistician Trevor Hastie.

Kaggle Leaderboard Weirdness

29 Jan

Earlier this week I finally, after about half a dozen false starts, posted a legal entry to a Kaggle competition, and then when I saw how far off the pace I was, I posted another half a dozen over the course of a day, improving very slightly each time. If the competition ran for a decade, I’d have a pretty good chance of winning, I reckon…

While I now understand how addictive Kaggle is – it hits the sweet spot between instant gratification and highly delayed gratification – I find the leaderboard kind of weird and frustrating because so many people upload the benchmark – the trivial solution the competition organisers upload to have a line in the sand. In this competition, the benchmark is a file of all zeroes.

This time yesterday, there were around a hundred entries that were just the benchmark, out of about 180. Today, for some reason, all the entries so far appear to have been removed, so there are only about thirty – but twenty of those are the benchmark again! I get that people just want to upload something so they can say they participated, but so many all zero files is just the thing getting out of hand.

2014: New Year’s Plans (Dreams?)

14 Jan

This is the first in a short series, and covers my R and computer programming pipe dreams for 2014. Another post will cover my maths and statistics pipe dreams, and who knows, I may find there are other dreams not covered at all.

To a certain extent, these pipe dreams begin to make concrete the drift away from actuarial studies that some of the more careful readers may have noticed. Since I left engineering, and became effectively a predictive modeler, a lot of the impetus to complete actuarial studies has fallen away. To me, though, the two areas are certainly related, and I present exhibit A, my earlier post on ‘Data Mining in the Insurance Industry’, which effectively covers papers explaining how to do some of the goals of CT6 by different means, to support this claim.

My immediate plans, then, come in three buckets – learn more maths, learn more computer programming and learn more statistics (in which category I include statistical and machine learning). The aim of the first two is obviously to support the last aim, so the selection of topics will be somewhat influenced by this consideration.

In this post, I will just talk about computer programming, as I rambling enough, without trying to cover three different areas of self learning. I am taking my cues in this area from a couple of blog posts from Cosma Shalizi, where he puts the case for computer programming as a vital skill for statisticians, and gives some basic prompts on what this means in practice.

Shalizi’s first piece of advice is to take a real programming class, or, if you can’t do that, read a real programming book. He recommends Structure and Interpretation of Computer Programs, and seeing as it is available for free, I say ‘that will do just fine’.

SICP, as it seems to be popularly known, teaches programming via the functional programming language Scheme. I would like to learn a little about functional programming, but I would also like to lean a programming language which is more commonly used for data analysis. Hence in addition to reading SICP  I want to read Think Python, which is also free, but which teaches the Python language (obviously)

Both of these books are listed, with many others on the GITHUB Free Programming Books page

Data Mining in Insurance

17 Dec

I have been working on a project which uses data mining techniques to use predict insurance outcomes. I have leaving it for a long time to write up the resources I found, as part of the point of this blog was to diarise the stuff I found during exactly this sort of research. I think that originally I wanted to give relatively detailed summaries of these items, but I begin to realise that I am in danger of never writing them up at all.

This first pamphlet is a good high level summary of data mining techniques and how they can be applied to some general insurance problems. Handy if you need to explain concepts to a non technical person.

http://www.casact.org/pubs/forum/03wforum/03wf001.pdf

The paper below emphasizes CART across a range of insurance contexts, and like the paper below discusses hybridising CART and MARS techniques (although they are by the same authors).

http://docs.salford-systems.com/insurance4211.pdf

Below is a comprehensive study of claim size prediction using a hybrid CART/ MARS model. Interestingly, the hybridisation is achieved within a single model, rather than creating separate models within an enesmble, for example by boosting. The authors don’t address the topic of boosting at all, in fact, which possibly a more obvious approach. This presentation is in fact a more detailed look at one of examples from the paper above.

Click to access RichardBrookes.pdf

This last is a more specific look at text mining in relation to a topic which is one of the concerns of the CT6 exam – claim prediction – but obviously using techniques not currently set for examination in the actuarial exam system.

http://actuaries.asn.au/Library/gipaper_kolyshkina0510.pdf

Sad News

17 Dec

Normal Deviate is no more! Vale and vive!

Hmmm, I have no cats, but my posting is haphazard! Hopefully, I can rise to Normal Deviate’s standard of targetted and thoughtful blogging as my blogging prowess matures.

The Epitome of Data Science

3 Dec

Robert Christian is a leading Bayesian statistician, and, like many Bayesian statisticians, an avid blogger (really, frequentists don’t seem to blog as much. Or maybe, there are really only Bayesian and ambivalent/ agnostic statisticians these days).

Christian generously posts what he is doing with his classes on his blog (or ‘og, as he prefers). For a few years now, he held a seminar series on classic papers (list found here: https://www.ceremade.dauphine.fr/~xian/M2classics.html). Last week, one of his students found a paper not included on the list which in some ways symbolises the meaning of data science as where statistics meets computer science:

The paper is here:

http://www.personal.psu.edu/users/j/x/jxz203/lin/Lin_pub/2013_ASMBI.pdf

And here is Christian’s write up of his student’s seminar with his own response to the paper

http://xianblog.wordpress.com/2013/11/29/reading-classics-3-2/

The paper is simply a proposal of how to calculate some commonly used statistics on data too big to fit in memory, using the approach of chopping the data set into smaller pieces. Christian raises some mathematical concerns.

In some ways, though, the correctness of the approach is not as interesting as the fact that academic statisticians are putting serious effort into dealing with the obstacles thrown up by datasets being greater than computers’ ability to process them, which will hopefully lead to the discipline of statistics having more of a Big Data voice. It is weird, though, that by doing this sort of work, we have gone full circle to the pre-computing age, where finding workable approximations to allow calculation by hand of statistics on data with a few hundred rows was a serious topic of interest. All of which makes re-reading the review (http://www.tandfonline.com/doi/abs/10.1080/00207547308929950#.Up5q8MQW2Cl) of Quenouille’s Rapid Statistical Calculations (which I have never seen for sale anywhere) a slightly odd experience when the reviewer says that computers have made that sort of thing irrelevant!

Data Cleaning for Predictive Modeling

25 Nov

This discussion – http://andrewgelman.com/2013/11/19/22182/ – where the question of whether data cleaning and preparation is intrinsic to applied statistics, or if spending many hours preparing data is more something data scientists do, statisticians possibly expecting at least semi-cleaned data. A theme which emerged in Gelman’s discussion of what a data scientist was that ‘data scientist’ means different things to different people, and the same applies to ‘data preparation’. There are different ‘data preparations for many different occasions.

Applied Predictive Modeling by Kuhn and Johnson, which we have looked at before, is one of the rare books on modeling or statistics which explicitly has a section devoted to optimal preparation of data sets. We reiterate that this concept means different things to different people.

The meat of Kuhn and Johnson’s advice on data preparation is found in Chapter 3: Data Pre-Processing.The author’s note that there is additional advice throughout the text which applies to supervised models which is additional to the advice in chapter 3.

Chapter 3 is about adding and subtracting predictors, and re-engineering predictors for the best effect. They are particularly down on binning, and have a number of methods to help overcome skewness, assist with correct scaling, and sensible data reduction (hint: binning is a poor choice). Another area of interest to Kuhn and Johnson is how to deal with missing data. This issue is notable for being one which is relatively often dealt with by applied statistics texts – for example Gelman and Hill’s Data Analysis Using Regression and Multilevel/Hierarchical Models contains a chapter on missing data imputation.

To be sure, there is plenty of very practical advice, but to be effective, your data set was looking pretty good to begin with.

A contrast to this approach is Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work. Applied Predictive Modeling’s interest in the optimal clean data set for modeling assumes a somewhat clean data set to begin with. Minimum requirements are not really listed, but they could plausibly include no wrong data, data in different rows and columns agreeing with respect to units, data being formatted in a way that your software can understand it.

Obviously a book long treatment can cover a lot more ground than a single chapter. Actually, the ground is completely disjoint rather than being broader or deeper and overlapping the ground. Adding to the amount of breadth of the Bad Data Handbook’s coverage is that this is a text written by a number of authors, each contributing a chapter on an area they are strong in. While they seem to have been well organised enough by the volume’s editor to avoid too much overlap, a negative result of this approach is that code examples come in a variety of platforms, which can be irritating if it means you have to interrupt your reading to learn basic Python syntax. That said, if you weren’t comfortable with the idea of becoming familiar with R, Python, Unix etc., you probably aren’t so interested in becoming a data scientist (whatever that might be, most people seem to agree that a very low basic requirement is to be willing to learn programming languages).

Another outcome of this approach is that each chapter reads like a self contained chapter. This is great because it means that you can profitably start reading from the beginning of any chapter, but the corollary is that it is not  necessarily straight forward to find what you are looking for if you want to use this as a reference book, as the use of the word ‘Handbook’ in the title implies.

Regardless this is a which covers very different ground to statistics texts, modeling texts such as Kuhn and Johnson, and is therefore its own reward.