Actual Biggish Data

29 Apr

My first reaction when I read that Kaggle was promoting a competition with its biggest ever data set (Aquire Value Customers Challenge)  was ‘Oh no, that’s one I will definitely need to avoid,’ as I am still struggling to find the time to hone my Hadoop, Mahout et al. chops.

As luck would have it, though, I have since stumbled across some R packages that may mean I can at least make a semi-sensible entry, and which have therefore motivated me to download the data (8 gb compressed, which took five hours, so not completely trivial)

The difficulty is that according to Kaggle, the uncompressed data is in excess of 22 gb, and R on my machine certainly balks at loading much smaller dataframes into memory. Part of the answer is probably to sample from the data set out of SQL before attempting any analysis. Even then, I was going to have to work with a very small subset of the data to get anywhere with R on my machine.

The first discovery that looks likely to assist me here is the bigmemory package, which comes with sibling packages bigalgebra, bigananalytics, bigtabulate. This suite of packages aims to improve the facility of working with very large matrices that still fit in memory.The paper below presents the features of the bigmemory packages.

http://www.stat.yale.edu/~mjk56/temp/bigmemory-vignette.pdf

By contrast, the ff package aims to allow the user to work with data frames which are too large to fit in memory. Hence, even larger sets than are tractable using bigmemory and the related packages on their own become tractable

The following informal paper introduces the ff package: http://wsopuppenkiste.wiso.uni-goettingen.de/ff/ff_1.0/inst/doc/ff.pdf

Another quick look at the ff package is available via the presentation files of the awesomely acronymed Dallas R Users group, whose members were luck enough to have a presentation on the ff package in Jan ’13. The presentation slides and R two related R files are available from this page:

Dallas R Users Group

Dallas, TX
345 Members

Dallas and Ft. Worth metroplex region of R Users. Learn, present, teach, and apply statistics, optimization, and mathematics with the R Project for Statistical Computing.Wel…

Check out this Meetup Group →

Lastly, the following paper discusses when to use the different packages mentioned above, and better still, some ways to avoid needing to use them in the first place:

http://www.mattblackwell.org/files/papers/bigdata.pdf

 

 

Advertisements

Lattice Trellis and Other Pretty Patterns

16 Mar

My local R users group is sufficiently in need of speakers that it was suggested that I give a talk, which I am quite willing to do. I have started to prepare a talk about the Lattice package, its ‘parent package’ Trellis from S-Plus, and some of the research of Trellis’s creator, Bill Cleveland, and his continuing influence.

Admittedly, I’m not an expert on visualisation, so I will cite Fell In Love With Data, to back up my assertion Cleveland is a big name in this area. FILWD listed their selection of the greatest papers of all time , with one of Cleveland’s early visualisation papers at the top.

That paper explains Cleveland’s feeling that there was a gap in the market for intellectual rigour (for want of a better phrase) in the application of statistical graphics. Cleveland’s contribution was to assess the cognitive difficulty of interpreting some of the key genres of data display, confirm the findings experimentally, and apply those findings to common trends in data display (pie graphs?)

Overall, Cleveland’s work history is too large to fit in one web page – he has two. One is his Bell Labs page, which remains preserved although he has since moved on. Cleveland’s new home is Purdue University, and his web page there has another selection of papers. Of course, of greatest interest to me in getting deeper in lattice was the trellis home page at Bell Labs, which includes the original Trellis manual, which can be used without changes with lattice in R to produce graphics.

A handy extra resource from Paul Murrell, the statistician who developed grid, the R package which underlies lattice is some sample chapters of his book ‘R Graphics’ that he has made available. In particular, he has made his chapter on lattice available.

Discovering Michael Friendly’s history of data visualisation has been a pleasantly surprising side benefit of looking into the lattice package, and related developments in data visualisation. Although there isn’t any detailed discussion of any of the specific topics I am researching, the timeline developed by Friendly showing peaks and troughs in data visualisation development (figure 1) gives some interesting context to Cleveland’s complaint of insufficient rigour in data visualisation. Ironically, Friendly posits the success of statistics via calculation as the leading cause of the lack of interest in visualisation throughout the first half of the twentieth century. Interestingly, Cleveland appears to be just after the rebirth of visualisation, (1950-1975) rather than part of the rebirth.

Why is math research important?

14 Feb

I’ve been trying to post a comment on this article from MathBabe, with zero success. The comments seem to just disappear, so I am trying this as an alternative way to say my piece. This is what I wanted to say:

Why not start by establishing the value of research in general. Others have gone down this path for example:

http://www.alrc.gov.au/publications/11-publicly-funded-research-and-intellectual-property/public-funding-research

and

http://in3.dem.ist.utl.pt/master/stpolicy03/temas/tema6_1a.pdf

From there the argument is over how important maths is to health of the whole research community. The second paper lists benefits of a healthy research community, including ‘increasing stock of useful knowledge’ and ‘forming networks’. Arguably a healthy maths research community is vital for these outcomes to occur across all research communities.

Another way of putting it is that the research community is a community of communities, and all the member communities suffer if one of their number is lessened in some way; maths is a place where many of the member communities meet, so if the maths research community is lessened, the effect will be especially great.

mathbabe

As I’ve already described, I’m worried about the oncoming MOOC revolution and its effect on math research. To say it plainly, I think there will be major cuts in professional math jobs starting very soon, and I’ve even started to discourage young people from their plans to become math professors.

I’d like to start up a conversation – with the public, but starting in the mathematical community – about mathematics research funding and why it’s important.

I’d like to argue for math research as a public good which deserves to be publicly funded. But although I’m sure that we need to make that case, the more I think about it the less sure I am how to make that case. I’d like your help.

So remember, we’re making the case that continuing math research is a good idea for our society, and we should put up some money towards it…

View original post 767 more words

Data Mining/ Predictive Modeling Resources

6 Feb

A short list of some of the more interesting, and free DM/ PM resources I have found the net, at least in part by way of knowing where they are myself for future reference.

First, and close to most obviously, Trevor Hastie’s publications, where you can find both the comprehensive Elements of Statistical Learning, and the newer Introduction to Statistical Learning available for download, along with descriptions of Hastie’s other books.

I’ve mentioned Cosma Shalizi before on this blog, because he seems to talk good sense on a number of issues. His future book, which began as class notes is available as a downloadable pdf.

Meanwhile, at Columbia University, Ian Langmore and Daniel Krasner teach a Data Science course with a much greater programming bent, kind of as an antidote against too much maths and statistics training. The course site also includes the lecture notes.

Another book covering material closer to the first few, but including some additional topics is by Zaki, and has a website here

Some original papers are also available, e.g. Breiman’s Random Forests paper, which I have not yet read, but want to.

New Year’s Plans Continued – Maths

2 Feb

Yes, I know it’s nearly February, I just write slowly (or more to the point, disjointedly)

In my earlier post I discussed my ambitions for learning some computer science, in order to be a more effective data scientist and statistician. In particular, my aim is to follow Cosma Shalizi’s advice that statisticians should at least be aware of how to program like a computer programmer.

To become a better data scientist/ statistician  maths is also an important element. The maths that I think that I am most lacking is probably algebra, in terms of linear algebra and abstract algebra. From what I can see, most algorithms for data start in this area, also making use of probability theory. Whilst my knowledge of probability is also in need of renovation, my knowledge of algebra is much more dilipidated. Professor Shalizi has an area of his personal site devoted to maths he ought to learn – assuredly, if I had such a website, the corresponding area would be much larger.

Fortunately the internet is here to help.

With respect to linear algebra, we can start at saylor.org’s open university:

http://www.saylor.org/courses/ma211/.

Note that this features the winner of Saylor’s open textbook competition,

http://www.saylor.org/site/wp-content/uploads/2012/02/Elementary-Linear-Algebra-1-30-11-Kuttler-OTC.pdf

so it seems safe to assume this is one of the best of Saylor’s offerings.

Saylor also have Abstract Algebra I and Algebra II courses in modern and abstract algebra. It is in the Abstract Algebra II course that found the following great video, which discusses the links between group theory and data mining, especially with respect to classification problems. From this video I discovered the existence of John Diaconnis and his area of research in probability on groups, which unfortunately I am nowhere near understanding due to deficiencies in almost all of the pre-requisites, from the group theory perspective and the probability perspective.

A final course I am trying to follow, although the timing is not quite right, is Coursera’s Functional Analysis course. I have enjoyed the videos so far, and seem to mostly understand it. This area is also important for understanding probability on groups, hopefully I will be able to find the time keep following along.

Historical Musings on R

31 Jan

I don’t usually reblog, but this seems like an interesting link

 

http://blog.revolutionanalytics.com/2014/01/john-chambers-recounts-the-history-of-s-and-r.html

Does what it says on the tin!

It’s actually almost a reblog itself, basically being a YouTube conversation between S orginator John Chambers, and celebrity statistician Trevor Hastie.

Kaggle Leaderboard Weirdness

29 Jan

Earlier this week I finally, after about half a dozen false starts, posted a legal entry to a Kaggle competition, and then when I saw how far off the pace I was, I posted another half a dozen over the course of a day, improving very slightly each time. If the competition ran for a decade, I’d have a pretty good chance of winning, I reckon…

While I now understand how addictive Kaggle is – it hits the sweet spot between instant gratification and highly delayed gratification – I find the leaderboard kind of weird and frustrating because so many people upload the benchmark – the trivial solution the competition organisers upload to have a line in the sand. In this competition, the benchmark is a file of all zeroes.

This time yesterday, there were around a hundred entries that were just the benchmark, out of about 180. Today, for some reason, all the entries so far appear to have been removed, so there are only about thirty – but twenty of those are the benchmark again! I get that people just want to upload something so they can say they participated, but so many all zero files is just the thing getting out of hand.