My first reaction when I read that Kaggle was promoting a competition with its biggest ever data set (Aquire Value Customers Challenge) was ‘Oh no, that’s one I will definitely need to avoid,’ as I am still struggling to find the time to hone my Hadoop, Mahout et al. chops.

As luck would have it, though, I have since stumbled across some R packages that may mean I can at least make a semi-sensible entry, and which have therefore motivated me to download the data (8 gb compressed, which took five hours, so not completely trivial)

The difficulty is that according to Kaggle, the uncompressed data is in excess of 22 gb, and R on my machine certainly balks at loading much smaller dataframes into memory. Part of the answer is probably to sample from the data set out of SQL before attempting any analysis. Even then, I was going to have to work with a very small subset of the data to get anywhere with R on my machine.

The first discovery that looks likely to assist me here is the bigmemory package, which comes with sibling packages bigalgebra, bigananalytics, bigtabulate. This suite of packages aims to improve the facility of working with very large matrices that still fit in memory.The paper below presents the features of the bigmemory packages.

http://www.stat.yale.edu/~mjk56/temp/bigmemory-vignette.pdf

By contrast, the ff package aims to allow the user to work with data frames which are too large to fit in memory. Hence, even larger sets than are tractable using bigmemory and the related packages on their own become tractable

The following informal paper introduces the ff package: http://wsopuppenkiste.wiso.uni-goettingen.de/ff/ff_1.0/inst/doc/ff.pdf

Another quick look at the ff package is available via the presentation files of the awesomely acronymed Dallas R Users group, whose members were luck enough to have a presentation on the ff package in Jan ’13. The presentation slides and R two related R files are available from this page:

### Dallas R Users Group

Dallas, TX

345 *Members*

Dallas and Ft. Worth metroplex region of R Users. Learn, present, teach, and apply statistics, optimization, and mathematics with the R Project for Statistical Computing.Wel…

Check out this Meetup Group →

Lastly, the following paper discusses when to use the different packages mentioned above, and better still, some ways to avoid needing to use them in the first place:

http://www.mattblackwell.org/files/papers/bigdata.pdf

Tags: kaggle; big data; predictive modeling