Tag Archives: Predictive Analytics

Data Mining/ Predictive Modeling Resources

6 Feb

A short list of some of the more interesting, and free DM/ PM resources I have found the net, at least in part by way of knowing where they are myself for future reference.

First, and close to most obviously, Trevor Hastie’s publications, where you can find both the comprehensive Elements of Statistical Learning, and the newer Introduction to Statistical Learning available for download, along with descriptions of Hastie’s other books.

I’ve mentioned Cosma Shalizi before on this blog, because he seems to talk good sense on a number of issues. His future book, which began as class notes is available as a downloadable pdf.

Meanwhile, at Columbia University, Ian Langmore and Daniel Krasner teach a Data Science course with a much greater programming bent, kind of as an antidote against too much maths and statistics training. The course site also includes the lecture notes.

Another book covering material closer to the first few, but including some additional topics is by Zaki, and has a website here

Some original papers are also available, e.g. Breiman’s Random Forests paper, which I have not yet read, but want to.

Empty Vessels

18 Nov

Influence is big at the moment, partly thanks to LinkedIn’s promotion of Influencers (who usually aren’t) – essentially people who write short career oriented inspirational stuff that is piped into your email inbox. Which is all good, but when a word is used incorrectly, or at best, loosely, its meaning is diluted, and when you want to use it for its original meaning, it doesn’t work as well.

To be clear, you are influential if a great number people in your field act differently because of work you have done. Picasso was an influential artist because many prominent artists paint differently because of his example. Bob Dylan is influential because a great many singer songwriters changed their methods because of his example and/ or found audiences which were created because Bob Dylan came first. While the idea of influence is slippery and subjective, in those cases, and others like them, we can make some progress towards objectivity using this definition.

A couple of weeks ago, Time magazine made itself a large and slow moving target by publishing a ‘Gods of Food’ cover story, featuring 3 males and no females on the cover, and 9 males in a list of 13 inside. See here for some discussion of associated brickbats:

http://www.huffingtonpost.com/2013/11/14/female-chefs-respond-time-gods-of-food_n_4273610.html

Another one of the many Big Data/ Data Science/ Predictive Modeling bloggers has flung themselves onto the same hand grenade by suggesting a list of ‘Top 10 Most Influential Data Scientists’ which includes no women at all.

http://www.deep-data-mining.com/2013/05/the-10-most-influential-people-in-data-analytics.html

Note that the first comment is a plea from someone whose name looks female for the inclusion of women, with another comment from the same person that has been deleted. I like to think that that comment was deleted because it was a howl of outrage, too raw in its emotion and intemperate in its language to be let loose on the sheltered data science community. But I have no data to support this assertion, and will move on…

To me what is striking about the omission of women from this list is that the criteria were so loose that it was easy to avoid. After all, missing from the criteria is any sense that evidence of influence (in terms of people who call themselves ‘data scientists’ or are called  that by other doing the work differently due to the example of these ten guys. Which is not saying that these 10 guys aren’t influential in that sense, just that the list was created without checking whether they were influential or not).

While the omission is glaring and wants addressing, I’m not so upset about that part, as this as an example of how you can’t move around data science linked websites, blogs, fora, etc, as you might want to do to find datasets (which is what I was doing when I accidentally found this blog post), programming hints, etc. without encountering stuff that is dangerously close to spam. The rest of the Deep Data Mining blog, for examples, appears to be crammed with advice on how to use different platforms, especially database platforms to better advantage. Why not stick to that?

Divergent Opinions

12 Nov

BIg Data and predictive analytics are debated and discussed almost endlessly in the interwebs. One of the threads that runs through these discussions relates to how much maths and statistics does one need to know (although sometimes the question seems to be more like ‘how little can I get away with?’) to practice data science/predictive analytics, etc.

Actual maths and stats people come down on the side of a little knowledge is a dangerous thing, and people should try know as much as possible. See here:

http://mathbabe.org/2013/04/04/k-nearest-neighbors-dangerously-simple/

But knowing enough statistics to be called a statistician could lead to being seen as out of touch with Big Data:

http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/

Particularly if contemporary, highly computer literate statisticians who are widely admired in their field admit in public they don’t know anything about Hadoop:

http://andrewgelman.com/2013/11/01/data-science/

Maybe this guy has the answer – ignore statisical theory and training, learn the least amount of programming to start hacking, and just teach yourself with whatever data comes to hand:

http://www.datasciencecentral.com/profiles/blogs/proposal-for-an-apprenticeship-in-data-science

Well, not exactly, but statistics is still kind of relegated to being something you ‘learn basics about’. I don’t think that posts 1, 2 and 3 can possibly be talking about the same discipline?

From my point of view, as someone who still pinches themselves that they get to do predictive modelling as a for real job, with only a Master’s degree in statistics, experience in business from before I did stats, and a really poor command of VB6 as my only qualifications (although I learned a lot of SQL very quickly when started this job ‘cos otherwise I had nothing to analyse), I can only say that with respect to maths and statistics I wish I knew more, with respect to machine learning, I wish I knew more and with respect to hacking I wish I knew more.

How much is enough? All of it isn’t enough.