Tag Archives: Managing Data Scientists

R Packages for Managers

22 Apr

Roger Peng, in his e-text, ‘Mastering Data Science’, makes the off-hand comment to the effect that if you are going to do something twice in R, write a function, but if you’re going to do it three times write a package (actually he’s self-plagiarising from his own book, Executive Data Science, which I don’t have)

When writing about functions and packages in R, Peng advances several of the usual arguments in favour of their use, such as avoiding rework, creating more readable code etc. In my opinion just listing off those standard reasons undersells the benefits of creating functions and packages, especially in a corporate environment.

A huge challenge in a corporate environment is to convert employee knowledge and experience , in an environment where lack of time and sometimes people breeds a culture of getting things out the door quickly without pausing for reflection or, crucially, documentation. Hence, if an employee goes out the door, their knowledge and experience goes with them. Asking people to write packages which collect the processes they applied during a particular project keeps a substantial part of that knowledge inside the organisation.

The other undersold virtue of writing functions and packages is that it is an antidote to R turning into a command line environment rather than a software environment. That is, it moves users away from inputting strings of R commands, effectively making themselves part of the program, to writing something closer to conventional programs, though usually small ones.

In my own work, I see a particular opening for moving activity toward functions and packages as we try to sell the same idea to three different potential customers, involving a similar process of providing customised (i.e. to the potential customer’s data set) toy examples before doing similar work across each customer’s data set. With three being Peng’s threshold where I need to create packages (and there will likely be some rework for individual customers, e.g. the same task performed on different date ranges), I seem to be squarely in the category that needs to write packages.