I was recently asked to prepare a talk on ogistic regression, which I thought I knew all about, but when I sat down to get my thoughts in order, I soon realised my knowledge had some holes, and some significant revision would be required.

Needing to use a book I already owned, I turned to Gelman and Hill’s ‘Data Analysis using regression…’ as my main guide. As the prose quality and pedagogical quality of this text are both first rate, having a mandate to read a chapter out of this text was no great hard ship, and I was soon reminded of a few main points:

1. Logistic regression is possibly the most used tool to model systems where the response variable is binary. ( I did actually know that already!)

2. The logit function constrains the sum of the products of explanatory variables and coefficients to fall between 0 and 1, and by using the inverse logit function, the probability of a success can be predicted for particular values of the explanatory variables.

3. We can extend the logistic model to multinomial data by adding cutpoints (definitely hadn’t thought about this before). Win/ loss can become at least win/draw/lose. I like to think this could be very handy for sports betting – it is possible to bet on a draw as the outcome, but I suspect that most people just don’t, potentially leading to an arbitrage opportunity for somebody who can precisely model the probability of a draw.

I was asked about significance tests but not about tests for goodness of fit, so I studied accordingly, although I had the nagging feeling that tests for goodness of fit ought to be a part of the picture, too.

The other nagging feeling I developed was that although I had been taught logistic regression as an example of a generalised linear model, logistic regression was something that was more ancient than that relatively recent exposition of statistical knowledge (exposition in the sense that the GLM theory gives a reason for why certain statistical artifacts work and are related via the exponential family of distributions).

Praise G-d for the internet!

After some digging, I uncovered this paper: http://www.tinbergen.nl/discussionpapers/02119.pdf

which is a history of logistic regression. It is there for anyone interested.

For me the highlights are the discovery that the logit function itself was originally invented as some sort of rebuttal of Malthus (populations may grow exponetially at first, but must hit some sort of limit as natural resources fail to support increasing numbers. Hence the well known ‘S’ shape, with an asymptote for when it is hard for the population to find sufficient resources. The other interesting aspect is the acrimony which the introduction of the logit function in regression, originally used for bioassays, occasioned – or really resistiance from ‘probit’ proponents. Statistics and probability seem to be full of these disputes – obviously the Bayesian/ frequentist controversy or the arguments on the fundamentals of probability that swirled before Kolmogorov’s axioms – and to a lesser extent afterwards.

Some may be wondering what any of this may have to do with CT6 and risk theory. Admittedly not much. But we will get to GLMs in a few more installments, and logistic regression seems to have been subsumed into GLMs now – and if we can’t find some sport in how these different tools came to be, life and statistical theory would be pretty dull.