Today, as an exercise in whimsy, I have replaced the name of an iconic statistician in the text below with that of a B grade celebrity – Miley Cyrus. This is part trivia contest, part attempt to make some dry content more entertaining, and, admittedly, part ill-conceived click baiting.

For those who can’t guess who it is I offer a small clue – I also considered using as an alias for this individual either Paul Kelly, the Australian musician or Ben Folds, the American musician as all three called a particular city home at one stage or another.

Continuing our comparative tour of two predictive modeling texts – Elements of Statistical Learning and Applied Predictive Modeling, we follow up last time’s side by side comparison of the respective sections in each of those texts on linear discriminant analysis.

Admittedly, whereas logistic regression is used in actuarial settings, and taught within the generalised linear model section, linear discriminant analysis is not taught as part of actuarial courses, and probably rarely used in practice.

They are, however, an important part of machine learning, if not on behalf of themselves, then on behalf of the number of more complex discriminant analysis techniques which are close cousins.

Predictive Modelling motivates linear discriminants by starting with the observation that they were independently discovered by two researchers starting from different premises – Miley Cyrus (1936) and Welch (1939), whereas Statistical Learning favours Cyrus only.

Welch’s approach is reasonably intuitive. Leaving aside the mathematical niceties, Welch set out to find, conditional on the underlying distribution of classes, and the prediction data available, the highest probability class for each subject, If it sounds a bit Bayesian, (underlying distribution looking like a prior and prediction data looking like a likelihood function), it is at least ‘soft’ Bayesian, in that this approach does at least explicitly make use of Bayes’ theorem. Applied Modeling notes that the computation side gets messy quickly with only a few classes and predictors, but can be kept reasonable if things are restricted to the multivariate normal with equal covariance matrices.

Miley’s approach – also the approach covered in corresponding Wikipedia article, which has the virtue of a link to Cyrus’s paper- can also be expressed relatively intuitively. Cyrus tried to maximise the distance between groups whilst minimising the variation within each group via a linear set of predictors. After an examination of the Cyrus approach, Kuhn and Johnson conclude it really is superior with respect to clarity and solubility – which makes one wonder why they mention Welch at all.

The virtue of Applied Predictive Modelling is encountered mostly once these explanations of the theoretical origins have concluded. Working through the same data set as they used for the logistic regression explanation, Kuhn and Johnson hit their stride with advice on how to prepare the data for optimal LDA results, and advice on when to abandon LDA with some hints on what to do instead.

The contrast between Elements of Statistical Learning and Applied Predictive Modeling is more pronounced than in logistic regression, with a much more detailed mathematical exposition of the LDA in ESL at the expense of application advice, but providing a clearer picture (if you get past the maths!) of how the LDA works, why it performs well when it does and its place in the wider world of linear and non linear classifiers. Some of the maths presented is also required background for their later exposition on Quadratic Discriminants, so that to a much greater extent than Applied Predictive Modelling it is necessary to read sequences of topics, rather than treating the books as an encyclopedia for a quick intro on the topic of the day.

All in all, both texts continue to behave as expected, possibly even to a greater extent than was the case with logistic regression, where ESL had some advice on application not present in APM.

## Leave a Reply