Statistical Learning books

An Introduction to Statistical Learning
with Applications in R

by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. Springer, 2013 (8th Corrected printing, 2017), 426 pages

Reviewed by T. Nelson

Statistical learning is not just for artificial intelligence. It's a way of analyzing data that recognizes the limitations of classical techniques that everybody has been using for two hundred years.

Those old techniques often give the wrong result when applied to real data (see here for an example). This book introduces some of the more interesting topics in modern statistical data analysis, with names like linear discriminant analysis, bootstrap, ridge regression and lasso, maximum likelihood, principal components analysis, and K-means clustering. Hastie and Tibshirani, along with Jerome Friedman, co-wrote The Elements of Statistical Learning, which gives a more in-depth treatment. (I don't recommend it, though, because it's too heavy on trendy topics like neural networks and unsupervised learning.)

Hey, stop dozing off, this is cool stuff! After all the hype about artificial intelligence dies down, we'll realize that statistics and pattern recognition have fused into one field. They're mostly the same already.

It used to be said in science that if you need statistics to see an effect, it's not real. These days, the opposite is often true: there is so much data that without statistics and data mining, you can't see anything. Datasets routinely exceed 30 or 40 GB. Even if you can download them without getting QoS'd, often they don't fit on your hard drive.

The authors say that no matrix algebra is needed, and that's true. But the selling point of this book is not the lightness of the presentation, but the clarity and skill with which the topics are taught. There's a lot of that; it's well written, nicely printed, and it walks the reader through the topics by giving examples in R, a powerful and free statistics package. I was disappointed, though, in the lack of references and the lack of coverage of issues like least absolute deviation and l₁ vs. l₂ methods.

The R exercises in the book actually work, if you type every line as shown. I recommend getting experience programming in R before starting this book, so you know what the commands are doing.

The authors say that curve-fitting is always a trade-off between flexibility (by which they mean adding more parameters, which causes over-fitting) and mean squared error. And that's good statistics. But the real world poses stringent constraints: in science, for example, every parameter of a model has to correspond with something real. Scientists agonize about whether going from a line to a quadratic can be justified scientifically.

That's my gripe with this book. The idea that the goal is to fit a curve—any curve—to the data leads the authors to put such things as splines and segmented regressions on the same footing as linear models. Thanks to books like this, I have to do battle with colleagues who think it's a great idea to fit a cubic spline to enzyme kinetic data. It just ain't so. A good statistical fit isn't just the one that gives the lowest mean squared error; it's the one that gives you the most meaningful information. They're not always the same. And books like this just make my life a little harder.

nov 19, 2017; updated nov 20, 2017