Can we predict what books ‘Try Books!’ will like?


We want to know if we can predict which books ‘Try Books!’ will like.  We can use the ratings given on various sites (LibraryThing, Amazon UK, Amazon US).  For some books the Amazon US site has information on the Fog Index, Words per Sentence and so on.  Finally, we can define our own indicators, reflecting for instance whether the author is male, whether the book seems to be directed at children and the like.

Prediction based on others’ ratings

There are 65 books for which I could decide whether the group had definitely liked the book, as against being divided or not liking it.  Using logistic regression in R (as helpfully explained in Baayen, below) it is not difficult to arrive at the following as the simplest model:

Model 1

z = -0.813 + 0.0544*AUK5 -0.283*AUK1.

Here, AUK5 is the number of 5-star ratings on UK Amazon for the particular book and AUK1 is the number of 1-star ratings on UK Amazon for it.

This gives a score (z) lying between +∞ and -∞, where a score >0 implies the book is likely to be liked (code 1) and a score less than 0 implies it is not likely to be liked (code 0).

We can transform the score z into a probability p of the book being generally liked as follows:

p = 1/(1 + e-z).


In this model, the probability of liking increases with the number of 5-star scores on Amazon UK and decreases with the number of 1-star scores there, which all sounds very reasonable.  This model classifies 36 books correctly and 19 incorrectly.  The ‘worst’ result if for The Master and Margarita, which ‘should’ have been generally liked with a probability of 0.89, and wasn’t.

In search of greater predictive accuracy, we arrive at the following more complex model:

Model 2

z = -1.052 +0.220*LT2 -0.808*LT3 +0.150*AUK5 -0.703*AUK1 +0.719*AUS4 -0.590*AUS2

Here, AUK5 is the number of 5-star ratings on UK Amazon and AUK1 is the number of 1-star ratings on UK Amazon, as before, while LT2 is the number of two-star ratings on LibraryThing, AUS4 is the number of four-star ratings on US Amazon, and so on.

There are 11 books (out of 65) classified incorrectly:  Stump, Restless, The Electric Michelangelo, Skin Lane, Mister Pip, The Magic Toyshop, The Testament of Gideon Mack had probabilities <0.50 and were generally liked, while The Consolations of Philosophy, Therese Raquin, Stasiland, Complicity had probabilities >0.5 and weren’t.  The ‘worst’ case was Stump, which only had a probability of 0.09 of being liked.

Stump escaped classification

This model gives us 12 books with a probability 1.00 of being generally liked (If This Is a Man / The Truce, The Color Purple,  The Five People You Meet in Heaven, The Bell Jar, A Prayer for Owen Meany, After You’d Gone, The Kite Runner, Bad Science, The Curious Incident of the Dog in the Night-time, The Book Thief), and they all were.  Similarly, it gives six books with a probability of 0.00 (On Chesil Beach, The Reluctant Fundamentalist, Pride and Prejudice and Zombies, Suspicions of Mr Whicher, The Unconsoled), and none of them were.

The model above shows some signs of overfitting, but applying penalized maximum likelihood estimation didn’t result in any meaningful changes.  It is also counterintuitive, in that the coefficient of LT2 (number of 2-star ratings on LibraryThing) is positive, so the model says that the more of this type of bad ratings there are, the more likely the group is to like it.

For both of these models, there are some books where the difference is just down to individual judgment–we liked Stump and the raters of Amazon(x2) and LibraryThing didn’t (did they actually read it?)

We can also derive a more comprehensive model by including some ‘genre’ factors (is it fiction? is it fantasy? is anyone killed on-screen? does anyone have sex on-screen? is it a children’s book? is the author male? was it originally written in English?) to get the following:

Model 3

z = -3.869+ 4.681*FICT – 3.143*KILL + 0.440*LT2 -0.0635*LT3 +0.190*AUK5 -1.114*AUK1  +0.362*AUS4 -1.264*AUS2.

So we see that a book is more likely to be liked if it is fiction and less likely if it involves killing (the other genre variables like SEX and MALE have no significant effect).  But LT2 still has the wrong sign!

Easy-to-read CART

Of the 65 books we have looked at so far, there are 25 for which US Amazon gives statistics for Fog Index,  Flesch Reading Ease, Flesch-Kincaid Grade Level, Percentage of Complex  Words, Syllables per Word,  Words per Sentence and Total Words.  These are all measures of text complexity, apart from the last, which measures length.

Just for a change, we can apply a classification and regression tree to these variables–again using R, and as set out in Baayen–to ‘predict’ our results, with results as follows:

Model 4: CART

So what this says is that only Flesch Reading Ease is significant; if it is less than 71 (in the left branch) then we predict ‘0’ and actually get ‘0’ 10 times and ‘1’ 6 times since (0*10 +1*6)/16 = 0.375, while if it is >71 (in the right branch) we predict ‘1’ and get ‘1’ 9 times.  So 19 books are classified correctly and 6 incorrectly.

The six books in this sample generally liked in spite of having a Flesch Reading Ease < 71 are:  The Bell Jar, A Confederacy of Dunces, Heart of Darkness, In Cold Blood, The Electric Michelangelo, A Prayer for Owen Meany. A Flesch score of  60.0–70.0 is interpreted as ‘easily understandable by 13- to 15-year-old students’,  so that’s just about where the group’s cut-off lies.

Practical applications

Model 1 seems the most practical if one is looking for guidance on a book to choose; Model 4 relies on data being supplied by US Amazon (which it usually isn’t), while Models 2 and 3 are a bit complicated.

At the last meeting, the group had to choose between The True Deceiver, Brooklyn, and Let the Great World Spin.  And it chose Brooklyn. Model 1 would have offered the following guidance:

TITLE AUK5 AUK1 model1z model1p
The True Deceiver 8 0 -0.378 0.407
Brooklyn 21 5 -1.086 0.252
Let the Great World Spin 4 1 -0.878 0.294

A vote for The True Deceiver, as it seems.  Well, I’ve now read that one and couldn’t see the point at all…

A very good book!


The text above has a Flesch-Kincaid score of 57.88, so it’s almost ‘easily understandable by 13- to 15-year-old students’.  The author has no connection with the R software (which is, after all, free) or with Harald Baayen’s book (jolly good value even at £19.99 from Blackwell’s–must be cheaper on Amazon).

And now there’s an amusing PCA to be seen here.


Tags: , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: