An amusing PCA

March 31, 2010
Well, this is interesting, though you probably need to click on it to see very much.  Here we have a PCA plot of the data from the 25 books read by Try Books! for which I could find ‘Text Stats’ data on  In more detail, this is a plot of the second principal component against the first principal component.  The items in black are the books, while those in red are the variables.  The variables in upper case were defined by me, while those largely in lower case come from

G10 (measuring whether the group liked it) is very close to Flesch (a measure of simplicity).  Naturally enough, the measures of complexity FogSyllables (per word), Complex (percent of complex words), Flesch_K(incaid) and Words_S (words per sentence) point in the opposite direction.  So does Words_10k (number of words in units of 10,000), which is right on top of Fog.

The arrow ENG (saying whether the book was originally written in English) points away from KILL and SEX, while the arrow HT (representing whether HT liked it, whoever that might be) points along SEX, but not so far.

In fact G10 points directly away from Flesch_K(incaid), which is a measure of grade level (or reading age, in British terms).

Everybody has got what they deserved, and especially me, as Tsar Nikolai I remarked after the opening night of Gogol’s Government Inspector.

Unlike Brockley

March 30, 2010

Back of Guardian Guide 27 March - 2 April

I was surprised to see Brockley mentioned in an advert for iPhone apps on the back cover of the Guardian Guide.  Here’s the relevant part:

Kensington ~ Brockley!

The interesting point is that the parallelism between Brockley and Kensington surely means that people are expected to know that Brockley exists, and perhaps even where it is.

Brockley Palace?  Brockley Gore?  Imperialist College of Science and Technology, South Brockley?  A Far Cry from Brockley, by Muriel Spark?

The possibilities are endless…

Sputnick, 88-92 Lee High Road

March 30, 2010


That spelling does suggest a pronunciation of  ‘Spootnitsk’, but never mind–this certainly seems to be one of the more substantial Baltic-Russian shops around the place.  Inside, there is (and has always been) a nice, clean, tidy grocery, together with an ever-diminishing shelf of books in Russian (I’ve never yet been moved to buy one) and some Baltic newspapers.  OK, newspapers in languages belonging to the Baltic family (Latvian and Lithuanian) rather than Estonian (Finno-Ugric family).

The section on the right as you look at the photo used to sell vodka and DVDs, now they’ve turned it into a beauty salon (that apparently does no business at all).  How about a dermatological-venereological-cosmetological clinic?–there are some Chinese versions of that kind of thing further along the road.

Maybe not with that picture window…

Oakam, 92 High Street Lewisham

March 30, 2010

APR 76.9%!

This place caught my eye today–at first I couldn’t believe the APR of 79.6%, then I thought at least they were stating it openly.  In fact, there are some much higher APRs on their website.

'Think again!', depending on how desperate you are

I also couldn’t decide whether the notice announcing the languages spoken was a sweet attempt to be helpful or a sign of determination to exploit new arrivals.

Lietuvių ‘must be’ Lithuanian, since ‘Litva’ is Russian for Lithuania.  Why Slovak and not Czech?  Probably because the Czech Republic is more prosperous, so there are fewer migrants here–Lithuania is supposed to be the poorest state in the EU (in terms of people’s incomes) so that makes sense.

Cosmopolitan or what?

Can we predict what books ‘Try Books!’ will like?

March 27, 2010


We want to know if we can predict which books ‘Try Books!’ will like.  We can use the ratings given on various sites (LibraryThing, Amazon UK, Amazon US).  For some books the Amazon US site has information on the Fog Index, Words per Sentence and so on.  Finally, we can define our own indicators, reflecting for instance whether the author is male, whether the book seems to be directed at children and the like.

Prediction based on others’ ratings

There are 65 books for which I could decide whether the group had definitely liked the book, as against being divided or not liking it.  Using logistic regression in R (as helpfully explained in Baayen, below) it is not difficult to arrive at the following as the simplest model:

Model 1

z = -0.813 + 0.0544*AUK5 -0.283*AUK1.

Here, AUK5 is the number of 5-star ratings on UK Amazon for the particular book and AUK1 is the number of 1-star ratings on UK Amazon for it.

This gives a score (z) lying between +∞ and -∞, where a score >0 implies the book is likely to be liked (code 1) and a score less than 0 implies it is not likely to be liked (code 0).

We can transform the score z into a probability p of the book being generally liked as follows:

p = 1/(1 + e-z).


In this model, the probability of liking increases with the number of 5-star scores on Amazon UK and decreases with the number of 1-star scores there, which all sounds very reasonable.  This model classifies 36 books correctly and 19 incorrectly.  The ‘worst’ result if for The Master and Margarita, which ‘should’ have been generally liked with a probability of 0.89, and wasn’t.

In search of greater predictive accuracy, we arrive at the following more complex model:

Model 2

z = -1.052 +0.220*LT2 -0.808*LT3 +0.150*AUK5 -0.703*AUK1 +0.719*AUS4 -0.590*AUS2

Here, AUK5 is the number of 5-star ratings on UK Amazon and AUK1 is the number of 1-star ratings on UK Amazon, as before, while LT2 is the number of two-star ratings on LibraryThing, AUS4 is the number of four-star ratings on US Amazon, and so on.

There are 11 books (out of 65) classified incorrectly:  Stump, Restless, The Electric Michelangelo, Skin Lane, Mister Pip, The Magic Toyshop, The Testament of Gideon Mack had probabilities <0.50 and were generally liked, while The Consolations of Philosophy, Therese Raquin, Stasiland, Complicity had probabilities >0.5 and weren’t.  The ‘worst’ case was Stump, which only had a probability of 0.09 of being liked.

Stump escaped classification

This model gives us 12 books with a probability 1.00 of being generally liked (If This Is a Man / The Truce, The Color Purple,  The Five People You Meet in Heaven, The Bell Jar, A Prayer for Owen Meany, After You’d Gone, The Kite Runner, Bad Science, The Curious Incident of the Dog in the Night-time, The Book Thief), and they all were.  Similarly, it gives six books with a probability of 0.00 (On Chesil Beach, The Reluctant Fundamentalist, Pride and Prejudice and Zombies, Suspicions of Mr Whicher, The Unconsoled), and none of them were.

The model above shows some signs of overfitting, but applying penalized maximum likelihood estimation didn’t result in any meaningful changes.  It is also counterintuitive, in that the coefficient of LT2 (number of 2-star ratings on LibraryThing) is positive, so the model says that the more of this type of bad ratings there are, the more likely the group is to like it.

For both of these models, there are some books where the difference is just down to individual judgment–we liked Stump and the raters of Amazon(x2) and LibraryThing didn’t (did they actually read it?)

We can also derive a more comprehensive model by including some ‘genre’ factors (is it fiction? is it fantasy? is anyone killed on-screen? does anyone have sex on-screen? is it a children’s book? is the author male? was it originally written in English?) to get the following:

Model 3

z = -3.869+ 4.681*FICT – 3.143*KILL + 0.440*LT2 -0.0635*LT3 +0.190*AUK5 -1.114*AUK1  +0.362*AUS4 -1.264*AUS2.

So we see that a book is more likely to be liked if it is fiction and less likely if it involves killing (the other genre variables like SEX and MALE have no significant effect).  But LT2 still has the wrong sign!

Easy-to-read CART

Of the 65 books we have looked at so far, there are 25 for which US Amazon gives statistics for Fog Index,  Flesch Reading Ease, Flesch-Kincaid Grade Level, Percentage of Complex  Words, Syllables per Word,  Words per Sentence and Total Words.  These are all measures of text complexity, apart from the last, which measures length.

Just for a change, we can apply a classification and regression tree to these variables–again using R, and as set out in Baayen–to ‘predict’ our results, with results as follows:

Model 4: CART

So what this says is that only Flesch Reading Ease is significant; if it is less than 71 (in the left branch) then we predict ‘0’ and actually get ‘0’ 10 times and ‘1’ 6 times since (0*10 +1*6)/16 = 0.375, while if it is >71 (in the right branch) we predict ‘1’ and get ‘1’ 9 times.  So 19 books are classified correctly and 6 incorrectly.

The six books in this sample generally liked in spite of having a Flesch Reading Ease < 71 are:  The Bell Jar, A Confederacy of Dunces, Heart of Darkness, In Cold Blood, The Electric Michelangelo, A Prayer for Owen Meany. A Flesch score of  60.0–70.0 is interpreted as ‘easily understandable by 13- to 15-year-old students’,  so that’s just about where the group’s cut-off lies.

Practical applications

Model 1 seems the most practical if one is looking for guidance on a book to choose; Model 4 relies on data being supplied by US Amazon (which it usually isn’t), while Models 2 and 3 are a bit complicated.

At the last meeting, the group had to choose between The True Deceiver, Brooklyn, and Let the Great World Spin.  And it chose Brooklyn. Model 1 would have offered the following guidance:

TITLE AUK5 AUK1 model1z model1p
The True Deceiver 8 0 -0.378 0.407
Brooklyn 21 5 -1.086 0.252
Let the Great World Spin 4 1 -0.878 0.294

A vote for The True Deceiver, as it seems.  Well, I’ve now read that one and couldn’t see the point at all…

A very good book!


The text above has a Flesch-Kincaid score of 57.88, so it’s almost ‘easily understandable by 13- to 15-year-old students’.  The author has no connection with the R software (which is, after all, free) or with Harald Baayen’s book (jolly good value even at £19.99 from Blackwell’s–must be cheaper on Amazon).

Leiden Summer School 2010

March 23, 2010

Yes, it really does look like that!

We have received an email as follows:

Leiden Summer School in Languages and Linguistics: 19 July – 30 July 2010

Dear Sir/Madam,

We are happy to announce the fifth edition of the Leiden Summer School in Languages and Linguistics which will be held from 19 July – 30 July 2010 at the Faculty of Humanities of Leiden University. The Summer School offers a number of courses on a wide range of subjects in the field of languages and linguistics. This year, the Summer School will consist of six programmes, including courses for beginners as well as for advanced students, taught by internationally renowned specialists:

Germanic Programme
Indo-European Programme
Indological Programme

Iranian Programme
Semitic Programme
Russian Programme

For more information and registration, visit: .

Yours sincerely,
Alexander Lubotsky (director)
Tina Janssen (organizer)

The Double (after Dostoevsky) White Bear Theatre 21 March

March 21, 2010


Picture from Theatre 6 Facebook page

I feel guilty about not liking this more, because in many ways it was very well done and the good ideas exhibited by the director and adaptor Kate McGregor deserve encouragement.  But there were just too many of these good ideas to fit into the time and space available–the rolling doorframes needed to be moved round lots of time and the live music played by members of what was clearly a very gifted cast was just too loud.  And the lamps raised and lowered to show who was at work, the stellazh of candles at the back, the frequent scene changes and rearrangement of props–it was all too much…

I wonder if Kate McGregor as director had really managed to extract the dramatic essence of the source novella–the idea is that the Petersburg clerk Golyadkin has been  behaving a bit oddly and suffered a bit of a setback in both love and career, when another Golyadkin appears and takes over his existence.  So is he mad or is there really a double at work?

The Petersburg point is important–the city has (has always had) the air of a giant theatrical set, indeed of an unconvincing attempt to overlay European order on primeval Russian chaos, and it’s also bloody foggy, so it’s quite easy to see things that aren’t there. Hence or otherwise, Gogol and Dostoevsky (and indeed Pushkin) set a particular type of grimly fanntastic narrative there.

But there was no trace of this here–the action was all too present and real.  Kate McGregor’s production notes attempted to draw a parallel between the novella and Dostoevsky’s own fate when the radical group he belonged to as a young man was infiltrated by the organs of state security, but what struck me was that the ‘real’ Golyadkin of  Ben Galpin looked very like the young Nikolai Gogol, while Freddie Machin as the surrogate (or hallucination) had the look of Dostoevsky himself as a young man.

And they played their parts very well, as did the rest of the cast.  And those who played instruments also played very well.  But we just needed less–less in the text to start off with, and then less on the stage.

The White Guard (Mikhail Bulgakov) National Theatre 18 March

March 19, 2010


Picture from NT Facebook page

This was–I understood it–the last preview night before the Press Night.  And rather than being Bulgakov’s own play (The Days of the Turbins) it was a new adaptation by Andrew Upton of his novel (The White Guard).  In search of ultimate cheapness, we were sitting in the second row from the front, and it was certainly a bit loud on occasions!  And the performances were clearly aimed at somewhere a bit further back, as well.

The play began in the Turbins’ apartment, which looked rather like the set for the National’s Philistines a couple of years ago, but painted in lighter colours (my companion felt she had also seen it in Three Sisters).  At the start I wasn’t really sure that I believed in the characters–I thought that Kevin Doyle as Count Talberg should at least have suggested someone who might be Deputy Minister of War at the beginning and then crumbled away, rather than being Basil Fawlty from the start.  There were many occasions where the director had half-understood Russian customs, which was more distracting than if he had ignored them entirely.  And if the actors find they can’t pronounce for instance Lyena as in Russian, then it’s better to go for Leyna (as in English) rather than Liyena, which sounds stupid in any language.

There followed a (tragi-)comic interlude in the Hetman’s headquarters when everyone ran away, having seized what valuables they could find.  And the interval.  I told my companion about the production I had once seen in St Petersburg, where the actor playing Nikolka Turbin (18 years old, according to Bulgakov’s dramatis personae) produced an uncannily accurate impersonation of the late Sid James, and she thought that was very funny.

After the interval we had some very loud external scenes of Aleksei Turbin perishing stupidly and the Petlyura band being bandits, after which it was back to the apartment.  I can’t help thinking that it might be better to stay in the Turbin’s apartment and have what happens elsewhere related in messenger speeches (or telephone calls, whatever).  After all, it is the contrast between the apartment and the world outside that is one of the main axes of the play, and the main character is Elena, who turns away from Talberg and the White Guard view of things to accept the new dispensation inthe form of Shervinsky.

So, we got back to the apartment and by this stage the play was working well, with the actors getting near to evincing Russian-style lightning and unprovoked changes of mood and a luminous performance from Justine Mitchell as Elena.

La vie d’Irene Nemirovsky (Olivier Philipponnat & Patrick Lienhardt)

March 14, 2010


This is the French original of the work whose English translation was launched during Jewish Book Week.  As I recall, there it was revealed that they had found some new material for the English version during a recent visit to Russia, in particular from Tatiana, the grand-daughter of Irene Nemirovsky’s aunt-cum-surrogate sister Victoria.  In fact, they may even have found some more documentary material…

From this book, it seems as though Le vin du solitude is highly autobiographical, certainly with regard to Nemirovsky’s childhood and her relationship with her mother.  I’m not sure that we really get an idea of what she was like–that may be inevitable with a biography of a writer, where we already seem to know more about the subject than any biography could tell us–and some events in her life happen offstage, presumably in the absence of any reliable evidence.  For instance, one moment she’s studying Russian and comparative literature at the Sorbonne while going out having a good time with her friends, while the next she’s married to Michel Epstein and engaged in producing oeuvres alimentaires to make ends meet.  OK, so her father’s fortune had disappeared about the time of his death, so that explains something…

How she met Michel Epstein, what their marriage was like–we never really learn.  Similarly, while she was determined to love her daughters in the way that she had never been loved herself, it appears that she wanted to have them educated by governesses, so that they (like her) would not have any schoolfriends–an irony that surely deserves some comment or explanation.

I also didn’t get an idea of what ways her books are like those of other French writers of her time, and in what way they differ from them.  There are odd cases where we learn about the same topics being treated by other writers, but nothing systematic.  The eternal undergraduate would be inclined to claim that the difference is that at the end of her freedom she was in the Burgundian countryside with no occupation other than writing Suite Francaise and no way of gaining control over her circumstances except by rising above them into objectivity.

We do learn a lot about how much she earned for what book when it was published by whom, and indeed the reason given for her never seeking to cross the line into Vichy France was that she depended on a mensualite from her publishers in Paris.  At the same time, her mother lasted out the war years in Nice with forged Latvian papers, which makes it sound as though survival was merely a matter of technique.

Irene Nemirovsky est bien plus preoccupee de litterature que de sauver sa peau, mais il se pourrait que cela revienne au meme car: ‘Ce qui demeure: 1) notre humble vie quotidienne; 2) l’art; 3) Dieu.’

Some time after Irene Nemirovsky had been…taken away…, Julie Dumot, who had agreed to look after her daughters, went to ask for help from their grandmother.  ‘I have no granddaughters’ came the answer through the closed door of her flat.  But they survived, and so–in the end–did what we know as Suite Francaise.

All that will remain of us is love.

Celebrating Irene Nemirovsky, Jewish Book Week 7 March

March 7, 2010

A rather large crowd was assembled in the Galleon Room of the Royal National Hotel to hear Olivier Philipponnat (biographer), Euan Cameron (translator of biography), Sandra Smith (translator of novels and, on this occasion, interpreter) and Denise Epstein (daughter) discuss Irene Nemirovsky.

The loudest applause was reserved for Denise Epstein when she said while her mother had wanted to be French–something she had never managed–she herself had decided to be Jewish once she had the possibility of a choice.  Olivier Philipponnat pointed out that an anti-Semitic pamphlet of 1936 had named Nemirovsky as one of the important Jews Frenchmen should beware of, after which she was especially anxious to become French.

There was some discussion of Nemirovsky’s books failing to show ‘correct’ attitudes in being critical of the Jews in ‘David Golder’ and other works while not being critical of the Germans in ‘Suite Francaise’.  Philipponnat answered this by saying that she was writing novels, and portraying things as she saw them, while after the Fall of France she was in internal exile in Burgundy and had no way of knowing what was happening by way of repression of the Jews.

Philipponnat gave some play to the idea that Nemirovsky had always been French, even though born into a Russian-speaking Jewish family in the Ukraine–her closest attachment as a girl, for instance, had been to her French governess.

That leaves the question of whether the parallels with and contradictions of ‘War and Peace’ in ‘Suite Francaise’ supply both the missing elements of Russian-ness and moral commentary.  For instance, the contrast between the Rostovs loading carts with their goods and then Natasha Rostova shaming them into making room for wounded soldiers with the Péricands loading their cars with goods and then waiting for their expensive linen to come back from the laundry is well-known.  But does this form part of an articulated critique?  I don’t know.  In any case, the negative characters seem to me very French and the positive ones very Russian.

Denise Epstein explained the delay in the MS of ‘Suite Francaise’ reappearing by saying that first of all she’d been waiting to give the suitcase back to her mother when she returned after the war, then when it became clear she wasn’t going to return she thought they were private diaries and so not to be opened, then finally before sending the papers off to a literary-historical archive she decided she’d better work out what they were.  Olivier Philipponnat said that before the publication of ‘Suite Francaise’ in France it was only ‘David Golder’ that was in print there [possibly because it lent itself to an anti-Semitic reading?]

And what is more, the 1930 film of ‘David Golder’ will be shown at the Institut Francais tomorrow (8 March) and it will be serialised on R4 “Woman’s Hour” from 29 March.