Friday, May 21, 2004

Which languages convey the most information? Computing the entropy (i.e. information content) of english has been a classic task of information theory since Shannon's famous 1951 paper. A simple google search reveals lots of hits estimating the information content of english text. But what about the entropy of french or german? Similar google searches reveal no hits.

One way to estimate which language contains the most information is to use winzip, which uses the Lempel-Ziv algorithm which is guaranteed to converge to entropy. Simply see which language compresses better and conclude that the most incompressible one has the most information content per symbol (the less information content per symbol a language has, the more redundant it is, the better it compresses). The results will not be accurate because winzip has a finite dictionary size. But its a way to get a quick and dirty estimate.

I used five french books and five english books, all from the 19th century.


Le Tour du Monde en 80 Jours, by Jules Verne
Germinal, by Emile Zola
Les Miserables, by Victor Hugo
Eugenie Grandet, by Honore de Balzac
Cyrano de Bergerac, by Edmund Rostand


Pride and Prejudice, by Jane Austen
Bleak House, by Charles Dickens
Middlemarch, by George Eliot
Huckleberry Finn , by Mark Twain
Jane Eyre, by Charlotte Bronte

The results? English books compress by 61.0% while french books compress by 61.6% (which means they are statistically indistinguishable given the small sample size).

Makes sense given how much the two languages have in common.


At 4:37 PM, Anonymous Anonymous said...

whats a quick and dirty estimate compared to the real thing? ;)

ok - charles dickens. winzip cant tell you how many pages of gratuitously long irrelevant descriptions are in his novels. winzip probably counts this as "information" ... thus demonstrating empirical evidence is lame :)

At 5:31 PM, Blogger alex said...

I wish I had a good answer to this but I don't :)


Post a Comment

<< Home