Tuesday 15 August 2017

Meaning in information theory

Here's something I've been meaning to write about for a long time, addressing a common criticism of Shannon information.
According to many measures of information, as used in Shannon’s information theory and in algorithmic information theories (Kolmogorov, Solomonoff and Chaitin), the more random and unpredictable data is, the higher the information content. This is often presented as a reason for rejecting them as meaningful measures of semantic information.
Arguments go something like this.
            Compares these two sets of data:
Data set #1
The rambler who, for old association or other reasons, should trace the forsaken coach-road running almost in a meridional line from Bristol to the south shore of England, would find himself during the latter half of his journey in the vicinity of some extensive woodlands, interspersed with apple-orchards.
Data set #2
ƒw€¾ë †eܯ⠡nÊ.,+´Àˆrœ                D[1]R¥­Æú9yŠB/öû˜ups>YS ­ Ù#Ô¤w/ÎÚÊø ªÄfª Æýð‘ëe© Y›C&$¾"˾Bµ"·
                       F8 ©t"60 <Ikˆ¢†I&Év*Úí‡a,àE l xùçkvV¯hó
                       ZÑi½ŠD°4‹¨D#n ¬T*g    
                       ZœÎr‡œ`™ó%î ù†ðË %P€óKIx­s©ÊÁz¯
8V79{¾à²ÈÇ'R|“‡üE­ û—€Ñ‚ŠB×ÉD         \{F$þݦýCÕŽ´Û2ø

They both contain the same number of characters (307, including spaces) and information theory would have us believe that the second contains more information than the first because it is more random. This is clearly rubbish.
QED: information theory sucks!

The thing is that you, the reader, perceive data set #1 as containing information because of the levels above. It contains information, for you, because it is a legitimate sentence according to English grammar (the syntax) and because of the meaning in the context of the story (“The Woodlanders” by Thomas Hardy [1]).  At the level of the Shannon (or Kolmogorov of whatever) this is not known. The Shannon level doesn’t know that data set #2 is ‘just’ random whereas data set #1 contains meaning. The data set #2 *might* contain information from the layers above, and if it did, it would be able to contain more than data set #1.

And actually, I generated data set #2 by using a zipping tool (7-Zip) to compress the whole of “The Woodlanders”, opening the compressed zip file in a text editing programme (Editpad lite) and copying 307 characters [2]. The uncompressed text file is 778 Kbytes compared to 287 kbytes for the compressed file, giving a compression ratio of 37%. So in fact data set #2 contains a bigger fraction of the whole story than data set #1. It contains 1/0.37 = 2.7 times the amount of information in data set #1. 

The point is that compressing files reduces their size by removing redundancy and the more a file is compressed, the more random it appears. This, indeed, is the principle behind algorithmic information theory: the information content of a file is given by the size of the smallest file that can represent the content. It is theoretical limit to compression.

However, data set #2 might not have been created in the way I said it it was. Consider now data set #3

      ±­Ìšê줳ÓÙÙ“ãÝá´¬Âᄅ¥¬È¤ØÃߤ—Ž»Ü“”š™Ø—ÕК Î祐ßÔ£¶ºª‘ì½’¤â¾¦”Êœ›   
      ÉêÈ šÆ§Ì¯èÉß«Ž²Œ«Ç¡ÉÍ̬—¦ê½Á•²¾ÁÅªÃ²¸¡¢Í¡¬¿à±·ž•Ü©ÑÚçÑύ敔ÈÙÂÚ×›
      ŽÓåÁ䟟ÃÍ ÙÙÍáßâÉÙäÕë⻵Ã̎ߪÒç´±¼Ø׬驓Ϭ߬—

This one I created with the help of the RAND function in Excel. So it doesn’t contain information, does it? Well, generating random numbers in computers is notoriously difficult, because any deterministic algorithm is by definition not random. I think I read somewhere that the RAND function in Excel tries to do something clever with time intervals in the machine, thereby picking up some of the thermal randomness that will affect the precise timing of the computer clock. But the point is: a) it doesn’t contain information for me and you at the moment [3] but b) it might nevertheless contain information that could, in principle, be extracted. If it does contain information, then information theory tells us how much it can contain – an upper limit on how much it can contain.

As to the randomness question, we are again back to quantum mechanical randomness as the only true randomness of physics, always assuming that God does play dice, of course.

[1] Text file of The Woodlanders downloaded from the Gutenberg Project: https://www.gutenberg.org/files/482/482-h/482-h.htm

[2] It was a bit more messy than that, because the zip file is not coded as a text file so Editpad is trying to interpret arbitrary bytes as text, so I've fudged it a bit. If I had the time, I would have written a compression programme that gave an output in text. It would be possible but presumably wouldn't compress as much. The fudge doesn't change the argument, though.

[3] Note the theme of provisionality cropping up again!


Phil Goetz said...

This is all true, but there is a real problem with information measurements. It isn't when people are trying to measure information; it's when they're trying to measure complexity.

This has become a real problem in the arts, because around 1910 various theorists decided that art was best which was most "complex", without having any idea what "complex" might mean. In practice they have always treated it as synonymous with "random". So for instance I have found a review praising Mahler's 5th Symphony for being more "complex" than earlier Romantic music because it is so unpredictable--which it is, as it has little small-scale structure or dramatic sequencing, but mostly wanders aimlessly from key-to-key. In the 20th century this led directly to the music of Ferneyhough, 12-tone music, Finnegan's Wake, and Jackson Pollock's paintings, which are all praised not for being enjoyable but because the theory which says complexity is good and randomness is complex dictates that we must enjoy them.

David Chapman said...

I read a bit about related ideas and the relationship between complexity, arousal and pleasure in the context of paintings some time back, drawing on work from the 1970s, and talked about it in a presentation that I gave in 2008. There is a link to the presentation from this old blog post. See slides #24 and #25. Somewhere in the writing of Donald Mackay (I think there is anyway, but there is a history of people make Donald Mackay quotes up!) he says something to the effect that people mix up the measure of information and the information itself, and because of that confusion then unfairly reject Shannon's measure as a measure of information. It is like rejecting kilograms as a measure of mass on the grounds that a kilogram of base metal is different from a kilogram of gold. I think, though, the key is in the ideas of levels, and information is only meaningful at peer levels.