Tuesday, 15 August 2017

Meaning in information theory

Here's something I've been meaning to write about for a long time, addressing a common criticism of Shannon information.
 
According to many measures of information, as used in Shannon’s information theory and in algorithmic information theories (Kolmogorov, Solomonoff and Chaitin), the more random and unpredictable data is, the higher the information content. This is often presented as a reason for rejecting them as meaningful measures of semantic information.
Arguments go something like this.
            Compares these two sets of data:
Data set #1
The rambler who, for old association or other reasons, should trace the forsaken coach-road running almost in a meridional line from Bristol to the south shore of England, would find himself during the latter half of his journey in the vicinity of some extensive woodlands, interspersed with apple-orchards.
Data set #2
ƒw€¾ë †eܯ⠡nÊ.,+´Àˆrœ                D[1]R¥­Æú9yŠB/öû˜ups>YS ­ Ù#Ô¤w/ÎÚÊø ªÄfª Æýð‘ëe© Y›C&$¾"˾Bµ"·
                       F8 ©t"60 <Ikˆ¢†I&Év*Úí‡a,àE l xùçkvV¯hó
                       ZÑi½ŠD°4‹¨D#n ¬T*g    
                       ZœÎr‡œ`™ó%î ù†ðË %P€óKIx­s©ÊÁz¯
                       ‰xtÝ\K®§Ivsmm*)¹W¯)öÖ(ÃS/`ŽM¨Äcs
                       ‹
8V79{¾à²ÈÇ'R|“‡üE­ û—€Ñ‚ŠB×ÉD         \{F$þݦýCÕŽ´Û2ø

They both contain the same number of characters (307, including spaces) and information theory would have us believe that the second contains more information than the first because it is more random. This is clearly rubbish.
QED: information theory sucks!
No.

The thing is that you, the reader, perceive data set #1 as containing information because of the levels above. It contains information, for you, because it is a legitimate sentence according to English grammar (the syntax) and because of the meaning in the context of the story (“The Woodlanders” by Thomas Hardy [1]).  At the level of the Shannon (or Kolmogorov of whatever) this is not known. The Shannon level doesn’t know that data set #2 is ‘just’ random whereas data set #1 contains meaning. The data set #2 *might* contain information from the layers above, and if it did, it would be able to contain more than data set #1.

And actually, I generated data set #2 by using a zipping tool (7-Zip) to compress the whole of “The Woodlanders”, opening the compressed zip file in a text editing programme (Editpad lite) and copying 307 characters [2]. The uncompressed text file is 778 Kbytes compared to 287 kbytes for the compressed file, giving a compression ratio of 37%. So in fact data set #2 contains a bigger fraction of the whole story than data set #1. It contains 1/0.37 = 2.7 times the amount of information in data set #1. 

The point is that compressing files reduces their size by removing redundancy and the more a file is compressed, the more random it appears. This, indeed, is the principle behind algorithmic information theory: the information content of a file is given by the size of the smallest file that can represent the content. It is theoretical limit to compression.

However, data set #2 might not have been created in the way I said it it was. Consider now data set #3

      ±­Ìšê줳ÓÙÙ“ãÝá´¬Âᄅ¥¬È¤ØÃߤ—Ž»Ü“”š™Ø—ÕК Î祐ßÔ£¶ºª‘ì½’¤â¾¦”Êœ›   
      ÉêÈ šÆ§Ì¯èÉß«Ž²Œ«Ç¡ÉÍ̬—¦ê½Á•²¾ÁÅªÃ²¸¡¢Í¡¬¿à±·ž•Ü©ÑÚçÑύ敔ÈÙÂÚ×›
      ŽÓåÁ䟟ÃÍ ÙÙÍáßâÉÙäÕë⻵Ã̎ߪÒç´±¼Ø׬驓Ϭ߬—
      ì×˙֩뵧áç¹âµá¯©Ó’çԚ㬮µë»³¼»×ÔéÚÓ¬îîÌ丣¥ÀÊ­
     ¨µÄ˺ߞɷ¿ÔÄï´¼Ûä¿ÀÙ£¸Øßç¼Ù¬ŒœÀÅá±åàæÙ˜«Ëבǚ•™™ÄͲÄÒЪ­£ŒÂÞ̯”Ú­
      𩢪ÏÔÂáÊæÑØðšá

This one I created with the help of the RAND function in Excel. So it doesn’t contain information, does it? Well, generating random numbers in computers is notoriously difficult, because any deterministic algorithm is by definition not random. I think I read somewhere that the RAND function in Excel tries to do something clever with time intervals in the machine, thereby picking up some of the thermal randomness that will affect the precise timing of the computer clock. But the point is: a) it doesn’t contain information for me and you at the moment [3] but b) it might nevertheless contain information that could, in principle, be extracted. If it does contain information, then information theory tells us how much it can contain – an upper limit on how much it can contain.

As to the randomness question, we are again back to quantum mechanical randomness as the only true randomness of physics, always assuming that God does play dice, of course.

[1] Text file of The Woodlanders downloaded from the Gutenberg Project: https://www.gutenberg.org/files/482/482-h/482-h.htm

[2] It was a bit more messy than that, because the zip file is not coded as a text file so Editpad is trying to interpret arbitrary bytes as text, so I've fudged it a bit. If I had the time, I would have written a compression programme that gave an output in text. It would be possible but presumably wouldn't compress as much. The fudge doesn't change the argument, though.

[3] Note the theme of provisionality cropping up again!

No comments: