Tuesday, 15 August 2017

Meaning in information theory

Here's something I've been meaning to write about for a long time, addressing a common criticism of Shannon information.
 
According to many measures of information, as used in Shannon’s information theory and in algorithmic information theories (Kolmogorov, Solomonoff and Chaitin), the more random and unpredictable data is, the higher the information content. This is often presented as a reason for rejecting them as meaningful measures of semantic information.
Arguments go something like this.
            Compares these two sets of data:
Data set #1
The rambler who, for old association or other reasons, should trace the forsaken coach-road running almost in a meridional line from Bristol to the south shore of England, would find himself during the latter half of his journey in the vicinity of some extensive woodlands, interspersed with apple-orchards.
Data set #2
ƒw€¾ë †eܯ⠡nÊ.,+´Àˆrœ                D[1]R¥­Æú9yŠB/öû˜ups>YS ­ Ù#Ô¤w/ÎÚÊø ªÄfª Æýð‘ëe© Y›C&$¾"˾Bµ"·
                       F8 ©t"60 <Ikˆ¢†I&Év*Úí‡a,àE l xùçkvV¯hó
                       ZÑi½ŠD°4‹¨D#n ¬T*g    
                       ZœÎr‡œ`™ó%î ù†ðË %P€óKIx­s©ÊÁz¯
                       ‰xtÝ\K®§Ivsmm*)¹W¯)öÖ(ÃS/`ŽM¨Äcs
                       ‹
8V79{¾à²ÈÇ'R|“‡üE­ û—€Ñ‚ŠB×ÉD         \{F$þݦýCÕŽ´Û2ø

They both contain the same number of characters (307, including spaces) and information theory would have us believe that the second contains more information than the first because it is more random. This is clearly rubbish.
QED: information theory sucks!
No.

The thing is that you, the reader, perceive data set #1 as containing information because of the levels above. It contains information, for you, because it is a legitimate sentence according to English grammar (the syntax) and because of the meaning in the context of the story (“The Woodlanders” by Thomas Hardy [1]).  At the level of the Shannon (or Kolmogorov of whatever) this is not known. The Shannon level doesn’t know that data set #2 is ‘just’ random whereas data set #1 contains meaning. The data set #2 *might* contain information from the layers above, and if it did, it would be able to contain more than data set #1.

And actually, I generated data set #2 by using a zipping tool (7-Zip) to compress the whole of “The Woodlanders”, opening the compressed zip file in a text editing programme (Editpad lite) and copying 307 characters [2]. The uncompressed text file is 778 Kbytes compared to 287 kbytes for the compressed file, giving a compression ratio of 37%. So in fact data set #2 contains a bigger fraction of the whole story than data set #1. It contains 1/0.37 = 2.7 times the amount of information in data set #1. 

The point is that compressing files reduces their size by removing redundancy and the more a file is compressed, the more random it appears. This, indeed, is the principle behind algorithmic information theory: the information content of a file is given by the size of the smallest file that can represent the content. It is theoretical limit to compression.

However, data set #2 might not have been created in the way I said it it was. Consider now data set #3

      ±­Ìšê줳ÓÙÙ“ãÝá´¬Âᄅ¥¬È¤ØÃߤ—Ž»Ü“”š™Ø—ÕК Î祐ßÔ£¶ºª‘ì½’¤â¾¦”Êœ›   
      ÉêÈ šÆ§Ì¯èÉß«Ž²Œ«Ç¡ÉÍ̬—¦ê½Á•²¾ÁÅªÃ²¸¡¢Í¡¬¿à±·ž•Ü©ÑÚçÑύ敔ÈÙÂÚ×›
      ŽÓåÁ䟟ÃÍ ÙÙÍáßâÉÙäÕë⻵Ã̎ߪÒç´±¼Ø׬驓Ϭ߬—
      ì×˙֩뵧áç¹âµá¯©Ó’çԚ㬮µë»³¼»×ÔéÚÓ¬îîÌ丣¥ÀÊ­
     ¨µÄ˺ߞɷ¿ÔÄï´¼Ûä¿ÀÙ£¸Øßç¼Ù¬ŒœÀÅá±åàæÙ˜«Ëבǚ•™™ÄͲÄÒЪ­£ŒÂÞ̯”Ú­
      𩢪ÏÔÂáÊæÑØðšá

This one I created with the help of the RAND function in Excel. So it doesn’t contain information, does it? Well, generating random numbers in computers is notoriously difficult, because any deterministic algorithm is by definition not random. I think I read somewhere that the RAND function in Excel tries to do something clever with time intervals in the machine, thereby picking up some of the thermal randomness that will affect the precise timing of the computer clock. But the point is: a) it doesn’t contain information for me and you at the moment [3] but b) it might nevertheless contain information that could, in principle, be extracted. If it does contain information, then information theory tells us how much it can contain – an upper limit on how much it can contain.

As to the randomness question, we are again back to quantum mechanical randomness as the only true randomness of physics, always assuming that God does play dice, of course.

[1] Text file of The Woodlanders downloaded from the Gutenberg Project: https://www.gutenberg.org/files/482/482-h/482-h.htm

[2] It was a bit more messy than that, because the zip file is not coded as a text file so Editpad is trying to interpret arbitrary bytes as text, so I've fudged it a bit. If I had the time, I would have written a compression programme that gave an output in text. It would be possible but presumably wouldn't compress as much. The fudge doesn't change the argument, though.

[3] Note the theme of provisionality cropping up again!

Sunday, 16 July 2017

Randomness is nothing. Nothingness is random.

To say that something is random is to say that it conveys no information. That is the true nothing.

We know from physics that vacuums, as the complete absence of any matter or energy, don't exist, because the uncertainty principle allows particle-antiparticle pairs to appear spontaneously provided they vanish again quickly enough. This is what creates the 'vacuum energy' and is measurable.

So you might argue that there's no such thing as nothing, but the presence or absence of matter or energy is of no significance. What is important is information. In the data-information model, information cannot be extracted from completely random data. Or, to put it another way, completely random data is no data at all (is nothing).

However, randomness is relative. If I listen to someone speaking a Chinese language the words are random to me because I cannot extract meaning from them, but of course they are not random to a Chinese speaker. So, nothingness is relative. Unless, that is, there exists randomness that is absolutely random. Is this the randomness of quantum mechanics? Is this 'God playing dice' (in the famous phrase of Alfred Einstein)?


Monday, 22 May 2017

The line-up for DTMD 2017 at IS4SI. Narrative and Rhetoric: exploring meaning in a digitalised society

The line-up for DTMD 2017 at IS4SI in Gothenburg is now finalised. Here's a brief description.
DTMD 2017 is the sixth workshop on understanding the nature of information organised by the DTMD group from The Open University in Milton Keynes, UK. DTMD is abbreviated from ‘The Difference that Makes a Difference’, Gregory Bateson’s celebrated definition of information, and the workshops have all had an interdisciplinary approach to information and sought to encourage cross-discipline discussion.

DTMD 2017 takes as its theme ‘narrative’: exploring both the narratives of information and language of information in the narratives of the digitalised society in order to enhance understanding, both of society and of information. The workshop is divided into two halves, with significant time allocated for in-depth discussion in each. The first half has a philosophical flavour, starting with Chapman asking “What can we say about information?” followed by Jones’ exploration of “Narrative realities and optimal entropy” and Fiorini’s Predicative Competence in a Digitalised Society. After discussion of the first three presentations, the second half has a more applied/political focus. Both of Ali’s “Decolonizing Information Narratives” and Sordi’s “The Algorithmic Narrator” take a critical look at algorithms in society, then the final paper of the workshop, Ramage’s “Meaning, selection & narrative: the information we see and the information we don’t” explores the contested nature of information and narratives before the final period of discussion.

It takes place on Monday 12th June to the schedule as follows:

10:30-11:00 What can we say about information? Agreeing a narrative (David Chapman)
11:00-11:30 Predicative Competence in a Digitalised Society (Rodolfo A. Fiorini) ;
11:30-12:00 Narrative realities and optimal entropy (Derek Jones)
[12:00-14:00 Lunch, then Deacon panel]
14:00-14:30 General discussion 1
14:30-1:500 Decolonizing Information Narratives (Syed Mustafa Ali)
[15:00-15:30 Tea]
15:30-16:00 The Algorithmic Narrator (Paolo Sordi)
16:00-16:30 Meaning, selection & narrative: the information we see and the information we don’t (Magnus Ramage)
16:30-17:00 General discussion 2

Monday, 8 May 2017

Perspectives on Information - now only £28

I've just spotted that you can now buy Perspectives on Information (Routledge, 2011) for as little as £28. (It originally came out in hardback at about £90!)

The book arose out the very first of the DTMD workshops (not that we called it DTMD at the time). It was an internal workshop at the Open University held in 2007, so all of the authors of Perspectives were, at the time of the workshop, employed at the OU, though some have since moved on.



Wednesday, 26 April 2017

Data rising and information falling

I've been too busy to blog for a while, and I'm still too busy, but here's a quick one.





QI, quite interesting, don't you think?  I did the comparison because I suspected that 'information' was going out of favour. I think people might be starting to be more careful in their use of 'information' and 'data'.

Monday, 27 February 2017

Submission deadline for DTMD 2017 extended to 1 April

The deadline for submitting papers to DTMD 2017, Information, Narrative and Rhetoric: Exploring Meaning in a Digitalised Society (and to all the conferences at IS4SI) has been extended to the first of April. See my previous post for details of DTMD 2017.

Monday, 13 February 2017

Studentships to study the nature of information at The Open University

The department where I work at the Open University in Milton Keynes, UK, is offering two full-time PhD studentships.  The details are here: http://www.open.ac.uk/about/employment/vacancies/phd-studentship-10211 

Note the deadline of 10th March

Applications are invited to work in any of the fields of interest of the department, and that includes the study of information.  See the list of topics of interest here: http://www9.open.ac.uk/mct-cc/study/research-degrees/student-projects,

but note especially:


and:


Both of these would be based in the DTMD Research Group.

If you might be interested in studying for a PhD in either of these areas, or you have an idea for a related topic, please get in touch with me (david.chapman[at]open.ac.uk) as soon as possible.

And please pass this on!