Thursday, May 08, 2008

I use features

What I tried to say:
"Greek place"
What I ended up saying:
"Breek ... errr..."
What this means:
The initial consonant of word 2 lent it's [place] feature to the initial consonant of word 1; BUT, word 1 kept it's [voicing] feature intact. Thus, /g/ -> /b/
Looks like I do use something like a feature after all :) 
Moral: pay close attention to your slips of the tongue and other speech errors...
Chris K swaps features between the onset and the coda of a single syllable - 
"wis.dom" -> "wis.bon"
by swapping the onset [+alveolar] with the coda [+bilabial]; leaving the rest intact...
Medal and honors to her.

Sunday, May 04, 2008

The Mystery of Sixteen

What's with Sixteen and Facebook?

Background 1: Facebook recently released an app called Lexicon, that counts the occurrence of upto five words on Facebook wall posts and shows their relative usage over a period of several days.

Background 2: I happened to have done my Ph.D. with the awesome Jacques Mehler, who, along with one of his (other ;) star students Stan Dehaene, wrote a paper way back in 1992, entitled 'Cross-linguistic regularities in the frequency of number words' (Cognition, 43, 1-29). In this paper, they found that in cross-linguistic comparison of written corpora, (a) the frequency of usage of a number was inversely proportional to its magnitude, and (b) there were some local increases in frequency, e.g. for 'hundred' or the multiples of 10 and some numbers between 10 and 20.

Materials and methods: Naturally, it behove me to look at how Facebook users went with this trend. You who know my method, would immediately guess that Matlab was made use of. It was, images were captured, axes drawn and relative frequencies (Lexicon does not give absolute frequencies... in fact, there is no y-scale) estimated.

Here is one image output from Facebook. The estimated relative frequencies are plotted in a log-log scale in the inset graph.

As had been observed, there is a nice decrease in frequency going from 'one' to 'four'.

So far so good. But could lexicon actually also pick up the subtle local increases that Stan & Jacques found?

So, I looked at the numbers between 11 and 18. Figure 2 (Numbers 10 to 19) shows the relation between the previous findings and the current ones. In the graph from the paper, each line is a different language (corpus). Ignoring the dotted lines, if you see the trends for the languages (usefully highlighted by me) and compare it with the inset from Facebook (in blue at the top), you see two things: (a) overall, the shapes are remarkably similar, BUT (b) 'sixteen' is vastly over-represented. Hmm.

Bit of a puzzle, that. Has American English changed? Is it the difference in corpora? Theirs was the famous Francis and Kucera (1982), this was Facebook. Or is it because of the online, web nature of the Facebook 'corpus'?

Well, that could be checked quickly enough. A bunch of google searches for 'eleven' through 'eighteen' produced (approximate) counts for these words on the web at large. I normalized the Facebook and the Google values by subtracting the minimum, adding 1 and dividing the whole by the maximum. The results are shown in the last figure. For the most part, there's a pretty amazing concordance between Facebook and Google. Except for 'sixteen'!

Well, I must admit, I am rather puzzled. What's with sixteen? Ok, so it's the age for getting a driver's license in most states, but so what? And why is Facebook different from the rest of the web? Is there something sinister?


So, I did some more poking around. One hypothesis (suggested by Celeste) was that this had to do with the American cultural association of 16 with 'coming of age' plus the fact that Facebook is primarily a teenage phenomenon. I thought, instead, that this might have to do with the fact that Facebook is a social networking site. To test this, I looked at some stats off of myspace, using Google's handy site-specific searching, e.g. 'sixteen'. The results are shown in the next figure:

These data show that myspace, like Facebook, has a spike in the count of 'sixteen'; something not found in the original study or in the stats from the whole web (google in the graph).

So, this supports the idea that social networking sites have an unexplained spike in 'sixteen'.

To look a little more into this, I looked at the phrase 'sweet sixteen'. Now Lexicon allows you to specify phrases as well as individual words. So I did that. As controls, I had 'thirteen', which we know is pretty low (and not just in American, although the exact reasons for the cross-culturally low values are not clear but are discussed in the original paper). And, I added in 'turned sixteen' and 'sixteen years' as other relevant phrases. The results from Lexicon in the next picture are clear: 'Sweet sixteen' shows up VERY frequently, while the other phrases don't even show up!

Once again, I used Matlab to estimate the frequency of the phrase 'sweet sixteen'- it accounts for nearly a THIRD of all occurrences. In the final graph, I plot the occurrences of 'sixteen' on Facebook either with (blue) or without the phrase 'sweet sixteen' (red, dotted).

Now suddenly the picture looks much more like the original and like Google!

So, the REAL question is, what's with social networking sites (ok, I haven't done this with Myspace yet) and 'sweet sixteen'???

