Brownian thought space

Cognitive science, mostly, but more a sometimes structured random walk about things.

My Photo
Name:
Location: Rochester, United States

Chronically curious モ..

Sunday, May 04, 2008

The Mystery of Sixteen

What's with Sixteen and Facebook?

Background 1: Facebook recently released an app called Lexicon, that counts the occurrence of upto five words on Facebook wall posts and shows their relative usage over a period of several days.

Background 2: I happened to have done my Ph.D. with the awesome Jacques Mehler, who, along with one of his (other ;) star students Stan Dehaene, wrote a paper way back in 1992, entitled 'Cross-linguistic regularities in the frequency of number words' (Cognition, 43, 1-29). In this paper, they found that in cross-linguistic comparison of written corpora, (a) the frequency of usage of a number was inversely proportional to its magnitude, and (b) there were some local increases in frequency, e.g. for 'hundred' or the multiples of 10 and some numbers between 10 and 20.

Materials and methods: Naturally, it behove me to look at how Facebook users went with this trend. You who know my method, would immediately guess that Matlab was made use of. It was, images were captured, axes drawn and relative frequencies (Lexicon does not give absolute frequencies... in fact, there is no y-scale) estimated.


Here is one image output from Facebook. The estimated relative frequencies are plotted in a log-log scale in the inset graph.

As had been observed, there is a nice decrease in frequency going from 'one' to 'four'.

So far so good. But could lexicon actually also pick up the subtle local increases that Stan & Jacques found?

So, I looked at the numbers between 11 and 18. Figure 2 (Numbers 10 to 19) shows the relation between the previous findings and the current ones. In the graph from the paper, each line is a different language (corpus). Ignoring the dotted lines, if you see the trends for the languages (usefully highlighted by me) and compare it with the inset from Facebook (in blue at the top), you see two things: (a) overall, the shapes are remarkably similar, BUT (b) 'sixteen' is vastly over-represented. Hmm.

Bit of a puzzle, that. Has American English changed? Is it the difference in corpora? Theirs was the famous Francis and Kucera (1982), this was Facebook. Or is it because of the online, web nature of the Facebook 'corpus'?

Well, that could be checked quickly enough. A bunch of google searches for 'eleven' through 'eighteen' produced (approximate) counts for these words on the web at large. I normalized the Facebook and the Google values by subtracting the minimum, adding 1 and dividing the whole by the maximum. The results are shown in the last figure. For the most part, there's a pretty amazing concordance between Facebook and Google. Except for 'sixteen'!




Well, I must admit, I am rather puzzled. What's with sixteen? Ok, so it's the age for getting a driver's license in most states, but so what? And why is Facebook different from the rest of the web? Is there something sinister?



UPDATE:

So, I did some more poking around. One hypothesis (suggested by Celeste) was that this had to do with the American cultural association of 16 with 'coming of age' plus the fact that Facebook is primarily a teenage phenomenon. I thought, instead, that this might have to do with the fact that Facebook is a social networking site. To test this, I looked at some stats off of myspace, using Google's handy site-specific searching, e.g. 'sixteen site:myspace.com'. The results are shown in the next figure:



These data show that myspace, like Facebook, has a spike in the count of 'sixteen'; something not found in the original study or in the stats from the whole web (google in the graph).

So, this supports the idea that social networking sites have an unexplained spike in 'sixteen'.

To look a little more into this, I looked at the phrase 'sweet sixteen'. Now Lexicon allows you to specify phrases as well as individual words. So I did that. As controls, I had 'thirteen', which we know is pretty low (and not just in American, although the exact reasons for the cross-culturally low values are not clear but are discussed in the original paper). And, I added in 'turned sixteen' and 'sixteen years' as other relevant phrases. The results from Lexicon in the next picture are clear: 'Sweet sixteen' shows up VERY frequently, while the other phrases don't even show up!


Once again, I used Matlab to estimate the frequency of the phrase 'sweet sixteen'- it accounts for nearly a THIRD of all occurrences. In the final graph, I plot the occurrences of 'sixteen' on Facebook either with (blue) or without the phrase 'sweet sixteen' (red, dotted).

Now suddenly the picture looks much more like the original and like Google!

So, the REAL question is, what's with social networking sites (ok, I haven't done this with Myspace yet) and 'sweet sixteen'???

Labels: , , , ,

4 Comments:

Anonymous Anonymous said...

You know that could be a Science paper!
beautiful!
JR

October 27, 2008 5:53 PM  
Blogger Matt Dye said...

Slow Mo -- way too much time on your hands, bro.

December 05, 2008 6:15 AM  
Anonymous muna said...

How did you plot the occurrences using Matlab?
can you help me with the code?
Thanks! :)

May 01, 2011 2:16 PM  
Blogger mohinish said...

Muna - do you mean just the plotting? Or the estimation and plotting? Estimation was done by importing the image, plotting it and manually plotting overlaid lines that best matched the general trend of the lines.. I don't have the code anymore, but there was nothing fancy :)
-Mo

May 01, 2011 2:54 PM  

Post a Comment

<< Home