Friday, November 13, 2009


I'm sorely vexed with 'teh'.
For those who think that frequency is teh everything, here is a conundrum inside an enigma: what explains the frequency of 'teh'?
The graph shows the log frequency for all the possible 2-letter mixups possible starting from 'the'. Ok, ignore the last one.
Note that 'teh' is not only the most common, it's the only one where the entire first page of results is about 'teh' as it relates to 'the'. For the last two, the first page hits had nothing to do with 'the'.
Now consider 'and'. The comparable graph shows decreasing frequency like for 'the', but in this case, none of the 2-letter mistakes have anything to do with 'and' - all the hits on page 1 are various abbreviations.
So why is 'teh', as a misspelling (deliberate or otherwise) of 'the' so common?
One possibility is that the bigram frequencies - TE or EH are way higher than TH or HE (well, probably not the latter).
The other possibility is motor planning and the layout of the qwerty keyboard - do people using other layouts like Dvorak or Colemak show the same pattern? Other language layouts (maybe with bilinguals)?
At the very least, it seems that frequency ALONE can never be an answer for stuff you see around you ;)


