Do stupid people make more spelling mistakes? An analysis with R.

2015-10-20
5 min read
R

English orthography is a tricky topic. Most people learning English (and a great number of native speakers) are puzzled by a written language with an utter disregard for phonemic representation. It is therefore very worrying that individuals intuitively assume making orthographic mistakes is a sign of low intelligence or unprofessionalism. In this post we are going to show how we looked into the relationship of IQ and orthography using R. We are going to need two things:

  1. A list of commonly misspelled words
  2. Access to a database of psychometric data paired with text written by the individuals tested.

The former can easily be found on Wikipedia whilst the latter can be accessed by researchers on the fabulous Mypersonality Wiki.

Step 1: The list of misspelled words

The longest list of misspelled words I found was on Wikipedia with over 4000 commonly misspelled words.

This bit was very easy, we just copied the text into a text editor and replaced -> with ;.Reading it into R was simple and quick using the readr package.

Step 2: Handling 22m Facebook status updates

Mypersonality has data downloadable in a text format. After much struggle, we realised that running R on a laptop was not going to suffice to handle 22 million rows of text. Instead, we used MySQL to read the text file, create indexes and do the relevant joins. Afterwards, we used RODBC to connect to the database and import the data into R.

library(RODBC)
con1<-odbcConnect("data", uid="root", pwd="root")
data<- sqlQuery(con1, "SELECT id, text FROM status_iq;",stringsAsFactors=F)

This looks easy enough, but given that I was completely unfamiliar with ODBC and MySQL it took a lot longer than it should have. Nevertheless, at this point we ended up with 420k+ status updates in the following format:

library(readr)
data<-read_csv("iqfbstatus.csv",col_names = TRUE, n_max=1)
print(data)

Step 3: Counting errors

We wanted to have a matrix with the different types of errors per status, in case we wanted to see whether some errors are more associated with a certain factor (low intelligence, personality, etc…) than others. First we load the misspelled words, make them a character vector and use this vector to loop through the data frame, counting the occurrence of each string in each Facebook status.

library(stringr)
errors<- read_csv2("sointer/fbiq/wrongwords2.csv",col_names=c("wrong","right"))
errors$wrong->wrongwords

for (y in 1:length(wrongwords)){
  cat(y)
  data[,ncol(data)+1]<-str_count(data$text,paste0("\\<",wrongwords[y],"\\>"))
  colnames(data)[ncol(data)]<-wrongwords[y]
  cat("...done! \n")
}

Due to the sheer size of the resulting data I ended up splitting this task into smaller, more manageable chunks so R wouldn’t crash. After getting rid of the status updates, I wanted to check which are the most commonly misspelled words and what the distribution of errors is.

errorsByWord<- as.data.frame(colSums(data[,-1]))
colnames(errorsByWord)[1] <- "total_err"
errorsByWord$word <- rownames(errorsByWord)
errorsByWord <- arrange(errorsByWord, total_err)
summary(errorsByWord$total_err)
tail(errorsByWord, 15)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.00    1.00    2.00    8.19    4.00  724.00
##      total_err       word
 1018        91 definately
 1019        93     todays
 1020        95      thier
 1021       106      wierd
 1022       111     untill
 1023       116    momento
 1024       142        teh
 1025       191      wasnt
 1026       200  everytime
 1027       263       thru
 1028       325   tommorow
 1029       337       isnt
 1030       485     doesnt
 1031       542       alot
 1032       724      didnt

The most common mistakes are errors I would classify as “petty” mistakes, such as leaving out apostrophes, letters or spaces. Many of these could easily be typos rather than an individual not knowing what the correct spelling is. “Definately” seems to be an exception to the rule, but honestly, who doesn’t misspell that one every now and then?

Step 4: Analysis

data$total_error<-rowSums(data[,2:ncol(data)])
by_user <- group_by(data, id)
users <- summarise(by_user,
                   count = n(),
                   total_err = sum(total_error),
                   error_mean = sum(total_error)/n(),
                   error_mse  = sd(total_error)/sqrt(n()))

users <- merge(users, iq, by="id")
plot <- ggplot(users, aes(x=iq)) +
  geom_point(aes(y=total_err))
alt text

The correlation between the mean error and IQ was -0.06, meaning that overall making more mistakes per Facebook status updates was only very slightly associated with a low IQ. A visual inspection of the graph shows three outliers with an error mean of over 0.75. These users typically have few status updates, only analysing individuals with 5 status updates or more yields a correlation of -0.1.

Out of curiosity we checked which spelling mistakes are related to high IQ and which mistakes are related to low IQ. We would not attach too much statistical significance to these results, but they make for a pretty wordcloud.

library(dplyr)
library(wordcloud)
library(RColorBrewer)
matrix <- data %>% group_by(id) %>% summarise_each(funs(sum))
matrix <- merge(matrix, iq, by="id")

x <- matrix[,2:1313]
y <- matrix[,1315]
cordata<-data.frame(cor(x, y))
cordata$word <- rownames(cordata)
colnames(cordata)[1]<-"cor"
cordata<-arrange(cordata,cor)

pal <- brewer.pal(8, "Set2")
wordcloud(tail(cordata,30)[,2],(tail(cordata,30)[,1])^2,random.order=FALSE,colors=pal)
wordcloud(head(cordata,30)[,2],(head(cordata,30)[,1])^2,random.order=FALSE,colors=pal)

High IQ
High IQ spelling errors, with the size of the words representing the ordinal size of the correlation. Ironically, the word “oximoron” is most representative of high IQ individuals, with a correlation of 0.04. All words shown range between 0.038 and 0.027.

Low IQ Here you can find the equivalent for Low IQ. Here the correlations range between -0.042 of "firends" and -0.085 of "alot".

Conclusion

In the end, this exercise served to demonstrate what most people intuitively believe: spelling matters, but not thaaaat much. It also taught me quite a few lessons about using R. I hope you found it useful or at least somewhat interesting.

Next up, we’ll check out what spelling mistakes have to do with personality. Any ideas on what else we should do? Get in touch.

The most common mistakes are errors I would classify as “petty” mistakes, such as leaving out apostrophes, letters or spaces. Many of these could easily be typos rather than an individual not knowing what the correct spelling is. “Definately” seems to be an exception to the rule, but honestly, who doesn’t misspell that one every now and then?