Do stupid people make more spelling mistakes? An analysis with R.
English orthography is a tricky topic. Most people learning English (and a great number of native speakers) are puzzled by a written language with an utter disregard for phonemic representation. It is therefore very worrying that individuals intuitively assume making orthographic mistakes is a sign of low intelligence or unprofessionalism. In this post we are going to show how we looked into the relationship of IQ and orthography using R. We are going to need two things:
- A list of commonly misspelled words
- Access to a database of psychometric data paired with text written by the individuals tested.
The former can easily be found on Wikipedia whilst the latter can be accessed by researchers on the fabulous Mypersonality Wiki.
Step 1: The list of misspelled words
The longest list of misspelled words I found was on Wikipedia with over 4000 commonly misspelled words.
This bit was very easy, we just copied the text into a text editor and replaced ->
with ;
.Reading it into R was simple and quick using the readr package.
Step 2: Handling 22m Facebook status updates
Mypersonality has data downloadable in a text format. After much struggle, we realised that running R on a laptop was not going to suffice to handle 22 million rows of text. Instead, we used MySQL to read the text file, create indexes and do the relevant joins. Afterwards, we used RODBC to connect to the database and import the data into R.
library(RODBC)
con1<-odbcConnect("data", uid="root", pwd="root")
data<- sqlQuery(con1, "SELECT id, text FROM status_iq;",stringsAsFactors=F)
This looks easy enough, but given that I was completely unfamiliar with ODBC and MySQL it took a lot longer than it should have. Nevertheless, at this point we ended up with 420k+ status updates in the following format:
library(readr)
data<-read_csv("iqfbstatus.csv",col_names = TRUE, n_max=1)
print(data)
Step 3: Counting errors
We wanted to have a matrix with the different types of errors per status, in case we wanted to see whether some errors are more associated with a certain factor (low intelligence, personality, etc…) than others. First we load the misspelled words, make them a character vector and use this vector to loop through the data frame, counting the occurrence of each string in each Facebook status.
library(stringr)
errors<- read_csv2("sointer/fbiq/wrongwords2.csv",col_names=c("wrong","right"))
errors$wrong->wrongwords
for (y in 1:length(wrongwords)){
cat(y)
data[,ncol(data)+1]<-str_count(data$text,paste0("\\<",wrongwords[y],"\\>"))
colnames(data)[ncol(data)]<-wrongwords[y]
cat("...done! \n")
}
Due to the sheer size of the resulting data I ended up splitting this task into smaller, more manageable chunks so R wouldn’t crash. After getting rid of the status updates, I wanted to check which are the most commonly misspelled words and what the distribution of errors is.
errorsByWord<- as.data.frame(colSums(data[,-1]))
colnames(errorsByWord)[1] <- "total_err"
errorsByWord$word <- rownames(errorsByWord)
errorsByWord <- arrange(errorsByWord, total_err)
summary(errorsByWord$total_err)
tail(errorsByWord, 15)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.00 2.00 8.19 4.00 724.00
## total_err word
1018 91 definately
1019 93 todays
1020 95 thier
1021 106 wierd
1022 111 untill
1023 116 momento
1024 142 teh
1025 191 wasnt
1026 200 everytime
1027 263 thru
1028 325 tommorow
1029 337 isnt
1030 485 doesnt
1031 542 alot
1032 724 didnt
The most common mistakes are errors I would classify as “petty” mistakes, such as leaving out apostrophes, letters or spaces. Many of these could easily be typos rather than an individual not knowing what the correct spelling is. “Definately” seems to be an exception to the rule, but honestly, who doesn’t misspell that one every now and then?
Step 4: Analysis
data$total_error<-rowSums(data[,2:ncol(data)])
by_user <- group_by(data, id)
users <- summarise(by_user,
count = n(),
total_err = sum(total_error),
error_mean = sum(total_error)/n(),
error_mse = sd(total_error)/sqrt(n()))
users <- merge(users, iq, by="id")
plot <- ggplot(users, aes(x=iq)) +
geom_point(aes(y=total_err))
The correlation between the mean error and IQ was -0.06, meaning that overall making more mistakes per Facebook status updates was only very slightly associated with a low IQ. A visual inspection of the graph shows three outliers with an error mean of over 0.75. These users typically have few status updates, only analysing individuals with 5 status updates or more yields a correlation of -0.1.
Out of curiosity we checked which spelling mistakes are related to high IQ and which mistakes are related to low IQ. We would not attach too much statistical significance to these results, but they make for a pretty wordcloud.
library(dplyr)
library(wordcloud)
library(RColorBrewer)
matrix <- data %>% group_by(id) %>% summarise_each(funs(sum))
matrix <- merge(matrix, iq, by="id")
x <- matrix[,2:1313]
y <- matrix[,1315]
cordata<-data.frame(cor(x, y))
cordata$word <- rownames(cordata)
colnames(cordata)[1]<-"cor"
cordata<-arrange(cordata,cor)
pal <- brewer.pal(8, "Set2")
wordcloud(tail(cordata,30)[,2],(tail(cordata,30)[,1])^2,random.order=FALSE,colors=pal)
wordcloud(head(cordata,30)[,2],(head(cordata,30)[,1])^2,random.order=FALSE,colors=pal)
High IQ spelling errors, with the size of the words representing the ordinal size of the correlation. Ironically, the word “oximoron” is most representative of high IQ individuals, with a correlation of 0.04. All words shown range between 0.038 and 0.027.
Conclusion
In the end, this exercise served to demonstrate what most people intuitively believe: spelling matters, but not thaaaat much. It also taught me quite a few lessons about using R. I hope you found it useful or at least somewhat interesting.
Next up, we’ll check out what spelling mistakes have to do with personality. Any ideas on what else we should do? Get in touch.
The most common mistakes are errors I would classify as “petty” mistakes, such as leaving out apostrophes, letters or spaces. Many of these could easily be typos rather than an individual not knowing what the correct spelling is. “Definately” seems to be an exception to the rule, but honestly, who doesn’t misspell that one every now and then?