You are wasting your time with spreadsheets
This week I had an interesting conversation (as interesting as these conversations can get) about the use of spreadsheets in business. I am still a little bit surprised by how often they are used for things they are not really good for. Now, I don’t want to write about how there are better alternatives to spreadsheets. Instead, I’d like to illustrate how powerful non-spreadsheet tools are by describing a task I completed this week using R.
I was recently asked to help an NGO aggregate their contact database. It was not a surprise they needed help aggregating it: it came from dozens of different sources, and had thousands of partially duplicated entries. Entries were identifiable by name or by emails, which were the only unique columns.
To give you a better idea, this is what it looked like:
source | firstName | lastName | details1 | details2 | |
---|---|---|---|---|---|
This | [email protected] | Bob | ANna | 1 | 5 |
That | [email protected] | Steve | NA | 2 | 4 |
Other | [email protected] | NA | Lacey | 3 | 3 |
Other | [email protected] | Steve | Lacey | 4 | 2 |
NA | [email protected] | Lacey | Steve | 5 | 1 |
Except that I had several columns with details, many emails and other values were invalid, or missing. In short, it was a big hassle that would have taken days of manual work in spreadsheets to sort out, which was what the client originally tried.
A more sophisticated consultant/banker spreadsheet user would’ve written some extensive for loop with dozens of conditions and incessant F10 pressing and debugging. This probably would’ve worked but would have been a great waste of time, with a lot of places for errors to creep in.
So how did I solve it in R? Almost all of the heavy lifting was done by a single function in the dplyr package. After some data-preprocessing, the hardest problem was solved by the following lines of code:
library(stringr)#loading package to deal with strings
library(dplyr)#loading data wrangling package
#first I write a function to give the most common value in a vector
a concrete example in this case, the first name that happens most commonly under a given email
mstcmn<-function(x){
x<-names(sort(table(as.character(x)),decreasing=TRUE)[1])
if (is.null(x)){
return(NA)
}
return(x)
}
#then I write a function to aggregate all other unique strings
aggr<-function(x){
x<-toString(na.omit(x))
x<-str_split_fixed(x,",",(str_count(x,",")+1))
x<-trim(x)
x<-unique(as.list(x))
return(toString(unlist(x)))
}
#and most of the magic happens here: the C++ code under dplyr’s hood rapidly and efficienly aggregates the information needed
newdf<-olddf %>% group_by(email) %>% summarise(source = aggr(source),
first = mstcmn(first_name),
last = mstcmn(last_name),
details1 = aggr(details1),
details2 = aggr(details2))
This was followed by a few lines of code to capitalise all relevant values, make things pretty. And voila:
source | firstName | lastName | details1 | details2 | |
---|---|---|---|---|---|
This | [email protected] | Bob | Anna | 1 | 5 |
That, Other | [email protected] | Steve | Lacey | 2, 4, 5 | 4, 2, 1 |
Other | [email protected] | NA | NA | 3 | 3 |
I also recreated people’s names based on their emails if their names were missing, as most professional emails in the format: [email protected].
Then I used the genderizeR package to add a column with people's titles: Mr or Ms based on their name.
Try doing this on a spreadsheet. I guess the main argument of this post is: don't use spreadsheets for more than what they are for.