You are wasting your time with spreadsheets

This week I had an interesting conversation (as interesting as these conversations can get) about the use of spreadsheets in business. I am still a little bit surprised by how often they are used for things they are not really good for. Now, I don’t want to write about how there are better alternatives to spreadsheets. Instead, I’d like to illustrate how powerful non-spreadsheet tools are by describing a task I completed this week using R.

I was recently asked to help an NGO aggregate their contact database. It was not a surprise they needed help aggregating it: it came from dozens of different sources, and had thousands of partially duplicated entries. Entries were identifiable by name or by emails, which were the only unique columns.

To give you a better idea, this is what it looked like:

source email firstName lastName details1 details2
This wrw@cs.com Bob ANna 1 5
That zam@pk.com Steve NA 2 4
Other s2@qs.com NA Lacey 3 3
Other zam@pk.com Steve Lacey 4 2
NA zam@pk.com Lacey Steve 5 1

Except that I had several columns with details, many emails and other values were invalid, or missing. In short, it was a big hassle that would have taken days of manual work in spreadsheets to sort out, which was what the client originally tried.

A more sophisticated consultant/banker spreadsheet user would’ve written some extensive for loop with dozens of conditions and incessant F10 pressing and debugging. This probably would’ve worked but would have been a great waste of time, with a lot of places for errors to creep in.

So how did I solve it in R? Almost all of the heavy lifting was done by a single function in the dplyr package. After some data-preprocessing, the hardest problem was solved by the following lines of code:

library(stringr)#loading package to deal with strings
library(dplyr)#loading data wrangling package

#first I write a function to give the most common value in a vector
# a concrete example in this case, the first name that happens most commonly under a given email
mstcmn<-function(x){
  x<-names(sort(table(as.character(x)),decreasing=TRUE)[1])
  if (is.null(x)){
    return(NA)
  }
  return(x)
}

#then I write a function to aggregate all other unique strings
aggr<-function(x){
  x<-toString(na.omit(x))
  x<-str_split_fixed(x,",",(str_count(x,",")+1))
  x<-trim(x)
  x<-unique(as.list(x))
  return(toString(unlist(x)))
}


#and most of the magic happens here: the C++ code under dplyr's hood rapidly and efficienly aggregates the information needed 
newdf<-olddf %>% group_by(email) %>% summarise(source = aggr(source),
                                              first  = mstcmn(first_name),
                                              last  = mstcmn(last_name),
                                              details1 = aggr(details1),
                                              details2 = aggr(details2))

This was followed by a few lines of code to capitalise all relevant values, make things pretty. And voila:

source email firstName lastName details1 details2
This wrw@cs.com Bob Anna 1 5
That, Other zam@pk.com Steve Lacey 2, 4, 5 4, 2, 1
Other s2@qs.com NA NA 3 3

I also recreated people’s names based on their emails if their names were missing, as most professional emails in the format: firstname.lastname@email.com.

Then I used the genderizeR package to add a column with people's titles: Mr or Ms based on their name.

Try doing this on a spreadsheet. I guess the main argument of this post is: don't use spreadsheets for more than what they are for.

comments powered by Disqus