You are wasting your time with spreadsheets

This week I had an interesting conversation (as interesting as these conversations can get) about the use of spreadsheets in business. I am still a little bit surprised by how often they are used for things they are not really good for. Now, I don’t want to write about how there are better alternatives to spreadsheets. Instead, I’d like to illustrate how powerful non-spreadsheet tools are by describing a task I completed this week using R.

I was recently asked to help an NGO aggregate their contact database. It was not a surprise they needed help aggregating it: it came from dozens of different sources, and had thousands of partially duplicated entries. Entries were identifiable by name or by emails, which were the only unique columns.

To give you a better idea, this is what it looked like:

source email firstName lastName details1 details2
This Bob ANna 1 5
That Steve NA 2 4
Other NA Lacey 3 3
Other Steve Lacey 4 2
NA Lacey Steve 5 1

Except that I had several columns with details, many emails and other values were invalid, or missing. In short, it was a big hassle that would have taken days of manual work in spreadsheets to sort out, which was what the client originally tried.

A more sophisticated consultant/banker spreadsheet user would’ve written some extensive for loop with dozens of conditions and incessant F10 pressing and debugging. This probably would’ve worked but would have been a great waste of time, with a lot of places for errors to creep in.

So how did I solve it in R? Almost all of the heavy lifting was done by a single function in the dplyr package. After some data-preprocessing, the hardest problem was solved by the following lines of code:

library(stringr)#loading package to deal with strings
library(dplyr)#loading data wrangling package

#first I write a function to give the most common value in a vector
# a concrete example in this case, the first name that happens most commonly under a given email
  if (is.null(x)){

#then I write a function to aggregate all other unique strings

#and most of the magic happens here: the C++ code under dplyr's hood rapidly and efficienly aggregates the information needed 
newdf<-olddf %>% group_by(email) %>% summarise(source = aggr(source),
                                              first  = mstcmn(first_name),
                                              last  = mstcmn(last_name),
                                              details1 = aggr(details1),
                                              details2 = aggr(details2))

This was followed by a few lines of code to capitalise all relevant values, make things pretty. And voila:

source email firstName lastName details1 details2
This Bob Anna 1 5
That, Other Steve Lacey 2, 4, 5 4, 2, 1
Other NA NA 3 3

I also recreated people’s names based on their emails if their names were missing, as most professional emails in the format:

Then I used the genderizeR package to add a column with people's titles: Mr or Ms based on their name.

Try doing this on a spreadsheet. I guess the main argument of this post is: don't use spreadsheets for more than what they are for.

comments powered by Disqus