You are wasting your time with spreadsheets

2016-11-13
3 min read
R , Excel

This week I had an interesting conversation (as interesting as these conversations can get) about the use of spreadsheets in business. I am still a little bit surprised by how often they are used for things they are not really good for. Now, I don’t want to write about how there are better alternatives to spreadsheets. Instead, I’d like to illustrate how powerful non-spreadsheet tools are by describing a task I completed this week using R.

I was recently asked to help an NGO aggregate their contact database. It was not a surprise they needed help aggregating it: it came from dozens of different sources, and had thousands of partially duplicated entries. Entries were identifiable by name or by emails, which were the only unique columns.

To give you a better idea, this is what it looked like:

sourceemailfirstNamelastNamedetails1details2
This[email protected]BobANna15
That[email protected]SteveNA24
Other[email protected]NALacey33
Other[email protected]SteveLacey42
NA[email protected]LaceySteve51

Except that I had several columns with details, many emails and other values were invalid, or missing. In short, it was a big hassle that would have taken days of manual work in spreadsheets to sort out, which was what the client originally tried.

A more sophisticated consultant/banker spreadsheet user would’ve written some extensive for loop with dozens of conditions and incessant F10 pressing and debugging. This probably would’ve worked but would have been a great waste of time, with a lot of places for errors to creep in.

So how did I solve it in R? Almost all of the heavy lifting was done by a single function in the dplyr package. After some data-preprocessing, the hardest problem was solved by the following lines of code:

library(stringr)#loading package to deal with strings
library(dplyr)#loading data wrangling package

#first I write a function to give the most common value in a vector

a concrete example in this case, the first name that happens most commonly under a given email

mstcmn<-function(x){ x<-names(sort(table(as.character(x)),decreasing=TRUE)[1]) if (is.null(x)){ return(NA) } return(x) }

#then I write a function to aggregate all other unique strings aggr<-function(x){ x<-toString(na.omit(x)) x<-str_split_fixed(x,",",(str_count(x,",")+1)) x<-trim(x) x<-unique(as.list(x)) return(toString(unlist(x))) }

#and most of the magic happens here: the C++ code under dplyr’s hood rapidly and efficienly aggregates the information needed newdf<-olddf %>% group_by(email) %>% summarise(source = aggr(source), first = mstcmn(first_name), last = mstcmn(last_name), details1 = aggr(details1), details2 = aggr(details2))

This was followed by a few lines of code to capitalise all relevant values, make things pretty. And voila:

sourceemailfirstNamelastNamedetails1details2
This[email protected]BobAnna15
That, Other[email protected]SteveLacey2, 4, 54, 2, 1
Other[email protected]NANA33

I also recreated people’s names based on their emails if their names were missing, as most professional emails in the format: [email protected].

Then I used the genderizeR package to add a column with people's titles: Mr or Ms based on their name.

Try doing this on a spreadsheet. I guess the main argument of this post is: don't use spreadsheets for more than what they are for.