You are wasting your time with spreadsheets

2016-11-13

3 min read

This week I had an interesting conversation (as interesting as these conversations can get) about the use of spreadsheets in business. I am still a little bit surprised by how often they are used for things they are not really good for. Now, I don’t want to write about how there are better alternatives to spreadsheets. Instead, I’d like to illustrate how powerful non-spreadsheet tools are by describing a task I completed this week using R.

I was recently asked to help an NGO aggregate their contact database. It was not a surprise they needed help aggregating it: it came from dozens of different sources, and had thousands of partially duplicated entries. Entries were identifiable by name or by emails, which were the only unique columns.

To give you a better idea, this is what it looked like:

source	email	firstName	lastName	details1	details2
This	[email protected]	Bob	ANna	1	5
That	[email protected]	Steve	NA	2	4
Other	[email protected]	NA	Lacey	3	3
Other	[email protected]	Steve	Lacey	4	2
NA	[email protected]	Lacey	Steve	5	1

Except that I had several columns with details, many emails and other values were invalid, or missing. In short, it was a big hassle that would have taken days of manual work in spreadsheets to sort out, which was what the client originally tried.

A more sophisticated consultant/banker spreadsheet user would’ve written some extensive for loop with dozens of conditions and incessant F10 pressing and debugging. This probably would’ve worked but would have been a great waste of time, with a lot of places for errors to creep in.

So how did I solve it in R? Almost all of the heavy lifting was done by a single function in the dplyr package. After some data-preprocessing, the hardest problem was solved by the following lines of code:

library(stringr)#loading package to deal with strings
library(dplyr)#loading data wrangling package
#first I write a function to give the most common value in a vector
a concrete example in this case, the first name that happens most commonly under a given email
mstcmn<-function(x){
x<-names(sort(table(as.character(x)),decreasing=TRUE)[1])
if (is.null(x)){
return(NA)
}
return(x)
}
#then I write a function to aggregate all other unique strings
aggr<-function(x){
x<-toString(na.omit(x))
x<-str_split_fixed(x,",",(str_count(x,",")+1))
x<-trim(x)
x<-unique(as.list(x))
return(toString(unlist(x)))
}
#and most of the magic happens here: the C++ code under dplyr’s hood rapidly and efficienly aggregates the information needed
newdf<-olddf %>% group_by(email) %>% summarise(source = aggr(source),
first  = mstcmn(first_name),
last  = mstcmn(last_name),
details1 = aggr(details1),
details2 = aggr(details2))

This was followed by a few lines of code to capitalise all relevant values, make things pretty. And voila:

source	email	firstName	lastName	details1	details2
This	[email protected]	Bob	Anna	1	5
That, Other	[email protected]	Steve	Lacey	2, 4, 5	4, 2, 1
Other	[email protected]	NA	NA	3	3

I also recreated people’s names based on their emails if their names were missing, as most professional emails in the format: [email protected].

Then I used the genderizeR package to add a column with people's titles: Mr or Ms based on their name.

Try doing this on a spreadsheet. I guess the main argument of this post is: don't use spreadsheets for more than what they are for.

You are wasting your time with spreadsheets

a concrete example in this case, the first name that happens most commonly under a given email

Series of Posts

See Also