Time Series analysis for Facebook message volume

Introduction to the data set

The data is the amount of Facebook messages a user (me) sent and received between 2011 and 2016. The data was exported from Facebook, converted from HTML to JSON and subsequently read into R. For more details on how this was done, please consult here.

The original data source contains 413,101 timestamps and messages between 2008 and 2016. The volume of messages is very low between 2008-2011, and highly atypical to the rest of the series, so it was not considered in the sample. An example is shown below:

In order to create a suitable format for time series analysis, the padr package was used in order to fill in missing observations with null values and aggregate all timestamps into weekly measurements. Message volume is defined as the quantity of messages sent and received by the user.

rawTs<- df %>%mutate(date = lubridate::with_tz(date,"CET"))%>% # change the time zone
thicken( interval = "month") %>% #creates a column with date_month
group_by(date_week) %>% # groups the date hour observations
summarise(amount = n()) %>% # counts the amount of messages sent or recieved
pad %>% # creates missing values for intervals where observations are missing
fill_by_value(amount)%>%   # fill missing values with 0
as_tbl_time(date_month) %>% # creating a new object class to ease time filtering
time_filter(2011 ~ 2016) # filter to 2011-2016

Now the data is in a suitable format to create time series objects.A train set was created for the first three years, with a test set created for the final year. Both were adjusted to account for calendar effects.

totalTs <- ts(data = rawTs[,2],
start = c(2011,1),# start
end = c(2016,3), # end with the last month omitted
frequency =12  #months in a year
)
end = c(2015,2))
test.ts <- window(totalTs.adj,start = c(2015,3),end = c(2016,3))

Exploration of data

It is evident before any exploration that we are limited by the relativly short duration of the time series at the monthly level and the fact that the message volume starts at 0. The formet indicates that the ability to forecast seasonal patterns may be limited and the latter contradicts the stationarity assumption. It would be preferable to have several decades of measurements. However, neither the author nor Facebook were around several decades ago, so we will have to proceed taking these limitations into account.

In order to gain an understanding of the data set exploratory plots of the training set were made. The most informative subset of these are displayed below. First we begin with a time series plot.

The times series plot shows an upwards trend with a peak in 2012. There is a slight downwards trend, with a recovery towards the mean towards the end of the series. In addition, the variance does not seem to change significantly over time, although perhaps you could make the argument that there is less variance at the end and beginning of the series compared to the middle.

In the plot above there are no obvious seasonal effects. There, for instance, dips around the last weeks of the year 2013 and 2012, but the exact opposite is observed in 2011 and 2014.

As an aside, the absense of seasonality is interesting because it depends entirely on the unit of measurement. If we take the same data but hourly instead of monthly, clear trends are visible with regards to the circadian rhythm.

The inverted V-shaped trend is more visible in the month plot. This confirms the observation made that the volume of messages sent seems to increase and subsequently decrease.

Exploring possible transformations

Based on the exploration of the data, whereby the series was roughly horizontal, with relatively constant variance and no predictable patterns in the long term, I decided to check whether the weak stationarity property holds. I did this using the sample ACF plot, the Augmented Dickey-Fuller Test as well as the KPSS Test for Level Stationarity.

Modelling

Choosing a benchmark model

Two possible benchmark models would be the naive model and the mean forecast. Unsurprisingly, both of the models have unacceptable residuals, with significant autocorrelations in the ACF chart, and the Ljung-Box test rejecting the null hypothesis that they are not correlated.

naiveModel <- naive(train.ts,h = 12)
meanModel <- meanf(train.ts,h = 12)

The graphical analyses of the residuals were omitted from the report, however both show substantial autocorrelation and the distribution of the residuals do not appear normal, as judged both visually and through the Ljung-Box test.

Arima family

mArima1<- auto.arima(train.ts)
fa1 <- forecast(mArima1,h = 12)
checkresiduals(mArima1)


Ljung-Box test

data:  Residuals from ARIMA(1,0,1) with zero mean
Q* = 19.786, df = 22, p-value = 0.5965

Model df: 2.   Total lags used: 24

Using the auto.Arima() function, I fitted an ARIMA model. This yields an ARIMA(1,0,1) model, which means that it is the combination of a first order autoregressive model ((\phi) = 0.98) with no differencing and a first order moving average ((\theta) = -0.57). This model seems reasonable given the sample ACF and PACF plots.

I considered adding a constant to the model, however that would lead to an overestimation of the long term trend of message volume. The zero mean assumption makes sense in that the limit of (\widehat{y}_{t+h|t}) would approach a constant determined by the last few observations.

The autocorrelation seems appropriate, although the residuals do diverge visually from normal in QQ-plots. Nonetheless, given the amount of data we have I will accept this model.

ETS Family

etsModel1 <- ets(train.ts)
checkresiduals(etsModel1)+theme_tufte()

##
Ljung-Box test

data:  Residuals from ETS(A,N,N)
Q* = 19.59, df = 22, p-value = 0.6087

Model df: 2.   Total lags used: 24
## NULL
fe1 <- forecast(etsModel1,h = 12)

The model selected is (ETS(A,N,N)). The smoothing parameter is (\alpha=) 0.42. This means that the growth, seasonal states do not change from their initial values. There is no damping parameter. Again, the residuals do not seem to be perfectly normally distributted, but this is not entirely surprising given the shortness of the time series. I will be content with the fact that they are not significantly non-normally distributed.

In order to select a model from these two I use time series cross validation following Hyndman’s example.

fets <- function(x, h) {
forecast(ets(x), h = h)
}

farima <- function(x, h) {
forecast(auto.arima(x), h=h)
}

e1 <- tsCV(train.ts, fets, h=1)#Compute CV errors for ETS as e1

e2 <- tsCV(train.ts, farima, h=1)#Compute CV errors for ARIMA as e2
Find MSE of each model class
mean(e1^2, na.rm=TRUE)
## [1] 10710.59
mean(e2^2, na.rm=TRUE)
## [1] 14473.21

Given that the MSE for the ETS model is lower, I would choose to select that to forecast.

Model Evaluation

In the plot above, forecasts compared to the test set. The ETS model estimates along with confidence intervals are presented in green, the mean forecast without confidence intervals is presented in blue.

Model evaluation for tested models
ME RMSE MAE MPE MAPE MASE ACF1 Theil.s.U Method
Training set 1.71 107.22 73.41 -10.92 38.10 0.49 -0.45 NA Naive
Test set 87.17 101.36 91.99 37.90 43.63 0.61 -0.27 0.97 Naive
Training set1 0.00 119.91 96.93 -86.07 113.14 0.64 0.58 NA Mean
Test set1 -16.27 54.22 41.29 -18.93 28.10 0.27 -0.27 0.46 Mean
Training set2 16.91 90.46 66.85 -2.89 35.41 0.44 -0.07 NA ARIMA(1,0,1)
Test set2 57.43 76.89 67.39 21.77 33.61 0.45 -0.31 0.79 ARIMA(1,0,1)
Training set3 6.32 90.30 67.82 -10.81 37.65 0.45 -0.05 NA ETS(A,N,N)
Test set3 30.60 60.10 49.22 6.82 26.81 0.33 -0.27 0.63 ETS(A,N,N)

As expected, the ETS(A,N,N) model outperforms the ARIMA(1,0,1) model. However, using the test data we can see that none of the models perform particularly well. In fact, this seems to be one of those cases where the mean baseline model performs best in terms of RMSE.

Packages Used

Arnold, Jeffrey B. 2017. Ggthemes: Extra Themes, Scales and Geoms for ’Ggplot2’. https://CRAN.R-project.org/package=ggthemes.

Aust, Frederik. 2016. Citr: ’RStudio’ Add-in to Insert Markdown Citations. https://CRAN.R-project.org/package=citr.

Francois, Romain. 2017. Bibtex: Bibtex Parser. https://CRAN.R-project.org/package=bibtex.

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. http://www.jstatsoft.org/v40/i03/.

Hyndman, Rob J. 2017. forecast: Forecasting Functions for Time Series and Linear Models. http://pkg.robjhyndman.com/forecast.

Hyndman, Rob J, and Yeasmin Khandakar. 2008. “Automatic Time Series Forecasting: The Forecast Package for R.” Journal of Statistical Software 26 (3): 1–22. http://www.jstatsoft.org/article/view/v027i03.

R Core Team. 2017a. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

———. 2017b. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.