Wednesday, June 15, 2016

Essbase Meets R

Summary
Introduction
Workflow Options
Defining Time Series
Defining Sample Dimensionality
Workflow Option 1
Workflow Option 2
Results

Summary

Let’s say  you are an experienced Essbase user. You can slice and dice your data and get invaluable insights from it, which you use for modeling, forecasting or predicting metrics relevant for your business. The problem though, is that oftentimes the insights are based on gut feelings,  outdated or oversimplified models, and do not incorporate even the most basic statistical models.
And as the research in behavioral economics showed for the past 40 years, gut feelings (or cognitive heuristics as they call it), often result in bad decisions.
In this article I demonstrate a few options to integrate Essbase with open source statistical software R. It calculates forecasted values based on different statistical models, loads calculated data back to Essbase, or generates calculations scripts that can be executed in Essbase. R will also provide metrics that tell us how reliable/valuable our predictions are.


Introduction

First, what is R?


R is a programming language and software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. - Wiki


R has thousands of packages for different purposes, and here we are not even going to scratch the surface. We are going to use a few statistical models to predict the values of a variable, based on its historical values, and values of other variables. We are going to make implicit assumptions about the nature of those variables distribution, which may or may not be correct. But this paper is not a paper in statistics. We are demonstrating a way to integrate two technologies - Essbase and R.
 
One of the most common ways to generate the baseline forecast in essbase applications is a simple copy of prior forecast (or actuals) into a new version of a forecast. Then business analysts make adjustments based on some internal knowledge, gut feelings, experience, and other criteria.


In some cases we utilize more advanced models, that take into account relationships between different variables, or make predictions based on a time series. For example, we could populate the next forecast with a moving average of actual data from prior periods. Or we could use member formulae to calculate some accounts based on the values of other accounts it we know the nature of relationship between them.


But in many cases we don’t know the nature of the relationship, or the nature has changed since the last time we embedded the formulae in the outline, the values of the parameters are not calculated from statistical model, and we don’t know how reliable those models are. In other words how statistically significant.


We maintain those models for years, and have almost religious belief in them. But again, in most cases we simply don’t bother to base our forecasts on statistical inference.  
To be fair, in many cases our models are consciously kept simple. If we decide to use statistical models for predicting next cycle forecast for example, we need to be aware of the following issues:
  • Data in Essbase cubes is usually very sparse. The same metric can behave very differently in different areas of the cube. For example travel expense account will have a different distribution in sales cost centers than in IT cost centers. If we try to combine samples from sales and IT cost centers our predictions will be inaccurate.
  • On the other hand, if we decide to calculate our models for each intersection of the cube, we will end up with very small samples. And again, our predictions would be inaccurate. So we need to learn about our data first, and whether samples from different intersections can be combined across dimensions, and up to which generation in the hierarchy. That knowledge comes from multiple experiments.
  • Behavior of some variables can be described with one model, such as ARIMA, but behaviour of other variables is better described by other model. Multiple models need to be compared in terms of the accuracy.
  • Let's say we don’t know how our metrics behave across different dimension, and whether we can use aggregated samples in order to predict individual intersections. What would we do then? Well, since we have historical data for the very intersection we are trying to predict, it may well be that historical actuals from that intersection will be the best predictors of the future.


Considering all the issues above, complexity related to integrating different technologies, and the lack of expertise to run those experiments, it is no wonder that we have the current process of seeding forecasts.
 
And yet, here I show that by using the most basic models, and by spending very limited time choosing the right model we can generate forecasts which are more accurate than those created by business analysts.
Does it mean we can replace a human analyst with a robot? Not yet. Analysts still have unique information about the future. That information needs to be accounted for. At the same time we can use automatically generated forecast as a baseline, or a starting point.


Eventually this can shift analysts attention from seeding next period forecasts to something much more important. Instead, they can study behaviour and distributions or the data across organizational structure, generate multiple forecasts based on the different models, validate assumptions about the data, and find those models that have the best fit with the actual results. They can find trends and relationships between metrics they were not aware of. They can acquire knowledge from constantly growing volumes of data.

Workflow Options



Depending on the phase we can imagine a few scenarios how we can utilize Essbase-R solution.
  • Exploratory phase: analyst studies the data, runs different models on specific metrics, examines how fit of the models changes when he/she zooms in/out of different dimensions.
  • The models are formalized and we need to rerun predictions for the next period.
  • We are still in exploratory phase, but are running calculation on large sub-cubes.
  • The models are formalized, but need to be re-evaluated based on new/additional data sets.


Those scenarios require ability to run jobs in interactive/focused mode, or in a batch mode. Similarly to how we create and run business rules on a limited data set, while providing Run-Time-Prompts; or create calc scripts and run them on larger subcubes.


Regardless, the framework would have the following components:
  • Essbase forecast application. It will obviously be the target of the new forecast generated by R. But it may also be the starting point of the workflow depending on the model. We’ll see examples later.
  • Relational repository with star schema.
    • Why relational repository? Because SQL language is much more convenient and robust than writing your own program to manipulate data sources. And you may need different data sources for different models. For example, if we want to predict specific account based on the values of other accounts, we will need to have dependent and independent variables (accounts) in columns. For models based on time series (discussed later) we need records ordered by years and periods.
    • Also, since our framework will generate thousands of queries, they need to be  extremely efficient, so we’ll need to use optimization techniques provided by relational database vendor.
    • Why do we need metadata? As discussed before, accounts we are trying to predict, behave differently across our dimensions. We need the ability to zoom in and out within our hierarchy while calculating the model. In other words, we may need to combine samples from different nodes in the hierarchy when our metric behaves similarly, or separate samples when metrics behave differently. For example, we will have two separate samples for sales cost centers and IT cost centers when we predict Travel expense account.
  • R scripts and R framework.
    • R scripts will run different statistical models and connect to  relevant data sources. They will either generate the calc scripts for essbase, which will calculate the forecast, or generate prediction data files that will be loaded to essbase.
    • R script will also copy files between the R server and Essbase server, create MAXL scripts that will automate execution of calc scripts or load rules.




Defining Time Series

Let’s say we want to predict “Meals And Entertainment” account for the next year, based on the actual data from the prior 6 years. And we want to use autoregressive OLS model. It is implemented in R as ar.ols package. Whether that is a good model and its assumptions about data distribution are reasonable in the context of that particular account, and our business - we’ll discuss later. For now let’s see what we need in order to calculate the model:
  • Data source that meets the requirements of ar.ols model. Meaning the data must be ordered as a time series (ordered by year, month, week, etc...). Depending on existing Essbase - ERP integration we could have 2 situations:
  1. We already have actuals data in some sort of the data mart, and we load it into Essbase cube. In that case we could use a fact table as a data source for our R model, but would need to perform some transformations first.
If we get monthly data from ERP, and load it into fact table, probably it will have a single data column, and not separate columns for each period. In that case we will not be able to query fact table directly, since it will have missing data points for time series. Let’s say the table below is the result of the query


FY14
Jan
76
FY14
Feb
6
FY14
Mar
33
FY14
Apr
98
FY14
May
37
FY14
Jun
72
FY14
Aug
67
FY14
Sep
31
FY14
Oct
40
FY14
Nov
69
FY15
Jan
87
FY15
Feb
69
FY15
May
91
FY15
Oct
42
FY15
Nov
21
FY15
Dec
51


But we can’t use it for time series since we are missing some periods:


FY14
Jul
FY14
Dec
FY15
Mar
FY15
Apr
FY15
Jun
FY15
Jul
FY15
Aug
FY15
Sep


If we wanted a time series usable for our model we would need the following results:


FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY15
FY15
FY15
FY15
FY15
FY15
FY15
FY15
FY15
FY15
FY15
FY15
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
76
6
33
98
37
72

67
31
40
69

87
69


91




42
21
51


Missing data should be interpreted as zeros in this case. But if data is missing for all periods of this intersection we shouldn’t calculate the model. The transformation from a fact table to a time series is a simple SQL, but we’ll leave it to the reader.


  1. Another situation is when we don’t have data in the fact table. We could export data from essbase using TimePeriod dimension for columns. The following calc script is very typical for data export into relational repository.


SET FRMLRTDYNAMIC ON;
SET CACHE ALL;
SET CACHE HIGH;
SET CALCTASKDIMS 4;
SET CALCPARALLEL 3;

SET DATAEXPORTOPTIONS {
DataExportColFormat ON;
DataExportDynamicCalc OFF;
DataExportOverwriteFile ON;
};

FIX (Actual,Final,USD
@relative(Entity,0),
@relative(AllCostCenters,0),
@relative(AllBUs,0),
@relative(AllPCCs,0),
@relative(IncomeStatement,0),
@relative(YearTotal,0),
HSP_inputvalue
)


DATAEXPORT "DSN" "HYPERIONPODBC" "T_EPMA_FACT_12MON" "epm_stage" "xxxxxx";

ENDFIX


Like in the previous case we need to perform transformation to fill in missing data. Essbase export will insert a new record into fact table if at least one month has a non-missing data, which makes it easy to replace nulls with zeros (or interpret in R). But we will not know if the entire year is missing. If our time series looks like this:


FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY15
FY15
FY15
FY15
FY15
FY15
FY15
FY15
FY15
FY15
FY15
FY15
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
76
6
33
98
37
72

67
31
40
69















But we didn’t perform the transformation, we will end up forecasting FY16 based on:


FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY14
FY14
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
76
6
33
98
37
72

67
31
40
69


The model will generate non-zero predictions, but in reality most likely this account was closed and we should not expect any data in FY16. So we need to perform a similar transformation to fill the missing years.


Defining Sample Dimensionality

In previous section we defined the data source. Now we need to query that data source based on the cube intersection we pass into R script from Essbase. Before we go into technical details about interactions between Essbase and R let’s think how we can define dimensionality of a sample.
When we pass a member (say a member of Cost Center dimension) from Essbase to R, we make assumptions about distribution of our metric.
If we don’t pass any Cost Center, or just the top node of the hierarchy, we assume that the metric is distributed similarly across all cost centers (or those cost centers that contain data). This also implies that the metric should have similar distribution across TimePeriods/Year dimensions for each individual cost center.
This is rarely the case though. Different cost centers or business units may have completely different purposes, the way they operate, and hence the distribution of the same account will be different across cost centers or other dimensions.
Reorg will also create a situation when data will start rolling into a new parent starting from some point. If we ignore this fact our prediction will be inaccurate at best.
Keep in mind that in this example we are trying to predict a metric for a particular level 0 intersection. When we base our prediction on the samples from other cost centers, that behave differently, we increase variability of our prediction, and hence decrease prediction accuracy.
Why do we need to include samples from member siblings, if we are predicting a specific member at level 0? The reason is: if we zoom in into a specific level 0 intersection our sample will be too small. It will not cover possible variability of the metric, and hence will make our prediction inaccurate again.
So we need to find a balance and calculate our model on the largest sample of level 0 members that have similar distribution.
In the context of essbase cube, when we go down one level within the hierarchy, and instead of selecting a parent, select one of its children, we reduce the sample of the model. We remove other children from the sample that may have different distribution.




The image above shows the hierarchy of cost centers. If you include the entire sample of All Cost Centers, the forecast will be inaccurate due to heterogeneity of distributions. If you include only level 0 cost centers the forecast may be inaccurate due to small sample size. The appropriate sample selection would be of children of All Cost Centers: IT, Sales, etc.


Workflow Option 1

In this workflow we will pass member from essbase into R script. R script will generate a query against relational repository that will use member names in the query to limit the sample. We calculate autoregressive OLS model using this workflow.


Below is a calc script that generates commands, each running R script with relevant parameters.


VAR RCOUNTER=0;


FIX ("FY16", "Plan", "HSP_Inputvalue",  "AllPCCs","USD","AllBUs", "Ver2",@RELATIVE("AllEnti",0), @CHILDREN("VP_ROLL",0),@RELATIVE("602000",0))
"Feb"(
IF(  
"YearTotal"<>#MISSING  
)
RCOUNTER=RCOUNTER+1;
@JechoString(@LIST("filename","/home/oracle/windowsshare/Rcommands.txt",
@JgetDoubleQuote("C:\R\R-3.1.3\bin\x64\rscript.exe"),
"D:\EssbaseR\OLS.txt",
@NAME(@CURRMBR("Account")),
@NAME(@CURRMBR("CostCenter")),
@NAME(@CURRMBR("Entity")),
@JgetStringFromDouble(RCOUNTER,@_false,@_false),"052416OLS", "LS","0",
@NAME(@CURRMBR("BusinessUnit")),@NAME(@CURRMBR("PCC"))

));
ENDIF;
);
ENDFIX


As you can see we write commands into /home/oracle/windowsshare/Rcommands.txt file, a second parameter of @JechoString CDF. The commands look like this:


"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 650000 Sales E_123 515 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 650001 Sales E_123 516 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 650006 Sales E_123 517 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 651000 Sales E_123 518 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 651002 Sales E_123 519 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 810001 Sales E_123 520 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 811000 Sales E_123 521 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 812003 Sales E_123 522 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 700010 Sales E_123 523 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 722000 Sales E_123 524 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 890060 Sales E_123 525 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 841002 Sales E_123 526 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 600001 Sales E_321 527 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 600004 Sales E_321 528 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 600006 Sales E_321 529 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 600010 Sales E_321 530 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 600000 Sales E_321 531 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 600020 Sales E_321 532 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 600030 Sales E_321 533 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 600320 Sales E_321 534 053116_LS LS 0
"C:\R\R-3.1.3\bin\x64\rscript.exe" D:\EssbaseR\OLS.txt 600322 Sales E_321 535 053116_LS LS 0


Let’s review how those commands are constructed:
  • C:\R\R-3.1.3\bin\x64\rscript.exe is an executable running the R script
  • D:\EssbaseR\OLS.txt is the R script for OLS model. It calculates parameters, and generates formulae based on provided parameters.
  • 650000 is account for which we calculate the model
  • Sales is one of the top nodes of the cost center dimension.
  • E_123 is a level 0 entity that limits the sample
  • 515 is a sequence number. It is used in the names of the generated calc script.
  • 053116_LS is a folder name into which logs and calc scripts are written
  • LS is a prefix of the calc script
  • 0 is an indicator whether the model should contain mean.


Each command calculates the model for a specified account, whereas other members define the sample. Section of the R script below defines SQL query running against the data source:


sql_str="SELECT
YEAR,
  MNUM,
 AVG(DATA) DATA
FROM T_EPMA_FACT
where 1=1
and (ENTITY) IN (
   Select (CHILD) FROM
             (SELECT PARENT, CHILD, CONNECT_BY_ISLEAF ISLEAF FROM V_ENTITY
             CONNECT BY PRIOR CHILD= PARENT
             START WITH CHILD = 'ENTITYTOKEN')
             WHERE ISLEAF = 1
   )
   and (COSTCENTER) IN (
   Select (CHILD) FROM
             (SELECT PARENT, CHILD, CONNECT_BY_ISLEAF ISLEAF FROM V_COSTCENTER
             CONNECT BY PRIOR CHILD= PARENT
             START WITH CHILD = 'CCTOKEN')
             WHERE ISLEAF = 1
   )
and ACCOUNT='ACCOUNTTOKEN'

and DATA is not null
GROUP BY YEAR,
  MNUM
ORDER BY YEAR,MNUM"

sql_str=gsub('ACCOUNTTOKEN',curr_account,sql_str)
sql_str=gsub('CCTOKEN',curr_cc,sql_str)
sql_str=gsub('ENTITYTOKEN',curr_entity,sql_str)


You can see that in order to filter Entity and CostCenter members we use dimension views and derive level 0 members with hierarchical query. This is how we can go up and down the hierarchy in essbase calc script that generates commands, and still use the same R script. For example, if we decided to “zoom in” the cost center dimension, we could change @CHILDREN("VP_ROLL",0) to @CHILDREN("Sales",0), assuming Sales is a child of VP_ROLL.


Once we ran the SQL query and got the sample for calculation, we calculate the model. What we mean by this, we calculate coefficients for the equation that defines the relationship between the current month value of the account of interest, and prior periods. In case of autoregressive OLS the equation is:


x[t] - m = a[1]*(x[t-1] - m) + … + a[p]*(x[t-p] - m) + e[t]


Whereas a[1] … a[p] are coefficients, x[t-1] ...x[t-p] are prior period observations, and m  is the mean. You can read more here about implementation of this model in R.


If the model generates statistically significant coefficients based on available sample, it will create a calc script which will reflect our equation:


/*Formulae generated by regression scan.*/
SET UPDATECALC OFF;
SET CLEARUPDATESTATUS OFF;
SET AGGMISSG ON;
FIX("FY16","OLS","Plan",HSP_InputValue,"USD",@RELATIVE("E_123",0), @RELATIVE("AllBUs",0),@RELATIVE("Sales",0), @RELATIVE("AllPCCs",0))
"NoAccount"(
IF(@ISMBR("NoAccount"))
IF(
NOT (@PRIOR( "650000"->"YearTotal",1, @RELATIVE("YEAR",0))==#MISSING )
)

IF(@ISMBR("Feb"))
"650000"=(4327.90426801802)+0.0147517780910044*("650000"->"Feb"-4327.90426801802)
+-0.00882540701437955*(@PRIOR("650000"->"Jan",1, @RELATIVE("YEAR",0))-4327.90426801802)
+-0.0203482721999207*(@PRIOR("650000"->"Dec",1, @RELATIVE("YEAR",0))-4327.90426801802)
+-0.0118165146109763*(@PRIOR("650000"->"Nov",1, @RELATIVE("YEAR",0))-4327.90426801802)
+-0.0198126740817153*(@PRIOR("650000"->"Oct",1, @RELATIVE("YEAR",0))-4327.90426801802)
+-0.024978120091583*(@PRIOR("650000"->"Sep",1, @RELATIVE("YEAR",0))-4327.90426801802)
+-0.0267901149013764*(@PRIOR("650000"->"Aug",1, @RELATIVE("YEAR",0))-4327.90426801802)
+-0.0452895103908632*(@PRIOR("650000"->"Jul",1, @RELATIVE("YEAR",0))-4327.90426801802)
+-0.0447117543028625*(@PRIOR("650000"->"Jun",1, @RELATIVE("YEAR",0))-4327.90426801802)
+-0.0459240716772066*(@PRIOR("650000"->"May",1, @RELATIVE("YEAR",0))-4327.90426801802)
+-0.0622774135775202*(@PRIOR("650000"->"Apr",1, @RELATIVE("YEAR",0))-4327.90426801802)
+0.738465706018489*(@PRIOR("650000"->"Mar",1, @RELATIVE("YEAR",0))-4327.90426801802) ;
...
ELSEIF(@ISMBR("Jan"))
"650000"=(4327.90426801802)+0.0147517780910044*("650000"->"Jan"-4327.90426801802)
+-0.00882540701437955*("650000"->"Dec"-4327.90426801802)
+-0.0203482721999207*("650000"->"Nov"-4327.90426801802)
+-0.0118165146109763*("650000"->"Oct"-4327.90426801802)
+-0.0198126740817153*("650000"->"Sep"-4327.90426801802)
+-0.024978120091583*("650000"->"Aug"-4327.90426801802)
+-0.0267901149013764*("650000"->"Jul"-4327.90426801802)
+-0.0452895103908632*("650000"->"Jun"-4327.90426801802)
+-0.0447117543028625*("650000"->"May"-4327.90426801802)
+-0.0459240716772066*("650000"->"Apr"-4327.90426801802)
+-0.0622774135775202*("650000"->"Mar"-4327.90426801802)
+0.738465706018489*("650000"->"Feb"-4327.90426801802) ;
ENDIF;
ELSE  
"650000"=#MISSING;
ENDIF;
ENDIF;);
ENDFIX


We will probably end up with hundreds of those calc scripts, so all that is left to do is to copy calc scripts into database folder, and execute a MAXL script that is also generated by the R script:
execute calculation "Myapp"."FinStmt"."R_prpLS";
execute calculation "Myapp"."FinStmt"."LS1";
execute calculation "Myapp"."FinStmt"."LS4";
execute calculation "Myapp"."FinStmt"."LS9";
execute calculation "Myapp"."FinStmt"."LS34";
...
execute calculation "Myapp"."FinStmt"."LS6742";
execute calculation "Myapp"."FinStmt"."LS6745";
execute calculation "Myapp"."FinStmt"."LS6746";


The first script R_prpLS populates initial data set in OLS version. It basically copies Actual->Final from prior year into Plan->OLS of both prior and current year. If we have TimePeriod and Account both dense dimensions, then by copying prior year actuals into forecasted year we create blocks. This also assumes if we didn’t have data for entire previous year in none of the account in some intersections, we will not have data in the forecasted year as well, and we do not need to calculate it.

Workflow Option 2

The second workflow option is similar to the first one, except R script does not generate essbase calc scripts. Instead it generates predictions for a specified number of periods for each level 0  intersection within the scope. R scripts appends predictions into the data file, which is loaded with regular essbase load rule.


This way we calculate ARIMA model. There are a few peculiarities to be aware of:
We need to generate predictions for level 0 combinations, hence it would make sense to calculate ARIMA model based on level 0 samples. Meaning we will need to run model hundreds of thousands times - for each level 0 intersection we are going to predict. It would take rather long time to calculate all those predictions sequentially. Instead we will need to distribute intersections to be predicted across multiple threads.


Results

What we really want do is to compare predictions generated by different models in R with forecast created by human analysts. To do so, we calculate absolute values of the errors (deltas) in each intersection of the forecast.
The value itself is meaningless, it’s just a measure of the fit to actuals.


In our experiment we ran the following models:
  1. R12 with mean: linear regression on 12 prior months as variables. Forward selection of statistically significant prior periods. Observations are clustered by entity and children of EVP_Roll. Mean/intercept is present in the model. Actuals from prior year used as a filler.
  2. R12 without mean: same as above, but without intercept.
  3. OLS: ar.ols order.max=12, demean=true, intercept=F. Autoregressive OLS with max 12 prior periods. Observations are clustered by entity and children of EVP_Roll. Averages are used to smooth across dimensions. Actuals from prior year used as a filler.
  4. ARIMA forecast of each intersection


In addition we calculate errors of the following forecasts:
  1. Actuals from prior year are copied into plan of the next year.
  2. Plan: manual legacy forecast.


Once we have results for each of the models above, we calculate the following models:
  1. Multi-model: the best model (1-5) is applied to each intersection.
  2. Multi-model integrated with manual forecast. The best model (1-6) is applied to each intersection. Manual adjustments are taken into account in those intersections where they produce the best fit.


You can see that there are 2 automatic forecasts that generate better results than manual Plan forecast. The absolute sum of Multi-model is 30% smaller than in Plan, and “R12 no mean” model is 14% better.


Model
ABS Sum of Deltas
Multi model Count
Multimodel With Manual Adjustments Count
Improvement
Multimodel With Manual
4,569,559,357

34,169
42.74%
Multimodel
5,551,939,656


30.43%
r12 no mean
6,849,173,802
10,706
7,025
14.17%
Plan
7,979,830,780



Plan as Actual copy
8,310,236,097
6,110,293
6,087,155

r12 with mean
8,310,236,097
-
-

ARIMA
8,772,708,070
92,988
88,966

OLS
10,655,750,349
11,154
7,826



Knowing which intersections can be predicted by one of the model is a key. You can see from “Multi-model Count” column that the number of intersections predicted by ARIMA in multi-model version is 92,988, whereas a simple copy of the prior year actuals generates the best fit in 6,110,293 intersections.
Selection of the best model for each intersection improves accuracy by 30.43%. Yet, if we also incorporate manual forecast, and predict 34,169 intersections using the legacy methods, we can improve the accuracy even further to almost 43%.

Hence, these results do not suggest replacing business analysts with automatic forecasts. Rather they show that implementing a variety of statistical models can improve the accuracy of the forecast, and will pinpoint the areas that business analysts should be focusing on. They will provide insight about what intersections can benefit from manual forecasting (or by using existing methodology), and which can be left to automation.

3 comments:

  1. interesting topic. how does this compare with capabilities afforded by Oracle's acquisition of Decisioneering's Crystal ball? does it work similar to the 'R' integrations described above?

    ReplyDelete
    Replies
    1. I'm not an expert on Crystall Ball, so please take my answer with a grain of salt. CB is more of a spreadsheet tool with some basic statistical models that allow users to enter assumptions about the model interactively. Tools like R or Python statistical libraries support much broader spectrum of models, including neural nets, deep learning, etc. They scale to huge datasets in distributed environments. With those tools you can built whatever application you like. So the purpose and audiences are quite different. In the posts i describe use cases when forecasting is automated and the blend of models is constantly updated depending on the new data. But again, those are just examples of what you could do.

      Delete