Friday, January 1, 2016

OBIEE 12c Advanced Analytics Functions part 1. Introduction & Trendline

In OBIEE 12c there is a set of new Advanced Analytics functions. Antony Heljula wrote a great post (as usual) at PEAKindicators blog called Testing the new R functions on the OBIEE 12c SampleApp where he shows the basic 4 (Trendline, Clusters, Outliers and Regression). 
Oracle describes them in Users Guide Appendix C (starting C-58) covering them all. I will post here my experience with them, as part of my learning process. Usually I start with the example from the online help and check additional (documented and not documented) options, sometimes add my observation.

 I plan to do a 7 part daily series:
  1. OBIEE 12c Advanced Analytics Functions part 1. Introduction & Trendline (this one)
  2. OBIEE 12c Advanced Analytic part 2: BIN and WIDTH_BUCKET
  3. OBIEE 12c Advanced Analytic part 3: Forecast
  4. OBIEE 12c Advanced Analytic part 4: Cluster
  5. OBIEE 12c Advanced Analytic part 5: Outlier
  6. OBIEE 12c Advanced Analytic part 6: Regression
  7. OBIEE 12c Advanced Analytic part 7: EVALUATE_SCRIPT

I assume you already configured OBIEE to work with R (described here). If you use ORE, the R script identifies it and uses ORE specific scripts. I believe it should support larger amount of data this way.

There are 2 groups of functions, those done with R and internal ones:

Advance Analytics Internal Logical SQL Functions

  • Bin
  • Width_Bucket
  • Trendline

Advance Analytics External Logical SQL Functions

  • Forecast
  • Cluster
  • Outlier
  • Regr
  • Evaluate Script
This is consistent with what I see in the MWHOME\bi\bifoundation\advanced_analytics\RInstaller\OBIEERPackage folder, in the scripts of OBIEEAdvancedAnalytics_12.2.1.0 (no trendline, bin ...):


One last remark in the introduction. Most of the R packages statistical by nature. As a result they have a "use Random Seed" option. When doing so the result might not be always consistent (for example, slightly different clusters). Even more challenging, sometimes you can select various statistical algorithms, that return different result. While unthinkable in most traditional financial reports, this behavior is expected in data exploration cases. If this is new to you, well, welcome to data science.
(http://www.smbc-comics.com/?id=3485)


Trendline

I will start with Trendline in this post. While Trendlines are popular and rather simple, they are not always considered as the best statistical model (see VividCortex blog here).
Interesting enough, the Trendline is described as "Internal", what I understand as "not done by R". 

I start with the basic example and complicate a bit.

I'm using the default Sample Sales Subject Area, with date and revenue columns:



The Trendiline function in this case is: TRENDLINE("Base Facts"."Revenue", ("Time"."Calendar Date"), 'LINEAR','VALUE')
   
The meaning is quite simple: Trendline of Revenue over dates (I'll explain  'LINEAR','VALUE'  in a moment).

The result is:
If we limit the data to one year, we will have a nicer slope.


 Now let's see the complete Syntax from Oracle docs:

Syntax

TRENDLINE( <numeric_expr>, ( [<series>] ) BY ( [<partitionBy>] ),
<model_type>, <result_type>, [number_of_degrees] )




Where:
numeric_expr indicates the data to trend. This is usually a measure column. Note that this is the Y-axis. In our case, it was "Base Facts"."Revenue".
series indicates the X-axis. This is a list of <valueExp> <orderByDirection>, where <valueExp> is a dimension column and <orderByDirection> is ASC or DES. The default is ASC. Note that this cannot be an arbitrary combination of numeric columns. (I Didn't find the correct syntax to use the ASC/DES values.- Boris)
partitionBy indicates the control break for the trendline.
model_type indicates the type of model to use. Currently you can only specify LINEAR.
result_type indicates the type of output. You can specify VALUE or MODEL. VALUE returns the regression Y values given X in the fit. MODEL returns the parameters in a JSON format string.
number_of_degrees is used in polynomial models, only. This parameter is optional.




I added "Product Type" to the analysis, but not the trendline definition:


While it seems to work fine, it should be clear that the trendline definition is the same for all Product Types. 
If I want a specific Trendline for each Product Type, I should alter the Trendline function as well (by adding the  BY ("Products"."Product Type") to the formula) to:
TRENDLINE("Base Facts"."Revenue", ("Time"."Calendar Date") BY ("Products"."Product Type"), 'LINEAR','VALUE')


We can see the difference:




Changing the definition from Value to Model (TRENDLINE("Base Facts"."Revenue", ("Time"."Calendar Date"), 'LINEAR','MODEL')) returns:





We can the difference. It is the same Model for all Product Types ({"analyticmodel": {"identifier": "001", "name": "trendline", "type": "linear", "formula": "0.611644210520042*x+2230.67258977209"}}) for "Trendline without by" column, as expected.


As you can see, I use few Trendline columns in the same analysis.

But they must have the same X-axis. Or you will get:

[nQSError: 42072] There are multiple TRENDLINE functions in the query with different X-axis. 


This will happen if I try to have 2 Trendline functions such as:

TRENDLINE("Base Facts"."Revenue", ("Time"."Calendar Date"), 'LINEAR','Value')
 and
TRENDLINE("Base Facts"."Revenue", ("Time"."Per Name Year","Time"."Per Name Month") ,  'LINEAR','Value')


While in most of the cases the series in the function is date oriented, it can be just a Numeric Column (for example TRENDLINE("Base Facts"."Revenue", ("Products"."Product Number") ,  'LINEAR','Value'))


But it can't be text column. then we get an error:
[nQSError: 42058] The non-time dimension column in TRENDLINE X-axis is not numeric.

The Year and Month series were accepted in the following example (despite being text columns) because they are part of the date hierarchy:
TRENDLINE("Base Facts"."Revenue", ("Time"."Per Name Year","Time"."Per Name Month") ,  'LINEAR','Value')



A bunch of little, clean, trendline graphs (or trellis) might always seem interesting in an exacutive dashboard:




P.S.
Despite, what is said in documentation the ModelType value 'EXPONENTIAL' (instead of 'LINEAR') seems to return values, but only if you have something in the PartirionBy.


{"analyticmodel": {"identifier": "001", "name": "trendline", "type": "exponential", "formula": "1525.179067806*1.00056329831379^x"}}

 TRENDLINE("Base Facts"."Revenue", ("Time"."Calendar Date")  By ("Products"."Product Type"),  'exponential','Value')


1 comment: