Tuesday, January 12, 2016

OBIEE 12c Advanced Analytic part 5: Outlier

This is the fifth part of OBIEE 12c Advanced Analytic:

  1. OBIEE 12c Advanced Analytics Functions part 1. Introduction & Trendline
  2. OBIEE 12c Advanced Analytic part 2: BIN and WIDTH_BUCKET
  3. OBIEE 12c Advanced Analytic part 3: Forecast
  4. OBIEE 12c Advanced Analytic part 4: Cluster
  5. OBIEE 12c Advanced Analytic part 5: Outlier (this one)
  6. OBIEE 12c Advanced Analytic part 6: Regression
  7. OBIEE 12c Advanced Analytic part 7: EVALUATE_SCRIPT
 (for 3-7 you need Enabling R and the relevant Analytics functions on OBIEE 12c )




Outlier detection is sometimes described as Anomaly detection. 
The Outlier function is part of the Analytics functions family. In documentation it's described: The OUTLIER function classifies a record as an outlier based on one or more input expressions using K-Means or Hierarchical Clustering or Multi-Variate Outlier detection Algorithms.

Outlier, like Clustering is a type of unsupervised learning. In clustering, objects of the data set are grouped into clusters, in such a way that groups are very different from each other and the objects in the same group or cluster are very similar to each other. In many cases, the Outliers are those values that don't fit well any of the existing cluster. It is not surprising that 2 out of available Outlier algorithms are shared with Cluster.
(When we are at academic mood we can read more about Multi-Variate Outlier detection here, for example. The other 2 algorithms were referenced in the Clustering post.)


I'll start with the online help example:
In the Analysis we have "Products"."Product",  "Offices"."Company", "Base Facts"."Revenue" and "Base Facts"."Billed Quantity". Lets add an outlier function.
Simplified version:
OUTLIER(("Product","Company"), ("Quantity","Revenue"), 'isOutlier', 'algorithm=mvoutlier')
Full function code:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier', 'algorithm=mvoutlier')


I added "best Visualization" option, did some minor changes and got the graph Outlier based on MVOUTLIER algorithm:
Since MVOUTLIER is default, I get the same results with empty algorithms value (yet the empty parameter must exist):

OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','') 


Changing the last parameter to h-clustering (3 outlier points):
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','algorithm=h-clustering')

And k-means (2 outlier points):
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','algorithm=k-means')
 Return different results, as one could expect.


Attempt to add Cluster function parameters such as numClusters returned syntax error. So it's time to see the syntax:

Syntax

OUTLIER((dimension_expr1 , ... dimension_exprN), (expr1, .. exprN),
output_column_name, options, [runtime_binded_options]))])
Where:
dimension_expr indicates a list of dimensions.
expr represents a list of dimension attributes or measures to find outlier.
output_column_name indicates the output column name. Valid values are 'isOutlier' and 'distance'.
options indicates a string list of name/value pairs separated by a semi-colon (;). The value can include %1 ... %N, which can be specified using runtime_binded_options.
runtime_binded_options is an option comma separated list (,) of run-time binded columns and or literal expressions.



Option Name
Description
Values
algorithm
The algorithm to use.
mvoutlier (default)
h-clustering
k-means
useRandomSeed
This is TRUE by default. This is the value to use in PRODUCTION environment.
If set to FALSE, then we use set.seed(initialSeed) to ensure reproducibility. Used in QA/Debug environment.
TRUE (default)
FALSE
initialSeed
This value is used only when useRandomSeed is set to FALSE.
Integer (250 default)
topN
If isTopNAsPercentage is TRUE then return the topN % as outliers. Or else return the topN values as outliers.
Not relevant to mvoutlier? 
Double (default 3)

isTopNAsPercentage
This value is used with topN.
Not relevant to mvoutlier? 
TRUE
FALSE (default)
 

Here is another example of the MVOUTLIER algorithm, this time with distance output as well.
IsOutlier:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','')

distance:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'distance','')

As expected, those with the greatest distance are the Outliers:



Combining the Outlier results with Clustering might seem interesting:

K-means Clusters with outliers:
but it's actually meaningless, from what I see (in the R code as well) in both k-means and h-clustering algorithms, we create a single cluster and look for the distance there.For example the same results for distance calculations of outlier distance and cluster (non-normalized) distance for one cluster:

We can see the last 2 columns are identical. I used cluster distance function for 1 cluster:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'distanceFromCenter', 'algorithm=k-means;numClusters=1;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE;normalizedDist=FALSE')
Outlier Distance Function:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'distance','algorithm=k-means')




From my testing, it seems that the topN and isTopNAsPercentage options work only with h-clustering algorithm:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','algorithm=h-clustering;isTopNAsPercentage=false;topN=10')

the others are indifferent to the options. This is strange, k-means, should have been influenced as well.

One last thing, mismatching the dimensions columns in the analysis with those in the outlier function might return error:

Error Codes: OPR4ONWY:U9IM8TAC:U9IM8TAC:U9IM8TAC:OI2DL65P
State: HY000. Code: 10058. [NQODBC] [SQL_STATE: HY000] [nQSError: 10058] A general error has occurred. (HY000)
State: HY000. Code: 43113. [nQSError: 43113] Message returned from OBIS. (HY000)
State: HY000. Code: 43119. [nQSError: 43119] Query Failed: Error(s): Error in covMcd(x, alpha = quan): n == p+1 is too small sample size for MCD (HY000)

Or make the system wonder how serious you are ([nQSError: 43119] Query Failed: Error(s): Error in covMcd(x, alpha = quan): n <= p -- you can't be serious!)










P.S.


I'm far from being a expert on the issue... But 2 more approaches to outliers based on clusters:

1. Sometimes small clusters are outliers by themselves. In those cases it might be interesting to check the very small clusters as outliers.

2. In some cases finding points that seems to belong to clusters but are at the far edges of the cluster might be interesting. in such cases the following might help:

I'll return to the k-means Cluster example from the previous post.In an analysis with Company, Product, Revenue and Billed Quantity columns, i have a Cluster:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;numClusters=5;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE')   
I'll add the same function but with distanceFromCenter option:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'distanceFromCenter', 'algorithm=k-means;numClusters=5;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE')



Now we can have outliers based on each cluster. Those values are by default normalized (0-100). Thanks to that we can select a constant number as a mark. I'll select the value of 65 distance from center, that marks outlier:

CASE WHEN (CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'distanceFromCenter', 'algorithm=k-means;numClusters=5;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE') > 65) THEN 'Outlier' ELSE '' END

The same can be done using percentage or other, more appropriate, calculations.





 



No comments:

Post a Comment