This is the fifth part of OBIEE 12c Advanced Analytic:
Outlier detection is sometimes described as Anomaly detection.
The Outlier function is part of the Analytics functions family. In documentation it's described: The OUTLIER function classifies a record as an outlier based on one or more input expressions using KMeans or Hierarchical Clustering or MultiVariate Outlier detection Algorithms.
Outlier, like Clustering is a type of unsupervised learning. In clustering, objects of the data set are grouped into clusters, in such a way that groups are very different from each other and the objects in the same group or cluster are very similar to each other. In many cases, the Outliers are those values that don't fit well any of the existing cluster. It is not surprising that 2 out of available Outlier algorithms are shared with Cluster.
(When we are at academic mood we can read more about MultiVariate Outlier detection here, for example. The other 2 algorithms were referenced in the Clustering post.)
I'll start with the online help example:
In the Analysis we have "Products"."Product", "Offices"."Company", "Base Facts"."Revenue" and "Base Facts"."Billed Quantity". Lets add an outlier function.
Simplified version:
OUTLIER(("Product","Company"), ("Quantity","Revenue"), 'isOutlier', 'algorithm=mvoutlier')
Full function code:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier', 'algorithm=mvoutlier')
I added "best Visualization" option, did some minor changes and got the graph Outlier based on MVOUTLIER algorithm:
Since MVOUTLIER is default, I get the same results with empty algorithms value (yet the empty parameter must exist):
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','')
Changing the last parameter to hclustering (3 outlier points):
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','algorithm=hclustering')
And kmeans (2 outlier points):
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','algorithm=kmeans')
Return different results, as one could expect.
Attempt to add Cluster function parameters such as numClusters returned syntax error. So it's time to see the syntax:
Here is another example of the MVOUTLIER algorithm, this time with distance output as well.
IsOutlier:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','')
distance:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'distance','')
As expected, those with the greatest distance are the Outliers:
Combining the Outlier results with Clustering might seem interesting:
Kmeans Clusters with outliers:
but it's actually meaningless, from what I see (in the R code as well) in both kmeans and hclustering algorithms, we create a single cluster and look for the distance there.For example the same results for distance calculations of outlier distance and cluster (nonnormalized) distance for one cluster:
We can see the last 2 columns are identical. I used cluster distance function for 1 cluster:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'distanceFromCenter', 'algorithm=kmeans;numClusters=1;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE;normalizedDist=FALSE')
Outlier Distance Function:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'distance','algorithm=kmeans')
From my testing, it seems that the topN and isTopNAsPercentage options work only with hclustering algorithm:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','algorithm=hclustering;isTopNAsPercentage=false;topN=10')
the others are indifferent to the options. This is strange, kmeans, should have been influenced as well.
One last thing, mismatching the dimensions columns in the analysis with those in the outlier function might return error:
Error Codes: OPR4ONWY:U9IM8TAC:U9IM8TAC:U9IM8TAC:OI2DL65P
State: HY000. Code: 10058. [NQODBC] [SQL_STATE: HY000] [nQSError: 10058] A general error has occurred. (HY000)
State: HY000. Code: 43113. [nQSError: 43113] Message returned from OBIS. (HY000)
State: HY000. Code: 43119. [nQSError: 43119] Query Failed: Error(s): Error in covMcd(x, alpha = quan): n == p+1 is too small sample size for MCD (HY000)
Or make the system wonder how serious you are ([nQSError: 43119] Query Failed: Error(s): Error in covMcd(x, alpha = quan): n <= p  you can't be serious!):
P.S.
I'm far from being a expert on the issue... But 2 more approaches to outliers based on clusters:
1. Sometimes small clusters are outliers by themselves. In those cases it might be interesting to check the very small clusters as outliers.
2. In some cases finding points that seems to belong to clusters but are at the far edges of the cluster might be interesting. in such cases the following might help:
I'll return to the kmeans Cluster example from the previous post.In an analysis with Company, Product, Revenue and Billed Quantity columns, i have a Cluster:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'clusterName', 'algorithm=kmeans;numClusters=5;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE')
I'll add the same function but with distanceFromCenter option:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'distanceFromCenter', 'algorithm=kmeans;numClusters=5;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE')
Now we can have outliers based on each cluster. Those values are by default normalized (0100). Thanks to that we can select a constant number as a mark. I'll select the value of 65 distance from center, that marks outlier:
CASE WHEN (CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'distanceFromCenter', 'algorithm=kmeans;numClusters=5;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE') > 65) THEN 'Outlier' ELSE '' END
The same can be done using percentage or other, more appropriate, calculations.
 OBIEE 12c Advanced Analytics Functions part 1. Introduction & Trendline
 OBIEE 12c Advanced Analytic part 2: BIN and WIDTH_BUCKET
 OBIEE 12c Advanced Analytic part 3: Forecast
 OBIEE 12c Advanced Analytic part 4: Cluster
 OBIEE 12c Advanced Analytic part 5: Outlier (this one)
 OBIEE 12c Advanced Analytic part 6: Regression
 OBIEE 12c Advanced Analytic part 7: EVALUATE_SCRIPT
Outlier detection is sometimes described as Anomaly detection.
The Outlier function is part of the Analytics functions family. In documentation it's described: The OUTLIER function classifies a record as an outlier based on one or more input expressions using KMeans or Hierarchical Clustering or MultiVariate Outlier detection Algorithms.
Outlier, like Clustering is a type of unsupervised learning. In clustering, objects of the data set are grouped into clusters, in such a way that groups are very different from each other and the objects in the same group or cluster are very similar to each other. In many cases, the Outliers are those values that don't fit well any of the existing cluster. It is not surprising that 2 out of available Outlier algorithms are shared with Cluster.
(When we are at academic mood we can read more about MultiVariate Outlier detection here, for example. The other 2 algorithms were referenced in the Clustering post.)
I'll start with the online help example:
In the Analysis we have "Products"."Product", "Offices"."Company", "Base Facts"."Revenue" and "Base Facts"."Billed Quantity". Lets add an outlier function.
Simplified version:
OUTLIER(("Product","Company"), ("Quantity","Revenue"), 'isOutlier', 'algorithm=mvoutlier')
Full function code:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier', 'algorithm=mvoutlier')
I added "best Visualization" option, did some minor changes and got the graph Outlier based on MVOUTLIER algorithm:
Since MVOUTLIER is default, I get the same results with empty algorithms value (yet the empty parameter must exist):
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','')
Changing the last parameter to hclustering (3 outlier points):
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','algorithm=hclustering')
And kmeans (2 outlier points):
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','algorithm=kmeans')
Return different results, as one could expect.
Attempt to add Cluster function parameters such as numClusters returned syntax error. So it's time to see the syntax:
Syntax
OUTLIER((dimension_expr1 , ... dimension_exprN), (expr1, .. exprN), output_column_name, options, [runtime_binded_options]))])Where:
dimension_expr
indicates a list of dimensions.expr
represents a list of dimension attributes or measures to find outlier.output_column_name
indicates the output column name. Valid values are 'isOutlier' and 'distance'.options
indicates a string list of name/value pairs separated by a semicolon
(;). The value can include %1 ... %N, which can be specified using runtime_binded_options
.runtime_binded_options
is an option comma separated list (,) of runtime binded columns and or literal expressions.
Option Name

Description

Values

algorithm

The algorithm to use.

mvoutlier (default)
hclustering
kmeans

useRandomSeed

This is TRUE by default. This is
the value to use in PRODUCTION environment.
If set to FALSE, then we use
set.seed(initialSeed) to ensure reproducibility. Used in QA/Debug
environment.

TRUE (default)
FALSE

initialSeed

This value is used only when
useRandomSeed is set to FALSE.

Integer (250 default)

topN

If isTopNAsPercentage is TRUE then
return the topN % as outliers. Or else return the topN values as outliers.
Not relevant to mvoutlier? 
Double (default 3)

isTopNAsPercentage

This value is used with topN.
Not relevant to mvoutlier? 
TRUE
FALSE (default)

Here is another example of the MVOUTLIER algorithm, this time with distance output as well.
IsOutlier:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','')
distance:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'distance','')
As expected, those with the greatest distance are the Outliers:
Combining the Outlier results with Clustering might seem interesting:
Kmeans Clusters with outliers:
but it's actually meaningless, from what I see (in the R code as well) in both kmeans and hclustering algorithms, we create a single cluster and look for the distance there.For example the same results for distance calculations of outlier distance and cluster (nonnormalized) distance for one cluster:
We can see the last 2 columns are identical. I used cluster distance function for 1 cluster:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'distanceFromCenter', 'algorithm=kmeans;numClusters=1;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE;normalizedDist=FALSE')
Outlier Distance Function:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'distance','algorithm=kmeans')
From my testing, it seems that the topN and isTopNAsPercentage options work only with hclustering algorithm:
OUTLIER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"), 'isOutlier','algorithm=hclustering;isTopNAsPercentage=false;topN=10')
the others are indifferent to the options. This is strange, kmeans, should have been influenced as well.
One last thing, mismatching the dimensions columns in the analysis with those in the outlier function might return error:
Error Codes: OPR4ONWY:U9IM8TAC:U9IM8TAC:U9IM8TAC:OI2DL65P
State: HY000. Code: 10058. [NQODBC] [SQL_STATE: HY000] [nQSError: 10058] A general error has occurred. (HY000)
State: HY000. Code: 43113. [nQSError: 43113] Message returned from OBIS. (HY000)
State: HY000. Code: 43119. [nQSError: 43119] Query Failed: Error(s): Error in covMcd(x, alpha = quan): n == p+1 is too small sample size for MCD (HY000)
Or make the system wonder how serious you are ([nQSError: 43119] Query Failed: Error(s): Error in covMcd(x, alpha = quan): n <= p  you can't be serious!):
P.S.
I'm far from being a expert on the issue... But 2 more approaches to outliers based on clusters:
1. Sometimes small clusters are outliers by themselves. In those cases it might be interesting to check the very small clusters as outliers.
2. In some cases finding points that seems to belong to clusters but are at the far edges of the cluster might be interesting. in such cases the following might help:
I'll return to the kmeans Cluster example from the previous post.In an analysis with Company, Product, Revenue and Billed Quantity columns, i have a Cluster:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'clusterName', 'algorithm=kmeans;numClusters=5;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE')
I'll add the same function but with distanceFromCenter option:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'distanceFromCenter', 'algorithm=kmeans;numClusters=5;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE')
Now we can have outliers based on each cluster. Those values are by default normalized (0100). Thanks to that we can select a constant number as a mark. I'll select the value of 65 distance from center, that marks outlier:
CASE WHEN (CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'distanceFromCenter', 'algorithm=kmeans;numClusters=5;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE') > 65) THEN 'Outlier' ELSE '' END
The same can be done using percentage or other, more appropriate, calculations.
No comments:
Post a Comment