This is the fourth part of OBIEE 12c Advanced Analytic:
The Cluster function is part of the Analytics functions family. In documentation it's described: The CLUSTER function collects a set of records into groups based on one or more input expressions using KMeans or Hierarchical Clustering.
The popular demonstration of kmeans Cluster explained nicely in Wikipedia.
Clustering is a type of unsupervised learning. In clustering method, objects of the data set are grouped into clusters, in such a way that groups are very different from each other and the objects in the same group or cluster are very similar to each other.
Remember, similarity is often hard to define
(Pictures from https://propuppy.wordpress.com/2012/11/19/ownerdoglookalike/).
Our Clustering function can use 2 algorithms KMeans and Hierarchical Clustering. The KMeans is very good for cases when we can "draw a circle/ellipse" around every cluster, while the the hierarchical options tend to create a cluster within a cluster in a hierarchy.
KMeans tends to have better performance especially when the amount of data is growing.
Read more about in Data Clustering: Kmeans and Hierarchical Clustering by Piyush Rai or OBIEE specific Comparison Between KMean and Hierarchical Algorithm Using Query Redirection by Manpreet kaur and Usvir Kaur
I'll start with the online help example:
Analysis with Product, Company, Revenue and Billed Quantity columns.
The Cluster Function is (simplified)
CLUSTER((""Product", "Company"), ("Billed Quantity","Revenue"),'clusterName', 'algorithm=kmeans;numClusters=%1;maxIter=%2;useRandomSeed=FALSE;enablePartitioning=TRUE', 5, 10)
Another option to write the same:
CLUSTER((""Product", "Company"), ("Billed Quantity","Revenue"),'clusterName', 'algorithm=kmeans;numClusters=5;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE')
actual function:
CLUSTER(("Sales"."Products"."Product", "Sales"."Offices"."Company"), ("Sales"."Facts"."Billed Quantity","Sales"."Facts"."Revenue"),'clusterName', 'algorithm=kmeans;numClusters=%1;maxIter=%2;useRandomSeed=FALSE;enablePartitioning=TRUE', 5, 10)
 OBIEE 12c Advanced Analytics Functions part 1. Introduction & Trendline
 OBIEE 12c Advanced Analytic part 2: BIN and WIDTH_BUCKET
 OBIEE 12c Advanced Analytic part 3: Forecast
 OBIEE 12c Advanced Analytic part 4: Cluster (this one)
 OBIEE 12c Advanced Analytic part 5: Outlier
 OBIEE 12c Advanced Analytic part 6: Regression
 OBIEE 12c Advanced Analytic part 7: EVALUATE_SCRIPT
The Cluster function is part of the Analytics functions family. In documentation it's described: The CLUSTER function collects a set of records into groups based on one or more input expressions using KMeans or Hierarchical Clustering.
The popular demonstration of kmeans Cluster explained nicely in Wikipedia.
Clustering is a type of unsupervised learning. In clustering method, objects of the data set are grouped into clusters, in such a way that groups are very different from each other and the objects in the same group or cluster are very similar to each other.
Remember, similarity is often hard to define
(Pictures from https://propuppy.wordpress.com/2012/11/19/ownerdoglookalike/).
Our Clustering function can use 2 algorithms KMeans and Hierarchical Clustering. The KMeans is very good for cases when we can "draw a circle/ellipse" around every cluster, while the the hierarchical options tend to create a cluster within a cluster in a hierarchy.
KMeans tends to have better performance especially when the amount of data is growing.
Read more about in Data Clustering: Kmeans and Hierarchical Clustering by Piyush Rai or OBIEE specific Comparison Between KMean and Hierarchical Algorithm Using Query Redirection by Manpreet kaur and Usvir Kaur
I'll start with the online help example:
Analysis with Product, Company, Revenue and Billed Quantity columns.
The Cluster Function is (simplified)
CLUSTER((""Product", "Company"), ("Billed Quantity","Revenue"),'clusterName', 'algorithm=kmeans;numClusters=%1;maxIter=%2;useRandomSeed=FALSE;enablePartitioning=TRUE', 5, 10)
Another option to write the same:
CLUSTER((""Product", "Company"), ("Billed Quantity","Revenue"),'clusterName', 'algorithm=kmeans;numClusters=5;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE')
actual function:
CLUSTER(("Sales"."Products"."Product", "Sales"."Offices"."Company"), ("Sales"."Facts"."Billed Quantity","Sales"."Facts"."Revenue"),'clusterName', 'algorithm=kmeans;numClusters=%1;maxIter=%2;useRandomSeed=FALSE;enablePartitioning=TRUE', 5, 10)
A resulting graph, where revenue and Billed Quantity are the Axes and each point is combination of company and product (the clusters are color and shapes):
Here is the same but with hclustering instead of kmean algorithm.
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'clusterName', 'algorithm=kmeans;')
returns:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'clusterName', '')
returned results similar to the kmeans result with 5 clusters.
Time to see the formal Syntax:
Syntax
CLUSTER((dimension_expr1 , ... dimension_exprN), (expr1, .. exprN), output_column_name, options, [runtime_binded_options])Where:
dimension_expr
indicates a list of dimensions to be clustered (each point you see on my graphs is combination of dimension_expressions).expr
indicates a list of dimension attributes or measures to use to cluster dimension_expr
(those are the axis in the graphs).output_column_name
indicates the output column name for the cluster. Valid values are
'clusterId', 'clusterName', 'clusterDescription', 'clusterSize',
'distanceFromCenter', and 'centers'. (you can use few similar cluster functions with different Output_column_name in the same analysis)options
indicates a string list of name/value pairs separated by a semicolon
(;). You can include %1 ... %N, which can be specified using runtime_binded_options
.runtime_binded_options
indicates a comma separated list (,) of runtime binded columns or literal expressions.Output Columns The CLUSTER function output contains the following columns:
 clusterID – This column is the cluster number or ID.
 clusterName – This column is the name of the cluster. It is the same as clusterID.
 clusterDescription – The description can be added by the end user after the cluster dataset is persisted into DSS.
 clusterSize – This column is the number of elements in the current cluster.
 distanceFromCenter – This column indicates how far the current cluster element is from the center of the current cluster.
 centers – This column indicates the center of the current cluster.
Option Name

Description

Values

algorithm

The
algorithm to use for clustering.

kmean default?
hclustering

method

The
method within the algorithm.

For kmeans algorithm:
HartiganWong, Lloyd, Forgy (last 2 are same algorithm), MacQueen
Default is HartiganWong.
For hclustering algorithm:
ward.D, ward.D2, single, complete, average, mcquitty, median, or centroid.
Default is complete. 
numClusters

The
number of clusters. Every record is assigned to one of the clusters.

Integer. Is 5 the default?

attributeNames

The
attributes to consider for clustering.

arg1,arg2,arg3,arg4,arg5,arg6,arg7,arg8,arg9,arg10

maxIter

The
number of iterations to create clusters.

Integer. Default 20.

normalizedDist

Normalizes
the distance between 0 to 100 if set to TRUE.

TRUE
FALSE
Default is TRUE. 
useRandomSeed

Set
to TRUE by default. If set to TRUE, then is the value to use in PRODUCTION
environment.
If
set to FALSE, then use set.seed(initialSeed) to ensure reproducibility, used
in QA/Debug Environment.

True
False
Default is TRUE. 
initialSeed

This
value is used only when useRandomSeed is set to FALSE.

Integer
Default is 250. 
clusterNamePrefix

If
set, then this is the prefix for the cluster name.

Varchar
Default
is empty.

clusterNameSuffix

If
set, then this is the suffix for the cluster name.

Varchar
Default is empty. 
What have we learned from this?
First of all, the examples are BAD for production, it's better to work with useRandomSeed=TRUE
We can have a Cluster name prefix:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'clusterName', 'algorithm=kmeans;numClusters=5;clusterNamePrefix=Cluster ')
When setting clusters based on some random selection, the algorithm tends to make few iteration to "recalibrate" itself. MaxIter parameter controls the max number of attempts to do the "recalibration".
Here are few additional output options, run on the same kmeans cluster:
I used:
 clusterName
 clusterSize
 centers
 distanceFromCenter
One last example: I wanted to see that using different methods in the same algorithm would cause variations in results. I used the Kmean cluster method as parameter from a prompt, with that parameter value as prefix, for readability. It's actually the same parameter.
When comparing methods we can't have random results, so we must have useRandomSeed=FALSE:
I used:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'clusterName', 'algorithm=kmeans;method= %1;numClusters=5;useRandomSeed=FALSE;clusterNamePrefix=%2', '@{P_Method}{HartiganWong}', '@{P_Method}{HartiganWong}')
(I had to enter the same runtime_binded_option twice, despite the fact it has the same value, we don't like reuse in this function).
As one might have expected, different methods return small variations in results (sorry about my drawing capabilities):
Removing columns from the criteria (but no the cluster function) would not cause any error, both for dimensions and attributes and will not change the number of rows returned (though removing the dimension might make result almost meaningless).
Few more points:
It should be crystal clear. the default of the system is useRandomSeed=TRUE. As a result the clusters order and even result might not be consistent, even on the same data.For example, the following function: CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'clusterName', 'algorithm=kmeans;numClusters=5;maxIter=10')
First run:
Second run:
My small personal experience:
Selecting the correct number of Clusters is not always easy. Often we might change our selection based of the results we see. In cases they are more then two expressions (Revenue and Billed Quantity in our case), it might be educational to set clusters by pairs of them, just for the graphical output.
I will return to few other options of Cluster in the next post.
haii this information is very usefull in my company,https://blogs.perficient.com/oracle/2015/11/02/whyisoraclesobiee12cimportant/#comment91029
ReplyDelete