OBIEE in IL: OBIEE 12c Advanced Analytic part 4: Cluster

This is the fourth part of OBIEE 12c Advanced Analytic:

(for 3-7 you need Enabling R and the relevant Analytics functions on OBIEE 12c )

The Cluster function is part of the Analytics functions family. In documentation it's described: The CLUSTER function collects a set of records into groups based on one or more input expressions using K-Means or Hierarchical Clustering.

The popular demonstration of k-means Cluster explained nicely in Wikipedia.

Clustering is a type of unsupervised learning. In clustering method, objects of the data set are grouped into clusters, in such a way that groups are very different from each other and the objects in the same group or cluster are very similar to each other.
Remember, similarity is often hard to define
(Pictures from https://propuppy.wordpress.com/2012/11/19/owner-dog-look-alike/).

Our Clustering function can use 2 algorithms K-Means and Hierarchical Clustering. The K-Means is very good for cases when we can "draw a circle/ellipse" around every cluster, while the the hierarchical options tend to create a cluster within a cluster in a hierarchy.
K-Means tends to have better performance especially when the amount of data is growing.

Read more about in Data Clustering: K-means and Hierarchical Clustering by Piyush Rai or OBIEE specific Comparison Between K-Mean and Hierarchical Algorithm Using Query Redirection by Manpreet kaur and Usvir Kaur

I'll start with the online help example:
Analysis with Product, Company, Revenue and Billed Quantity columns.

The Cluster Function is (simplified)

CLUSTER((""Product", "Company"), ("Billed Quantity","Revenue"),'clusterName', 'algorithm=k-means;numClusters=%1;maxIter=%2;useRandomSeed=FALSE;enablePartitioning=TRUE', 5, 10)

Another option to write the same:
CLUSTER((""Product", "Company"), ("Billed Quantity","Revenue"),'clusterName', 'algorithm=k-means;numClusters=5;maxIter=10;useRandomSeed=FALSE;enablePartitioning=TRUE')

actual function:
CLUSTER(("Sales"."Products"."Product", "Sales"."Offices"."Company"), ("Sales"."Facts"."Billed Quantity","Sales"."Facts"."Revenue"),'clusterName', 'algorithm=k-means;numClusters=%1;maxIter=%2;useRandomSeed=FALSE;enablePartitioning=TRUE', 5, 10)

A resulting graph, where revenue and Billed Quantity are the Axes and each point is combination of company and product (the clusters are color and shapes):

Here is the same but with h-clustering instead of k-mean algorithm.

There seems to be some undocumented default of 5 clusters, so the following Cluster function:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;')
returns:

And removing the algorithm part as well:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'clusterName', '')
returned results similar to the k-means result with 5 clusters.

Time to see the formal Syntax:

Syntax

CLUSTER((dimension_expr1 , ... dimension_exprN), (expr1, .. exprN),
output_column_name, options, [runtime_binded_options])

Where:
dimension_expr indicates a list of dimensions to be clustered (each point you see on my graphs is combination of dimension_expressions).
expr indicates a list of dimension attributes or measures to use to cluster dimension_expr (those are the axis in the graphs).
output_column_name indicates the output column name for the cluster. Valid values are 'clusterId', 'clusterName', 'clusterDescription', 'clusterSize', 'distanceFromCenter', and 'centers'. (you can use few similar cluster functions with different Output_column_name in the same analysis)
options indicates a string list of name/value pairs separated by a semi-colon (;). You can include %1 ... %N, which can be specified using runtime_binded_options.
runtime_binded_options indicates a comma separated list (,) of run-time binded columns or literal expressions.

Output Columns The CLUSTER function output contains the following columns:

clusterID – This column is the cluster number or ID.
clusterName – This column is the name of the cluster. It is the same as clusterID.
clusterDescription – The description can be added by the end user after the cluster dataset is persisted into DSS.
clusterSize – This column is the number of elements in the current cluster.
distanceFromCenter – This column indicates how far the current cluster element is from the center of the current cluster.
centers – This column indicates the center of the current cluster.

More about parameters:

Option Name	Description	Values
algorithm	The algorithm to use for clustering.	k-mean default? h-clustering
method	The method within the algorithm.	For k-means algorithm: Hartigan-Wong, Lloyd, Forgy (last 2 are same algorithm), MacQueen Default is Hartigan-Wong. For h-clustering algorithm: ward.D, ward.D2, single, complete, average, mcquitty, median, or centroid. Default is complete.
numClusters	The number of clusters. Every record is assigned to one of the clusters.	Integer. Is 5 the default?
attributeNames	The attributes to consider for clustering.	arg1,arg2,arg3,arg4,arg5,arg6,arg7,arg8,arg9,arg10
maxIter	The number of iterations to create clusters.	Integer. Default 20.
normalizedDist	Normalizes the distance between 0 to 100 if set to TRUE.	TRUE FALSE Default is TRUE.
useRandomSeed	Set to TRUE by default. If set to TRUE, then is the value to use in PRODUCTION environment. If set to FALSE, then use set.seed(initialSeed) to ensure reproducibility, used in QA/Debug Environment.	True False Default is TRUE.
initialSeed	This value is used only when useRandomSeed is set to FALSE.	Integer Default is 250.
clusterNamePrefix	If set, then this is the prefix for the cluster name.	Varchar Default is empty.
clusterNameSuffix	If set, then this is the suffix for the cluster name.	Varchar Default is empty.

What have we learned from this?

First of all, the examples are BAD for production, it's better to work with useRandomSeed=TRUE

We can have a Cluster name prefix:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;numClusters=5;clusterNamePrefix=Cluster ')

When setting clusters based on some random selection, the algorithm tends to make few iteration to "recalibrate" itself. MaxIter parameter controls the max number of attempts to do the "recalibration".

Here are few additional output options, run on the same k-means cluster:

I used:

clusterName
clusterSize
centers
distanceFromCenter

Please note the distanceFromCenter is normilized on scale 0-100 by default.

One last example: I wanted to see that using different methods in the same algorithm would cause variations in results. I used the K-mean cluster method as parameter from a prompt, with that parameter value as prefix, for readability. It's actually the same parameter.
When comparing methods we can't have random results, so we must have useRandomSeed=FALSE:
I used:
CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;method= %1;numClusters=5;useRandomSeed=FALSE;clusterNamePrefix=%2', '@{P_Method}{Hartigan-Wong}', '@{P_Method}{Hartigan-Wong}')
(I had to enter the same runtime_binded_option twice, despite the fact it has the same value, we don't like reuse in this function).

As one might have expected, different methods return small variations in results (sorry about my drawing capabilities):

Removing columns from the criteria (but no the cluster function) would not cause any error, both for dimensions and attributes and will not change the number of rows returned (though removing the dimension might make result almost meaningless).

Few more points:

It should be crystal clear. the default of the system is useRandomSeed=TRUE. As a result the clusters order and even result might not be consistent, even on the same data.
For example, the following function: CLUSTER(("Products"."Product", "Offices"."Company"), ("Base Facts"."Billed Quantity","Base Facts"."Revenue"),'clusterName', 'algorithm=k-means;numClusters=5;maxIter=10')
First run:

Second run:

My small personal experience:
Selecting the correct number of Clusters is not always easy. Often we might change our selection based of the results we see. In cases they are more then two expressions (Revenue and Billed Quantity in our case), it might be educational to set clusters by pairs of them, just for the graphical output.

I will return to few other options of Cluster in the next post.

OBIEE in IL

Monday, January 11, 2016

OBIEE 12c Advanced Analytic part 4: Cluster

Syntax

Few more points:

1 comment: