Wednesday, December 28, 2016

Playing with Alcohol and Oracle DV Desktop 2 for some basic data science

I will cover here few nice option to investigate and prepare data with the Oracle Data Visualization Desktop 2.
In the spirit of New Year, lets talk about alcohol.

Some time ago I saw a great  DV demo about Alcohol prices by Pavel Dubinin, my data set had some darker shades of the topic as well. For some time I had a file of Alcohol consumption for each country in my Big data Virtual Machine. I believe the origin is from Wikipedia. From HDFS It can be accessed by Hive / Impala / Spark... I decided to use Impala.
There are few data sets I downloaded from the World Health Organization GISAH section as Excel and csv files.

The Alcohol prices file has data about 2012 / 2015 prices in part of the countries, I don't remember where did I get it from 😱.

So I have:
  • Country information (XLXS)
  • Alcohol Dependency in countries (XLXS)
  • AVG Alcohol prices in USD (XLXS)
  • Percent of death by alcohol by country (CSV)
  • Alcohol Consumption by Country (Impala)

The first problem is country names. While countries like Israel or Greece have simple names, they are many ways to write the name of Iran (The Islamic Republic of) or even USA or GB. My Implala data might not fit the formal naming in the data from World Health Organization. How can I check?

First I need the files in the Data Visualization Desktop.
Importing the csv and Excel files is mostly trivial. Create a new data source, that is based on a File. After importing I can also define what columns are measures or Attributes, the Data type and, if measure, the Aggregation:

what do I do with Impala?
I defined a new connection:
And  created a new data source based on that connection:

I could import a table and select columns,

but I preferred to write my own Select statement, and on the way control the data format:
select country, cast(total as double) as total_cons,cast(__beer as double) as beer_perc, cast(__other as double) as other_perc,cast( __spirits as double) as spirits_perc, cast(__wine as double) as wine_perc from default.alcohol_consumption 

You might notice I added "where rand()<1", just as a reminder to myself. When working with real big data we might like something like "where rand()<0.1" so we don't bring too large data set to the desktop tool. (Actually you want to do something a bit trickier since Impala rand function has a static nature).

Now I have my data sources in the Data Visualization Desktop and I can create my first data flow that compares country name:
I selected the Country data source
In the data flow I want only the Country name column. So I right Click on Country data source in the flow, Add a step:

And pick the Select Columns:

The system Automatically adds all the columns, so i select "Remove All" pick the "Country" column and press "Add selected":

In "Preview" I can see the data:

Now I want to bring the Impala source and do the same:
Another Right Click on the last step (Select Columns), 'add step' and selecting 'Add Data', does that:
I have 2 data flows I want to join. The trick is select one and press Ctrl+Right Click on the other. Now we have the join option:

Our case is simple (country = Country), in other cases we can select the columns and pres the Plus mark on the right to add one more pair.

To check the country names that don't fit we want a full outer join. For non technical people it's "All rows" vs. "Matching rows":

 Now I can add the filter (country is Null or the other country is Null):

Last part of the data flow is to save the data:

 I didn't bother saving the data flow. What I did next is to export the thing to excel and create if-then-else statement. Something like this:

CASE Country
WHEN 'Bolivia (Plurinational State of)' THEN 'Bolivia'
WHEN 'Brunei Darussalam' THEN 'Brunei'
WHEN 'Democratic People''s Republic of Korea' THEN 'South Korea'
WHEN 'Iran (Islamic Republic of)' THEN 'Iran'
WHEN 'Micronesia (Federated States of)' THEN 'Federated States of Micronesia'
WHEN 'Lao People''s Democratic Republic' THEN 'Laos'
WHEN 'Syrian Arab Republic' THEN 'Syria'
WHEN 'United Kingdom of Great Britain and Northern Ireland' THEN 'United Kingdom'
WHEN 'The former Yugoslav republic of Macedonia' THEN 'Macedonia'
WHEN 'Venezuela (Bolivarian Republic of)' THEN 'Venezuela'
WHEN 'Congo' THEN 'Republic of the Congo'
WHEN 'Republic of Korea' THEN 'North Korea'
ELSE Country

I used it in the next part, where I combined all the data sources.
I could create a project and do the joins there, but I preferred to have a data flow that creates a reusable single data source of all the data.

The only new things I did in the second data flow, was to create a column (based on the above case statement) and renamed few columns.


Lets talk about the last steps of this data flow:

After the join and selecting only the relevant columns, we have to define the Aggregate function of the measures. We often use sum. In our case, there is no point in aggregating alcohol consumption in countries so I mostly use Avg:

Now I have a new data source, named United Alcohol. Lets have fun.

Can you guess what region consumes the largest amount of alcohol? Europe of course, when Eastern Europe is on your side you will always win this "game". Not surprisingly, the Eastern Mediterranean countries (that are mostly Muslims) consume very little alcohol. (The number are amount of pure alcohol per person a year).   

If you wanted to know what is the most popular alcoholic type of drink around the world, we have the answer:

Looking at maps of Total Alcohol Consumption, 2 types of Alcohol prices, % of Alcohol dependence and Alcohol attributable fractions, all causes deaths (%) shows me few interesting things:
1. The situation in Eastern Europe is scary.
2. Norway alcohol prices are very high, but it didn't help them with Alcohol dependence and slightly with it's consumption comparing to other Scandinavian countries.

So lets see whats going on in Europe.

Here is a chart of Total Alcohol Consumption vs. Alcohol dependency in European Countries with Polynomial Trend Line.

When I check for Outliers,  I'm not really surprised:

Belarus, Russia and Hungary on one side and Tajikistan, Turkey and Azerbaijan on the other. Other Outlier model include Israel as well (the regions are not geographical). 

Last thing I will do with Europe, is to set the size of the bubble to show the percent of deaths attributable to Alcohol. The strange thing I see, is that the high percent of deaths attributable to Alcohol is not directly connected to consumption. It seems that the geography is the main factor. If you are from ex-USSR republic, you have high chance to die with something related to alcohol. Belarus is leading with 34.7%! Hungary that is the same cluster of Alcoholism and Consumption, reports 6.7%, by far less then Ukraine, Latvia, Lithuania, Moldova, Estonia...

Lets see it on the map:

It's not directly related to Consumption of Alcohol:

The part that really surprised me was when I split the percent of deaths attributable to Alcohol by gender. In countries where the percentage  is highest more Woman than Men death is  attributable to Alcohol!


I could continue on and on, but lets stop here. Wish you a happy new year and "С новым годом" if you prefer Russian. Lets finish with this Merry Christmas video using Oracle Data Visualization:

Friday, November 11, 2016

OBIEE - Selecting date range by selecting a column

Customers often work with specific date ranges: "last 3 days", "last week", "this month", "last 7 working days"... While it's easy to create a filter that does this and reuse it, they prefer an easier method.
Adding those option in the tool by default is problematic, since users always invent new range needs. We need something the local IT can manage without too much work for each new idea.
I covered how to do it with prompts in the post: "OBIEE Prompts - Last Year, Last Month and Custom Dates", but the request is to give easy access to date range selection in Answers (and VA) as well.
Sorry, it's going to be a long post.

My basic assumption is: you have a date dimension based on dates table. I will cover the following options:
  • Reusing a filter.
  • Adding date range columns to date table.
  • Create a date_range table and joining it to date dimension.   
  • Having multiple data sources on the same physical table with various "where" conditions in the content as a date dimension and specific column.

In all examples I will use 3 ranges "last 3 days", "last month", "this year". Assuming today is 11-Nov-2016:
last 3 days is 10-Nov-2016, 9-Nov-2016, 8-Nov-2016
last month is all dates from 11-Oct-2016 till 10-Nov-2016
this year are all the dates from 1-Jan-2016 till 11-Nov-2016. 

I'll describe the pros and cons of each method with the following criteria:
  • Ease of use for customer.
  • Requires RPD change for each additional date range.
  • Requires daily ETL.
  • Supports easy reverse selection and combinations (such as: last month and not last 3 days).
  • Works with VA as well or Answers only.
  • Has performance impact.
I use very simple model with measure value 1 in each day of fact.

Reusing Filter

I can create a filter from within an analysis or from menu New / Filter.

I can save the filter for reuse

 or save the entire analysis.
To use the filter, the user has to select it from the filters list of the Catalog (left panel):

Some users find it easier to use the analysis itself as filter, they believe they have better control this way:
 In a similar manner I can create the other date range filter / Analysis:
Only the analysis based filter let us create reverse selection:


The best option for IT and OK but not so great for Users.
  • Ease of use for User is OK but not great.
  • Requires RPD change for each additional date range?  NO.
  • Requires daily ETL? NO.
  • Supports easy reverse selection and combinations (such as: last month and not last 3 days)? Yes, with filter based on analysis only.
  • Works with VA? NO.
  • Has performance impact? NO.

Adding date range columns to date table

This option requires change in the dates table. I'll add 3 columns (last_3_days, last_month, this_year). For each column I have to update daily the Boolean value (in my case Yes/No).

This is the Subject area:

The user has very simple task; Filtering Yes/No on the column:

The reverse option is easy to create:

It works with VA as well:


The worst option for IT and great for Users.
  • Ease of use for User? YES.
  • Requires RPD change for each additional date range?  YES.
  • Requires daily ETL? YES.
  • Supports easy reverse selection and combinations (such as: last month and not last 3 days)? YES.
  • Works with VA? YES.
  • Has performance impact? NO.

Create a date_range table with line for each range and joining it to date dimension

The simple option here is to create a table with 3 columns: Range_name, From_date, Till_date and join them to the Date dimension. This sort of join has bad impact on performance. The table also requires ETL for updating the date_range table. We can have a work around for the ETL by creating a dynamic view (Oracle SQL code):
create or REPLACE view date_range_line as
select 'Last 3 days' as Range, trunc(sysdate-3) as from_date, trunc(sysdate-1) as till_date from dual
Union ALL
select 'Last month' as Range, trunc(add_months(sysdate,-1)) as from_date, trunc(sysdate-1) as till_date from dual
Union ALL
select 'This Year' as Range, TRUNC(TO_DATE(sysdate),'YEAR') as from_date, trunc(sysdate) as till_date from dual
This is easy to update for each change or new requirement, in the future. 
So I created it as a View, imported into repository and joined it to date table. 


Now it's working:
Unfortunately, the reverse option returns wrong results (the last 3 days were not removed):

(Actually it can be done using minus option that is under the union option in the criteria, but that's a bit advanced).

For performance we see the select statement from the log:
SAWITH0 AS (select sum(T909.MES) as c1
     DATE_D T895,
     TEST_FACT T909
where  ( T895.DATEKEY = T909.DATEKEY and T1138.RANGE = 'Last 3 days' 
and T895.DATEVALUE between T1138.FROM_DATE and T1138.TILL_DATE ) )
select D1.c1 as c1, D1.c2 as c2 from ( select distinct 0 as c1,
     D1.c1 as c2
     SAWITH0 D1 ) D1 where rownum <= 65001
As expected we have 3 tables joined instead of 2.


Good option for everyone, if you don't mind the performance problem and the wrong results of reverse or multiple range selection.
  • Ease of use for User? YES.
  • Requires RPD change for each additional date range?  NO.
  • Requires daily ETL? NO (for the view option).
  • Supports easy reverse selection and combinations (such as: last month and not last 3 days)? NO.
  • Works with VA? YES.
  • Has performance impact? YES.But can be solved with driving table as described bellow, in the update.

For this option I also considered creating a date_range table with a line for each date and range combination and joining it to date dimension as driving table. 
This is ETL intensive option where for example for last 3 days I have 3 rows of data, the range column is "last 3 days" and the date column has each of the 3 days that fit the range. For last month, up to 31 rows... We can see that dates such as '10-Nov-2016' appear both in 'last 3 days' and 'last Month' :

This way the join with date table is simple.


To solve the performance problem we can use a feature named "driving table" in OBIEE. The idea of driving table described here and here. This way OBIEE runs a separate query for the date_range table and places the result into the date table filter. Since the date ranges are usually small (if you have a year, then you should change MAX_PARAMETERS_PER_DRIVE_JOIN parameter). Unfortunately since driving table makes it own optimization, it didn't act in my tests. 
OBIEE prefers to use driving table only with "cross-database joins ".
If you can't beat them, join them.To overcame the optimizer I duplicated my physical data source (now it's cross-database join):

And used the date_range table from the second source (ORCL#1). Set this date_range as driving table in the join at the business model:
Now the performance is fine!

This is from the log:

select T1287.RANGE as c1,
     T1287.DATES as c2
     DATE_RANGE T1287
where  ( T1287.RANGE = 'last 3 days' ) 
order by c2


select  /*+ no_merge */  sum(T909.MES) as c1,
     T909.DATEKEY as c2,
     T895.DATEVALUE as c3
     TEST_FACT T909,
     DATE_D T895
where  ( T895.DATEKEY = T909.DATEKEY and (T895.DATEVALUE in 
:PARAM10, :PARAM11, :PARAM12, :PARAM13, :PARAM14, :PARAM15, :PARAM16, :PARAM17, 
:PARAM100)) ) 
group by T895.DATEVALUE, T909.DATEKEY 
group by T895.DATEVALUE, T909.DATEKEY
order by c3

Having multiple data sources on the same physical table with various "where" conditions in the content as a date dimension and specific column

This is a nice option for performance and demands no ETL.
For each range I duplicate the date data source and add a column that is specific for this data source. Selecting that column or filtering on it, enables the filter.

The example:

In the Business Model, in the date dimension table I duplicate the data source (it is still the same physical date_d table):

Rename the duplicate data surce to "last 3 days" and use a where condition, very similar to the one in the first example of "Reusing a filter" in Content Tab:


Now I add a new Logical data column to the Date, rename it

Set the column source to be from physical mapping from the "last 3 days" data source. It is critical that no other data source will be mapped to this column.

I enter the data source and Map this column:

We don't need a Physical Table, I just write 'Last 3 Days' (or whatever I want) in the Expression. We do need to mark "Show unmapped columns".

The order of data sources is important. We want the default date to be without constraints and place that data source first (by default OBIEE starts with the first data source and goes top down). If you want extra caution, you can work with data source priority group as well.  
Oracle help: "To assign priority group numbers, rank your logical table sources in numeric order, with 0 being the highest-priority source. You can assign the same number to multiple sources. For example, you can have two logical table sources in priority group 0, two logical table sources in priority group 1, and so on. Often, only two priority groups are necessary (0 and 1)."

Here in a similar way I add "Last Month" data source add it's condition in content tab and add the "Last Month column" for that data source:

I map the column to physical source in "Last Month" data source only. In a similar way you can see "Last 3 days" has no mapping in the "Last Month" data source.

After I complete the set this is what I have:
When working with Analysis, not using any of the range column returns the usual data:

In our case all the dates:

Adding the "Last 3 Days" column, activates the filtering. Same would happen if it's added hidden or used as filter.

Attempt of reverse option wouldn't work.
In this case when I attempt to add the filter "Last 3 Days is not equal / is not in Last 3 Days" it returns empty set as expected (once I use the "last 3 days" data set, the "Last 3 Days" column value is always 'Last 3 Days').


  • Ease of use for User? YES.
  • Requires RPD change for each additional date range?  YES, but it's rather easy.
  • Requires daily ETL? NO.
  • Supports easy reverse selection and combinations (such as: last month and not last 3 days)? NO.
  • Works with VA? YES.
  • Has performance impact? NO.

Final Summary:

The easiest option for IT is the first, "Reusing a filter". It's also the worst for the User.
The best for the user seems to be the "Adding date range columns to date table" option, it's also the worst for IT.
 You can consider the other options as well.