We had to do a Binary Classification model. I can find lots of materials on the topic at various Oracle and non-Oracle blogs. I wanted to do a slightly deeper than usual investigation of creating such model with SQL and PLSql. In this example we have a very organized 100K example of loan credit score. This is a rather long post.
What to expect:
1. Creating a basic model, with scoring of results.
2. Comparing possible classification models.
3. Finding optimal Hyper-parameters.
4. Testing Partition on a model.
Creating a basic model, with scoring of results.
In Oracle DB 19 Classification models can use Naive Bayes, Neural Network, Decision Tree, Logistic Regression, Random Forest, Support Vector Machines, or Explicit Semantic Analysis. The default is Naive Bayes. See more in the documentation.
Classification is considered an instance of supervised learning i.e., learning where a training set of correctly identified observations is available. It specifies the class to which data elements belong to and is best used when the output has finite and discrete values. It predicts a class for an input variable as well.
The original DB table is credit_scoring.
Check Target values:
select distinct credit_score_bin from credit_scoring;
Other Credit
Good Credit
We have 2 target values, so it is a specific case called binary classification. In OML4SQL, there is no special handling of binary cases.
First thing to do is split the data to train and test parts.
In Oracle we can Split train / test data using ORA_HASH or Sample (for additional reading see here
and here). I used the Sample option that creates train_credit_scoring and test_credit_scoring tables with 70/30 split. The seed is used for testing consistency:
create table train_credit_scoring as
select * from (
select * from credit_scoring SAMPLE (70) SEED (7));
create table test_credit_scoring as
select * from (
select * from credit_scoring
MINUS
select * from train_credit_scoring);
Relevant Algorithms are documented here, 4 of them existing in 11, 12c DB and 3 are new:
- ALGO_NAIVE_BAYES
- ALGO_NEURAL_NETWORK - new
- ALGO_DECISION_TREE
- ALGO_GENERALIZED_LINEAR_MODEL
- ALGO_RANDOM_FOREST - new
- ALGO_SUPPORT_VECTOR_MACHINES
- ALGO_EXPLICIT_SEMANTIC_ANALYS - new
The usual process from here is to create a setting table where we can enter model parameters, run model creation script and evaluating / using the model.
Lets Start With Decision Tree. For model settings documentation see here.
We will enter 2 values:
The selected Classification Model (the default is NAIVE BAYES) and telling oracle to auto-prepare the data (When you enable Automatic Data Preparation,
the model uses heuristics to transform the build data according to the
requirements of the algorithm. Instead of fully Automatic Data
Preparation, the user can request that the data be shifted and/or scaled
with the PREP_SCALE*
and PREP_SHIFT*
settings. The default is prep_auto_on).
create table CR_SCR_set_DT (
setting_name varchar2(50),
setting_value VARCHAR2(50)
);
Begin
Delete from CR_SCR_set_DT;
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.algo_name, dbms_data_mining.algo_Decision_Tree);
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);
Commit;
End;
/
We might want to delete the model if existing from previous tests:
BEGIN
DBMS_DATA_MINING.DROP_MODEL(model_name => 'credit_DT');
EXCEPTION WHEN OTHERS THEN NULL; -- to prevent the error when no model
END;
/
And create the Model, we must have a primary key/case id column defined:
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'credit_DT',
mining_function => dbms_data_mining.classification,
data_table_name => 'train_credit_scoring',
case_id_column_name => 'CUSTOMER_ID',
target_column_name => 'credit_score_bin',
settings_table_name => 'CR_SCR_set_DT');
END;
/
On my laptop, for 70K rows, Task completed in 4.362 seconds.
Let's evaluate the model using the test table, by using confusion matrix:
I'm looking for the Good Credit customers. Let's treat the customers I predicted to be Good Credit as True customers.
TP - True Positive are those True customers I predicted correctly.
FN - False Negative are customers I predicted as False (Other Credit) but was wrong.
FP - False Positive are customers I predicted as True and was Wrong.
TN - True Negative are customer I predicted correctly as False.
Calculations based on them:
Accuracy = TP+TN/TP+FP+FN+TN (Ratio
of correctly predicted observation to the total observations.)
Precision = TP/TP+FP (Ratio
of correctly predicted positive observations to the total predicted positive
observations.)
Recall (Sensitivity)
= TP/TP+FN (Ratio
of correctly predicted positive observations to the all observations in actual
class - true).
F1 Score = 2*(Recall * Precision) /
(Recall + Precision) The
weighted average of Precision and Recall.
In different cases we might prefer one calculation over the other.
SELECT TP, FN, FP, TN, TP/(TP+FP) PRECISION, TP/(FN + TP) RECALL, 2*TP/(FN+2*TP+FP) F1, (TP+TN)/(TP+FP+TN+FN) Model_Acuracy, FP/(FP+TN) False_Positive
FROM (
SELECT SUM(TP) TP, SUM(FP) FP, SUM(TN) TN, SUM(FN) FN
FROM (
SELECT CASE WHEN pred = 'Good Credit' AND target = 'Good Credit' THEN 1 ELSE 0 END AS TP,
CASE WHEN pred = 'Good Credit' AND target = 'Other Credit' THEN 1 ELSE 0 END AS FP,
CASE WHEN pred = 'Other Credit' AND target = 'Other Credit' THEN 1 ELSE 0 END AS TN,
CASE WHEN pred = 'Other Credit' AND target = 'Good Credit' THEN 1 ELSE 0 END AS FN
FROM (
SELECT CUSTOMER_ID, prediction(credit_DT using *) pred, credit_score_bin target
FROM test_credit_scoring) a ) b ) c;
The result is not bad:
TP FN FP TN PRECISION RECALL F1 MODEL_ACURACY FALSE_POSITIVE
---------- ---------- ---------- ----------- ---------------- ------------ ---------- --------------------------- --------------
6487 766 2674 20031 .7081104 .8943885 .7904228 .8851726 .1177714
We can use the model with prediction and probability functions:
select
customer_id,
PREDICTION(credit_DT using *) pred,
credit_score_bin target,
PREDICTION_PROBABILITY(credit_DT using *) prob
FROM test_credit_scoring;
There are several general View built in the system to check information about your models (see here and here for better explanations).
ALL_MINING_MODELS - general information about each model
ALL_MINING_MODEL_ATTRIBUTES, a view with information about the columns in the training data that were used to build the model.
ALL_MINING_MODEL_SETTINGS – a view describing the settings of the mining model. (You can check possible setting values here instead of the documentation.)
ALL_MINING_MODEL_VIEWS - A view deascribing model specific view. See examples bellow.
Oracle creates Model Detail Views for our convenience, that describe the model.
Each View is named DM$V + One_Letter + You_Model_Name. For example: DM$VGcredit_DT
Here are few views relevant for our model:
DM$VOcredit_DT (the tree nodes)
DM$VPcredit_DT (tree nodes and attribute, including suurogate split. Surrogate split is used when the main split has no values.)
DM$VTcredit_DT (with split by rows of target, and if exists target weight).
DM$VIcredit_DT (with tree nodes and target support count).
Comparing possible classification models.
In Oracle DB 19 Classification models can use Naive Bayes, Neural
Network, Decision Tree, Logistic Regression, Random Forest, Support
Vector Machines, or Explicit Semantic Analysis. The default is Naive
Bayes. See more in the documentation. Here are their formal descriptions:
- ALGO_NAIVE_BAYES
- ALGO_NEURAL_NETWORK
- ALGO_DECISION_TREE
- ALGO_GENERALIZED_LINEAR_MODEL
- ALGO_RANDOM_FOREST
- ALGO_SUPPORT_VECTOR_MACHINES
- ALGO_EXPLICIT_SEMANTIC_ANALYS
In the previous section we created a model based on Decision Tree and used the following sql code:
Updated the setting table:
Begin
Delete from CR_SCR_set_DT;
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.algo_name, dbms_data_mining.algo_Decision_Tree);
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);
Commit;
End;
/Created the Model:
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'credit_DT',
mining_function => dbms_data_mining.classification,
data_table_name => 'train_credit_scoring',
case_id_column_name => 'CUSTOMER_ID',
target_column_name => 'credit_score_bin',
settings_table_name => 'CR_SCR_set_DT');
END;
/
Evaluated the model:
SELECT
TP, FN, FP, TN, TP/(TP+FP) PRECISION, TP/(FN + TP) RECALL,
2*TP/(FN+2*TP+FP) F1, (TP+TN)/(TP+FP+TN+FN) Model_Acuracy, FP/(FP+TN)
False_Positive
FROM (
SELECT SUM(TP) TP, SUM(FP) FP, SUM(TN) TN, SUM(FN) FN
FROM (
SELECT CASE WHEN pred = 'Good Credit' AND target = 'Good Credit' THEN 1 ELSE 0 END AS TP,
CASE WHEN pred = 'Good Credit' AND target = 'Other Credit' THEN 1 ELSE 0 END AS FP,
CASE WHEN pred = 'Other Credit' AND target = 'Other Credit' THEN 1 ELSE 0 END AS TN,
CASE WHEN pred = 'Other Credit' AND target = 'Good Credit' THEN 1 ELSE 0 END AS FN
FROM (
SELECT CUSTOMER_ID, prediction(credit_DT using *) pred, credit_score_bin target
FROM test_credit_scoring) a ) b ) c;
If you want to test other algorithms, one option is to reuse the existing model. In that case all I need is to change the algorithm (in bold) in the setting table.
I prefer to create a new setting table for each algorithm and update the code accordingly. I need to create a new setting table for each algorithm and change the model name. For example, for Naive Bayes here is my code (the changes from previous in bold):
create table CR_SCR_set_NB (
setting_name varchar2(50),
setting_value VARCHAR2(50)
);
Begin
Delete from CR_SCR_set_NB;
INSERT INTO CR_SCR_set_NB (setting_name, setting_value)
Values(dbms_data_mining.algo_name, dbms_data_mining.algo_Naive_Bayes);
INSERT INTO CR_SCR_set_NB (setting_name, setting_value)
Values(dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);
Commit;
End;
/
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'credit_NB',
mining_function => dbms_data_mining.classification,
data_table_name => 'train_credit_scoring',
case_id_column_name => 'CUSTOMER_ID',
target_column_name => 'credit_score_bin',
settings_table_name => 'CR_SCR_set_NB');
END;
/
SELECT
TP, FN, FP, TN, TP/(TP+FP) PRECISION, TP/(FN + TP) RECALL,
2*TP/(FN+2*TP+FP) F1, (TP+TN)/(TP+FP+TN+FN) Model_Acuracy, FP/(FP+TN)
False_Positive
FROM (
SELECT SUM(TP) TP, SUM(FP) FP, SUM(TN) TN, SUM(FN) FN
FROM (
SELECT CASE WHEN pred = 'Good Credit' AND target = 'Good Credit' THEN 1 ELSE 0 END AS TP,
CASE WHEN pred = 'Good Credit' AND target = 'Other Credit' THEN 1 ELSE 0 END AS FP,
CASE WHEN pred = 'Other Credit' AND target = 'Other Credit' THEN 1 ELSE 0 END AS TN,
CASE WHEN pred = 'Other Credit' AND target = 'Good Credit' THEN 1 ELSE 0 END AS FN
FROM (
SELECT CUSTOMER_ID, prediction(credit_NB using *) pred, credit_score_bin target
FROM test_credit_scoring) a ) b ) c;
the NB results are quite good, in our case:
TP FN FP TN PRECISION RECALL F1 MODEL_ACURACY FALSE_POSITIVE
--------------- ---------- ---------- ------- ------------------ ----------------- ------------- ----------------------------- -----------------------
7008 245 3154 19551 .689628026 .966220874 .804823428 .886541158 .138912134
If you are curious, here are the results of all 7 algorithms of this case:
Naturally, the Model Detail Views are different for every algorithm. We can find more info in the documentation here.
DBMS_DATA_MINING.CREATE_MODEL2 (update)
There is a newer version of create_model named create_model2. It has slightly different parameters (bold bellow). The main advantage here is the ability to create setting list instead of physical table and the usage of data query instead of fixed data table.
BEGIN
DBMS_DATA_MINING.DROP_MODEL(model_name => 'credit_DT');
EXCEPTION WHEN OTHERS THEN NULL; -- to prevent the error when no model
END;
/
DECLARE
v_setlst DBMS_DATA_MINING.SETTING_LIST;
BEGIN
v_setlst('PREP_AUTO') := 'ON';
v_setlst('ALGO_NAME') := 'ALGO_DECISION_TREE';
DBMS_DATA_MINING.CREATE_MODEL2(
model_name => 'credit_DT',
mining_function => dbms_data_mining.classification,
data_query => 'select * from train_credit_scoring',
set_list => v_setlst,
case_id_column_name=> 'CUSTOMER_ID',
target_column_name => 'credit_score_bin'
);
END;
/
I wanted to create a standard script to find my optimal algorithm. For that purpose I will combine Create_model2 and a "for" loop. In that loop I will create a model with every algorithm and update a table with the performance of each algorithm. I named the table alg_results.
drop table alg_results;
create table alg_results (algo varchar2(60), TP Number(16,6), FN Number(16,6), FP Number(16,6), TN Number(16,6), PRECISION Number(16,6), RECALL Number(16,6), F1 Number(16,6), Model_Acuracy Number(16,6), False_Positive Number(16,6));
Next I will do the previous code with a FOR loop, use a different algorithm each time and update the results to our alg_results table.
DECLARE
v_setlst DBMS_DATA_MINING.SETTING_LIST;
BEGIN
delete from alg_results;
FOR I IN 1..7 LOOP
BEGIN
DBMS_DATA_MINING.DROP_MODEL(model_name => 'credit_tmp');
EXCEPTION WHEN OTHERS THEN NULL;
END;
BEGIN
v_setlst('PREP_AUTO') := 'ON';
v_setlst('ALGO_NAME') := case i when 1 then 'ALGO_DECISION_TREE' when 2 then 'ALGO_NAIVE_BAYES' when 3 then 'ALGO_NEURAL_NETWORK' when 4 then 'ALGO_GENERALIZED_LINEAR_MODEL' when 5 then 'ALGO_RANDOM_FOREST' when 6 then 'ALGO_SUPPORT_VECTOR_MACHINES' else 'ALGO_EXPLICIT_SEMANTIC_ANALYS' end;
DBMS_DATA_MINING.CREATE_MODEL2(
model_name => 'credit_tmp',
mining_function => dbms_data_mining.classification,
data_query => 'select * from train_credit_scoring',
set_list => v_setlst,
case_id_column_name=> 'CUSTOMER_ID',
target_column_name => 'credit_score_bin'
);
END;
BEGIN
insert into alg_results (ALGO, TP, FN, FP, TN, PRECISION, RECALL, F1, Model_Acuracy, False_Positive)
SELECT case i when 1 then 'ALGO_DECISION_TREE' when 2 then 'ALGO_NAIVE_BAYES' when 3 then 'ALGO_NEURAL_NETWORK' when 4 then 'ALGO_GENERALIZED_LINEAR_MODEL' when 5 then 'ALGO_RANDOM_FOREST' when 6 then 'ALGO_SUPPORT_VECTOR_MACHINES' else 'ALGO_EXPLICIT_SEMANTIC_ANALYS' end ALGO,
TP, FN, FP, TN, TP/(TP+FP) PRECISION, TP/(FN + TP) RECALL, 2*TP/(FN+2*TP+FP) F1, (TP+TN)/(TP+FP+TN+FN) Model_Acuracy, FP/(FP+TN) False_Positive
FROM (
SELECT SUM(TP) TP, SUM(FP) FP, SUM(TN) TN, SUM(FN) FN
FROM (
SELECT CASE WHEN pred = 'Good Credit' AND target = 'Good Credit' THEN 1 ELSE 0 END AS TP,
CASE WHEN pred = 'Good Credit' AND target = 'Other Credit' THEN 1 ELSE 0 END AS FP,
CASE WHEN pred = 'Other Credit' AND target = 'Other Credit' THEN 1 ELSE 0 END AS TN,
CASE WHEN pred = 'Other Credit' AND target = 'Good Credit' THEN 1 ELSE 0 END AS FN
FROM (
SELECT CUSTOMER_ID, prediction(credit_tmp using *) pred, credit_score_bin target
FROM test_credit_scoring) a ) b ) c;
commit;
END;
END LOOP;
end;
/
The entire process took less then 2 minutes on my laptop (109 seconds).
Now it is easy to compare the performance of each algorithm:
Finding optimal Hyper-parameters
We can define various parameters in the setting table, with different list for every algorithm. (See here for details).
I will stay with the Decision Tree, since it is easier to understand.
Here is the list of its specific hyper-parameters:
I can always set specific setting values such as tree max depth (TREE_TERM_MAX_DEPTH, default 7 and TREE_TERM_MINREC_NODE - Minimum number of rows in the node, default 10). We can always check for specific values:
Begin
Delete from CR_SCR_set_DT;
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.algo_name, dbms_data_mining.algo_Decision_Tree);
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.TREE_TERM_MAX_DEPTH, 10);
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.TREE_TERM_MINREC_NODE, 20);
Commit;
End;
I would like to check what are the optimal values of of the 2,TREE_TERM_MAX_DEPTH (values 5-15) and TREE_TERM_MINREC_NODE (values 10-50, steps of 10).
I can run the 4 steps in a for loop with PLSql. Where I update the setting table, delete the previous model (assumes there is a previous model), build new model based on the latest setting and place the results of my model evaluation values in a table.
I drop and create the results table
drop table DT_results;
create table DT_results (Depth Number, Split NUMBER(16,6), TP Number(16,6), FN Number(16,6), FP Number(16,6), TN Number(16,6), PRECISION Number(16,6), RECALL Number(16,6), F1 Number(16,6), Model_Acuracy Number(16,6), False_Positive Number(16,6));
Run 2 "FOR" loops:
declare J1 number(16,6); -- no reason for this format
BEGIN
Delete from DT_RESULTS;
FOR I IN 5..15 LOOP
FOR J IN 1..5 LOOP
J1:=J*10;
Begin
Delete from CR_SCR_set_DT;
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.algo_name, dbms_data_mining.algo_Decision_Tree);
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.TREE_TERM_MAX_DEPTH, I);
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.TREE_TERM_MINREC_NODE, J1);
Commit;
End;
BEGIN
DBMS_DATA_MINING.DROP_MODEL(model_name => 'credit_DT');
EXCEPTION WHEN OTHERS THEN NULL; -- to prevent the error when no model
END;
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'credit_DT',
mining_function => dbms_data_mining.classification,
data_table_name => 'train_credit_scoring',
case_id_column_name => 'CUSTOMER_ID',
target_column_name => 'credit_score_bin',
settings_table_name => 'CR_SCR_set_DT');
END;
BEGIN
insert into DT_results (Depth, Split, TP, FN, FP, TN, PRECISION, RECALL, F1, Model_Acuracy, False_Positive)
SELECT I DEPTH, J1 Split, TP, FN, FP, TN, TP/(TP+FP) PRECISION, TP/(FN + TP) RECALL, 2*TP/(FN+2*TP+FP) F1, (TP+TN)/(TP+FP+TN+FN) Model_Acuracy, FP/(FP+TN) False_Positive
FROM (
SELECT SUM(TP) TP, SUM(FP) FP, SUM(TN) TN, SUM(FN) FN
FROM (
SELECT CASE WHEN pred = 'Good Credit' AND target = 'Good Credit' THEN 1 ELSE 0 END AS TP,
CASE WHEN pred = 'Good Credit' AND target = 'Other Credit' THEN 1 ELSE 0 END AS FP,
CASE WHEN pred = 'Other Credit' AND target = 'Other Credit' THEN 1 ELSE 0 END AS TN,
CASE WHEN pred = 'Other Credit' AND target = 'Good Credit' THEN 1 ELSE 0 END AS FN
FROM (
SELECT CUSTOMER_ID, prediction(credit_DT using *) pred, credit_score_bin target
FROM test_credit_scoring) a ) b ) c;
commit;
END;
END LOOP;
END LOOP;
END;
/
Now I can test the results:
Apparently, TREE_TERM_MAX_DEPTH is best in shallow tree of depth 5-6 and TREE_TERM_MINREC_NODE changes don't influence the results.
Testing Partition on a model
The idea of partition is to have a sub-model for specific values. It was nicely covered here by Brendan Tierney that also suggested to run it in parallel here.
I desided to try partitioning on Loan_type and Martial_status. That should create 20 sub-models. The only change is by adding the bold line in the setting table.
Begin
DElete from CR_SCR_set_DT;
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.algo_name, dbms_data_mining.algo_Decision_Tree);
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
Values(dbms_data_mining.TREE_TERM_MAX_DEPTH, 6);
INSERT INTO CR_SCR_set_DT (setting_name, setting_value)
VALUES(dbms_data_mining.odms_partition_columns, 'LOAN_TYPE, MARITAL_STATUS');
Commit;
End;
/
Surprisingly it did not inprove the results in this case.
What Else can we do?
There are many more options I could try, for example specifying cost as described here.
A cost matrix is a mechanism for influencing the decision making of a model. A cost
matrix can cause the model to minimize costly misclassifications. It
can also cause the model to maximize beneficial accurate
classifications.
For example, if a model classifies a customer with poor credit as low
risk, this error is costly. A cost matrix could bias the model to avoid
this type of error. The cost matrix might also be used to bias the
model in favor of the correct classification of customers who have the
worst credit history.
ROC is a useful metric for evaluating how a model behaves with
different probability thresholds. You can use ROC to help you find
optimal costs for a given classifier given different usage scenarios.
You can use this information to create cost matrices to influence the
deployment of the model.
Like a confusion matrix, a cost matrix is an n-by-n matrix, where n
is the number of classes. Both confusion matrices and cost matrices
include each possible combination of actual and predicted results based
on a given set of test data.
A confusion matrix is used to measure accuracy,
the ratio of correct predictions to the total number of predictions. A
cost matrix is used to specify the relative importance of accuracy for
different predictions. In most business applications, it is important to
consider costs in addition to accuracy when evaluating model quality.
Oracle Autonomous Data Warehouse (ADW)
You can see the same example in the Oracle Autonomous Data Warehouse machine learning notebook classification example. If you don't have access to ADW see the workshop here.