There are a variety of situations, in which clients need to predict non-numerical factors. For instance, banks might be interested in predicting whether a loan will be default or non-default based on a variety of attributes of a borrower. Another example would be a mailing filter that predicts whether an email is a regular mail or a spam mail. Or, doctors might want to predict a specific disease based on a number of symptoms. In order to do this, we can apply a C5.0 decision tree, a very popular algorithm that was originally created by computer scientist J. Ross Quinlan. Of course, there are other machine learning techniques such as topological analysis or neural networks, which perform much better in many cases, but the C5.0 algorithm is very accurate and practical, easy to understand and applicable to a variety of problems.

In the following analysis I will give the C5.0 algorithm a try in order to predict credit default/non-default. In doing so, I will apply the algorithm to a dataset on default and non-default home equity loans provided by Credit Risk Analytics. The dataset includes the variables “default/non-default”, “credit amount”, “property”, “amount due on existing mortgage”, “job”, “years at present job”, “number of derogatory reports”, “number of delinquent credit lines”, “oldest credit line (in months)”, “number of credit inquiries”, “number of credit lines” and “debt-to-income ratio”.

First, I will give a theoretical overview of decision trees and the C5.0 algorithm more particular. Before I will analyze the data I will split the dataset into two parts, one that will train the model and one that will test the algorithm. As soon as I have trained and tested the model I will show some of the improvements of the C5.0 algorithm compared to the C4.5 algorithm.

### Theoretical background

The C5.0 algorithm is cited amongst the widely used decision tree algorithms in data mining and machine learning. Basically, decision trees are based on the concept of classification, which means that a number of records (observations) have a variety of attributes (x) that are assigned to a certain class (y). In this case y has two classes: “default” and “non-default”, of which the former is labelled by the value 0 in the dataset and the latter by the value 1.

In the inductive process, which is the stage a decision tree model is trained, attributes are dedicated to the class and its labels. In the deductive process, which is the stage the model will be tested on new observations, the labels of the class are assigned to the attributes. The following figure shows the process described above.

The C5.0 decision tree is dealing with classification problems. In order to predict the classes the algorithm consists of a variety techniques, many of which are part of other algorithms as well:

(1) Decision tree algorithms produce subsets of the data and split the observations into different groups with the same classes. The degree to which a group of observations consists of the same class is called purity. The more observations of a group have the same class, the higher the purity of the subset. The C5.0 algorithm uses the concept of *entropy* (E) to measure the purity of a subset, a term that comes from information theory and quantifies the randomness of the class labels within a data segment.

If there are only two classes, as it is in the following analysis (default and non-default), *E(S)* can range between 0 (high purity) and 1 (low purity). If there are more than two classes *n* entropy ranges from 0 to *log _{2} n *. The formula below calculates the purity of a data segment:

where *E(S)* means the purity of a specific data segment or the whole dataset, *c* represents the number of class labels, and *p _{i}* reflects the proportion of observation within a class. In the dataset used in this analysis there are 3,515 loans, of which 3,206 loans are non-default (91.21 %) and 309 default (8.79%). Using the formula above, the purity of the whole dataset can be calculated as:

The function of the formula above can be illustrated by the following curve. Note that the highest purity (1) occurs when *x* lies at 0.5., which means the data segment consists of two equally large groups of classes. If one of the attributes can split the data segment into two halfs separating the two classes, the entropy of each of the two segments would go down to 0:

In order to decide on which attribute the data segment should be split, the algorithm computes the change in entropy resulting from a split on each of the possible attributes, which is called as information gain (A). In other words, the change in entropy is the difference between the entropy of the split before and the entropy of the segment after the split. In mathematical terms this can be calculated by the following formula:

where *E(S)* describes the entropy of the segment before the split and *E(S|A)* reflects the entropy of the segments after the split. What complicates matters is that a split results in more than one segment, which is why we also need the so-called conditional information entropy. *E(S|A)*, therefore, describes the sum of all the entropies after the split weighted by the proportion of the observations falling into each of the segment. *E(S|A)* can be calculated by the following formula:

To explain the conditional information entropy, let’s go into the details. After we have calculated the overall entropy or the entropy of the split before, we take the first attribute – let’s say – Job (C1) with the 6 distinct values “manager” (A1), “office” (A2), “prof/exe” (A3), “sales” (A4), “self-employed” (A5) and “other” (A6) (v = 6). Then we aggregate all observations with the value A2 *(p’ _{1})*. In this case we have 601 office workers of a total number of 3,515 observations which will be split into one group that includes default cases

*(p*– in this case 38 observations – and one that consists of non-default observations

_{01})*(p*– in this case 563.

_{11})Now we calculate three proportions: the proportion of the observations with attribute A2 in the dataset (601/3,515), the proportion of the default cases in the number of observation with the attribute A2 (38/601) and the proportion of the non-default cases in the number of cases with the attribute A2 (563/601). We repeat the same procedure with A1, A3, A4, A5 and A6 and insert the proportions into the conditional information entropy formula. After that we calculate the information gain (see formula above), by which we can further calculate the information gain ratio. The information gain ratio can be calculated as follows:

These steps have to be repeated with all the other attributes and their values. Then, each of the information gain ratios are compared with each other and the attribute with the highest information gain ratio will be used as a node that splits the dataset. If C1 reflects the highest information gain ratio, the dataset will be split into six subsets based on the six different values. After that the same procedure will be repeated with the subsets as long as one of the three conditions are met: (1) all the observations in the subset have the same class; (2) there are no attributes left that can divide the subsets and (3) all observations have the same value of an attribute but still belong to different classes.

Considering the formula above, the higher the homogeneity of the segments after the split, the higher the information gain. If the split results in completely homogeneous data segments with each of them having only one class label, the information gain of the split equals the information gain of the split before. If the information gain equals 0, by contrast, the split did not result in a reduction of entropy.

(2) Another important feature of the C5.0 algorithm is that it does not only deal with categorical variables but also with numerical ones. The algorithm basically splits numerical variables into two segments based on a numerical threshold that results in the best information gain. In other words, the algorithm produces two categorical labels, one segment that is greater (>) than the threshold and one that is less (<) than the threshold.

(3) The formulas above are used by the C5.0 algorithm to create rules determining the decision tree. These rules are “if then” statements, each of which represents a node in the decision tree. For instance, if X_{1} > 45,000 then class = 1. If X_{1} < 45,000 then class = 0.

(4) As compared to C5.0’s predecessor C4.5 and other decision tree algorithms, the C5.0 algorithm allows to improve the results by a technique that is called boosting. Shapire and Freund, who have extensively written on the issue, use a funny metaphor to explain boosting technology. To boost the performance of a council with a high number of incompetent politicians, as their metaphor goes, a smart committee should be formed by grossly incompetent but carefully selected politicians.

To use a more technical language, the goal of boosting is to create an accurate rule based on a combination of rough or moderately accurate rules of thumb. In other words, the idea behind the boosting technology is to focus on rules that misclassify the most observations and to combine them in a way that are capable of producing more accurate rules. If you are interested in how to do this exactly read this paper here since a detailed description would go much beyond the scope of this analysis.

(5) In contrast to the predecessor algorithm C4.5 the C5.0 algorithm allows us to implement a cost matrix into the algorithm. In many cases a specific type of error has higher costs than others. In our case, for instance, predicting a default loan as non-default would have much higher costs for a bank than predicting a non-default loan as default. The costs of a certain misclassification represents the weight of some given result, which are then transformed into a factor that changes the results of the model.

### Model training

Before we can predict default and non-default loans of new clients, we will train the C5.0 algorithm in the following section. Most of the R codes to apply the algorithm were borrowed from Brett Lantz’s introductory book Machine Learning with R.

First of all, we need to define the working directory and to install and activate the main packages for our analysis in the R terminal.

```
getwd()
[1] "/home/chris"
setwd("/home/chris/R/Credit_Scoring")
getwd()
[1] "/home/chris/R/Credit_Scoring"
install.packages("C50")
library(C50)
install.packages("gmodels")
library(gmodels)
```

In the next step we read the dataset into R, which should be located in the defined working directory. Note that the following dataset is a .csv text file dataset called “hmeq.csv”, in which the columns are separated by a comma. Then we inspect the structure of the dataset and define the type of those variables which could not be automatically detected by R.

```
mydata str(mydata)
'data.frame': 5960 obs. of 13 variables:
$ Default : int 1 1 1 1 0 1 1 1 1 1 ...
$ Amount_Loan : int 1100 1300 1500 1500 1700 1700 1800 1800 2000 2000 ...
$ Amount_Mortgage : num 25860 70053 13500 NA 97800 ...
$ Property : num 39025 68400 16700 NA 112000 ...
$ Reason : Factor w/ 3 levels "","DebtCon","HomeImp": 3 3 3 1 3 3 3 3 3 3 ...
$ Job : Factor w/ 7 levels "","Mgr","Office",..: 4 4 4 1 3 4 4 4 4 6 ...
$ Job_Years : num 10.5 7 4 NA 3 9 5 11 3 16 ...
$ Derogatory_Reports : int 0 0 0 NA 0 0 3 0 0 0 ...
$ No_Delinquent : int 0 2 0 NA 0 0 2 0 2 0 ...
$ Age_Credit_Line : num 94.4 121.8 149.5 NA 93.3 ...
$ No_Recent_Inquiries: int 1 0 1 NA 0 1 1 0 1 0 ...
$ No_Credit_Lines : int 9 14 10 NA 14 8 17 8 12 13 ...
$ Debt_Income_Ration : num NA NA NA NA NA ...
mydata$Default = as.factor(mydata$Default)
mydata$Amount_Loan = as.numeric(mydata$Amount_Loan)
mydata$Derogatory_Reports = as.numeric(mydata$Derogatory_Reports)
mydata$No_Delinquent = as.numeric(mydata$No_Delinquent)
mydata$No_Recent_Inquiries = as.numeric(mydata$No_Recent_Inquiries)
mydata$No_Credit_Lines = as.numeric(mydata$No_Credit_Lines)
```

Although the original C5.0 algorithm performs very well with missing cases, the C5.0 package in R often stumbles over these cases, which is why we redefine the missing categorical variables and exclude those observations with missing numerical cases.

```
levels(mydata$Reason)[1] = "missing"
levels(mydata$Job)[1] = "missing"
mydata <- mydata[complete.cases(mydata), ]
```

Looking at the structure of the dataset we can see that the number of observations is reduced from 5,960 observations to 3,515 cases. In the next step we randomly split the dataset into a training sample and a testing sample. In this case I will use 80% of the observations (2,812 cases) as a training sample and 20% as a testing sample.

```
set.seed(1234)
train_sample <- sample(3515, 2812)
mydata_train <- mydata[train_sample, ]
mydata_test <- mydata[-train_sample, ]
```

After that we need to check whether our variable of interest (default/non-default) has the same or similar distribution in both samples. If yes, an important precondition to conduct our analysis is fulfilled.

```
prop.table(table(mydata_train$Default))
0 1
0.91536273 0.08463727
prop.table(table(mydata_test$Default))
0 1
0.8990043 0.1009957
```

Now we can start with creating the decision tree, which will be done by the C5.0 package in R. Note that the default/non-default is in the first column of the dataset. If it was in the 12th column, we would have to write [-12] in the code.

```
model <- C5.0(mydata_train[-1], mydata_train$Default)
model
Call:
C5.0.default(x = mydata_train[-1], y = mydata_train$Default)
Classification Tree
Number of samples: 2812
Number of predictors: 12
Tree size: 42
Non-standard options: attempt to group attributes
```

As can be seen in the output, there are 2,812 observations, 12 predictors and a tree size of 42. Now we will generate the decision tree with the following code.

```
summary(model)
```

The code above will generate the following output:

```
Call:
C5.0.default(x = mydata_train[-1], y = mydata_train$Default)
C5.0 [Release 2.07 GPL Edition] Tue Feb 13 12:20:32 2018
-------------------------------
Class specified by attribute `outcome'
Read 2812 cases (13 attributes) from undefined.data
Decision tree:
Debt_Income_Ratio > 43.69152:
:...Age_Credit_Line 236.9968:
: :...Debt_Income_Ratio 45.56984: 1 (5)
Debt_Income_Ratio 3:
:...No_Delinquent > 4: 1 (13)
: No_Delinquent 35: 1 (6)
: No_Credit_Lines 28: 0 (16)
: No_Credit_Lines <= 28:
: :...Job = missing: 0 (0)
: Job in {Office,ProfExe,Self}: 1 (5)
: Job in {Mgr,Other,Sales}:
: :...Derogatory_Reports 2:
: :...Debt_Income_Ratio 26.69616: 1 (4)
Derogatory_Reports 1:
:...Job in {missing,Sales}: 0 (0)
: Job = Self: 1 (3)
: Job in {Mgr,Office,Other,ProfExe}:
: :...Age_Credit_Line 90.44952:
: :...Age_Credit_Line 285.3861:
: :...Reason = missing: 0 (0)
: Reason = HomeImp: 1 (3)
: Reason = DebtCon:
: :...Amount_Mortgage 65303: 1 (2)
No_Delinquent 35: 1 (5)
: No_Credit_Lines <= 35:
: :...Age_Credit_Line 109.4294: 0 (26)
Job in {missing,Mgr,Office,Other,ProfExe,Self}:
:...No_Credit_Lines 26200: 0 (33)
: Amount_Loan 9:
:...No_Recent_Inquiries 3:
:...Property 72566:
:...No_Delinquent > 0:
:...Amount_Loan 23300: 0 (3)
No_Delinquent <= 0:
:...Amount_Mortgage 107766:
:...Debt_Income_Ratio 37.47941: 0 (9)
Evaluation on training data (2812 cases):
Decision Tree
----------------
Size Errors
39 94( 3.3%) <<
(a) (b) <-classified as
---- ----
2567 7 (a): class 0
87 151 (b): class 1
Attribute usage:
100.00% Debt_Income_Ratio
97.23% No_Delinquent
96.34% Derogatory_Reports
95.55% Job
92.53% No_Credit_Lines
82.65% No_Recent_Inquiries
8.78% Age_Credit_Line
6.08% Amount_Loan
6.01% Reason
4.20% Property
3.81% Amount_Mortgage
```

As can be seen in the output, there are 94 observations which could not be classified correctly. The attribute usage shows the importance of each of the attributes as a classifier. Furthermore, the output reveals that 7 of actual non-default loans were misclassified as default loans and 87 actual default loans were misclassified as non-default loans. Now this model will be tested on the testing sampling.

### Model testing

```
model_pred <- predict(model, mydata_test)
CrossTable(mydata_test$Default, model_pred,
+ prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
+ dnn = c('actual default', 'predicted default'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 703
| predicted default
actual default | 0 | 1 | Row Total |
---------------|-----------|-----------|-----------|
0 | 623 | 9 | 632 |
| 0.886 | 0.013 | |
---------------|-----------|-----------|-----------|
1 | 32 | 39 | 71 |
| 0.046 | 0.055 | |
---------------|-----------|-----------|-----------|
Column Total | 655 | 48 | 703 |
---------------|-----------|-----------|-----------|
```

As can be seen in the table of the output, the model predicted 623 of a total number of 632 actual non-default loans correctly (98.57 %) and predicted 9 actual non-default loans as default. However, the error rate with the default loans is much higher. 39 cases of a total number of 71 actual default loans were predicted correctly (54.92%), while 32 actual default loans were incorrectly predicted as non-default. Although the overall error rate lies at 5.83 %, which sounds very high, the costs of 32 incorrect predictions of a total number 71 actual default loans might not be acceptable for a bank. For this reason I will use the boosting technology to check whether it can improve the model. First, I boost the model on the training set before I test the boosted model on the test data .

```
model_boost <- C5.0(mydata_train[-1], mydata_train$Default, trials = 10)
model_boost
summary(model_boost)
...
(a) (b) <-classified as
---- ----
2574 (a): class 0
43 195 (b): class 1
```

The output of the `summary(model_boost)`

command shows the decision trees of the ten trials and summarizes the performance of the boosted model. As the output shows, the boosted model could reduce the misclassification on the training data compared to the original model. Whereas the original model misclassified 7 non-default loans as default loans and 87 default loans as non-default, the boosted model resulted in no misclassification of the actual non-default loans and reduced the misclassification of actual default loans from 87 cases to 43. Now we have to test whether the boosted model will also result in a better error rate on the testing data.

```
credit_boost_predict <- predict(model_boost, mydata_test)
CrossTable(mydata_test$Default, credit_boost_predict,
+ prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
+ dnn = c('actual default', 'predicted default'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 703
| predicted default
actual default | 0 | 1 | Row Total |
---------------|-----------|-----------|-----------|
0 | 632 | 0 | 632 |
| 0.899 | 0.000 | |
---------------|-----------|-----------|-----------|
1 | 34 | 37 | 71 |
| 0.048 | 0.053 | |
---------------|-----------|-----------|-----------|
Column Total | 666 | 37 | 703 |
---------------|-----------|-----------|-----------|
```

As compared to the original model, the boosted model performed better in predicting the actual non-default but performed worse than in the original model. Whereas in the original model 32 cases were misclassified, the boosted model misclassified 34 actual default loans as non-default. The reason for this could be that the training data is very noisy. Nevertheless, the C5.0 algorithm allows us to implement a cost matrix, which is included in the C5.0 R package as the `costs`

function. Let’s assume that the costs of predicting an actual default loan as non-default is four times higher than the costs of predicting an actual non-default as default loan. The correct predictions, of course, have zero costs.

```
matrix_dim <- list(c("0", "1"), c("0", "1"))
names(matrix_dim) <- c("predicted", "actual")
matrix_dim
$predicted
[1] "0" "1"
$actual
[1] "0" "1"
error_cost <- matrix(c(0, 1, 4, 0), nrow = 2,
+ dimnames = matrix_dim)
error_cost
actual
predicted 0 1
0 0 4
1 1 0
credit_cost <- C5.0(mydata_train[-1], mydata_train$Default,
+ costs = error_cost)
credit_cost_predicted <- predict(credit_cost, mydata_test)
CrossTable(mydata_test$Default, credit_cost_predicted,
+ prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
+ dnn = c('actual default', 'predicted default'))
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 703
| predicted default
actual default | 0 | 1 | Row Total |
---------------|-----------|-----------|-----------|
0 | 618 | 14 | 632 |
| 0.851 | 0.048 | |
---------------|-----------|-----------|-----------|
1 | 29 | 42 | 71 |
| 0.041 | 0.060 | |
---------------|-----------|-----------|-----------|
Column Total | 647 | 76 | 703 |
---------------|-----------|-----------|-----------|
```

The costs in the cost matrix serve as weights in the model and show a different outcome compared to the original model and the boosted model. Although the error rate of actual non-default increased from 0 to 14 incorrect predictions, the cost sensitive model reduced the errors of the actual default loans by almost 4 percentage points (from 34 to 29 cases). Due to the higher costs of false predictions of actual default loans, this could be a trade off the bank might accept.

The prediction of default loans by using the C5.0 decision tree algorithm is one of the standard problems used in a variety of research papers. The algorithm is easier to understand than topological analysis, it can deal with both categorical and numeric attributes and it is very accurate and practical in predicting categorical unknowns. For more technical information, read the papers in the literature section below.

### References

Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) *Introduction to Algorithms*, 3rd edition, MIT Press, Massachusetts, London.

Pang, Su-lin; Gong, Ji-zhang (2009) * C5.0 Classification Algorithm and Application on Individual Credit Evaluation of Banks *, in: Systems Engineering – Theory & Practice, Vol. 29, No. 12, February, pp. 94-104.

Schapire, Robert E.; Freund, Yoav (2012) * Boosting: Foundations and Algorithms*, MIT Press, Massachusetts, London.

Tan, Pang-Ning; Steinbach, Michael; Karpatne, Anuj; Kumar, Vipin (2006) * Introduction to Data Mining *, Pearson, New York.

Wu, Xindong; Kumar, Vipin; Quinlan, J.Ross; Gosh, Joydeep; Yang, Qiang; Motoda, Hiroshi; McLachlan, Geoffrey J.; Ng, Angus; Liu, Bing; Yu, Philip S.; Zhou, Zhi-Hua; Steichbach, Michael; Hand, David J.; Steinberg, Dan (2008) * Top 10 algorithms in data mining *, in: Knowledge and Information Systems, Vol. 14, No. 1, pp. 1-37.