There are a variety of situations, in which clients need to predict non-numerical factors. For instance, banks might be interested in predicting whether a loan will be default or non-default based on a variety of attributes of a borrower. Another example would be a mailing filter that predicts whether an email is a regular mail or a spam mail. Or, doctors might want to predict a specific disease based on a number of symptoms. In order to do this, we can apply a C5.0 decision tree, a very popular algorithm that was originally created by computer scientist J. Ross Quinlan. Of course, there are other machine learning techniques such as topological analysis or neural networks, which perform much better in many cases, but the C5.0 algorithm is very accurate and practical, easy to understand and applicable to a variety of problems.
In the following analysis I will give the C5.0 algorithm a try in order to predict credit default/non-default. In doing so, I will apply the algorithm to a dataset on default and non-default home equity loans provided by Credit Risk Analytics. The dataset includes the variables “default/non-default”, “credit amount”, “property”, “amount due on existing mortgage”, “job”, “years at present job”, “number of derogatory reports”, “number of delinquent credit lines”, “oldest credit line (in months)”, “number of credit inquiries”, “number of credit lines” and “debt-to-income ratio”.