In 1965, Gordon Moore, who was then the director of the Research and Development Laboratories of Fairchild Semiconductor and later became the co-founder of Intel Corporation, wrote a piece in the magazine Electronics about the future developments in the semiconductor industry. He had observed that the number of components on an integrated circuit (IC) has doubled every year since the production of the first planar transistor in 1959 and that this development will continue at least for the next ten years. In the next few decades the number of transistors on an IC increased dramatically, which resulted in a dynamic and rapid development of new product categories.
There are a variety of situations, in which clients need to predict non-numerical factors. For instance, banks might be interested in predicting whether a loan will be default or non-default based on a variety of attributes of a borrower. Another example would be a mailing filter that predicts whether an email is a regular mail or a spam mail. Or, doctors might want to predict a specific disease based on a number of symptoms. In order to do this, we can apply a C5.0 decision tree, a very popular algorithm that was originally created by computer scientist J. Ross Quinlan. Of course, there are other machine learning techniques such as topological analysis or neural networks, which perform much better in many cases, but the C5.0 algorithm is very accurate and practical, easy to understand and applicable to a variety of problems.
In the following analysis I will give the C5.0 algorithm a try in order to predict credit default/non-default. In doing so, I will apply the algorithm to a dataset on default and non-default home equity loans provided by Credit Risk Analytics. The dataset includes the variables “default/non-default”, “credit amount”, “property”, “amount due on existing mortgage”, “job”, “years at present job”, “number of derogatory reports”, “number of delinquent credit lines”, “oldest credit line (in months)”, “number of credit inquiries”, “number of credit lines” and “debt-to-income ratio”.