There are a variety of situations, in which clients need to predict non-numerical factors. For instance, banks might be interested in predicting whether a loan will be default or non-default based on a variety of attributes of a borrower. Another example would be a mailing filter that predicts whether an email is a regular mail or a spam mail. Or, doctors might want to predict a specific disease based on a number of symptoms. In order to do this, we can apply a C5.0 decision tree, a very popular algorithm that was originally created by computer scientist J. Ross Quinlan. Of course, there are other machine learning techniques such as topological analysis or neural networks, which perform much better in many cases, but the C5.0 algorithm is very accurate and practical, easy to understand and applicable to a variety of problems.
In the following analysis I will give the C5.0 algorithm a try in order to predict credit default/non-default. In doing so, I will apply the algorithm to a dataset on default and non-default home equity loans provided by Credit Risk Analytics. The dataset includes the variables “default/non-default”, “credit amount”, “property”, “amount due on existing mortgage”, “job”, “years at present job”, “number of derogatory reports”, “number of delinquent credit lines”, “oldest credit line (in months)”, “number of credit inquiries”, “number of credit lines” and “debt-to-income ratio”.
With the growing amount of data and its consequences for scientific paradigms quantitative research and statistical methodology will become more and more important even for scientific discoveries in social sciences and other disciplines. As can be seen in many high profile academic journals, however, quantitative studies lack statistical knowledge and are prone to mistakes, many of which are based on false beliefs and misleading peer consensus.
In the following analysis I will address some of these common mistakes and false beliefs in basic statistical research. One of the issues discussed in this analysis will be the selection process of covariates within the context of different regression models. It has become quite common in economic and social science that researchers include a bunch of covariates into their regression models in order to “control for” other variables, a practice that often leads to multicollinearity. The following analysis will discuss the issue of “bad control” or multicollinearity.
The negative impact of terrorist attacks on tourism demand has frequently been the focus of attention of the academic literature and the media. Just recently, the media reported that the terrorist attacks conducted by the Islamic State have harmed the tourism industry in such places as Turkey, Egypt, Paris or London. However, the scholarly literature on the impact of terrorism on tourism or the terrorism-tourism nexus, as I call it here, was mainly dominated by qualitative analysis of specific case studies and anectodal evidence.
This study, by contrast, is supposed to go far beyond the scope of most academic studies. In fact, the analysis will focus on the general effects of terrorist attacks on tourism by using quantitative methodology. More specifically, the main purpose of this study is to assess the causal effect of terrorist attacks on tourist arrivals, how strong this effect is, and whether or not terrorist attacks have a lagged effect on the tourism demand one year after an attack had occured.