In 1965, Gordon Moore, who was then the director of the Research and Development Laboratories of Fairchild Semiconductor and later became the co-founder of Intel Corporation, wrote a piece in the magazine Electronics about the future developments in the semiconductor industry. He had observed that the number of components on an integrated circuit (IC) has doubled every year since the production of the first planar transistor in 1959 and that this development will continue at least for the next ten years. In the next few decades the number of transistors on an IC increased dramatically, which resulted in a dynamic and rapid development of new product categories.
There are a variety of situations, in which clients need to predict non-numerical factors. For instance, banks might be interested in predicting whether a loan will be default or non-default based on a variety of attributes of a borrower. Another example would be a mailing filter that predicts whether an email is a regular mail or a spam mail. Or, doctors might want to predict a specific disease based on a number of symptoms. In order to do this, we can apply a C5.0 decision tree, a very popular algorithm that was originally created by computer scientist J. Ross Quinlan. Of course, there are other machine learning techniques such as topological analysis or neural networks, which perform much better in many cases, but the C5.0 algorithm is very accurate and practical, easy to understand and applicable to a variety of problems.
In the following analysis I will give the C5.0 algorithm a try in order to predict credit default/non-default. In doing so, I will apply the algorithm to a dataset on default and non-default home equity loans provided by Credit Risk Analytics. The dataset includes the variables “default/non-default”, “credit amount”, “property”, “amount due on existing mortgage”, “job”, “years at present job”, “number of derogatory reports”, “number of delinquent credit lines”, “oldest credit line (in months)”, “number of credit inquiries”, “number of credit lines” and “debt-to-income ratio”.
With the growing amount of data and its consequences for scientific paradigms quantitative research and statistical methodology will become more and more important even for scientific discoveries in social sciences and other disciplines. As can be seen in many high profile academic journals, however, quantitative studies lack statistical knowledge and are prone to mistakes, many of which are based on false beliefs and misleading peer consensus.
In the following analysis I will address some of these common mistakes and false beliefs in basic statistical research. One of the issues discussed in this analysis will be the selection process of covariates within the context of different regression models. It has become quite common in economic and social science that researchers include a bunch of covariates into their regression models in order to “control for” other variables, a practice that often leads to multicollinearity. The following analysis will discuss the issue of “bad control” or multicollinearity.
CM is collaborating with an international team of data scientists to develop and implement the Advanced Data Systems Analysis (ADSA), an algorithm to rank and cluster multi-dimensional (multi-sectoral) items. ADSA provides a wide range of advantages compared to other tools currently prevalent on the market and can be implemented in a variety of forms including consultancy and research papers, cloud-based software applications or as an algorithm that can be integrated into a company’s own data analysis section or automated industrial systems.
The areas of application include such fields as financial risk analysis, pricing optimization, product positioning, credit scoring, political risk analysis, or emotion scoring. Moreover, automated systems such as manufacturing and production processes can be equipped with the ranking and cluster algorithm.
Während Statistiken und Umfragen in der medialen Wahrnehmung lange Zeit als bare Münze dargestellt und unkritisch verbreitet wurden, so scheint sich dies in letzter Zeit in das komplette Gegenteil umzukehren. Daten und Umfragen büßten massiv an Glaubwürdigkeit ein und das Zitat ,,Traue keiner Statistik, die du nicht selbst gefälscht hast” gilt bereits als Standardmeinung in sämtlichen Online-Foren.
Abseits der Medien jedoch insbesondere in der akademischen Literatur reicht die Kritik an Umfragen schon sehr weit zurück. Der amerikanische Journalist Walter Lippmann etwa verwarf bereits im Jahre 1927 in seinem berühmten Werk The Phantom Public (2011) das Konzept der öffentlichen Meinung, da er eine zu große Diskrepanz zwischen der Komplexität der politischen und sozialen Realitäten und der einfachen Gestricktheit der Durchschnittsbürger sah. Neben einigen qualitativen Kritikpunkten steht in der akademischen Literatur eine Reihe von methodologischen und praktischen Herausforderungen zur Debatte.
The negative impact of terrorist attacks on tourism demand has frequently been the focus of attention of the academic literature and the media. Just recently, the media reported that the terrorist attacks conducted by the Islamic State have harmed the tourism industry in such places as Turkey, Egypt, Paris or London. However, the scholarly literature on the impact of terrorism on tourism or the terrorism-tourism nexus, as I call it here, was mainly dominated by qualitative analysis of specific case studies and anectodal evidence.
This study, by contrast, is supposed to go far beyond the scope of most academic studies. In fact, the analysis will focus on the general effects of terrorist attacks on tourism by using quantitative methodology. More specifically, the main purpose of this study is to assess the causal effect of terrorist attacks on tourist arrivals, how strong this effect is, and whether or not terrorist attacks have a lagged effect on the tourism demand one year after an attack had occured.