CM Data Research aims to conduct research based on the highest standards of scientific reasoning and academic ethics. To us, science means to constantly rethink well-established ideas, to compare different theories and to back them with multiple lines of evidence.
Even though we work with empirical data, it is highly important to back our analysis with theoretical considerations. Generally speaking, we prefer those theories which are closer to the empirical reality than other theories. In practice, CM has to test different machine learning models, for instance, and prefers those models, whose predictions come closer to well-established theories and the empirical reality than other models.
It is part of our philosophy to take a postpositivist point of view based on Popper’s principle of falsification or Kuhn’s theory of revolutionary paradigms. As the history of science indicates, a change in paradigms often occurs as methods of measurement improve. This is particularly important in the area of data science. As more and more data becomes available and as the accuracy of the data improves, theoretical theories might be adapted to new and more accurate data or even new theoretical paradigms emerge.
Empirical and theoretical research
The analysis conducted by CM Data Research has a strong focus on empirical research, especially quantitative data. Too often, intuition, beliefs or theoretical claims turn out to contradict empirical evidence. It is part of our mission to test theoretical claims on the basis of empirical reality that can be observed, measured and quantified.
Empirical research, however, has its limits and it is important to know these boundaries. Scientific theory distinguishes between phenomenological laws and theoretical laws. The former describes the causal relationship between two or more factors in a concrete situation, which can be observed by using empirical data. Theoretical laws, by contrast, basically represent a set of mathematical or logical equations that try to explain a phenomenological law. Whereas the empirical reality might be literally true, theoretical laws are subject of debate.
Objectivity, reliability and validity
The analysis conducted by CM Data Research is characterized by the principles of objectivity, reliability and validity. The principle of objectivity reflects the idea that our data, methodology and the analysis of our results are free from biases. This means in practice that our research has to be independent from the personal views of those conducting the research and that the argumentation is not determined by cultural boundaries or political world views.
Reliability basically means to reduce the error in the measurement process, which is particularly important for large-scale data analysis. Moreover, we assess the quality of our measurements, conduct pre-testing procedures and guarantee full transparency. As a consequence, our models and tests are repeatable by other researchers and practitioners.
Last not least, our data research is supposed to follow the principle of validity, which means that the variables and tests we use should accurately reflect what they are supposed to measure. This requires tight operational definitions, sound theoretical constructs and premises and a proper judgement based on numerous lines of evidence and different scientific studies.
Deduction, induction and abduction
In regard to the debate about inductionism, deductionism and abductionism CM Data Research advocates a pragmatic point of view and considers these concepts as complementary methods of reasoning rather than contradictory. In practice, all of these methods of reasoning are part of our research. For instance, we test specific observations and single cases on the basis of large-scale data whether they can be generalized or not. At the same time, our predictive models and algorithms deduct general patterns of the past and present to future cases. And if these models do not work, we abduct new theories and invent new models.
CM Data Research has developed a variety of data processing models according to the scientific standards currently prevalent within the scholarly literature. We basically start our research by planning the project. This includes the formulation of the problem, research questions and hypothesis and a solid discussion of the academic literature and theoretical considerations. The second step involves data preparation, a process that is very time consuming and that includes the accumulation, cleaning and manipulation of quantitative data. The third step consists of the analysis of the data, which includes a general exploration of the data and the application of the analytical models. In the last step we test our models and put the results into the theoretical context.
Machine learning and prediction
Machine learning and predictions are fundamental elements of CM’s work. However, machine learning and predictions have their limits and it is vital for us to understand them. Potential problems of predictions can be related with the data itself. If variables that are necessary to explain a certain outcome are missing, the prediction cannot work. A similar problem in machine learning can occur if old and new data differ from each other in terms of the measurement.
Another main problem in predicting outcomes are the unknown unknowns, a term that was famously coined by Donald Rumsfeld. For instance, it is easy to predict a 50:50 chance of head or tail as a result of tossing a coin. But who says that a coin is even being tossed? In the real world there might be external impacts which affect a certain outcome but which are not predictable.
Another problem is described by Bernhard Schölkopf, arguing that current machine learning systems perform better in predicting causes from effects rather than the other way round, a fact that supports the Bayes rule. Also, Stephen Wolfram’s principle of computational equivalence and the halting problem suggest that a machine cannot predict outcomes made by another machine unless it goes through the same computation.