CM Data Research aims to conduct research based on the highest standards of scientific reasoning and academic ethics. To CM, science means to constantly rethink well-established ideas, to compare different theories and to back them with multiple lines of evidence and quantitative data.
It is part of CM’s philosophy to take a postpositivist point of view based on Popper’s principle of falsification and Kuhn’s theory of revolutionary paradigms. As the history of science indicates, a change in paradigms often occurs as methods of measurement improve. This is particularly important in the area of data science. As data becomes more and more complex in terms of its volume and variety, theoretical ideas might be adapted to new and more accurate data or even new theoretical paradigms emerge.
Even though the nature of data science indicates a strong focus on empirical research, it is highly important to back the analysis with theoretical considerations. Consider the following: a statistical analysis reveals a correlation between the number of storks and the birth rate of humans since rural areas do not only have a higher number of storks than urban areas but also a higher human birth rate. Without a theoretical analysis, one could come to the conclusion that increasing the number of storks in a certain area would result in a higher birth rate. From a theoretical point of view, however, this connection does not make sense even though correlation analysis would suggest so. If the goal is only prediction, this might not be a problem. However, for businesses predictions should lead to concrete actions and for many concrete actions the theoretical explanations behind a phenomenon has to be clear.
Empirical and theoretical research
The analysis conducted by CM Data Research has a strong focus on empirical research, especially on quantitative data. Too often, intuition, beliefs or theoretical claims turn out to contradict empirical evidence. It is part of CM’s mission to test theoretical claims on the basis of empirical reality that can be observed, measured and quantified.
Empirical research, however, has its limits and it is important to know these boundaries. Scientific theory distinguishes between phenomenological laws and theoretical laws. The former describes the causal relationship between two or more factors in a concrete situation, which can be observed by using empirical data. Theoretical laws, by contrast, basically represent a set of mathematical or logical equations that explain a phenomenological law. Whereas the empirical reality might be literally true, theoretical laws are subject of debate.
Objectivity, reliability and validity
The analysis conducted by CM Data Research is characterized by the principles of objectivity, reliability and validity. The principle of objectivity reflects the idea that our data, methodology and the analysis of our results are free from biases and outside influences. Consider the following: if an image recognition algorithm is trained by a dataset that only includes black cats and white dogs, it will probably predict a white cat as a dog. The algorithm chose color as a main variable predicting the outcome even though in the real world there not only black cats and white dogs. Within the context of data science objectivity means that the collection of data was not influenced by outside biases and that the data objectively represent the real world.
Reliability essentially means that a scientific test is repeatable. Reliability can also mean that the test results are independent from the data collection method. Reliability is an important factor to interpret changes or fluctuations in the results. If there is a fluctuation in the test results, we have to make sure that these changes are not due to different data collection methods.
Last not least, data research is supposed to follow the principle of validity, which means that the variables and tests we use should accurately reflect what they are supposed to measure. This requires tight operational definitions, sound theoretical constructs and premises and a proper judgement based on numerous lines of evidence and different scientific studies.
Deduction, induction and abduction
In regard to the debate about inductionism, deductionism and abductionism CM Data Research advocates a pragmatic point of view and considers these concepts as complementary methods of reasoning rather than contradictory. In practice, all of these methods of reasoning are part of our research. For instance, CM tests specific observations and single cases on the basis of large-scale data whether they can be generalized or not. At the same time, the predictive models and algorithms deduct general patterns of the past and present to future cases. And if these models do not work, new theories and models are invented.
CM Data Research deploys two fundamentally different approaches of data analysis, depending on the special needs and the specific circumstances. The first approach can be considered as a conventional technique and includes hypothesis testing. In doing so, CM starts its research by formulating the problem, research questions and hypothesis and by discussing the academic literature and theoretical considerations. The second step involves data collection and preparation based on the hypotheses built in the previous step. In the last step the hypotheses will be tested by the collected and prepared data.
The second approach is a modern approach and emerged from data mining and big data analysis. Here, the starting point is not hypothesis building but rather the extrapolation of data patterns in large datasets by using a certain algorithm. Whereas the first approach is hypothesis-based, the second approach is hypothesis-generating. The hypotheses represent the end product and not the starting point of the analysis.
Machine learning and prediction
Machine learning and predictions are fundamental elements of CM’s work. However, machine learning and predictions have their limits and it is vital to understand them. Potential problems of predictions can be related with the data themselves. If variables that are necessary to explain a certain outcome are missing, the prediction cannot work. A similar problem in machine learning can occur if old and new data differ from each other in terms of the measurement.
Another main problem linked to the problem described above are the so-called “unknown unknowns” – a term that was famously coined by Donald Rumsfeld. For instance, it is easy to predict a 50:50 chance of head or tail as a result of tossing a coin. But who says that a coin is even being tossed? In the real world there might be external impacts which affect a certain outcome but which are not predictable because these fluctuations were not represented in the training dataset.
Another problem is described by Bernhard Schölkopf, arguing that current machine learning systems perform better in predicting causes from effects rather than the other way round – a fact that supports the Bayes rule. Also, Stephen Wolfram’s principle of computational equivalence and the halting problem suggest that a machine cannot predict outcomes made by another machine unless it goes through the same computation.