With the growing amount of data and its consequences for scientific paradigms quantitative research and statistical methodology will become more and more important even for scientific discoveries in social sciences and other disciplines. As can be seen in many high profile academic journals, however, quantitative studies lack statistical knowledge and are prone to mistakes, many of which are based on false beliefs and misleading peer consensus.
In the following analysis I will address some of these common mistakes and false beliefs in basic statistical research. One of the issues discussed in this analysis will be the selection process of covariates within the context of different regression models. It has become quite common in economic and social science that researchers include a bunch of covariates into their regression models in order to “control for” other variables, a practice that often leads to multicollinearity. The following analysis will discuss the issue of “bad control” or multicollinearity.
To address these issues I will analyze factors that shape people’s satisfaction with their government based on a quantitative dataset provided by the European Social Survey, one of the largest European cross-sectional datasets of that kind with more than 25,000 survey participants.
The Problem of multicollinearity in multiple regression analysis
Many statistical studies in economics, social science, life sciences or in environmental science often make fundamental mistakes in regressing dependent variables on independent ones. Researchers are used to include a bunch of covariates into regression models in order to “control for” other independent variables.
One common mistake in doing this is the issue of multicollinearity, which describes a situation when two or more independent variables correlate with each other or with the error term. A study by biologist Michael Graham showed that only 11 % of 294 statistical scientific papers on ecology published in five high quality journals discuss the issue of multicollinearity despite the fact that ecological data are often afflicted with multicollinearity.
Let us assume a regression model, in which data are included into a model that predicts a dependent variable (Y) based on the weighted sum of independent variables (Xi) and the random error (ε):
where X1 is the main independent variable of interest and β2 X2 represents a second independent variable (covariate). The problem of multicollinearity arises if β1 X1 and β2 X2 are correlating with each other because β2 X2 would distort the real direct effect represented in the regression coefficient of β1 X1 on Y.
Basically, there are three important implications of multicollinearity:
(1) Researchers frequently argue that multicollinearity is not a problem if we only care about the prediction of new observations and not about the coefficient. But this is not entirely correct. Models that involve multicollinearity can predict Y only if the new observation lies within the scope of the model, which means that it should not exceed the highest or lowest number of the predictor.
(2) Another problem is that multicollinearity produces regression coefficients with a high sampling variability, which greatly differ from each other when applied to new samples. Thus, individual regression coefficients might be highly imprecise.
(3) The typical interpretation of regression coefficients as the change in the expected value of the dependent variable regressed on the independent variable while the other variables are hold constant cannot be applied to models, whose independent variables correlate with each other.
One solution to this problem would be to exclude β2 X2 from the regression model in order to estimate the real direct effect of β1 X1 on Y. The problem with this though is that the exclusion of the covariate could distort the predictive power of the model because β2 X2 might not only have a shared contribution to Y, but also a unique contribution.
The primary solution to deal with this problem is to isolate the unique contribution of β2 X2 to the dependent variable by calculating partial regression and correlation coefficients. This does not only guarantee the predictive power of the model but also gives us the unique effects of each of the independent variables represented in the regression coefficients while holding the other independent variable constant.
If the variables are standardized to a mean of zero and unit variance, the partial regression coefficient of β1 X1 in a model with one covariate (β2 X2) can be calculated by the formula:
in which rY1 represents the correlation between β1 X1 and Y, rY2 is the correlation between β2 X2 and Y, and r12 is the correlation between &beta1 X1 and β2 X2. The partial regression coefficient provides information on the effect of the one of the two independent variables on the dependent variable while holding the other one constant.
In order to show the actual relationship between the covariate and the dependent variable I will also compute the partial correlation coefficient by using the formula below. Note that I used a different notation in the formula below, despite the fact that the terms reflect the same terms as in the regression coefficient formula:
Even though statistical analysis deals with empirical data, a solid theoretical framework is one of the most important preconditions of quantitative research. This does not only include statistical and mathematical foundations such as probability theory or sample theory, but also theories of the subject of interest. To investigate the reasons why respondents are satisfied with their government, the following analysis will be based on the theory of economic vote.
There is a widespread belief among spin-doctors and the media that elections can be won by touching the voters’ feelings and by communicating emotional messages. A closer look at empirical studies and socio-economic theories, however, reveals that the satisfaction of respondents with the government depends, at least to a considerable extent, on economics.
The conventional view within the scholarly literature argues that voting patterns can be explained by economic conditions, a hypothesis that is often referred to as economic voting. The first academic studies focusing on this issue emerged in the late 1970s. Edward Tufte (1978) and Douglas A. Hibbs (1977: 1467-1487) argued that the economy can be understood by looking at politics. The primary goal of governments and political parties, as their argument goes, is to implement economic policies that improve the economic conditions of their voters.
One of the most comprehensive studies on the relationship between economic conditions and voting patterns was conducted by Michael Lewis-Beck (1990) in his book Economics and Elections: The Major Western Democracies. As the graph below illustrates, the vote share of the British party in government can be predicted by the annual inflation rate, which indicates that British voters care about increasing prices and the devaluation of their property.
The overall correlation coefficient between the vote variable and the inflation rate in the five countries Great Britain, France, Germany, Italy and the United States lies at 0.64 (Lewis-Beck 1990: 11). Although the inflation rate seems to be an important variable determining voting patterns in all five countries, there are other macroeconomic indicators too, which are powerful predictors of voting patterns. As the following graph shows, in France the electorate tends to vote for left parties as the unemployment rate increases. A higher GDP, by contrast, correlates with a higher vote share of French right parties.
However, not all election outcomes can be explained by macroeconomic indicators. As a variety of scholars argue, voting choices are more complex and can be influenced by values and ideologies, campaign slogans, political scandals, natural disasters or personalities (Van der Brug et al 2007: 3-5). Nevertheless, a high number of studies on this subject have established a widespread consensus arguing that economic conditions determine election outcomes. Or as Van der Brug (et al 2007: 2) puts it, the “conventional wisdom holds more often than not”.
To discuss the issue of multicollinearity described above, I will analyze factors shaping people’s satisfaction with their government. In doing so, the analysis is based on the newest ESS dataset (2016) which surveyed 25,900 individuals from 13 different EU countries, including Austria, Belgium, Czech Republic, Germany, Estonia, Finland, France, Great Britain, Ireland, the Netherlands, Poland, Sweden and Slovenia. The dataset covers a wide range of different data such as biographic variables, economic and political factors, living standard, religious variables, satisfaction, trust in people and institutions, or human values.
To identify the independent variables I computed Spearman correlations between all metric and interval scaled variables and conducted t-tests with some of the factor variables in order to compare the means. The following correlation table includes the correlating pairs and excludes those variables which do not correlate with satisfaction with government.
cor_frame <- cbind(satisfied_gov, satisfied_economy, satisfied_life, satisfied_healthcare, satisfied_educ, trust_parl, happy, unemployment_country, ls_pensioners) res <- cor(cor_frame, use="complete.obs") round(res, 2) satisfied_gov satisfied_economy satisfied_life satisfied_gov 1.00 0.61 0.27 satisfied_economy 0.61 1.00 0.39 satisfied_life 0.27 0.39 1.00 satisfied_healthcare 0.32 0.33 0.24 satisfied_educ 0.35 0.30 0.22 trust_parl 0.59 0.46 0.25 happy 0.21 0.30 0.71 unemployment_country -0.12 -0.22 -0.09 ls_pensioners 0.31 0.34 0.25 satisfied_healthcare satisfied_educ trust_parl satisfied_gov 0.32 0.35 0.59 satisfied_economy 0.33 0.30 0.46 satisfied_life 0.24 0.22 0.25 satisfied_healthcare 1.00 0.39 0.33 satisfied_educ 0.39 1.00 0.29 trust_parl 0.33 0.29 1.00 happy 0.21 0.21 0.20 unemployment_country -0.07 -0.05 -0.18 ls_pensioners 0.33 0.21 0.31 pol_interest left_right_scale happy satisfied_gov -0.06 0.10 0.21 satisfied_economy -0.11 0.08 0.30 satisfied_life -0.08 0.11 0.71 satisfied_healthcare -0.03 0.05 0.21 satisfied_educ 0.04 0.09 0.21 trust_parl -0.21 0.05 0.20 happy -0.08 0.07 1.00 unemployment_country 0.14 -0.01 -0.04 ls_pensioners -0.13 0.05 0.20 unemployment_country ls_pensioners educ_years satisfied_gov -0.12 0.31 0.06 satisfied_economy -0.22 0.34 0.11 satisfied_life -0.09 0.25 0.11 satisfied_healthcare -0.07 0.33 -0.01 satisfied_educ -0.05 0.21 -0.01 trust_parl -0.18 0.31 0.15 happy -0.04 0.20 0.11 unemployment_country 1.00 -0.10 -0.18 ls_pensioners -0.10 1.00 0.09
As can be seen in the correlation table, there is a relatively high correlation between the “satisfied with government” and the variables “satisfied with economy” (0.61) and “trust in parliament” (0.59) and a modest correlation between “satisfied with government” and the “living standard of pensioners”, “satisfied with healthcare system”, “satisfied with education system”, and “satisfied with life” and a small correlation between the satisfaction with the government and the overall “happiness”.
Also, I conducted a Welch two sample t-test comparing the means of the variable “satisfied with government” of different groups in order to check whether to include such variables as gender or party affiliation.
t.test(satisfied_gov ~ gender, data = mydata) Welch Two Sample t-test data: satisfied_gov by gender t = 0.13703, df = 25078, p-value = 0.891 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.05309728 0.06107945 sample estimates: mean in group 1 mean in group 2 4.472418 4.468427
What we see in the test above is that there is no major difference between men and women in their overall satisfaction with the government. Neither can we reject the null-hypothesis. The same is true in regard to the mean difference between those who are affiliated with a political party and those who are not. Note that the t-test above would require many more (1000) t.test statistics comparing many more groups and testing each pair of 10 standard random normal numbers and to inspect the theoretical density.
Although there is a variety of independent variables which correlate with the dependent variable, the following regression analysis will focus on “satisfied with economy” as the main independent variable and its effects on the dependent variable “satisfied with government” and the variable “trust in parliament” as the covariate because the purpose of this article is to demonstrate the problem of multicollinearity.
First of all, I will test a simple linear regression model that only includes the variable “satisfied with economy” and its effects on the dependent variable “satisfied with government”. As can be seen below, the regression coefficient lies at 0.63, which means that an increase in satisfaction with the economy of the value 1 (in an interval scaled variable between 0 = not satisfied and 10 = highly satisfied) leads to an increase in the satisfaction with the government of 0.63 within the same scale. Here, the null-hypothesis could be rejected, standard error is much lower than with other variables, and the result is highly significant. More importantly, R2 lies at 37.6 %, which means that the variance can be explained by 37.6 %.
reg1 = lm(satisfied_gov ~ satisfied_economy, data=mydata, weights=pweight) summary(reg1) Call: lm(formula = satisfied_gov ~ satisfied_economy, data = mydata, weights = pweight) Weighted Residuals: Min 1Q Median 3Q Max -12.2907 -0.7164 0.0857 0.8996 11.5190 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.080877 0.029346 36.83 <0.0000000000000002 *** satisfied_economy 0.633603 0.005143 123.19 <0.0000000000000002 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.91 on 25094 degrees of freedom (804 observations deleted due to missingness) Multiple R-squared: 0.3769, Adjusted R-squared: 0.3768 F-statistic: 1.518e+04 on 1 and 25094 DF, p-value: < 0.00000000000000022
As the correlation table shows, there is also a correlation between the variable “trust in parliament” and “satisfaction with the government” with a coefficient of 0.59. So, I will test another regression model, in which the dependent variable “satisfaction with government” will be regressed on the variable “trust in parliament”.
reg2 = lm(trust_parl ~ satisfied_economy, data=mydata, weights=pweight) summary(reg2) Call: lm(formula = trust_parl ~ satisfied_economy, data = mydata, weights = pweight) Weighted Residuals: Min 1Q Median 3Q Max -11.5679 -0.8446 0.0123 1.0869 12.9110 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.038023 0.034305 59.41 <0.0000000000000002 *** satisfied_economy 0.494271 0.006009 82.26 <0.0000000000000002 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.239 on 25133 degrees of freedom (765 observations deleted due to missingness) Multiple R-squared: 0.2121, Adjusted R-squared: 0.2121 F-statistic: 6766 on 1 and 25133 DF, p-value: < 0.00000000000000022
As can be seen in the regression table above, an increase in the trust in the parliament of the value 1 corresponds with a higher satisfaction with the government. However, the coefficient of Reg2 is lower than the coefficient in Reg1, standard error is higher, t-value is much lower and R2 is 21%, which means that the contribution of the independent variable to the variation is much lower than in Reg1.
What many researchers would do in this case is to include both independent variables “satisfied with economy” and “trust in parliament” into one regression model and to test their impact on the dependent variable “satisfaction with government”. However, as can be seen in the correlation table, the variable “trust in parliament” depends, at least to a considerable extent, on the individual’s satisfaction with the economy. The correlation coefficient lies at 0.46.
If we make the mistake described above and include two correlating independent variables, we will see that the coefficients will change and do not reflect the direct impact anymore. Moreover, we cannot interpret the coefficient as being the change in the value of Y while the other independent variable is hold constant. Comparing the R2 of Reg1 and Reg3 below, one might think that the inclusion of the covariate “trust in parliament” reflects a contribution to the variation of more than 12 percentage points.
reg1 = lm(satisfied_gov ~ satisfied_economy + trust_parl, data=mydata, weights=pweight) summary(reg1) Call: lm(formula = satisfied_gov ~ satisfied_economy + trust_parl, data = mydata, weights = pweight) Weighted Residuals: Min 1Q Median 3Q Max -11.3398 -0.6585 0.0451 0.7196 12.3162 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.311975 0.028228 11.05 <0.0000000000000002 *** satisfied_economy 0.445058 0.005217 85.31 <0.0000000000000002 *** trust_parl 0.378849 0.004856 78.02 <0.0000000000000002 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.714 on 24874 degrees of freedom (1023 observations deleted due to missingness) Multiple R-squared: 0.4988, Adjusted R-squared: 0.4987 F-statistic: 1.238e+04 on 2 and 24874 DF, p-value: < 0.00000000000000022
Now I try to solve this problem by testing a partial least square regression model which will estimate the unique contribution of each of the variables to the dependent variable “satisfaction with government”. For this, I will use the package “plsdepot” created by Gaston Sanchez. The advantage of this package is that it computes several indicators very quickly. However, it is no problem, of course, to calculate the partial regression coefficients in R by using the formula above
library(plsdepot) Reg_frame <- mydata[, c(17, 23, 18)] Reg_frame[is.na(Reg_frame)] <- FALSE pls1 = plsreg1(Reg_frame[, 1:2], Reg_frame[, 3, drop = FALSE], comps = 2) pls1$reg.coefs Intercept satisfied_economy trust_parl 0.4474060 0.4284556 0.3696757
As can be seen in the computation above, the partial regression coefficients provide more accurate results with holding the second variable constant. However, the coefficients do not tell us much about the relationship between the two independent variables. The problem here is that the variable “satisfied with economy” also affects the variable “trust in parliament” which, in return, affects the dependent variable “satisfied with government”. So, the regression coefficient does not reflect the indirect impact of the economy on the satisfaction with the government through the variable trust. The computation of R2 below shows us the actual contribution of the second variable to the variation. Whereas the inclusion of the variable “trust in parliament” into Reg1 indicates that the contribution of this variable to the variation is more than 12 percentage points, the actual contribution according to the partial least square model lies at 0.004 percentage points.
pls1$R2 t1 t2 0.46246392467 0.00004780238
To illustrate the problem once again the graphic below shows the relationships and the correlation coefficients between each of the variables included in our analysis. The three variables are correlating with each other, ranging from 0.46 between “trust in parliament” and “satisfied with economy” to 0.62 between “satisfied with economy” and “satisfied with government”. The problem is that the relationship between “trust in parliament” and “satisfied with government” does not seem to be independent because “trust in parliament” correlates with “satisfied with economy”.
In order to deal with this problem I will calculate the partial correlation coefficient between “trust in parliament” and “satisfaction with government”. For this, we need to insert the correlation coefficients in the table above into the formula below. This time I calculated the partial correlation coefficients by hand, but it is no problem to do this in R by using the same formula.
Using a partial correlation coefficient, the result shows that the coefficient decreases from 0,59 in ordinary correlation to 0.3378 as partial regression coefficient, which means that much of the relationship between “trust in parliament” and “satisfied with government” can be explained by the variable “satisfied with economy”. However, a partial correlation coefficient of 0.3378 also means that a slight correlation between “trust in parliament” and “satisfied with government” remains, which is independent from the “satisfaction with the economy”. At this point, we could continue the analysis by computing the other partial correlation coefficients as well.
Note that this analysis was just a demonstration of the problem of multicollinearity and a potential solution to deal with multicollinearity. To make reliable assertions about the role of economics in shaping people’s satisfactions with the government, we would have to make further steps. Moreover, we would have to analyze more regression models and the model fit by using a Q-Q plot and other techniques. Another problem related to the issue of multicollinearity is that one or more of the independent variables could correlate with the error term of the regression model. But this problem will be discussed in a future post.
HIBBS JR., Douglas A. (1977) Political Parties and Macroeconomic Policy, in: The American Political Science Review, Vol. 71, No. 4, December, pp. 1467-1487.
LEWIS-BECK, Michael L. (1990) Economics and Elections: The Major Western Democracies, The University of Michigan Press, Michigan.
TUFTE, Edward (1978) Political Control of the Economy, Princeton University Press, Princeton.
VAN DER BRUG, Wouter; VAN DER EIJK, Cees; FRANKLIN, Mark (2007) The Economy and the Vote: Economic Conditions and Elections in Fifteen Countries, Cambridge University Press, New York.