The Effects of Median Splits

The applet on this page illsutrate in a regression context the negative effects of using median splits in data analysis. When the page is loaded, the applet displays a regression analysis of 20 data points randomly smapled from a bivariate normal distribution for which the population squared correlation is 0.25. The vertical red line marks the median. In a regression context, splitting the predictor variable at the median is equivalent to assigning all observations below the predictor median to have the same score on the predictor and all observations above the predictor median to ahve another score on the predictor. It is common to use 0, 1 (dummy codes) or -1, +1 (contrast codes) as these predictor scores. Equivalently, one can use the mean predictor value fo the respecitve groups, as is done in the applet below.



As the slider at the bottom of the graph is moved from left to right, the points in the graph slide towards their location if the predictor were split at the median. At the top of eh graph is the regression equation, its r-square, as well as its t and p values. Note that as the slider moves towards the right, the regression line fluctuates minially but r-square and t steadily decrease. These decreases are due almost entirely to the systematic reduction in predictor variance, displayed under the regression equation.

When the slider is at the far right, the test statistics reported are identical to those that would be obtained using a two-sample Student's t-test. This illustrates that doing a median split reduces statistical power, primarily due to the reduction in the inherent variability of the predictor variable.

Click on the graph to generate a new sample of 20 observations. Note that due to sampling variablity, you may occassionally generate a sample for which doing the median split slightly increases the squared correlation. However, on average, performing the median split reduces the squared correlation to about 64% of what it otherwise would have been. [see references below].

It is interesting to observe the movement of points near each other and near the median. Moving the slider from left to right exaggerates the difference between those observations. At the same time, extreme observations are grouped together with observations near the median as the slider moves from left to right. Exaggering the difference between obsevations that were originally close together while at the same time minimizing the differences between observations that were originally very far apart cannot possibly be a useful strategy for data analysis.

For further considertion of this example and other negative consequences of dichotomizing continuous variables, see:

Irwin, J.R., & McClelland, G.H. (2003). Negative consequences of dichotomizing continuous predictor variables. Journal of Marketing Research, 40, 366-371.

See also

MacCallum, R.C., Zhang, S., Preacher, K.J., & Rucker, D.D. (2002). On the practice of dichotomomization of quantitative variables. Psychological Methods, 7, 19-40.