Required for next week: March 5, 1996

Technical Note: multiple regression

In the cookie experiment, I did a simple linear regression of 'taste' on 'price'; that is, I found the least squares line for a plot of taste (on the $y$ axis) against price (on the $x$ axis). The line was not a very good fit to the data, although it did at least have a positive slope. The variable chosen for the $y$ axis is usually called the dependent variable: it is the variable assumed to be influenced by the socalled independent variable that is on the $x$ axis.

A very common generalization of simple linear regression is to the setting of several independent variables, all thought to potentially influence the dependent variable. An example mentioned in Tainted Truth (p. 61) is the use of multiple regression in the analysis of the association between coffee intake and, for example, heart disease. The dependent variable would typically be the probability of heart disease (more precisely, the log of the odds of heart disease), and several independent variables, in addition to coffee consumption, are suggested by Crossen: cigarette consumption, amount of exercise, fat consumption, for example.

The standard statistical/mathematical expression for a multiple regression equation is the following equation:

y=b0 + b1 x1 + b2 x2 + ...  + bp xp
where x1, x2, ... , xp are the independent variables, and b1, b2, ... , bp are the coefficients of the regression model that are estimated using least squares. (In our cookie example, we had p=1, x1=price, and an estimated value of b1 of 0.89.)

The coefficients bi have the interpretation as a marginal rate: on average y increases (decreases) by bi for every unit increase (decrease) in xi, when all other variables are held fixed. In our taste test, the average taste rating went up by 0.89 for every 1 dollar increase in price per 100 grams. When several independent variables are used, the hope is that the coefficient for the variable of interest, coffee consumption, say, is an accurate measure of the effect of coffee consumption that is not contaminated by other factors, such as smoking, because they have already been 'controlled for' in the equation.

Of course, if the data don't follow a line, at least on average, then linear regression doesn't make much sense. The same thing is true for multiple regression: if the equation doesn't fit the data, then the equation isn't telling you much. However, regression is a pretty reliable and simple technique in a lot of cases, and there are lots of data sets for which the model does fit reasonably well.

In the Globe and Mail this week