Linear Regression

Here I go over the motivations, properties, and interpretations of linear regression

Linear regression is often the first technique taught to undergraduates in a machine learning course. This is a testament to its simplicity, historical significance, and the intuition it provides for designing statistical models. The shortcomings of linear regression also help to motivate the more sophisticated methods that are taught later. In this post I discuss the motivations, properties, and interpretations of linear regression, starting first with simple linear regression.

Predicting House Prices

Suppose we are given a dataset containing the sale price and square footage of a large collection of previously sold homes, and we are tasked with modeling the relationship between these two variables. That is we’d like to be able to reasonably well guess a homes sale price as a function of its square footage.

Sale Price Area
221900 1180
180000 770
604000 1960
Figure 1. Sale price and area of 21613 homes from our dataset.

Where would you start? Beyond just predicting the average sale price, the simplest model would be to assume that the two variables are proportional. More specifically, we assume there is some scalar weight β1\beta_1 which describes how much the sale price is expected to increase for every additional square foot of area. sale priceareasale price=β1area \text{sale price} \propto \text{area} \Longrightarrow \text{sale price} = \beta_1 \cdot \text{area} If the model above was perfectly accurate, the homes in our dataset would fall on a straight line passing through the origin with slope β1\beta_1. Of course in reality our data does not lie on a straight line, and the above quantitites are not equal. However, we can fix our equation by adding the house specific sale price not explained by our model eie_i. For succinctness I’ll transition to using yy to denote sale price, xx to denote area, and ii to denote the ithi^{th} data point. y=sale pricex=areayiβ1xiyi=β1xi+ei y = \text{sale price} \quad x = \text{area} \\[5pt] y_i \approx \beta_1 x_i \\[5pt] y_i = \beta_1 x_i + e_i One obvious pitfall of this model is that our line is constrained to pass through the origin. It may be true that a home with 0 area is worth 0 dollars, but perhaps our dataset doesn’t contain any home prices or square footage below a certain point, or perhaps we just want more flexibility. We can remove this constraint by adding an additive factor β0\beta_0 to our model. yi=β0+β1xi+ei y_i = \beta_0 + \beta_1x_i + e_i How then should we go about choosing β0\beta_0 and β1\beta_1, the parameters of our model? Intuitively, the more accurate our model the smaller our errors should be. We could use the sum of the absolute errors (SAE) i=1Nei\sum_{i=1}^N |e_i|, but in practice we usually use the sum of the squared errors (SSE) instead. SSE=i=1Nei2=i=1N(yi(β0+β1xi))2 \begin{aligned} \text{SSE} &= \sum_{i=1}^N e_i^2 \\ &= \sum_{i=1}^N (y_i - (\beta_0 + \beta_1x_i))^2 \end{aligned} The reasons why the SSE is often preferred are non-trivial and beyond the scope of this post, but at least one of them will come up in my next one.

Okay so we now have our optimization objective: arg minβ0,β1i=1N(yi(β0+β1xi))2 \argmin_{\beta_0, \beta_1} \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i ))^2 In words, we want to choose the β0\beta_0 and β1\beta_1 that minimizes the sum of the squared difference between our data and our prediction. To find the forms for the optimal values we can start by taking the partial derivatives of the SSE with respect to each of our parameters A1 SSEβ0=i=1N2(yβ0β1xi)SSEβ1=i=1N2xi(yβ0β1xi) \frac{\partial \text{SSE}}{\partial \beta_0} = \sum_{i = 1}^N -2(y - \beta_0 - \beta_1 x_i)\\[5pt] \frac{\partial \text{SSE}}{\partial \beta_1} = \sum_{i = 1}^N -2x_i(y - \beta_0 - \beta_1 x_i)\\ As per usual we can set each derivative equal to zero and solve for the parameter.

Finding β0\beta_0

0=i=1N2(yiβ0β1xi)=i=1Nyiβ0β1xi=i=1nyiNβ0β1i=1Nxi Nβ0=i=1nyiβ1i=1Nxi β0=yˉβ1xˉ \begin{aligned} 0 &= \sum_{i=1}^N -2(y_i - \beta_0 - \beta_1x_i) \\ &= \sum_{i=1}^N y_i - \beta_0 - \beta_1x_i \\ &= \sum_{i=1}^n y_i - N\beta_0 - \beta_1\sum_{i = 1}^N x_i\ \\ N\beta_0 &= \sum_{i=1}^n y_i - \beta_1\sum_{i = 1}^N x_i\ \\ \beta_0 &= \bar{y} - \beta_1\bar{x} \end{aligned} Where the bar ˉ\enspace\bar{}\enspace denotes the average of the variable. Now let’s plug our value for β0\beta_0 into SSEβ1\frac{\partial\text{SSE}}{\partial\beta_1}, set it equal to 00 and solve.

Finding β1\beta_1

0=i=1N2xi(yiβ0β1xi)=i=1N2xi(yiyˉβ1xˉβ1xi)=i=1Nxiyixi(yˉβ1xˉ)β1xi2=i=1Nxiyiyˉi=1Nxi+β1xˉi=1Nxiβ1i=1Nxi2=i=1NxiyiNxˉyˉ+Nβ1xˉ2β1i=1Nxi2β1(i=1Nxi2Nxˉ2)=i=1NxiyiNxˉyˉβ1=i=1NxiyiNxˉyˉi=1Nxi2Nxˉ2 \begin{aligned} 0 &= \sum_{i=1}^N -2x_i(y_i - \beta_0 - \beta_1x_i) \\ &= \sum_{i=1}^N -2x_i(y_i - \bar{y} - \beta_1\bar{x} - \beta_1x_i) \\ &= \sum_{i=1}^N x_iy_i - x_i(\bar{y} - \beta_1\bar{x}) - \beta_1x_i^2 \\ &= \sum_{i=1}^N x_iy_i - \bar{y}\sum_{i=1}^N x_i + \beta_1\bar{x}\sum_{i=1}^Nx_i - \beta_1\sum_{i=1}^Nx_i^2 \\ & = \sum_{i=1}^N x_iy_i - N\bar{x}\bar{y} + N\beta_1\bar{x}^2 - \beta_1\sum_{i=1}^Nx_i^2 \\ \beta_1\biggl(\sum_{i=1}^Nx_i^2 - N\bar{x}^2 \biggr) &= \sum_{i=1}^Nx_iy_i - N\bar{x}\bar{y} \\ \beta_1 &= \frac{\sum_{i=1}^N x_iy_i - N\bar{x}\bar{y}}{\sum_{i=1}^N x_i^2 - N\bar{x}^2} \end{aligned} This form for β1\beta_1 is completely valid and we could stop here. However, with some rearrangement we can get β1\beta_1 into a more familiar form. Let’s tackle the numerator first. i=1NxiyiNxˉyˉ=i=1NxiyiNxˉyˉ+NxˉyˉNxˉyˉ=i=1N(xiyiyˉxi+xˉyixˉyˉ)=i=1N(xixˉ)(yiyˉ) \begin{aligned} \sum_{i=1}^N x_i y_i - N\bar{x}\bar{y} &= \sum_{i=1}^N x_i y_i - N\bar{x}\bar{y} + N\bar{x}\bar{y} - N\bar{x}\bar{y} \\ &= \sum_{i=1}^N \biggl( x_iy_i - \bar{y} x_i + \bar{x} y_i - \bar{x}\bar{y}\biggr) \\ &= \sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y}) \end{aligned} And now the denominator… i=1Nxi2Nxˉ2=i=1Nxi2Nxˉ2Nxˉ2+Nxˉ2=i=1N(xi22xˉxi+xˉ2)=i=1N(xixˉ)2 \begin{aligned} \sum_{i=1}^N x_i^2 - N\bar{x}^2 &= \sum_{i=1}^N x_i^2 - N\bar{x}^2 - N\bar{x}^2 + N\bar{x}^2 \\ &= \sum_{i=1}^N \biggl( x_i^2 - 2\bar{x}x_i + \bar{x}^2 \biggr) \\ &= \sum_{i=1}^N (x_i - \bar{x})^2 \end{aligned}

Now we can rewrite β1\beta_1

β1=i=1N(xixˉ)(yiyˉ)i=1N(xixˉ)2 \beta_1 = \frac{\sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^N (x_i - \bar{x})^2}

In this form, we can now clearly see that our parameter estimate for β1\beta_1 is the covariance between our independent and dependent variables scaled by the variance of our independent variable.

β1=Cov(x,y)Var(x) \beta_1 = \frac{\text{Cov}(x, y)}{\text{Var}(x)}

If you aren’t familiar, the variance of a random variable is defined as the average squared distance from that random variables mean μ\mu. It is a measure of the spread of that variable, how much it varies. Var(x)=1Ni=1N(xiμx)2 \text{Var}(x) = \frac{1}{N} \sum_{i=1}^N (x_i - \mu_x)^2 The covariance is a measure of the joint variability between two random variables. That is, it measures the degree to which two variables change together. The covariance has the following form. Cov(x,y)=1Ni=1N(xiμx)(yiμy) \text{Cov}(x, y) = \frac{1}{N}\sum_{i=1}^N (x_i - \mu_x)(y_i - \mu_y) Let’s zoom in on the covariance formula, starting with just a single element of the sum. If xix_i were to be below its mean and yiy_i were to be below its mean, the negative signs would cancel and the ithi^{th} term would have a positive contribution to the covariance. If instead, xix_i were below its mean and yiy_i was above its mean, that term would have a negative contribution to the covariance. The covariance as a whole measures the average of these contributions and so if small values of xx tend to co-occur with larger values of yy and vice versa, the covariance will be negative, indicating a negative linear relationship between the two variables.

Figure 1. Data with a range of covariances.

How does this inform our interpretation of β1\beta_1? Recall that β1\beta_1 parameterizes the slope of our linear model, that is how much yy changes for every unit of xx. Thus the covariance between xx and yy determines whether this change varies downward or upward (the sign of the slope), and the variance of xx determines over what range this relationship plays out.

where wfeaturew_\text{feature} denotes the weight assigned to that particular feature and ε\varepsilon is the unobserved contributions to the price that aren’t captured by our features, such as market randomness. We include ε\varepsilon to ensure that the left-hand side and right-hand side are equal, and we refer to ε\varepsilon as “errors”. Importantly we think of the the weights ww in the model above as describing some actual, ground-truth process from which we obtain noisy observations (our data). Our goal then is to use those observations to find an approximation of ww which we denote w^\hat{w}, that closely matches the true weights. This process is called fitting. To start, let’s rewrite our model using mathematical notation. In doing so we will sacrifice specificity for generality, and succinctness, as well as gain access to a toolbox of mathematical techniques.

First, we will simplify our model and only use one feature, the area (in square feet). Lets denote the area and sale price of the ithi^{th} home in our dataset as xix_i and yiy_i respectively. Our model then, looks like yi=xiw+εi y_i = x_iw + \varepsilon_i Here ww is the slope of our line or how much the sale price changes for an increase in the area of the home. If we add an offset term w0w_0 you might recognize this as the slope-intercept form of a line from algebra class. yi=xiw+w0+εi y_i = x_iw + w_0 + \varepsilon_i How then can we estimate our model parameters, ww and w0w_0? In truth we don’t know the true relationship In machine learning we conceptualize fitting (and learning more generally) as an optimization problem, and to solve an optimization problem we need some function to optimize. We want a function that measures how good a particular ww and w0w_0 are at fitting our data. In practice we usually formulate this function as measuring how bad our choice of parameters are and we try to find the parameters which minimize this function (in my last post on maximum likelihood estimation we took the opposite approach and formulated the optimization as a maximization problem). There are many such functions, referred to as loss functions, that we could choose, and they will all yield slightly different results, but for linear regression the most common choice is to use the least squared error (LSE). If we have NN different data points, the LSE is defined as LSE=i=1N(yixiww0)2 \text{LSE} = \sum_{i=1}^N (y_i - x_iw - w_0)^2 In words, the LSE is the sum over our dataset of the squared differences between our models guess at yiy_i and the actual sale price given a particular choice of ww and w0w_0. The LSE can only equal 00 when our estimate of all the sale prices in our dataset are perfectly correct. Our model is linear and so if this were the case, it would mean that our data lies perfectly on a straight line. Obviously this is highly unrealistic, but if our data follows a linear trend, then optimizing the LSE can result in a decent predictive model.

You might be asking yourself, why optimize the squared differences rather than the absolute differences, isn’t that more intuitive? Admittedly, optimizing the absolute differences is a more intuitive approach, however this is actually less common than using the LSE. This is largely for practical reasons. The LSE has a closed-form solution which we will derive shortly, whereas the least absolute differences requires iterative approximation to converge on an answer. In addition, because the LSE grows quadratically, we end up penalizing bad choices of parameters much more harshly with the LSE than if we were to use absolute differences.

Okay with all that setup lets find our parameter values. We are trying to solve the following optimization problems. w^LSE,w^0=arg minw,w0i=1N(yixiww0)2 \hat{w}_\text{LSE}, \hat{w}_0 = \argmin_{w, w_0} \sum_{i=1}^N (y_i - x_iw - w_0)^2 As is the case in most optimization problems we will try to find the values for our parameters when the derivative of our function is equal to 00. Let’s start with finding our coefficient.

Finding w^LSE\hat{w}_\text{LSE}

We begin by taking the derivative of the LSE with respect to ww LSEw=wi=1N(yixiww0)2=i=1N2(yixiww0)(xi)=2i=1Nxiyixi2wxiw0 \begin{aligned} \frac{\partial \text{LSE}}{\partial w} &= \frac{\partial}{\partial w}\sum_{i=1}^N (y_i - x_iw - w_0)^2 \\ &= \sum_{i=1}^N 2(y_i - x_iw - w_0)(x_i) &= 2\sum_{i=1}^N x_iy_i - x_i^2w - x_iw_0 % &= \frac{\partial}{\partial w}\sum_{i=1}^N (y_i - x_iw + w_0)(y_i - x_iw + w_0) \\ % &= \frac{\partial}{\partial w}\sum_{i=1}^N y_i^2 - 2y_ix_iw - y_iw_0 + x_iw^2 - x_iww_0 + w_0^2 \\ % &= \sum_{i=1}^N -2y_ix_i + 2x_iw - x_iw_0 \end{aligned}

Rather than writing our dataset as a table we will write it as pairs of house features and sale prices. We’ll wrap our features into a vector and denote it in bold xi\mathbf{x}_i, and we will denote our sale prices as yiy_i, where ii is an indexing variable used to denote the ithi^{th} entry in our dataset. You will often hear yiy_i referred to as a response variable. Using our particular features, one entry in our dataset now looks like this xi=[areabedroomsbathrooms]iyi=sale price \mathbf{x}_i = \begin{bmatrix} \text{area} \\ \text{bedrooms} \\ \text{bathrooms} \\ \vdots \end{bmatrix}_i\quad y_i = \text{sale price} This is a good way to understand where our data comes from, but for most of the remainder of this article I’ll use a more general formulation.

We will denote the DD features of our ithi^{th} datapont as a vector xiRD\mathbf{x}_i\in\mathbb{R}^D, our DD weights as a vector wRD\mathbf{w}\in\mathbb{R}^D, and our error terms and reponse variables as scalars yi,εiRy_i, \varepsilon_i\in\mathbb{R}. xi=[xi,1xi,dxi,D]w=[w1wdwD] \mathbf{x}_i = \begin{bmatrix} x_{i, 1} \\ \vdots \\ x_{i, d} \\ \vdots \\ x_{i, D} \end{bmatrix} \quad \mathbf{w} = \begin{bmatrix} w_1 \\ \vdots \\ w_d \\ \vdots \\ w_D \end{bmatrix} In linear regression we model our response variables as a linear combination of the elements in the corresponding feature vector. For the ithi^{th} data point we have yi=xi,1w1++xi,dwd++xi,DwD+εi y_i = x_{i, 1}w_1 + \dots + x_{i, d}w_d + \dots + x_{i, D} w_D + \varepsilon_i \\ We can more succinctly write this using vector notation A1. yi=xiw+εi y_i = \mathbf{x}_i^\top\mathbf{w} + \varepsilon_i We can write out the model for our whole dataset if we collect our responses and errors into vectors and our features into a matrix. y=Xw+ε \mathbf{y} = \mathbf{Xw} + \boldsymbol{\varepsilon} Each row of X\mathbf{X} is now a different data point and the columns correspond to that data points features. If we have NN data points XRN×DX=[x1xixN]=[x1,1x1,dx1,Dxi,1xi,dxi,DxN,1xN,dxN,D] \mathbf{X}\in\mathbb{R}^{N\times D}\quad \mathbf{X} = \begin{bmatrix} \rule[.5ex]{2.5ex}{0.5pt} & \mathbf{x}_1^\top & \rule[.5ex]{2.5ex}{0.5pt} \\ & \vdots & \\ \rule[.5ex]{2.5ex}{0.5pt} & \mathbf{x}_i^\top & \rule[.5ex]{2.5ex}{0.5pt} \\ & \vdots & \\ \rule[.5ex]{2.5ex}{0.5pt} & \mathbf{x}_N^\top & \rule[.5ex]{2.5ex}{0.5pt} \end{bmatrix} = \begin{bmatrix} x_{1,1} & \cdots & x_{1,d} & \cdots & x_{1,D} \\ \vdots & & \vdots & & \vdots \\ x_{i,1} & \cdots & x_{i,d} & \cdots & x_{i,D} \\ \vdots & & \vdots & & \vdots \\ x_{N,1} & \cdots & x_{N,d} & \cdots & x_{N,D} \end{bmatrix}

Appendix

A1
Both partial derivatives of the SSE\text{SSE} are obtained through applications of the chain rule SSEβ0= \begin{aligned} \frac{\partial \text{SSE}}{\partial \beta_0} &= \end{aligned}