Linear Regression

Here I go over the motivations, properties, and interpretations of linear regression

Published

Linear regression is often the first technique taught to undergraduates in a machine learning course. This is a testament to its simplicity, historical significance, and the intuition it provides for designing statistical models. The shortcomings of linear regression also help to motivate the more sophisticated methods that are taught later. In this post I discuss the motivations, properties, and interpretations of linear regression, starting first with simple linear regression.

Predicting House Prices

Suppose we are given a dataset containing the sale price and square footage of a large collection of previously sold homes, and we are tasked with modeling the relationship between these two variables. That is we’d like to be able to reasonably well guess a homes sale price as a function of its square footage.

Sale Price	Area
221900	1180
180000	770
604000	1960
…	…

Figure 1. Sale price and area of 21613 homes from our dataset.

Where would you start? Beyond just predicting the average sale price, the simplest model would be to assume that the two variables are proportional. More specifically, we assume there is some scalar weight $\beta_1$ which describes how much the sale price is expected to increase for every additional square foot of area. $\text{sale price} \propto \text{area} \Longrightarrow \text{sale price} = \beta_1 \cdot \text{area}$ If the model above was perfectly accurate, the homes in our dataset would fall on a straight line passing through the origin with slope $\beta_1$ . Of course in reality our data does not lie on a straight line, and the above quantitites are not equal. However, we can fix our equation by adding the house specific sale price not explained by our model $e_i$ . For succinctness I’ll transition to using $y$ to denote sale price, $x$ to denote area, and $i$ to denote the $i^{th}$ data point. $y = \text{sale price} \quad x = \text{area} \\[5pt] y_i \approx \beta_1 x_i \\[5pt] y_i = \beta_1 x_i + e_i$ One obvious pitfall of this model is that our line is constrained to pass through the origin. It may be true that a home with 0 area is worth 0 dollars, but perhaps our dataset doesn’t contain any home prices or square footage below a certain point, or perhaps we just want more flexibility. We can remove this constraint by adding an additive factor $\beta_0$ to our model. $y_i = \beta_0 + \beta_1x_i + e_i$ How then should we go about choosing $\beta_0$ and $\beta_1$ , the parameters of our model? Intuitively, the more accurate our model the smaller our errors should be. We could use the sum of the absolute errors (SAE) $\sum_{i=1}^N |e_i|$ , but in practice we usually use the sum of the squared errors (SSE) instead. $\begin{aligned} \text{SSE} &= \sum_{i=1}^N e_i^2 \\ &= \sum_{i=1}^N (y_i - (\beta_0 + \beta_1x_i))^2 \end{aligned}$ The reasons why the SSE is often preferred are non-trivial and beyond the scope of this post, but at least one of them will come up in my next one.

Okay so we now have our optimization objective: $\argmin_{\beta_0, \beta_1} \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i ))^2$ In words, we want to choose the $\beta_0$ and $\beta_1$ that minimizes the sum of the squared difference between our data and our prediction. To find the forms for the optimal values we can start by taking the partial derivatives of the SSE with respect to each of our parameters A1 $\frac{\partial \text{SSE}}{\partial \beta_0} = \sum_{i = 1}^N -2(y - \beta_0 - \beta_1 x_i)\\[5pt] \frac{\partial \text{SSE}}{\partial \beta_1} = \sum_{i = 1}^N -2x_i(y - \beta_0 - \beta_1 x_i)\\$ As per usual we can set each derivative equal to zero and solve for the parameter.

Finding $\beta_0$

$\begin{aligned} 0 &= \sum_{i=1}^N -2(y_i - \beta_0 - \beta_1x_i) \\ &= \sum_{i=1}^N y_i - \beta_0 - \beta_1x_i \\ &= \sum_{i=1}^n y_i - N\beta_0 - \beta_1\sum_{i = 1}^N x_i\ \\ N\beta_0 &= \sum_{i=1}^n y_i - \beta_1\sum_{i = 1}^N x_i\ \\ \beta_0 &= \bar{y} - \beta_1\bar{x} \end{aligned}$ Where the bar $\enspace\bar{}\enspace$ denotes the average of the variable. Now let’s plug our value for $\beta_0$ into $\frac{\partial\text{SSE}}{\partial\beta_1}$ , set it equal to $0$ and solve.

Finding $\beta_1$

$\begin{aligned} 0 &= \sum_{i=1}^N -2x_i(y_i - \beta_0 - \beta_1x_i) \\ &= \sum_{i=1}^N -2x_i(y_i - \bar{y} - \beta_1\bar{x} - \beta_1x_i) \\ &= \sum_{i=1}^N x_iy_i - x_i(\bar{y} - \beta_1\bar{x}) - \beta_1x_i^2 \\ &= \sum_{i=1}^N x_iy_i - \bar{y}\sum_{i=1}^N x_i + \beta_1\bar{x}\sum_{i=1}^Nx_i - \beta_1\sum_{i=1}^Nx_i^2 \\ & = \sum_{i=1}^N x_iy_i - N\bar{x}\bar{y} + N\beta_1\bar{x}^2 - \beta_1\sum_{i=1}^Nx_i^2 \\ \beta_1\biggl(\sum_{i=1}^Nx_i^2 - N\bar{x}^2 \biggr) &= \sum_{i=1}^Nx_iy_i - N\bar{x}\bar{y} \\ \beta_1 &= \frac{\sum_{i=1}^N x_iy_i - N\bar{x}\bar{y}}{\sum_{i=1}^N x_i^2 - N\bar{x}^2} \end{aligned}$ This form for $\beta_1$ is completely valid and we could stop here. However, with some rearrangement we can get $\beta_1$ into a more familiar form. Let’s tackle the numerator first. $\begin{aligned} \sum_{i=1}^N x_i y_i - N\bar{x}\bar{y} &= \sum_{i=1}^N x_i y_i - N\bar{x}\bar{y} + N\bar{x}\bar{y} - N\bar{x}\bar{y} \\ &= \sum_{i=1}^N \biggl( x_iy_i - \bar{y} x_i + \bar{x} y_i - \bar{x}\bar{y}\biggr) \\ &= \sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y}) \end{aligned}$ And now the denominator… $\begin{aligned} \sum_{i=1}^N x_i^2 - N\bar{x}^2 &= \sum_{i=1}^N x_i^2 - N\bar{x}^2 - N\bar{x}^2 + N\bar{x}^2 \\ &= \sum_{i=1}^N \biggl( x_i^2 - 2\bar{x}x_i + \bar{x}^2 \biggr) \\ &= \sum_{i=1}^N (x_i - \bar{x})^2 \end{aligned}$

Now we can rewrite $\beta_1$

$\beta_1 = \frac{\sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^N (x_i - \bar{x})^2}$

In this form, we can now clearly see that our parameter estimate for $\beta_1$ is the covariance between our independent and dependent variables scaled by the variance of our independent variable.

$\beta_1 = \frac{\text{Cov}(x, y)}{\text{Var}(x)}$

If you aren’t familiar, the variance of a random variable is defined as the average squared distance from that random variables mean $\mu$ . It is a measure of the spread of that variable, how much it varies. $\text{Var}(x) = \frac{1}{N} \sum_{i=1}^N (x_i - \mu_x)^2$ The covariance is a measure of the joint variability between two random variables. That is, it measures the degree to which two variables change together. The covariance has the following form. $\text{Cov}(x, y) = \frac{1}{N}\sum_{i=1}^N (x_i - \mu_x)(y_i - \mu_y)$ Let’s zoom in on the covariance formula, starting with just a single element of the sum. If $x_i$ were to be below its mean and $y_i$ were to be below its mean, the negative signs would cancel and the $i^{th}$ term would have a positive contribution to the covariance. If instead, $x_i$ were below its mean and $y_i$ was above its mean, that term would have a negative contribution to the covariance. The covariance as a whole measures the average of these contributions and so if small values of $x$ tend to co-occur with larger values of $y$ and vice versa, the covariance will be negative, indicating a negative linear relationship between the two variables.

Figure 1. Data with a range of covariances.

How does this inform our interpretation of $\beta_1$ ? Recall that $\beta_1$ parameterizes the slope of our linear model, that is how much $y$ changes for every unit of $x$ . Thus the covariance between $x$ and $y$ determines whether this change varies downward or upward (the sign of the slope), and the variance of $x$ determines over what range this relationship plays out.

where $w_\text{feature}$ denotes the weight assigned to that particular feature and $\varepsilon$ is the unobserved contributions to the price that aren’t captured by our features, such as market randomness. We include $\varepsilon$ to ensure that the left-hand side and right-hand side are equal, and we refer to $\varepsilon$ as “errors”. Importantly we think of the the weights $w$ in the model above as describing some actual, ground-truth process from which we obtain noisy observations (our data). Our goal then is to use those observations to find an approximation of $w$ which we denote $\hat{w}$ , that closely matches the true weights. This process is called fitting. To start, let’s rewrite our model using mathematical notation. In doing so we will sacrifice specificity for generality, and succinctness, as well as gain access to a toolbox of mathematical techniques.

First, we will simplify our model and only use one feature, the area (in square feet). Lets denote the area and sale price of the $i^{th}$ home in our dataset as $x_i$ and $y_i$ respectively. Our model then, looks like $y_i = x_iw + \varepsilon_i$ Here $w$ is the slope of our line or how much the sale price changes for an increase in the area of the home. If we add an offset term $w_0$ you might recognize this as the slope-intercept form of a line from algebra class. $y_i = x_iw + w_0 + \varepsilon_i$ How then can we estimate our model parameters, $w$ and $w_0$ ? In truth we don’t know the true relationship In machine learning we conceptualize fitting (and learning more generally) as an optimization problem, and to solve an optimization problem we need some function to optimize. We want a function that measures how good a particular $w$ and $w_0$ are at fitting our data. In practice we usually formulate this function as measuring how bad our choice of parameters are and we try to find the parameters which minimize this function (in my last post on maximum likelihood estimation we took the opposite approach and formulated the optimization as a maximization problem). There are many such functions, referred to as loss functions, that we could choose, and they will all yield slightly different results, but for linear regression the most common choice is to use the least squared error (LSE). If we have $N$ different data points, the LSE is defined as $\text{LSE} = \sum_{i=1}^N (y_i - x_iw - w_0)^2$ In words, the LSE is the sum over our dataset of the squared differences between our models guess at $y_i$ and the actual sale price given a particular choice of $w$ and $w_0$ . The LSE can only equal $0$ when our estimate of all the sale prices in our dataset are perfectly correct. Our model is linear and so if this were the case, it would mean that our data lies perfectly on a straight line. Obviously this is highly unrealistic, but if our data follows a linear trend, then optimizing the LSE can result in a decent predictive model.

You might be asking yourself, why optimize the squared differences rather than the absolute differences, isn’t that more intuitive? Admittedly, optimizing the absolute differences is a more intuitive approach, however this is actually less common than using the LSE. This is largely for practical reasons. The LSE has a closed-form solution which we will derive shortly, whereas the least absolute differences requires iterative approximation to converge on an answer. In addition, because the LSE grows quadratically, we end up penalizing bad choices of parameters much more harshly with the LSE than if we were to use absolute differences.

Okay with all that setup lets find our parameter values. We are trying to solve the following optimization problems. $\hat{w}_\text{LSE}, \hat{w}_0 = \argmin_{w, w_0} \sum_{i=1}^N (y_i - x_iw - w_0)^2$ As is the case in most optimization problems we will try to find the values for our parameters when the derivative of our function is equal to $0$ . Let’s start with finding our coefficient.

Finding $\hat{w}_\text{LSE}$

We begin by taking the derivative of the LSE with respect to $w$ $\begin{aligned} \frac{\partial \text{LSE}}{\partial w} &= \frac{\partial}{\partial w}\sum_{i=1}^N (y_i - x_iw - w_0)^2 \\ &= \sum_{i=1}^N 2(y_i - x_iw - w_0)(x_i) &= 2\sum_{i=1}^N x_iy_i - x_i^2w - x_iw_0 % &= \frac{\partial}{\partial w}\sum_{i=1}^N (y_i - x_iw + w_0)(y_i - x_iw + w_0) \\ % &= \frac{\partial}{\partial w}\sum_{i=1}^N y_i^2 - 2y_ix_iw - y_iw_0 + x_iw^2 - x_iww_0 + w_0^2 \\ % &= \sum_{i=1}^N -2y_ix_i + 2x_iw - x_iw_0 \end{aligned}$

Rather than writing our dataset as a table we will write it as pairs of house features and sale prices. We’ll wrap our features into a vector and denote it in bold $\mathbf{x}_i$ , and we will denote our sale prices as $y_i$ , where $i$ is an indexing variable used to denote the $i^{th}$ entry in our dataset. You will often hear $y_i$ referred to as a response variable. Using our particular features, one entry in our dataset now looks like this $\mathbf{x}_i = \begin{bmatrix} \text{area} \\ \text{bedrooms} \\ \text{bathrooms} \\ \vdots \end{bmatrix}_i\quad y_i = \text{sale price}$ This is a good way to understand where our data comes from, but for most of the remainder of this article I’ll use a more general formulation.

We will denote the $D$ features of our $i^{th}$ datapont as a vector $\mathbf{x}_i\in\mathbb{R}^D$ , our $D$ weights as a vector $\mathbf{w}\in\mathbb{R}^D$ , and our error terms and reponse variables as scalars $y_i, \varepsilon_i\in\mathbb{R}$ . $\mathbf{x}_i = \begin{bmatrix} x_{i, 1} \\ \vdots \\ x_{i, d} \\ \vdots \\ x_{i, D} \end{bmatrix} \quad \mathbf{w} = \begin{bmatrix} w_1 \\ \vdots \\ w_d \\ \vdots \\ w_D \end{bmatrix}$ In linear regression we model our response variables as a linear combination of the elements in the corresponding feature vector. For the $i^{th}$ data point we have $y_i = x_{i, 1}w_1 + \dots + x_{i, d}w_d + \dots + x_{i, D} w_D + \varepsilon_i \\$ We can more succinctly write this using vector notation A1. $y_i = \mathbf{x}_i^\top\mathbf{w} + \varepsilon_i$ We can write out the model for our whole dataset if we collect our responses and errors into vectors and our features into a matrix. $\mathbf{y} = \mathbf{Xw} + \boldsymbol{\varepsilon}$ Each row of $\mathbf{X}$ is now a different data point and the columns correspond to that data points features. If we have $N$ data points $\mathbf{X}\in\mathbb{R}^{N\times D}\quad \mathbf{X} = \begin{bmatrix} \rule[.5ex]{2.5ex}{0.5pt} & \mathbf{x}_1^\top & \rule[.5ex]{2.5ex}{0.5pt} \\ & \vdots & \\ \rule[.5ex]{2.5ex}{0.5pt} & \mathbf{x}_i^\top & \rule[.5ex]{2.5ex}{0.5pt} \\ & \vdots & \\ \rule[.5ex]{2.5ex}{0.5pt} & \mathbf{x}_N^\top & \rule[.5ex]{2.5ex}{0.5pt} \end{bmatrix} = \begin{bmatrix} x_{1,1} & \cdots & x_{1,d} & \cdots & x_{1,D} \\ \vdots & & \vdots & & \vdots \\ x_{i,1} & \cdots & x_{i,d} & \cdots & x_{i,D} \\ \vdots & & \vdots & & \vdots \\ x_{N,1} & \cdots & x_{N,d} & \cdots & x_{N,D} \end{bmatrix}$

Appendix

A1
Both partial derivatives of the $\text{SSE}$ are obtained through applications of the chain rule $\begin{aligned} \frac{\partial \text{SSE}}{\partial \beta_0} &= \end{aligned}$