Linear Regression

Here I go over the motivations, properties, and interpretations of linear regression, specifically ordinary least squares regression.

Linear regression is often the first technique taught to undergraduates in a machine learning course. This is a testament to its simplicity, historical significance, and the intuition it provides for designing statistical models. In this post I derive and discuss ordinary least squares (OLS), a particular flavor of linear regression that is most commonly taught in an introductory ML course.

Predicting House Prices

Suppose we are given a dataset containing the details of house sales in the US. The dataset contains the square footage, number of bedrooms, number of bathrooms, the sale price, and other features of a large collection of previously sold houses.

Sale Price Area Bedrooms Bathrooms
221900 1180 3 1
180000 770 2 1
604000 1960 4 3
 

The challenge we face is to predict the final sale price using these features. That is, we want to model the relationship between sale price and our other features such as area, bedrooms, bathrooms, etc. In linear regression, the critical assumption is that this relationship is linear, specifically a weighted sum sale price=areawarea+bedroomswbedrooms+bathswbaths++ε \text{sale price} = \text{area}\cdot w_\text{area} + \text{bedrooms}\cdot w_\text{bedrooms} + \text{baths}\cdot w_\text{baths} + \dots + \varepsilon where wfeaturew_\text{feature} denotes the weight assigned to that particular feature and ε\varepsilon is the unobserved contributions to the price that aren’t captured by our features, such as market randomness. We include ε\varepsilon to ensure that the left-hand side and right-hand side are equal, and we refer to ε\varepsilon as “errors”. Importantly we think of the the weights ww in the model above as describing some actual, ground-truth process from which we obtain noisy observations (our data). Our goal then is to use those observations to find an approximation of ww which we denote w^\hat{w}, that closely matches the true weights. This process is called fitting. To start, let’s rewrite our model using mathematical notation. In doing so we will sacrifice specificity for generality, and succinctness, as well as gain access to a toolbox of mathematical techniques.

First, we will simplify our model and only use one feature, the area (in square feet). Lets denote the area and sale price of the ithi^{th} home in our dataset as xix_i and yiy_i respectively. Our model then, looks like yi=xiw+εi y_i = x_iw + \varepsilon_i Here ww is the slope of our line or how much the sale price changes for an increase in the area of the home. If we add an offset term w0w_0 you might recognize this as the slope-intercept form of a line from algebra class. yi=xiw+w0+εi y_i = x_iw + w_0 + \varepsilon_i How then can we estimate our model parameters, ww and w0w_0? In truth we don’t know the true relationship In machine learning we conceptualize fitting (and learning more generally) as an optimization problem, and to solve an optimization problem we need some function to optimize. We want a function that measures how good a particular ww and w0w_0 are at fitting our data. In practice we usually formulate this function as measuring how bad our choice of parameters are and we try to find the parameters which minimize this function (in my last post on maximum likelihood estimation we took the opposite approach and formulated the optimization as a maximization problem). There are many such functions, referred to as loss functions, that we could choose, and they will all yield slightly different results, but for linear regression the most common choice is to use the least squared error (LSE). If we have NN different data points, the LSE is defined as LSE=i=1N(yixiww0)2 \text{LSE} = \sum_{i=1}^N (y_i - x_iw - w_0)^2 In words, the LSE is the sum over our dataset of the squared differences between our models guess at yiy_i and the actual sale price given a particular choice of ww and w0w_0. The LSE can only equal 00 when our estimate of all the sale prices in our dataset are perfectly correct. Our model is linear and so if this were the case, it would mean that our data lies perfectly on a straight line. Obviously this is highly unrealistic, but if our data follows a linear trend, then optimizing the LSE can result in a decent predictive model.

You might be asking yourself, why optimize the squared differences rather than the absolute differences, isn’t that more intuitive? Admittedly, optimizing the absolute differences is a more intuitive approach, however this is actually less common than using the LSE. This is largely for practical reasons. The LSE has a closed-form solution which we will derive shortly, whereas the least absolute differences requires iterative approximation to converge on an answer. In addition, because the LSE grows quadratically, we end up penalizing bad choices of parameters much more harshly with the LSE than if we were to use absolute differences.

Okay with all that setup lets find our parameter values. We are trying to solve the following optimization problems. w^LSE,w^0=arg minw,w0i=1N(yixiww0)2 \hat{w}_\text{LSE}, \hat{w}_0 = \argmin_{w, w_0} \sum_{i=1}^N (y_i - x_iw - w_0)^2 As is the case in most optimization problems we will try to find the values for our parameters when the derivative of our function is equal to 00. Let’s start with finding our coefficient.

Finding w^LSE\hat{w}_\text{LSE}

We begin by taking the derivative of the LSE with respect to ww LSEw=wi=1N(yixiww0)2=i=1N2(yixiww0)(xi)=2i=1Nxiyixi2wxiw0 \begin{aligned} \frac{\partial \text{LSE}}{\partial w} &= \frac{\partial}{\partial w}\sum_{i=1}^N (y_i - x_iw - w_0)^2 \\ &= \sum_{i=1}^N 2(y_i - x_iw - w_0)(x_i) &= 2\sum_{i=1}^N x_iy_i - x_i^2w - x_iw_0 % &= \frac{\partial}{\partial w}\sum_{i=1}^N (y_i - x_iw + w_0)(y_i - x_iw + w_0) \\ % &= \frac{\partial}{\partial w}\sum_{i=1}^N y_i^2 - 2y_ix_iw - y_iw_0 + x_iw^2 - x_iww_0 + w_0^2 \\ % &= \sum_{i=1}^N -2y_ix_i + 2x_iw - x_iw_0 \end{aligned}

Rather than writing our dataset as a table we will write it as pairs of house features and sale prices. We’ll wrap our features into a vector and denote it in bold xi\mathbf{x}_i, and we will denote our sale prices as yiy_i, where ii is an indexing variable used to denote the ithi^{th} entry in our dataset. You will often hear yiy_i referred to as a response variable. Using our particular features, one entry in our dataset now looks like this xi=[areabedroomsbathrooms]iyi=sale price \mathbf{x}_i = \begin{bmatrix} \text{area} \\ \text{bedrooms} \\ \text{bathrooms} \\ \vdots \end{bmatrix}_i\quad y_i = \text{sale price} This is a good way to understand where our data comes from, but for most of the remainder of this article I’ll use a more general formulation.

We will denote the DD features of our ithi^{th} datapont as a vector xiRD\mathbf{x}_i\in\mathbb{R}^D, our DD weights as a vector wRD\mathbf{w}\in\mathbb{R}^D, and our error terms and reponse variables as scalars yi,εiRy_i, \varepsilon_i\in\mathbb{R}. xi=[xi,1xi,dxi,D]w=[w1wdwD] \mathbf{x}_i = \begin{bmatrix} x_{i, 1} \\ \vdots \\ x_{i, d} \\ \vdots \\ x_{i, D} \end{bmatrix} \quad \mathbf{w} = \begin{bmatrix} w_1 \\ \vdots \\ w_d \\ \vdots \\ w_D \end{bmatrix} In linear regression we model our response variables as a linear combination of the elements in the corresponding feature vector. For the ithi^{th} data point we have yi=xi,1w1++xi,dwd++xi,DwD+εi y_i = x_{i, 1}w_1 + \dots + x_{i, d}w_d + \dots + x_{i, D} w_D + \varepsilon_i \\ We can more succinctly write this using vector notation A1. yi=xiw+εi y_i = \mathbf{x}_i^\top\mathbf{w} + \varepsilon_i We can write out the model for our whole dataset if we collect our responses and errors into vectors and our features into a matrix. y=Xw+ε \mathbf{y} = \mathbf{Xw} + \boldsymbol{\varepsilon} Each row of X\mathbf{X} is now a different data point and the columns correspond to that data points features. If we have NN data points XRN×DX=[x1xixN]=[x1,1x1,dx1,Dxi,1xi,dxi,DxN,1xN,dxN,D] \mathbf{X}\in\mathbb{R}^{N\times D}\quad \mathbf{X} = \begin{bmatrix} \rule[.5ex]{2.5ex}{0.5pt} & \mathbf{x}_1^\top & \rule[.5ex]{2.5ex}{0.5pt} \\ & \vdots & \\ \rule[.5ex]{2.5ex}{0.5pt} & \mathbf{x}_i^\top & \rule[.5ex]{2.5ex}{0.5pt} \\ & \vdots & \\ \rule[.5ex]{2.5ex}{0.5pt} & \mathbf{x}_N^\top & \rule[.5ex]{2.5ex}{0.5pt} \end{bmatrix} = \begin{bmatrix} x_{1,1} & \cdots & x_{1,d} & \cdots & x_{1,D} \\ \vdots & & \vdots & & \vdots \\ x_{i,1} & \cdots & x_{i,d} & \cdots & x_{i,D} \\ \vdots & & \vdots & & \vdots \\ x_{N,1} & \cdots & x_{N,d} & \cdots & x_{N,D} \end{bmatrix}

Appendix

A1
In vector notation, xw\mathbf{x}^\top\mathbf{w} (read x transpose w) is used to denote a dot product. x,wRDxw=x1w1+x2w2++xDwD \mathbf{x}, \mathbf{w} \in \mathbb{R}^D \\ \mathbf{x}^\top\mathbf{w} = x_1w_1 + x_2w_2 + \dots + x_D w_D