Stubborn Coder - The Theory Behind Linear Regression

Now that we have introduced linear regression, we are going to think a little bit about the theory behind linear regression. Suppose we have \(n\) predictors, which we call \(X_1,…,X_n\), and a response which we call \(Y\), and we would like to do some regression. To do this, we need some values for our predictors and response, let’s say we have \(m\) of these. We must remember to distinguish between the variables \(X_1,…,X_n,Y\) and the values which we usually call samples. The former we generally identify with the axes of a \(n+1\) dimensional space, and the latter with points in that space.

So, we would like to find a hyperplane that comes as close as possible to each of our \(m\) sample points. More specifically, if we consider our hyperplane as a function of \(X_1,…,X_n\), then for each sample \(i\), we would like to minimise the distance between the point in our hyperplane at \(X_{1,i}…,X_{n,i}\) and the point \(X_{1,i}…,X_{n,i},Y_i\).

This is all a bit hard to visualise, so let’s think about things algebraically. We can think of our hyperplane as the set of points defined by

\[ \beta_0 + \beta_1 X_1 + … + \beta_n X_n \]

for some scalars \(\beta_0,...,\beta_n\). And for each sample \(i\), we let

\[ \beta_0 + \beta_1 X_{1,i} + ... + \beta_n X_{n,i} = \hat{Y_i} \]

Now, we need to pick a good algebraic notion of distance. Our best bet is the euclidean distance, this lines up pretty well with our intuitive notion of geometric distance and it has has plenty of nice mathematical properties.

So for each \(i\), we want to minimise:

\[ \sqrt{(X_{1,i} - X_{1,i})^2 + ... + (X_{n,i} - X_{n,i})^2 + (Y_i - \hat{Y_i})^2 } = \sqrt{(Y_i - \hat{Y_i})^2} \]

The square root is an increasing function, so, although it changes the value of the minimum of a function, it does not change it's location, so we can leave it out. Also, we would like to minimise this value for all points simultaneously, as they are all positive, the simultaneous minimum will be the minimum of the sum, so we take the sum over all \(m\) samples, which gives us

\[ \sum_{i = 1} ^m { (Y_i - \hat{Y_i})^2 } \]

And we can rewrite this using our definition of \(\hat{Y}\) as

\[\sum_{i=1}^m{ \left( Y_i - (\beta_0 + \sum_{j=2}^N\beta_j X_{j,i})\right)^2}\]

So, we have a formula that we would like to minimise, and the good news is that there is an analytic solution to this! We just follow the normal process for analytically finding the location of a minimum, we differentiate our objective function with respect to the \(\beta\) and then we set the result to zero and solve for the \(\beta\).

Rather than deal with various different indices, we are going to convert our objective function into matrix form. To do this, we borrow a standard trick, we add a new predictor \(X_0\) which has constant value \(1\) to the set of predictors \(X_1...,X_n\). This gets rid of the floating \(\beta_0\) term, so we can now write

\[ \hat{Y_i} = \beta_0 + \sum_{j = 1}^N{\beta_j X_{j,i} } = \sum_{j = 0}^N{\beta_j X_{j,i} } \]

Then we can rewrite this as a matrix multiplication

\[ \hat{Y} = X \beta \]

Here, the columns of \(X\) represent predictors and the rows represent samples. The value we would like to minimise can then be rewritten in matrix form as.

\[ ( X \beta - Y)^T ( X \beta - Y)\]

Before we differentiate, we are going to rearrange a little. So, we multiply this out and get

\[ (X \beta)^T X \beta - Y^T X \beta - ( X \beta)^T Y + Y^T Y \]

We can simplify this. Firstly note that, the middle two terms are scalars, so we can freely transpose them without changing their value, so we have

\[ Y^T X \beta = ( X \beta)^T Y \]

Also, we can distribute the transpose in the first term, putting this together we get,

\[ \beta^T X^T X \beta - 2 Y^T X \beta + Y^T Y \]

Now, we follow the normal rules for differentiation of vectors and matrices. We can think of the first term as

\[ \beta^T A \beta \text{ where } A = X^TX\]

The derivative of which is

\[ \beta^T (A^T + A) = \beta^T ((X^T X)^T + (X^T X)) = \beta^T ((X^T X) + (X^T X)) = 2 \beta^T X^T X \]

Have look at this for the first step. Then, when we differentiate the second term of our objective function, we are just differentiating a row vector times a column vector, so we get

\[-2 Y^T X\]

Finally, the third term is constant with respect to \beta so the derivative is zero. Putting this all together we get

\[ \beta^T (X^T X ) - Y^T X = 0 \]

Rearranging and taking the transpose gives

\[ \beta = (X^T X)^{-1} X^{T} Y \]

using the fact that the transpose of an inverse is the inverse of a transpose.

So, the coefficients of our regression are determined exactly by this equation! All we have to do is plug our sample values into the \(X\) and \(Y\) matrices, and we have the parameters of the best fit hyperplane.

In real life, when we have a lot of predictors or a lot of samples, doing this matrix algebra can be quite intensive, and it may in fact be easier to solve this optimisation problem numerically. The good news is that our objective function is convex, so it has a unique global minimum that we can find easily.

Regression II – A Little Theory

Leave a Reply Cancel reply