Regression II – A Little Theory

Now that we have introduced linear regression, we are going to think a little bit about the theory behind linear regression. Suppose we have n predictors, which we call X1,,Xn, and a response which we call Y, and we would like to do some regression. To do this, we need some values for our predictors and response, let’s say we have m of these. We must remember to distinguish between the variables X1,,Xn,Y and the values which we usually call samples. The former we generally identify with the axes of a n+1 dimensional space, and the latter with points in that space.

So, we would like to find a hyperplane that comes as close as possible to each of our m sample points. More specifically, if we consider our hyperplane as a function of X1,,Xn, then for each sample i, we would like to minimise the distance between the point in our hyperplane at X1,i,Xn,i and the point X1,i,Xn,i,Yi.

This is all a bit hard to visualise, so let’s think about things algebraically. We can think of our hyperplane as the set of points defined by

β0+β1X1++βnXn

for some scalars β0,...,βn. And for each sample i, we let

β0+β1X1,i+...+βnXn,i=^Yi

Now, we need to pick a good algebraic notion of distance. Our best bet is the euclidean distance, this lines up pretty well with our intuitive notion of geometric distance and it has has plenty of nice mathematical properties.

So for each i, we want to minimise:

(X1,iX1,i)2+...+(Xn,iXn,i)2+(Yi^Yi)2=(Yi^Yi)2

The square root is an increasing function, so, although it changes the value of the minimum of a function, it does not change it's location, so we can leave it out. Also, we would like to minimise this value for all points simultaneously, as they are all positive, the simultaneous minimum will be the minimum of the sum, so we take the sum over all m samples, which gives us

mi=1(Yi^Yi)2

And we can rewrite this using our definition of ˆY as

mi=1(Yi(β0+Nj=2βjXj,i))2

So, we have a formula that we would like to minimise, and the good news is that there is an analytic solution to this! We just follow the normal process for analytically finding the location of a minimum, we differentiate our objective function with respect to the β and then we set the result to zero and solve for the β.

Rather than deal with various different indices, we are going to convert our objective function into matrix form. To do this, we borrow a standard trick, we add a new predictor X0 which has constant value 1 to the set of predictors X1...,Xn. This gets rid of the floating β0 term, so we can now write

^Yi=β0+Nj=1βjXj,i=Nj=0βjXj,i

Then we can rewrite this as a matrix multiplication

ˆY=Xβ

Here, the columns of X represent predictors and the rows represent samples. The value we would like to minimise can then be rewritten in matrix form as.

(XβY)T(XβY)

Before we differentiate, we are going to rearrange a little. So, we multiply this out and get

(Xβ)TXβYTXβ(Xβ)TY+YTY

We can simplify this. Firstly note that, the middle two terms are scalars, so we can freely transpose them without changing their value, so we have

YTXβ=(Xβ)TY

Also, we can distribute the transpose in the first term, putting this together we get,

βTXTXβ2YTXβ+YTY

Now, we follow the normal rules for differentiation of vectors and matrices. We can think of the first term as

βTAβ where A=XTX

The derivative of which is

βT(AT+A)=βT((XTX)T+(XTX))=βT((XTX)+(XTX))=2βTXTX

Have look at this for the first step. Then, when we differentiate the second term of our objective function, we are just differentiating a row vector times a column vector, so we get

2YTX

Finally, the third term is constant with respect to \beta so the derivative is zero. Putting this all together we get

βT(XTX)YTX=0

Rearranging and taking the transpose gives

β=(XTX)1XTY

using the fact that the transpose of an inverse is the inverse of a transpose.

So, the coefficients of our regression are determined exactly by this equation! All we have to do is plug our sample values into the X and Y matrices, and we have the parameters of the best fit hyperplane.

In real life, when we have a lot of predictors or a lot of samples, doing this matrix algebra can be quite intensive, and it may in fact be easier to solve this optimisation problem numerically. The good news is that our objective function is convex, so it has a unique global minimum that we can find easily.

Leave a Reply

Your email address will not be published. Required fields are marked *