Um blog sobre nada

Um conjunto de inutilidades que podem vir a ser úteis

Archive for the ‘Linear Regression’ Category

Using Matrix Multiplication to calculate the Intercept and the Slope

Posted by Diego em Fevereiro 3, 2015

 

A different way to calculate the Intercept and slope of a function is to use Matrix Multiplication. Let’s try that with the dataset defined here. It’s a very simple dataset with one prediction (X) and one outcome (Y) where we know, from this post, that the Intercept is -1.336847 and the slope is 2.065414.

 

The first step is to derive 1 matrix and 1 vector from our data (which we will call X and Y). The matrix, X, will contain a column full of 1s and a column for each one of the predictions, in this example we only have 1, so it will be a 20 X 2 matrix (20 is the number of observations we have). The vector, Y, will contains only the outcomes, so it will be a 20 X 1 ”matrix”. Here is the R code to load them:

X = matrix(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), nrow=20, ncol=2)


Y = matrix(c(5,6,7,8,9,10,11,12,13,14,15,20,25,33,34,35,36,37,38,39), nrow=20, ncol=1)


X
[,1] [,2]
[1,]    1    1
[2,]    1    2
[3,]    1    3
[4,]    1    4
[5,]    1    5
[6,]    1    6
[7,]    1    7
[8,]    1    8
[9,]    1    9
[10,]    1   10
[11,]    1   11
[12,]    1   12
[13,]    1   13
[14,]    1   14
[15,]    1   15
[16,]    1   16
[17,]    1   17
[18,]    1   18
[19,]    1   19
[20,]    1   20

Y
[,1]
[1,]    5
[2,]    6
[3,]    7
[4,]    8
[5,]    9
[6,]   10
[7,]   11
[8,]   12
[9,]   13
[10,]   14
[11,]   15
[12,]   20
[13,]   25
[14,]   33
[15,]   34
[16,]   35
[17,]   36
[18,]   37
[19,]   38
[20,]   39

 

                Once we have those two objects, all we have to do is apply the formula bellow, which for the sake of simplicity, I did in 4 steps:

clip_image002

 

Multiply X by its transpose (the function “t” in R calculates the transpose of a matrix – Some matrix Algebra in R here):

A = t(X) %*% X


Calculate the inverse of A using the function “solve”:

B = solve(A)

 

Multiply the result by the transpose of X

C = B %*% t(X)

 

Multiply the result by the vector Y:

D = C %*% Y

 

The result will be a matrix containing the intercept and all the slopes:

D
          [,1]
[1,] -1.336842
[2,]  2.065414

 

 

 

On a slightly more complex example, let’s use the mtcars R dataset predict mpg using number of cylinders, horsepower and weight. First of all let’s run the regular lm function to see if we can match the values:

 

Library(mtcars)

lm(mpg ~ cyl + hp + wt, mtcars)

 

Call:

lm(formula = mpg ~ cyl + hp + wt, data = mtcars)

 

Coefficients:

(Intercept)          cyl           hp           wt 

   38.75179     -0.94162     -0.01804     -3.16697

 

So the function we are looking for is

y = 38.75179 -0.94162cyl – 0.01804hp -3.16697wt

 

#we are going to need the dplyr library to call the mutate function

library(dplyr)

#1)Create a copy of the mtcars dataset

aux = mtcars

#2)Extract just the columns we want:

aux = aux[,c("cyl","hp","wt")]

#3)Add a column with “one”

aux = mutate(aux, i=1)

#4)Convert into a matrix

X = data.matrix(aux[,c("i","cyl","hp","wt")])

5)Get our Y column from mtcars – which is the predicted values

Y = data.matrix(mtcars[,c("mpg")])

 

#6) At last, we can call the above formula in one go:

(solve(t(X) %*% X) %*% t(X)) %*% Y

          [,1]

i   38.7517874

cyl -0.9416168

hp  -0.0180381

wt  -3.1669731

 

And we can check that the values are the ones expected.

Anúncios

Posted in Data Science, Linear Regression, R, Statistics | Leave a Comment »

Regression Line and confidence interval

Posted by Diego em Novembro 18, 2014

 

Here I expand a little the concept of regression line by adding statistic inference to the “foo” dataset that I defined here

Just to recap, we can do fit <- lm (Y ~ X, foo) to get the regression line’s coefficients and calculate the predicted point:

 
> fit$coef
(Intercept)           X 
  -1.336842    2.065414 

 

X

Y

 

Predicted Point

1

5

 

0.7286

2

6

 

2.794

3

7

 

4.8594

4

8

 

6.9248

5

9

 

8.9902

6

10

 

11.0556

7

11

 

13.121

8

12

 

15.1864

9

13

 

17.2518

10

14

 

19.3172

11

15

 

21.3826

12

20

 

23.448

13

25

 

25.5134

14

33

 

27.5788

15

34

 

29.6442

16

35

 

31.7096

17

36

 

33.775

18

37

 

35.8404

19

38

 

37.9058

20

39

 

39.9712

 

 
> summary(fit)
 
Call:
lm(formula = Y ~ X, data = foo)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-6.3827 -2.3874  0.0519  2.4701  5.4211 
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.3368     1.6081  -0.831    0.417    
X             2.0654     0.1342  15.386 8.42e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 3.462 on 18 degrees of freedom
Multiple R-squared:  0.9293,   Adjusted R-squared:  0.9254 
F-statistic: 236.7 on 1 and 18 DF,  p-value: 8.42e-12
 
 
 

We can use the predict function to predict the values of x:

> X <- foo$X
> p<-predict(fit,data.frame(X), interval="confidence")
> p

 

clip_image002

 

Note that when we use interval=”confidence”, the lower and upper endpoints are displayed and that the “fit” column matches the values we calculated using the regression function.

 

To finish, we can plot the data and the regression line:

> plot(foo$X, foo$Y, xlab="X", ylab="Y")
> abline(fit, foo)

 

And add 2 lines to represent the lower and upper confidence interval:

> lines(X,p[,2],col="blue") #Lower confidence interval
> lines(X,p[,3],col="red")  #upper confidence interval
 
clip_image004

 

Posted in Data Science, Linear Regression, R, Statistics | Leave a Comment »

T – Test for regression Coefficients

Posted by Diego em Novembro 12, 2014

 

Getting the coefficients:

 

·         Prepare data:

library(UsingR);
data(diamond)

y <- diamond$price;
x <- diamond$carat;
n <- length(y)

 

·         Slope and intersect:

beta1 <- cor(y, x) * sd(y) / sd(x)

beta0 <- mean(y) – beta1 * mean(x)

·         Calculate the error:

e <- y – beta0 – beta1 * x #y – (beta1*x + beta0)

clip_image002[4]

·         Calculate Sigma

> sigma <- sqrt(sum(e^2) / (n-2))

> sigma

[1] 31.84052

·         Calculate the sum of the squared Xs:
> ssx <- sum((x – mean(x))^2)
>
ssx
[1] 0.1515667

·         Calculate the standard error for Beta0 and Beta1
> seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma
>
seBeta1 <- sigma / sqrt(ssx)
>
seBeta0
[1] 17.31886
>
seBeta1
[1] 81.78588

·         T distribution for hypothesis testing:
> tBeta0 <- beta0 / seBeta0;
>
tBeta1 <- beta1 / seBeta1
>
tBeta0
[1] -14.99094
>
tBeta1
[1] 45.49715

·         P values (shouldn’t be bigger than 1):
> pBeta0 <- 2 * pt(abs(tBeta0), df = n – 2, lower.tail = FALSE)
>
pBeta1 <- 2 * pt(abs(tBeta1), df = n – 2, lower.tail = FALSE)
>
pBeta0
[1] 2.523271e-19
>
pBeta1
[1] 6.75126e-40

 

 

Similarly:

> fit<-lm(y~x);
> summary(fit)$coefficients

 

clip_image003[4]

 

Confidence Interval:

> sumCoef<-summary(fit)$coefficients
> sumCoef[2,1]+c(-1,1)*qt(.975,df=fit$df)*sumCoef[2,2]
[1] 3556.398 3885.651

 

With 95% confidence, we estimate that a 0.1 carat increase in diamond size results in a 355.6 to 388.6 increase in price in (Singapore) dollars.

In other words, the model predicts that a 0.16 carat diamond will cost 335.73, based on the confidence interval, we can assume with 95% confidence that a 0.26 (+0.1) diamond will cost between 691.33 (335.73 + 355.6) and 724.33 (335.73 + 388.6). This can be “verified” by looking at the predicted value for a 0.26 carat diamond, which is 707.83

 

Plotting the prediction Intervals:

Plot the points and the regression line:

> plot(x,y,frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black",bg="lightblue",cex=2)
> abline(fit,lwd=2)
 

 

Calculate X points and the predicted Y points:
 
> xVals<-seq(min(x),max(x),by=.01)
> yVals<-beta0+beta1*xVals

 

clip_image004[4]

 

Calculate the standard error:

> se1<-sigma*sqrt(1/n+(xVals-mean(x))^2/ssx)
> se2<-sigma*sqrt(1+1/n+(xVals-mean(x))^2/ssx

clip_image006[4]

 

> lines(xVals,yVals+2*se1)
> lines(xVals,yVals-2*se1)
> lines(xVals,yVals+2*se2)
> lines(xVals,yVals-2*se2)

 

 

clip_image008[4]

Posted in Data Science, Linear Regression, R, Statistics | Leave a Comment »

Residuals and R square

Posted by Diego em Novembro 11, 2014

(Still working with the linear model l defined here)

 

Residuals are the distances between the actual point and the estimate given by the regression line.
Bellow we can see the X, Y point, the “predicted point” (the value of y substituting x on the y = 2.0654x – 1.3368 function) and the residual (actual point – predicted point):

X

Y

 

Predicted point

Residual

1

5

 

0.7286

4.2714

2

6

 

2.794

3.206

3

7

 

4.8594

2.1406

4

8

 

6.9248

1.0752

5

9

 

8.9902

0.0098

6

10

 

11.0556

-1.0556

7

11

 

13.121

-2.121

8

12

 

15.1864

-3.1864

9

13

 

17.2518

-4.2518

10

14

 

19.3172

-5.3172

11

15

 

21.3826

-6.3826

12

20

 

23.448

-3.448

13

25

 

25.5134

-0.5134

14

33

 

27.5788

5.4212

15

34

 

29.6442

4.3558

16

35

 

31.7096

3.2904

17

36

 

33.775

2.225

18

37

 

35.8404

1.1596

19

38

 

37.9058

0.0942

20

39

 

39.9712

-0.9712

 

 

Same values can be accessed using: l$residuals:

clip_image001

 

The predicted points can be found using: l$fitted.values

clip_image002

FYI, l$model will return the data.

 

R2 is the percentage of variation explained by the regression model.

It is calculated by dividing the total error with the line by the total error from the meanY

 

image

 

 
 

 

X Y   Predicted Point Residual (error) Squared Error (residual^2) Squared error from Mean
1 5   0.7286 4.2714 18.24485796 235.6225
2 6   2.794 3.206 10.278436 205.9225
3 7   4.8594 2.1406 4.58216836 178.2225
4 8   6.9248 1.0752 1.15605504 152.5225
5 9   8.9902 0.0098 9.604E-05 128.8225
6 10   11.0556 -1.0556 1.11429136 107.1225
7 11   13.121 -2.121 4.498641 87.4225
8 12   15.1864 -3.1864 10.15314496 69.7225
9 13   17.2518 -4.2518 18.07780324 54.0225
10 14   19.3172 -5.3172 28.27261584 40.3225
11 15   21.3826 -6.3826 40.73758276 28.6225
12 20   23.448 -3.448 11.888704 0.1225
13 25   25.5134 -0.5134 0.26357956 21.6225
14 33   27.5788 5.4212 29.38940944 160.0225
15 34   29.6442 4.3558 18.97299364 186.3225
16 35   31.7096 3.2904 10.82673216 214.6225
17 36   33.775 2.225 4.950625 244.9225
18 37   35.8404 1.1596 1.34467216 277.2225
19 38   37.9058 0.0942 0.00887364 311.5225
20 39   39.9712 -0.9712 0.94322944 347.8225
        Sum: 215.7045116 3052.55
             
MeanY: 20.35          
             
  Total squared error with the line:   215.70  
  Squared error from the meanY:     3052.55
             
  Percentage of the total variation that is not explained by the line (215.7/3052.55): 0.070663711
  7% of the total variation is not explained by the variation in X  
             
          R2   (1 – 0.07) 0.929336289

 

Residual Variation: Yi-Yi_hat
Total Variation: Yi-mean(Yi)

The term R^2 represents the percent of total variation described by the model

 

R2   is equal to the correlation of (X,Y) squared.
R2   = r (squared)

 

The mean of the residuals is 0 and the residuals are not correlated with the predictor X:

clip_image004

 

 

Using R

1) Manually:

Calculate the mean:

> mu <- mean(foo$Y)
> mu
[1] 20.35

 

Calculate the squared error from the mean:

> sTot <- sum((foo$Y-mu)^2)
> sTot
[1] 3052.55

 

R provides the function deviance, which calculates the sum of the squared residuals, the differences between the actual value and the estimated value specified by the line defined by the given parameters (slope and intercept).   (You’d get the same result if you square all value on the “residual” column and sum them).

 

> sRes <- deviance(l)
> sRes
[1] 215.7045
 
> 1-sRes/sTot
[1] 0.9293363
 
2) Pre defined calculations:
> summary(l)$r.squared
[1] 0.9293363
 
OR
 
> cor(foo$Y, foo$X)^2
[1] 0.9293363
 

Posted in Data Science, Linear Regression, R, Statistics | Leave a Comment »

Linear Regression Function – using R

Posted by Diego em Novembro 10, 2014

 

This is a continuation of this post, were I’ll do all the same calculation but using R

First of all, I’ve loaded the dataset into a variable called foo:

 

foo<- read.csv(“foo.csv”)

clip_image001

 

Creating the regression line is very simple;
The first parameter is the “formula” on the format: dependent variable ~ independent variable (in this case we are predicting Y using X) and the second parameter is the data:

l<-lm(Y~X, foo)

 

clip_image002

 

Or, if you summary it, you get more details:

clip_image003

 

 

We can also build the formulas manually:

clip_image004

 

And very easily plot the data and add the regression line:

> plot(foo)
> abline(l)
 
image

Posted in Data Science, Linear Regression, R, Statistics | Leave a Comment »

Using Linear Regression to predict Diamond Prices

Posted by Diego em Novembro 10, 2014

 

This post is a follow up on the “Statistical Regression Models Examples” lecture from the Coursera Data Science Specialization from the Johns Hopkins University.
I wrote it as I was attempting to understand this particular lesson.

The idea is to use linear regression to predict the price of diamonds on the “diamonds” dataset from the UsingR library which is a data set on 48 diamond rings containing price in Singapore dollars and size of diamond in carats.

 

library(UsingR)

plot(diamond$carat,diamond$price,
xlab=”Mass(carats)”,
ylab=”Price(SIN$)”,
bg=”lightblue”,
col=”black”,cex=1.1,pch=21,frame=FALSE)
abline(lm(price~carat,data=diamond),lwd=2)

 

clip_image002

 

using ggplot2:

> g = ggplot(diamond, aes (x= carat, y=price))
> g= g + geom_point(size = 6, colour = "black", alpha =0.2)
> g= g + geom_point(size = 5, colour = "blue", alpha =0.2)
> g= g + geom_smooth(method = "lm", colour="black")
image
 
 

Create the linear regression:

fit<-lm(price~carat,data=diamond)

coef(fit)

(Intercept)       carat 
  -259.6259   3721.0249 

 

·         That means the linear regression function is:

o   y = 3721x – 259.63

·         In other words:

o   We estimate an expected 3721.02 (SIN) dollar increase in price for every carat increase in mass of diamond.

o   The intercept ­259.63 is the expected price of a 0 carat diamond.

 

Getting a more interpretable intercept:

 

In order to get a more meaningful result, we center the “carat” variable:

 

> fit2<-lm(price~I(carat-mean(carat)),data=diamond)
> coef(fit2)
 
           (Intercept) I(carat - mean(carat)) 
500.833            3721.0249 

 

That means that $500.1 is the expected price for the average sized diamond of the data (0.2042 carats). We can verify that by substituting 0.2042 on the first function, we get 500:

o   3721 * 0.2042 – 259.63 = 500.1

 

Graphs:

clip_image003

 

Data: 

clip_image004

 

Predicting the price of a diamond

In order to predict the value of a diamond, we can just substitute the value we want on the linear regression formula:

clip_image005

 

Or we can use R’s predict function:

clip_image006

 

 

 

Posted in Data Science, Linear Regression, R, Statistics | 2 Comments »

Centring the random variables and Scaling and Normalizing the data

Posted by Diego em Novembro 6, 2014

 

·         Centring the random variable:

o   Subtract the mean from each data point, the resulting dataset will have mean zero

·         Scaling the data:

o   Divide each point by the standard deviation, the resulting dataset will have standard deviation 1

 

 

X

Xi – meanX

Xi/StdDevS

 

10

-34

0.306528517

 

15

-29

0.459792775

 

25

-19

0.766321292

 

35

-9

1.072849808

 

33

-11

1.011544105

 

87

43

2.666798095

 

48

4

1.47133688

 

99

55

3.034632315

       

Mean:

44

0

 

StdDevS

32.62339

 

1

 

 

 

 

clip_image001[4]

clip_image002[4]

 

clip_image003[4]

 

Normalizing the data

 

Normalizing = centring + scaling

Normalized data has mean zero and standard deviation 1.
The normalized data is defined in units of standard deviation.

 

X

 

Normalization

   
 

10

 

-1.042196957

 ->

(10 -44) / 32.62

 

15

 

-0.888932698

   
 

25

 

-0.582404182

   
 

35

 

-0.275875665

   
 

33

 

-0.337181368

   
 

87

 

1.318072622

   
 

48

 

0.122611407

   
 

99

 

1.685906842

   
           

Mean:

44

 

0

   

StdDevS

32.62339

 

1

   

 

clip_image004[4]


 

The Slope is the same if you centre the data or it is the Correlation if you normalize the data.
(using the same dataset defined here)

 

 

Centre

   

Normalize

 

X

Y

 

X

Y

 

-9.5

-15.35

 

-1.60579

-1.21103

 

-8.5

-14.35

 

-1.43676

-1.13213

 

-7.5

-13.35

 

-1.26773

-1.05324

 

-6.5

-12.35

 

-1.0987

-0.97434

 

-5.5

-11.35

 

-0.92967

-0.89545

 

-4.5

-10.35

 

-0.76064

-0.81656

 

-3.5

-9.35

 

-0.59161

-0.73766

 

-2.5

-8.35

 

-0.42258

-0.65877

 

-1.5

-7.35

 

-0.25355

-0.57987

 

-0.5

-6.35

 

-0.08452

-0.50098

 

0.5

-5.35

 

0.084515

-0.42208

 

1.5

-0.35

 

0.253546

-0.02761

 

2.5

4.65

 

0.422577

0.366858

 

3.5

12.65

 

0.591608

0.998012

 

4.5

13.65

 

0.760639

1.076907

 

5.5

14.65

 

0.92967

1.155801

 

6.5

15.65

 

1.098701

1.234695

 

7.5

16.65

 

1.267731

1.313589

 

8.5

17.65

 

1.436762

1.392484

 

9.5

18.65

 

1.605793

1.471378

           

Mean

0

0.000

 

0.000

0.000

StdDev

5.91608

12.67519334

 

1

1

Correl

0.964021

   

0.964021

 
           

β1

2.065414

   

0.964021

 

 

 

Posted in Data Science, Linear Regression, Statistics | Leave a Comment »

Intercept, Slope and the Linear Regression Function

Posted by Diego em Novembro 4, 2014

 

The Linear Regression function is a line that minimizes the squared distances (squared error) between the line and each point.
The error is the vertical distance between the point and the line.
To define the function, we need to find out the Intersect and the slope (β0 and β1).

 

Given a random dataset:

 

X

Y

1

5

2

6

3

7

4

8

5

9

6

10

7

11

8

12

9

13

10

14

11

15

12

20

13

25

14

33

15

34

16

35

17

36

18

37

19

38

20

39

With the following  information:

 

 

X

Y

   

Std Dev

5.91608

12.67519

   
     

Correl

0.964021

Mean

10.5

20.35

   

 

·         We are using the sample standard deviation

·         The correlation was calculated in excel using the CORREL function

 

 

Who can be plotted like this:

clip_image002

 

 

The slope is defined by:

β1 = CORREL * (sdY / sdX)
       = 0.964021 * (
12.67519 / 5.91608)
       = 2.065414

(how to calculate correlation is explained here)

β0 =  meanY – (β1* meanX)
     = 20.35 – (
2.065414 * 10.5)
     = 20.35 – (21.686847)
     = -1.336847

 

Then the formula is:

Y= β1X + β0
y = 2.0654x – 1.3368

clip_image003

 

Interesting to note that the function will cross the (meanX, meanY) point, in this case (10.5, 20.35)

clip_image005

 

 

Another way to find the slope is applying the formula bellow (and this is from khan academy – https://www.khanacademy.org/math/probability/regression/regression-correlation/v/proof-part-4-minimizing-squared-error-to-regression-line) :

clip_image006

 

Which in this example would be:

= ((10.5 * 20.35) – (282.35)) / (110.25 – 143.5)
= -68.675 / -33.25
= 2.065414

Another interesting formula is:

image

Which means, the predicted Yi – the average of Ys  is equal to the slope * (Xi – mean of X)

Posted in Data Science, Linear Regression, Statistics | Leave a Comment »

Covariance and Correlation

Posted by Diego em Novembro 4, 2014

This is an example on to to calculate covariance and correlation between two datasets.
The datasets are represented by X and Y and the formulas are defined bellow.
The calculations are explained on the table and the bold value is the result of the equivalent equation on excel, used to verify the results

 

X

Y

 

(Xi – Xmean) * (Yi – Ymean)

1

5

 

145.825

2

6

 

121.975

3

7

 

100.125

4

8

 

80.275

5

9

 

62.425

6

10

 

46.575

7

11

 

32.725

8

12

 

20.875

9

13

 

11.025

10

14

 

3.175

11

15

 

-2.675

12

20

 

-0.525

13

25

 

11.625

14

33

 

44.275

15

34

 

61.425

16

35

 

80.575

17

36

 

101.725

18

37

 

124.875

19

38

 

150.025

20

39

 

177.175

   

Sum:

1373.5

   

Sum/(n-1)

72.28947368

   

Covar:

72.28947368

X

Y

   

Std Dev

 

Covar/(StdDevX * StdDevY)

0.964020897

5.91608

12.67519

Correl:

0.964020897

Mean

     

10.5

20.35

   

 

clip_image001

 

clip_image002

Posted in Data Science, Linear Regression, Statistics | Leave a Comment »