ANOVA and Regression

This tutorial uses the diamonds data from the previous pages. Run the following line of code if you haven’t already called it in.

sparkly<-read.csv("https://mooredata.weebly.com/uploads/1/6/0/9/16090314/diamonds.csv",strings=T) #loads data directly from website

ANOVA

One-way ANOVA (Analysis of Variance)

Suppose we want to know whether the mean price is different across the different diamond cuts. The aov() function in R will carry out an ANOVA. And what’s really nice is that we don’t have to specify which group like we did in the t.test(). The aov() function uses one column with the factors and one column with the quantitative response. To run the ANOVA, use aov(Y ~ X, data=datsetName)

fit<-aov(price ~ cut, data=sparkly)  #Model price by cut and call the fitted model 'fit'  
summary(fit)  #see the ANOVA output

##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## cut             4 1.104e+10 2.760e+09   175.7 <2e-16 ***
## Residuals   53935 8.474e+11 1.571e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two-way ANOVA

You can add predictor terms by using the “+” sign between predictors in the aov() call. For instance, the following code would run the ANOVA with price as the response with cut, color, and their interaction

fit1 <- aov(price ~ cut + color + cut*color,data=sparkly)  #this isn't legal since the deign is not balanced
fit2 <- aov(price ~ cut*color,data=sparkly)  #this includes interaction and all lower order terms

Notice, we did not call the summary(fit1) or summary(fit2). Since the design is not balanced (there are not an equal number of observations at each combination), we need to use a special method to perfom the ANOVA. The most popular way to do this in R is to use the Anova() function in the ‘cars’ package, but this is not covered here.

We can demonstrate how to create an interaction plot with the diamonds data. To create an interaction plot, use interaction.plot(). interaction.plot() takes three arguments: x-axis predictor, separate lines predictor, and quantitative response. Run the following to see the difference.

interaction.plot(sparkly$cut,sparkly$color,sparkly$price,col=1:7)

interaction.plot(sparkly$color,sparkly$cut,sparkly$price,col=1:5)

To get a table of means, if the fitted model is called “fit”, use the print(model.tables()) call.

fit <- aov(price ~ cut*color,data=sparkly)
print(model.tables(fit,"means"),digits=3)

## Tables of means
## Grand mean
##        
## 3932.8 
## 
##  cut 
##     Fair Good Ideal Premium Very Good
##     4359 3929  3458    4584      3982
## rep 1610 4906 21551   13791     12082
## 
##  color 
##        D    E    F     G    H    I    J
##     3194 3089 3732  4014 4454 5080 5259
## rep 6775 9797 9542 11292 8304 5422 2808
## 
##  cut:color 
##            color
## cut         D    E    F    G    H    I    J   
##   Fair      4291 3682 3827 4239 5136 4685 4976
##   rep        163  224  312  314  303  175  119
##   Good      3405 3424 3496 4123 4276 5079 4574
##   rep        662  933  909  871  702  522  307
##   Ideal     2629 2598 3375 3721 3889 4452 4918
##   rep       2834 3903 3826 4884 3115 2093  896
##   Premium   3631 3539 4325 4501 5217 5946 6295
##   rep       1603 2337 2331 2924 2360 1428  808
##   Very Good 3470 3215 3779 3873 4535 5256 5104
##   rep       1513 2400 2164 2299 1824 1204  678

fit <- aov(price ~ cut + color,data=sparkly)
print(model.tables(fit,"means"),digits=3)

## Tables of means
## Grand mean
##        
## 3932.8 
## 
##  cut 
##     Fair Good Ideal Premium Very Good
##     4359 3929  3458    4584      3982
## rep 1610 4906 21551   13791     12082
## 
##  color 
##        D    E    F     G    H    I    J
##     3194 3089 3732  4014 4454 5080 5259
## rep 6775 9797 9542 11292 8304 5422 2808

Notice, the means correspond to the predictor terms in the fitted ANOVA model. Also notice that ‘rep’ (short for repetitions) is the number of observations at each combination and the means are not labeled. For example, the mean price for a dimaond with color D is $3,194.00 and there are 6,775 D color diamonds.

Assumption Checks for ANOVA

plot(predict(fit),resid(fit))
abline(h=0)

qqnorm(resid(fit))

Regression and Correlation

Fitting a regression model works much the same as the aov() function, except we will use the lm() function. Suppose we want to predict price from carat weight.

plot(sparkly$carat,sparkly$price)
fit <- lm(price ~ carat + I(carat^2), data = sparkly)  #quadratic term because of the curve,I() function allows calculations
plot(sparkly$carat,sparkly$price)
lines(sparkly$carat,fitted(fit),col="red")  #plot x's(carat weights) against y's (fitted/predicted values

We can do better than that. Let’s try log().

fit<-lm(log(price)~log(carat),data=sparkly)
plot(log(sparkly$carat),log(sparkly$price))
abline(fit,col="blue")

Assumption Checks for Regression

plot(predict(fit),resid(fit))
abline(h=0)

qqnorm(resid(fit))

Prediction Intervals for mean response

The following code produces a 90% prediction and the interval for the mean response for price for a carat weight of 1

predict(fit,newdata=data.frame(carat=1),interval="confidence",level=0.9)  #The answer will be on the ln() scale

##        fit      lwr      upr
## 1 8.448661 8.446416 8.450905

Prediction Intervals for a single Y

The following code produces a 90% prediction and the interval for a single Y for price for a carat weight of 1

predict(fit,newdata=data.frame(carat=1),interval="prediction",level=0.9)

##        fit      lwr      upr
## 1 8.448661 8.016612 8.880709

Other Diagnostics

library(car)  #if 'car' is not installed, do install.packages('car') first and then run library(car)

## Loading required package: carData

lev=hat(model.matrix(fit))
plot(lev)

hist(lev)

library(MASS)
studRes <- studres(fit)  #studentized residuals
hist(studRes)

summary(studRes)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -5.744654 -0.645387 -0.022518  0.000001  0.633404  5.095121

cooks <- cooks.distance(fit)  #Cook's D
hist(cooks)

summary(cooks)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 1.416e-06 6.745e-06 1.946e-05 2.178e-05 3.129e-03

Notes regarding multiple regression models in R

The following are ways to add interactions and functions of predictor variables:

  |        .            | include all single predictors
  |       +X            | include this variable                                                           
  |       -X            | delete this variable                                                             
  |        X:Z          | include the interaction between these variables                                  
  |        X*W          | include these variables and the interactions between them                        
  |        X | Z        | conditioning: include x given z                                                  
  | (W + X + Z )^3      | include these variables and all interactions up to three way                     
  |        I(X*Z)       | allows functions: include a new variable consisting of these variables multiplied   
  |        X - 1        | delete the intercept (regress through the origin)

There are often multiple ways to create your model in R. The following three are all equivalent:

Y ~ X + Z + W + X:Z + X:W + Z:W + X:Z:W
Y ~ X * Z * W
Y ~ (X + Z + W)^3

You can also use the “.” in combination with others. The following 2 lines are equivalent (if you only have W, X, Y, and Z in your dataset):

fit <- lm(Y ~ ., )
fit <- lm(Y ~ W + X + Z )

The next 2 are also equivalent (if you only have W, X, Y, and Z in your dataset):
fit <- lm(Y ~ .-W)
fit <- lm(Y ~ X + Z)

Tips

When running an ANOVA, if your categorical predictor is entered as a number, you must turn the quantitative input into a factor before running the ANOVA.
- If the ‘cut’ column had number 1 through 5 representing the 5 levels (fair, good, very good, ideal, and premium), use the command fit <- aov(price ~ factor(cut), data = sparkly)
- Only make a quantitative column a factor for regression if it is intended to be categorical.

YOUR CART