ANOVA and Regression
- Regression and Correlation
- <Previous | Next>
This tutorial uses the diamonds data from the previous pages. Run the following line of code if you haven’t already called it in.
sparkly<-read.csv("https://mooredata.weebly.com/uploads/1/6/0/9/16090314/diamonds.csv",strings=T) #loads data directly from website
One-way ANOVA (Analysis of Variance)
Suppose we want to know whether the mean price is different across the different diamond cuts. The aov() function in R will carry out an ANOVA. And what’s really nice is that we don’t have to specify which group like we did in the t.test(). The aov() function uses one column with the factors and one column with the quantitative response. To run the ANOVA, use aov(Y ~ X, data=datsetName)
fit<-aov(price ~ cut, data=sparkly) #Model price by cut and call the fitted model 'fit' summary(fit) #see the ANOVA output
## Df Sum Sq Mean Sq F value Pr(>F) ## cut 4 1.104e+10 2.760e+09 175.7 <2e-16 *** ## Residuals 53935 8.474e+11 1.571e+07 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You can add predictor terms by using the “+” sign between predictors in the aov() call. For instance, the following code would run the ANOVA with price as the response with cut, color, and their interaction
fit1 <- aov(price ~ cut + color + cut*color,data=sparkly) #this isn't legal since the deign is not balanced fit2 <- aov(price ~ cut*color,data=sparkly) #this includes interaction and all lower order terms
Notice, we did not call the summary(fit1) or summary(fit2). Since the design is not balanced (there are not an equal number of observations at each combination), we need to use a special method to perfom the ANOVA. The most popular way to do this in R is to use the Anova() function in the ‘cars’ package, but this is not covered here.
We can demonstrate how to create an interaction plot with the diamonds data. To create an interaction plot, use interaction.plot(). interaction.plot() takes three arguments: x-axis predictor, separate lines predictor, and quantitative response. Run the following to see the difference.
To get a table of means, if the fitted model is called “fit”, use the print(model.tables()) call.
fit <- aov(price ~ cut*color,data=sparkly) print(model.tables(fit,"means"),digits=3)
## Tables of means ## Grand mean ## ## 3932.8 ## ## cut ## Fair Good Ideal Premium Very Good ## 4359 3929 3458 4584 3982 ## rep 1610 4906 21551 13791 12082 ## ## color ## D E F G H I J ## 3194 3089 3732 4014 4454 5080 5259 ## rep 6775 9797 9542 11292 8304 5422 2808 ## ## cut:color ## color ## cut D E F G H I J ## Fair 4291 3682 3827 4239 5136 4685 4976 ## rep 163 224 312 314 303 175 119 ## Good 3405 3424 3496 4123 4276 5079 4574 ## rep 662 933 909 871 702 522 307 ## Ideal 2629 2598 3375 3721 3889 4452 4918 ## rep 2834 3903 3826 4884 3115 2093 896 ## Premium 3631 3539 4325 4501 5217 5946 6295 ## rep 1603 2337 2331 2924 2360 1428 808 ## Very Good 3470 3215 3779 3873 4535 5256 5104 ## rep 1513 2400 2164 2299 1824 1204 678
fit <- aov(price ~ cut + color,data=sparkly) print(model.tables(fit,"means"),digits=3)
## Tables of means ## Grand mean ## ## 3932.8 ## ## cut ## Fair Good Ideal Premium Very Good ## 4359 3929 3458 4584 3982 ## rep 1610 4906 21551 13791 12082 ## ## color ## D E F G H I J ## 3194 3089 3732 4014 4454 5080 5259 ## rep 6775 9797 9542 11292 8304 5422 2808
Notice, the means correspond to the predictor terms in the fitted ANOVA model. Also notice that ‘rep’ (short for repetitions) is the number of observations at each combination and the means are not labeled. For example, the mean price for a dimaond with color D is $3,194.00 and there are 6,775 D color diamonds.
Assumption Checks for ANOVA
Regression and Correlation
Fitting a regression model works much the same as the aov() function, except we will use the lm() function. Suppose we want to predict price from carat weight.
plot(sparkly$carat,sparkly$price) fit <- lm(price ~ carat + I(carat^2), data = sparkly) #quadratic term because of the curve,I() function allows calculations plot(sparkly$carat,sparkly$price) lines(sparkly$carat,fitted(fit),col="red") #plot x's(carat weights) against y's (fitted/predicted values
We can do better than that. Let’s try log().
fit<-lm(log(price)~log(carat),data=sparkly) plot(log(sparkly$carat),log(sparkly$price)) abline(fit,col="blue")
Assumption Checks for Regression
Prediction Intervals for mean response
The following code produces a 90% prediction and the interval for the mean response for price for a carat weight of 1
predict(fit,newdata=data.frame(carat=1),interval="confidence",level=0.9) #The answer will be on the ln() scale
## fit lwr upr ## 1 8.448661 8.446416 8.450905
Prediction Intervals for a single Y
The following code produces a 90% prediction and the interval for a single Y for price for a carat weight of 1
## fit lwr upr ## 1 8.448661 8.016612 8.880709
library(car) #if 'car' is not installed, do install.packages('car') first and then run library(car)
## Loading required package: carData
library(MASS) studRes <- studres(fit) #studentized residuals hist(studRes)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -5.744654 -0.645387 -0.022518 0.000001 0.633404 5.095121
cooks <- cooks.distance(fit) #Cook's D hist(cooks)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000e+00 1.416e-06 6.745e-06 1.946e-05 2.178e-05 3.129e-03
Notes regarding multiple regression models in R
The following are ways to add interactions and functions of predictor variables:
| . | include all single predictors | +X | include this variable | -X | delete this variable | X:Z | include the interaction between these variables | X*W | include these variables and the interactions between them | X | Z | conditioning: include x given z | (W + X + Z )^3 | include these variables and all interactions up to three way | I(X*Z) | allows functions: include a new variable consisting of these variables multiplied | X - 1 | delete the intercept (regress through the origin)
There are often multiple ways to create your model in R. The following three are all equivalent:
Y ~ X + Z + W + X:Z + X:W + Z:W + X:Z:W
Y ~ X * Z * W
Y ~ (X + Z + W)^3
You can also use the “.” in combination with others. The following 2 lines are equivalent (if you only have W, X, Y, and Z in your dataset):
fit <- lm(Y ~ ., )
fit <- lm(Y ~ W + X + Z )
The next 2 are also equivalent (if you only have W, X, Y, and Z in your dataset):
fit <- lm(Y ~ .-W)
fit <- lm(Y ~ X + Z)
- When running an ANOVA, if your categorical predictor is entered as a number, you must turn the quantitative input into a factor before running the ANOVA.
- If the ‘cut’ column had number 1 through 5 representing the 5 levels (fair, good, very good, ideal, and premium), use the command fit <- aov(price ~ factor(cut), data = sparkly)
- Only make a quantitative column a factor for regression if it is intended to be categorical.