This tutorial uses the diamonds data from the previous pages. Run the following line of code if you haven’t already called it in.

sparkly<-read.csv("https://mooredata.weebly.com/uploads/1/6/0/9/16090314/diamonds.csv") #loads data directly from website     

Referring to Columns and Rows in a Data Frame

Most data files are organized in such a fashion that the columns are variables and the rows are observations. For the datasets we will work with, R will usually recognize them as a data frame. If a dataset is all numerical, it might recognize this as a different type of object, a matrix. This may affect how you can refer to a column (see below). You will often want to refer to a specific column, specific row, or a specific row of a specific column in a data frame.

Referring to a Column

There are 3 good ways to refer to a column:

  1. dataset$columnName (This way only works for a data frame or list)
  2. dataset[ ,i] to refer to the i^th column (This works for both data frames and matrices)
  3. attach(dataset) and then refer directly to columnName (Works for data frame)

The following 3 lines of code are eqivalent.

mean(sparkly$carat)
## [1] 0.7979397
mean(sparkly[,2])  #since 'carat' is the second column
## [1] 0.7979397
mean(sparkly[['carat']])  #double square brackets refer to an element of a list (and a data frame is a fancy list of lists)
## [1] 0.7979397

If you try mean(carat) you will get an error. However, attaching ‘sparkly’ will allow us to refer directly to the column name.

attach(sparkly)
mean(carat)
## [1] 0.7979397

Note that using attach() can be problematic if you have columns in separate datasets that are named the same way and you attach both. Use the detach() function before attaching another dataset. Also, sometimes R “remembers” the previous definition of an attached dataset column. Bottom line: attach() is handy but can be problematic, so beware.

detach(sparkly)

Referring to a Specific Row of a Specific Column

To refer to a specific row of a specific column, there are several options. The following all refer to the 2nd observation for ‘cut’:

sparkly$cut[2]  #2nd element of the 'cut' column vector
## [1] Premium
## Levels: Fair Good Ideal Premium Very Good
sparkly[['cut']][2]  #2nd element of 'cut' 
## [1] Premium
## Levels: Fair Good Ideal Premium Very Good
sparkly[2,3]  #2nd row, 3rd column of the sparkly data frame
## [1] Premium
## Levels: Fair Good Ideal Premium Very Good

The Basics of Subsetting

Suppose we want to look at average carat weight, but only for premium diamonds. In this way, we want to find the mean of the carat column but only for the rows where cut is premium.

mean(sparkly$carat[sparkly$cut=="Premium"])  #mean of carat column such that the cut is Premium
## [1] 0.8919549
mean(sparkly[['carat']][sparkly$cut=="Premium"])
## [1] 0.8919549

We can also subset on multiple criteria.

mean(sparkly$price[sparkly$cut=="Premium" & sparkly$table > 60])  #average price for premium cut and table larger than 60
## [1] 4777.124

Graphical Displays

R has extensive graphics capabilities, especially if you utilize newer packages like ggplot2. In this tutorial, you will learn some basic capabilities.

Histograms

To create a histogram, use the hist() function.

hist(sparkly$price,xlab="Price ($)",cex.lab=1.5,main="Distribution of Diamond Price",sub="n = 53,940 diamonds",cex.main=2)

A quick glance at the help file for the hist() function shows there are many customization options. The arguments used above were as follows:
xlab= and ylab= are used to give custom labels to the axes
main= and sub= give custom main and subtitles
cex.main and cex.lab allow changes in font size to 1.5 times or 2 times, etc…

Type ?hist into your console and glance at the help file to discover other options for customization, such as changing the x and y axis limits, changing the binwidth, etc…

Boxplots

A boxplot is a nice way to visualize a quantitative variable across categories of some categorical variable.

par(mar=c(5,5,2,1))  #set the margin sizes. default from bottom, left, top, right are 5.1, 4.1, 4.1, 2.1
boxplot(sparkly$price~sparkly$cut,xlab="Cut",ylab="Price ($)",at=c(1,2,5,4,3),cex.lab=1.5)

Notice the ‘~’ sign in the boxplot command. This means ‘modeled by’ or ‘as a function of’.
Note that if you leave out the at= command in the above, the boxplots would be ordered in alphabetical order. Since ‘cut’ is an ordinal variable, re-ordering is necessary. The at = c(1,2,5,4,3) command says “in the alphabetical list, put the 1st one first, 2nd one second, 5th one third, 4th one fourth, and 3rd one last”
In addition, you can change the labels for boxes using the names= argument. If I were plotting 2 boxes that should be named Box 1 and Box 2, use names = c(“Box 1”, “Box 2”) inside the boxplot function.

Scatterplot

A scatterplot is a nice way to visualize the relationship between 2 quantitative variables. It seems natural to model price against carat weight. The following scatterplot does this and adds a third variable, cut, to the graph.

plot(sparkly$carat,sparkly$price,col=sparkly$cut,xlab="Carat Weight",ylab="Price ($)")
legend("bottomright",fill=1:5,legend=levels(sparkly$cut))

fill= specifies colors 1 through 5. You can also specify colors by name. To see a list of all 657 colors, type colors() into your console.
The legend function puts the legend on a graph that is created already. You have to run the boxplot() command before running the legend command.

Line Graph

A line graph can show how a quantitative variable changes over time. We will make some data to work with. If sales1 is a vector of numbers that you want to graph on a line graph, and sales2 is a second vector, the plot() function can be used again. For example:

sales1 <- c(8,5,14,13,25)
sales2 <- c(13,8,6,18,7)
plot(sales1,type="o",col ="red",xlab = "Month",ylab ="Sales ($1,000s)")
lines(sales2, type="o", col = "blue",lty=2)

Note that type=“p” plots points, type=“l” plots a line, and type=“o” plots both. lty=2 creates a dashed line. Try other numbers to see the other line types. Adding xaxt=“n” in the plot function leaves the x-axis tick labels blank. You can use the axis() function to create custom lables. The at=1:5 below tells R where to put your labels (position 1, then 2, then 3, etc…)

plot(sales1,type="o",col ="red",xlab = "Month",ylab ="Sales ($1,000s)",xaxt="n")
lines(sales2, type="o", col = "blue",lty=2)
axis(1,at=1:5,labels=c("Jan","Feb","Mar","Apr","May"))
legend(1,25,c("Quarter 1","Quarter 2"),lty=c(1,2),lwd=c(2.5,2.5),col=c("red","blue"))

The legend() function puts a legend at the appropriate X and Y coordinates (here at (1,25)) and mimicks the line types and colors used in the graph.

Tips

<Previous | Next>