Workshop on R

May 2018

Hosted by Virginia Education Science Training (VEST) Program at UVA

Overview
Schedule
Getting started
Modules
Data

Syntax

  

It’s hard to know where to start when teaching a new programming language. This page is meant to give some background about R that hopefully
1. explains a little about how it is put together, and
2. puts it in context with other programming languages you might know.

That said, revisiting this page after after working through the other modules might be useful.

R: language + environment

R is a port of the S language, which was developed at Bell Labs. As a GNU project, R is open source and free (as in freedom) to use and distribute. It can be installed and used on most major operating systems.

R is best thought of as an integrated language and environment that was designed with statistical computing and data analysis in mind. To that end, its structure represents a compromise between a code base optimized for mathematical procedures and one with high-level functionality that can be used interactively (unlike compiled code). In other words, it’s a great tool for working interactively with quantitative data.

R is probably best known for its graphing capabilities, but it has continued to grow in popularity among data scientists,[1] who are increasingly extending R’s functionality through user-contributed packages.[2] We will use a number of packages in this workshop.

Integrated development environment (IDE) for R

RStudio

RStudio does most everything R-related well and with little fuss, so it’s a great all-around program for using R. We will use it in this workshop.

Quick exercise

If you haven’t already, open up RStudio and poke around. First, try entering an equation in the console (like 1 + 1). Next, open the script associated with this module and run the first line.

Assignment

R thinks of things as objects. Objects are like boxes in which we can put things: data, functions, and even other objects.

Before discussing data types and structures, the first lesson in R is how to assign values to objects. In R (for quirky reasons), the primary means of assignment is the arrow, <-, which is a less than symbol, <, followed by a hyphen, -.

## assign value to object x using <-
x <- 1

## show
x
[1] 1

You can also assign using a single equals sign, =:

## assign value to object y using =
y = 'a'

## show
y
[1] "a"

Keep in mind, however, that since = sometimes has other meanings in R (it’s how functions set argument options), it may be clearer to use <-.

Quick exercise

Using the arrow, assign the output of 1 + 1 to x. Next subtract 1 from x and reassign the result to x.

Data types and structures

Types

There are three primary data types in R that you will regularly use:
- logical
- numeric (integer & double)
- character

Logical

Logical vectors can be TRUE, FALSE, or NA. They can be assigned to objects or returned by logical operators (e.g., ==, !=, <, >, etc), which makes them useful for control flow in loops and functions.

NB In R, you can shorten TRUE to T and FALSE to F, but both the short and long versions must be capitalized.

## assignment
x <- TRUE
x
[1] TRUE
## ! == NOT
!x
[1] FALSE
## check
is.logical(x)
[1] TRUE
## evaluate
1 + 1 == 2
[1] TRUE

Numeric: Integer and Double

Numeric values can be both integers and double precision floating point values, or just doubles. R automatically converts between the two data types for you, so knowing the difference between the two isn’t really important for most analyses.

If you want to use an integer, place a capital L after the number like 1L. If a number is stored as an integer, some R output will place an L behind the digits to let you know that. Mostly, R defaults to using doubles, but if you see a number with an L behind it, know that it’s still a number.

## use 'L' after digit to store as integer
x <- 1L
is.integer(x)
[1] TRUE
## R stores as double by default
y <- 1
is.double(y)
[1] TRUE
## both are numeric
is.numeric(x)
[1] TRUE
is.numeric(y)
[1] TRUE

Character

Character values are stored as strings, which means you need to place either single ' or double " quotes around them. Numeric values can also be stored as strings (sometimes useful if you must store leading zeroes), but they have to be converted back to numbers before you can perform numeric operations on them (like adding or subtracting) or use them in a statistical model.

## store a string using quotation marks
x <- 'The quick brown fox jumps over the lazy dog.'
x
[1] "The quick brown fox jumps over the lazy dog."
## store a number with leading zeros
x <- '00001'
x
[1] "00001"

Quick exercise

Try to add a string digit to a numeric value. What happens? Can you convert the string version on the fly so that the equation works? (HINT: in R, you can change a vector type using as.<type>(), where <type> is the name of what you want.)

Structures

Building on these data types, R relies on four primary data structures:

Vector

A vector in R is just a collection of the data types discussed. In fact, a single value is a vector of one. Vectors do not have dimensions (dim()), but do have length(), which is good to remember when inspecting your data or writing loops and functions.

You combine multiple values using the concatenate, c(), function. We will use c() a lot.

## create vector
x <- 1

## check
is.vector(x)
[1] TRUE
## add to vector (can do so recursively meaning old x can help make new x)
x <- c(x, 5, 8)
x
[1] 1 5 8
## no dim...
dim(x)
NULL
## ...but length
length(x)
[1] 3

You can access the elements of a vector using brackets, [], after the object name. If you think of each element in the vector as having an address, that is, a way to access it specifically, then its address is its position number in the vector. This position number is called its index, and in R, the index always starts with 1.

In our current vector, we have three items, 1, 5, and 8, which in turn have indices of 1, 2, and 3. To access 5 specifically, we can call it using the brackets and its index: x[2].

## get the second element
x[2]
[1] 5

Quick exercise

Since you know how to access a specific element in a vector and how to assign new values, try to change the 3rd element of the x vector to 4.

All values in a vector must be of the same type. If you concatenate values of different data types, R will automatically promote all values to least ambiguous type. We can check this with class().

## check class of x
class(x)
[1] "numeric"
## add character
x <- c(x, 'a')
x
[1] "1" "5" "8" "a"
## check class
class(x)
[1] "character"

Matrix

A matrix is a 2D arrangement of data types. Instead of length, it has dimensions. Like vectors, all data elements must be of the same type.

## create 3 x 3 matrix that is the sequence of numbers between 1 and 9
x <- matrix(1:9, nrow = 3, ncol = 3)
x
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
## ...fill by row this time
y <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
y
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
## a matrix has dimension
dim(x)
[1] 3 3

Use nrow() and ncol() to get the number of rows and columns, respectively.

## # of rows
nrow(x)
[1] 3
## # of columns
ncol(x)
[1] 3

Like a vector, you can access parts of a matrix. Since it has two dimensions, use a comma in the bracket to separate row indices from column indices.

When using brackets with objects that have two dimensions, a good rule of thumb is to add your comma first: x[ , ]. Numbers or objects you put between the first bracket and the comma will affect the rows; numbers between the comma and the closing bracket will affect the columns.

If you don’t put anything in either of those spaces (a blank space doesn’t count), R will assume you want all rows or columns, depending on which side of the comma is blank.

## show the values in the first row
x[1, ]
[1] 1 4 7
## show the values in the third column
x[ ,3]
[1] 7 8 9
## this is the same as just calling x by itself
x[ , ]
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Quick exercise

Return the middle value of the x matrix. Next assign the middle value the character value ‘a’. What happens to the rest of the values in the matrix?

List

Lists are a catch all objects that can hold an assortment of other objects of different data types. They can be flat, meaning that all values are at the same level, or nested, with lists holding other lists.

## create single-level list
x <- list(1, 'a', TRUE)

## show
x
[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE
## check
is.list(x)
[1] TRUE
## create blank list
y <- list()

## add to first list, creating nested list
z <- list(x, y)

## show
z
[[1]]
[[1]][[1]]
[1] 1

[[1]][[2]]
[1] "a"

[[1]][[3]]
[1] TRUE


[[2]]
list()

You access items in lists like you do vectors and matrices. You may, however, need to use double brackets, [[]], and multiple pairs, [[]][[]], to reach the item you need.

## the first item in list z is list x
z[[1]]
[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE
## to get to 'a' in list x, need to add more brackets
z[[1]][[2]]
[1] "a"

Data frame

Data frames are really just an organized collection of lists / vectors that are the same length. That quick description, however, belies the importance of data frames: you will use them all the time in your data work.

Most of the time, you will be reading in data frames, but you can also create them.

## create data frame where col_* are the column (variable) names
df <- data.frame(col_a = c(1,2,3),
                 col_b = c(4,5,6),
                 col_c = c(7,8,9))

## show
df
  col_a col_b col_c
1     1     4     7
2     2     5     8
3     3     6     9
## check
is.data.frame(df)
[1] TRUE

Like matrices, data frames have a dim() and the number of rows and columns can be recovered using nrow() and ncol(). The column names, which are needed when estimating models and making graphics, are accessed using names().

## get column names
names(df)
[1] "col_a" "col_b" "col_c"

To access a column, you need to give R the data frame’s name followed by a $ and then the variable name.

## get col_a
df$col_a
[1] 1 2 3

You can also use the df[['<var name>']] construction, which comes in handy in loops and functions.

## get col_a (note the quotation marks this time)
df[['col_a']]
[1] 1 2 3

Quick exercise

Create two or three equal length vectors. Next, combine to create a data frame. Finally, change one value in the data frame (HINT: think about how you changed vector and matrix values before).

Packages

User-submitted packages are a huge part of what makes R great. Most of your scripts will make use of one or more packages.

Installation

CRAN

As you’ve seen on the getting started page, packages can be installed from the official CRAN repository using:

install.packages('<package name>')

The default option installs all dependencies (other packages that the package you want may rely on to work properly). By default, R will check how you installed R and download the right operating system file type.

Quick exercise

Install the survey package, which we will use in a later module. Don’t forget to use single or double quotation marks around the package name.

GitHub

Recently, people have begun sharing the source code for their R packages on GitHub. If you want to download a package on GitHub, either because it isn’t hosted on CRAN or because you want the newest development version, you can use the devtools package to get it (you will need git on your system, too):

library(devtools)
install_github('<github handle>/<repo name>')

Loading package libraries

Package libraries[4] can loaded in a number of ways, but the easiest it to write:

library('<library name>')

where '<library name>' is the name of the package/library. You will need to load these before you can use their functions in your scripts. Typically, they are placed at the top of the script file.

Quick exercise

Load the tidyverse package, which you should have already installed. This will be a good test of the installation since we will use tidyverse libraries throughout the rest of the workshop.

Help

Even I don’t have every R function and nuance memorized. With all the user-written packages, it would be difficult to keep up if I tried! When stuck, there are a few ways to get help.

Help files

In the console, typing a function name immediately after a question mark will bring up that function’s help file:

## get help file for function
?sum

Two question marks will search for the command name in CRAN packages:

## search for function in CRAN
??sum

Google it!

Google is a coder’s best friend. If you are having a problem, odds are a 1,000 other people have too and at least one of them has been brave (or foolhardy!) enough to ask about it in a forum like StackOverflow, CrossValidated, or R-help mailing list. Google it!

Miscellaneous notes about R

Compared to other statistical languages

Like all computing languages, R has its own structure and quirks. The idiomatic R approach to data analysis can be especially challenging at first for those who come to R from other common statistical packages or scripting languages, like SPSS, Python, and Stata.[5]

I came to R after learning Stata first, which I think is common for many researchers trained in econometric methods. For me and others who’ve made the same Stata-to-R transition, I think the root of many problems is the fundamental difference between how Stata and R operate. Whereas Stata is more of a procedural language in which commands do things in an environment (your data), R is more object-oriented in that data and functions are stored in variables or objects and await instructions that pertain to them.[6]

As pointed out by my friend and colleague Richard Blissett, users can see this difference in the command/function names in each language. Stata commands tend to be verbs: summarize, tabulate, and regress; on the other hand, R functions are often nouns: summary, table, and lm (for linear model). And so, common problems in the R to Stata switch such as
- I ran a model and didn’t get any output…
- How do I create local/global macros in R?
- Which of these data objects is the actual data?
- Why isn’t R doing anything?

may be due to misunderstanding this difference.[7]

Like learning a new spoken language, constantly translating between your native tongue and the new language will only get you so far. To that end, I encourage native-Stata users to try to approach R without Stata procedures in mind (easier said than done, I know). That said, this document that shows the same analysis done in Stata and R side-by-side may be useful in the initial transition.

Other options for running R

There are many other ways besides RStudio to run R. Below are just a few that, depending on your personal preferences and project needs, may be better or worse than RStudio.

Miscellanea

Notes

  1. The “data scientist” as a person/title, like “big data,” has probably become a little played out, but for lack of a better catch-all term, I think everyone knows what I mean.

  2. For a little more history on R, particularly its success as an open source project, see Fox (2009)

  3. R also supports arrays, which can take on more than two dimensions.

  4. For clarity, I’ll call them packages when talking about what is downloaded and libraries when discussing what is loaded into memory. Since the names are the same, it’s really a semantic difference.

  5. If you come to R knowing C/C++, Fortran, or Java, see Rcpp, rFortran, rJava for some cool interactivity.

  6. Stata has some object-oriented features and R some procedural programming behaviors, so the assigned labels aren’t perfect. They are mostly right, though.

  7. Full disclosure: all questions I asked when learning R.