In order to perform most of the data manipulations in this class, we will be teaching it through R, which is a programming language often used in statistics.
Rstudio is a free software program that makes using R much easier. - As its name implies, it is a ‘studio’ or ‘editor’ (actually an integrated development environment, IDE) to make R easier to work with - IMPORTANT: Rstudio is not R. Without R downloaded, R studio has nothing to ‘work with’ - R studio gives us clickable options and allows us to integrate pre-made software packages for statistical analysis
RStudio.Version() # checks version of RStudio on computer
$citation
To cite RStudio in publications use:
RStudio Team (2019). RStudio: Integrated Development for
R. RStudio, Inc., Boston, MA URL http://www.rstudio.com/.
A BibTeX entry for LaTeX users is
@Manual{,
title = {RStudio: Integrated Development Environment for R},
author = {{RStudio Team}},
organization = {RStudio, Inc.},
address = {Boston, MA},
year = {2019},
url = {http://www.rstudio.com/},
}
$mode
[1] "desktop"
$version
[1] ‘1.2.5033’
$release_name
[1] "Orange Blossom"
For a much more detailed intro to R, check out: https://cran.r-project.org/doc/manuals/R-intro.pdf
The rest of this lab will center on running through the basics of using R, using the RStudio interface, and getting familiar with the practice of programming and statistical analysis. It will be a lot of content to cover but working with the material hands on will help too! At the end of this tutorial, there will be an R file with the code used today, as well as an activity we will work through in lab to play around with some data and get familiar with the output in R.
The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or code, instructions in R because it is a common language that both the computer and we can understand. We call the instructions commands and we tell the computer to follow the instructions by executing (also called running) those commands.
To begin, open up RStudio, and you will usually see the main window has been divided into four panels. We will primarily be using the two panels on the left, the top is the script, and the bottom is the console.
There are two main ways of interacting with R, by writing code and running it from the script (Panel 1), or by typing it and executing it in the console (Panel 2).
The script is where you will write your code, or documentation, think of it as a document where your instructions go. But unless you run these instructions, they are just text to the computer. You can run your script selecting the command you desire, and using the Ctrl
+ Enter
shortcut on Windows, or Cmd
+ return
on Macs.
On the console, you can type your command in and directly run it by pressing Enter
.
1 + 2
## [1] 3
16 / 4
## [1] 4
This is what that execution and output looks like in RStudio as well.
However, to do more, we need to assign values to objects. To do so, we name an object, let’s say we call it some_numbers, followed by the assignment operator <-
, and the value we want to give it:
some_numbers <- 5
So now, we can do a series of mathematical operations to some_numbers.
2.2 * some_numbers
## [1] 11
We can also change the value of an object by reassigning it a new value using the same assignment operator <-
some_numbers <- 10 + 20
some_numbers
## [1] 30
We could easily multiply 2.2 by 5 ourselves, or add 10 to 20 by hand, but the true power in assigning value to objects is when we have multiple values stored. Below I have a list of five numbers (more on creating lists later), when we save them as one object, we can apply the same mathematical operator on all five of them at the same time without doing more work.
some_numbers <- c(2,3,6,18,108)
some_numbers * 5
## [1] 10 15 30 90 540
Isn’t that easier than doing it individually to each number?
Syntax: Case-sensitivity
Note, that R is case-sensitive. Hence, be careful of how you name objects!
some_numbers
## [1] 2 3 6 18 108
Some_numbers
## Error in eval(expr, envir, enclos): object 'Some_numbers' not found
Syntax: Writing and commenting in-line You can also add in-line comments and sections when writing code, so as to introduce some kind of structure or explanation in your code. This makes it easier for both yourself to keep track of the progress you have made, and for a reader to follow your code and what the variables mean. Often we use
#
in-line to indicate a comment, when R sees a#
, it stops executing the rest of the line. Some conventions used in sectioning code is to use multiple####
to indicate headers and major sections in the code,###
as sub-sections, and just one#
for minor comments For example, if one had two variables, x and x2, which referred to different things, it could be helpful to comment next to both variables during the assignment process what they mean to you.
x <- 5 # Maximum score for first homework
x2 <- 10 # Total score for first and second homework
Panel 1 - Script
Panel 2 - Console
Panel 3 - Local Environment
x <- 3 # variables can be stored in local environment
y <- "Hello World"
data <- read.csv("germancities.csv") # so can datasets
View(data)
Syntax: How to clear the environment
rm(x) # things can be removed individually
rm(list=ls()) # or all at once
Panel 4 - Everything Else
Data Types/Elements/Primitives in R
There are different ‘types’ of information stored in R as data, these are some of the most common ones that we will encounter:
The most basic unit of data types in R are numbers, characters, words, and booleans.
whole.number <- 6
real.number <- 3.141592
"a"
a
char.two
yields an error because the a
is not enclosed in quotations, and there is no object called a
char.one <- "a"
char.two <- a
## Error in eval(expr, envir, enclos): object 'a' not found
string.one <- "Hello"
string.two <- "ab12cd"
TRUE
or FALSE
TRUE/FALSE
form, or in a 0/1
formPresent/Absent
is as easily represented as True/False
TRUE
## [1] TRUE
FALSE
## [1] FALSE
T
## [1] TRUE
F
## [1] FALSE
4. Factor
factor
data. Factors are used to capture categorical data, where there is a pre-defined set of values. For example take the number of classes in a college - there are four categories used to organize students: freshman, sophomore, junior, and senior. These are used regardless of how many years of study a student might have completed, and are often evaluated based on the quality and quantity of classes and course credits a student has completed. Another often used example for factors is gender: male, female, and other.We use the operator levels()
to find out what the levels in a factor are. Notice that even when the levels are mixed with integers, as.factor()
interprets them as character
objects.
factor.example <- as.factor(c(rep("Freshman",3), rep("Sophomore",4), rep("Junior",6), rep("Senior", 2)))
levels(factor.example)
## [1] "Freshman" "Junior" "Senior" "Sophomore"
factor.example[1]
## [1] Freshman
## Levels: Freshman Junior Senior Sophomore
factor.example2 <- as.factor(c(rep(1,3), rep(2,4), rep("Junior",6), rep("Senior", 2)))
factor.example2
## [1] 1 1 1 2 2 2 2 Junior Junior Junior
## [11] Junior Junior Junior Senior Senior
## Levels: 1 2 Junior Senior
levels(factor.example2)
## [1] "1" "2" "Junior" "Senior"
factor.example2[1]
## [1] 1
## Levels: 1 2 Junior Senior
Syntax: Checking Data Types
Knowing the different data types is important because some operations only work with specific data types. For example, though you can add multiple numeric objects together using the addition operator +
, it is not possible to do the same with character objects, or between character and numeric objects.
5 + 5
## [1] 10
"five" + "five"
## Error in "five" + "five": non-numeric argument to binary operator
5 + "five"
## Error in 5 + "five": non-numeric argument to binary operator
Aside from getting an error message, there are a couple of tools available to you for determining and changing the data type of an object.
is.numeric(whole.number)
## [1] TRUE
is.numeric(string.one)
## [1] FALSE
str(whole.number)
## num 6
class(string.one)
## [1] "character"
Changing data types:
One example of a need to change data types is when combining data across types. For example, while factors look (and often behave) like character vectors, they can be they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings. Yet because the levels don’t have any inherent numerical value, they cannot be combined with numerical objects. They have to be transformed into numeric types first.
factor.example + whole.number
## Warning in Ops.factor(factor.example, whole.number): '+' not meaningful for
## factors
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
as.numeric(factor.example) + 6
## [1] 7 7 7 10 10 10 10 8 8 8 8 8 8 9 9
You can also change numeric types to characters. Remember how strings have the quotation marks ""
around them?
as.character(whole.number)
## [1] "6"
Let’s combine some of these basic data types into a type of data some of you might be familiar with - a hypothetical list of student grades.
data <- read.csv("class-grades.csv")
datatable(data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))
class(data$Student.ID)
## [1] "integer"
class(data$First.Name)
## [1] "character"
class(data$Class)
## [1] "character"
class(data$Grades)
## [1] "integer"
Here we have at first glance, two types of data: numeric and character.
However, thinking back to the basic data types we just covered, we might find it useful to structure Class as factor
instead of character
if we want to organize the data by the seniority of students. Luckily, R has automatically done that conversion for us. But if we needed to do it, remember we can use the operator as.factor()
.
Vectors and lists are a way to store multiple values or objects, which can be either numbers or characters.
c()
c()
to add more elements to an existing vectorweights <- c(10, 15, 20, 60, 65)
weights
## [1] 10 15 20 60 65
animals <- c("mouse", "cat", "dog")
animals
## [1] "mouse" "cat" "dog"
c(weights, animals)
## [1] "10" "15" "20" "60" "65" "mouse" "cat" "dog"
Lists are similar to vectors in that they are a series of values, however the key difference between vectors and lists is that while vectors should contain the same kind of data types across all objects, lists can mix elements
list()
2
as a character instead of a number. This is called type-casting.class()
to both x and y, class tells us what kind of data is in the vectors, but sees the list as a different data typex <- c('mike', 2, 'lucy')
x
## [1] "mike" "2" "lucy"
class(x)
## [1] "character"
y <- list('mike', 2, 'lucy')
y
## [[1]]
## [1] "mike"
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] "lucy"
class(y)
## [1] "list"
Since vectors and lists are essentially just a group of objects, we can inspect their contents, structure, and also interact with the objects within them as a group. - Note the difference between adding a number to every value in a vector, and adding another number to a vector
length(weights)
## [1] 5
class(weights)
## [1] "numeric"
weights + 30
## [1] 40 45 50 90 95
weights2 <- c(weights, 30)
weights2
## [1] 10 15 20 60 65 30
In this case, weights
is a vector of 5 values, we can extract and replace values at specific locations in the vector. Using square brackets []
to select the position of the value we are interested in.
weights[1]
## [1] 10
weights[3] <- 5
weights
## [1] 10 15 5 60 65
Syntax: Indices in R For those who have some programming experience, R indices start at 1. Other programming languages in the C family such as C++, Java, and Python, start from 0 instead.
Similar to vectors, Matrices (plural) or a Matrix (singular) are a collection of elements arranged into a fixed number of rows and columns. The function matrix()
creates a matrix. All the columns must have the same data type, and must be the same length, though most often, matrices are numerical (think the matrix in the movie The Matrix!)
Below we have a 4x3 matrix, with four rows, three columns. Here we used 1:12
to create a sequence of numbers from 1 to 12. It is the same as using c(1,2,3,4,5,6,7,8,9,10,11,12)
1:12
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
matrix (1:12, nrow = 4, ncol = 3)
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
You can also create a matrix from a vector. Matrices have to have the same column length, so note how weights
with 5 values is not the right dimension for a matrix (R automatically repeats the first number), but weights2
with 6 values is.
matrix(weights, byrow = TRUE, nrow = 3)
## Warning in matrix(weights, byrow = TRUE, nrow = 3): data length [5] is not a
## sub-multiple or multiple of the number of rows [3]
## [,1] [,2]
## [1,] 10 15
## [2,] 5 60
## [3,] 65 10
matrix(weights2, byrow = TRUE, nrow = 3)
## [,1] [,2]
## [1,] 10 15
## [2,] 20 60
## [3,] 65 30
Dataframe: A dataframe is a general form of data that has columns and rows, like a list or a table. Unlike a matrix, a dataframe usually has multiple different types of information. You might see something similar in a class roster, a list of invitees to a birthday party, a list of ingredients in a recipe, these are all a form of data we interact with everyday.
If you have a matrix or a list you would like to convert in a dataframe, the associated function is data.frame()
. However, more often than not we start by importing a dataset into R that we want to work with.
To access something inside of R, $
is often used to refer to a column. Below we have a makeshift dataframe with two columns, age and name. my.data$age
allows us to reference the age column.
my.data <- data.frame(age = c(35, 24, 18, 72), name = c("Oliver", "Meghan",
"Cole", "Violet"))
View(my.data)
my.data$age # the '$' lets R know you want to access something inside of my.data
## [1] 35 24 18 72
data <- read.csv("excel-employee.csv")
datatable(data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))
So in the dataset above, we already have at least three types of data, from the basic data types we covered earlier.
The dataset itself is a dataframe, the employees ID can be numerical, the names can be thought of characters or string. If there are a few limited categories for the job title, it can be a type of factor.
Note that in this case, the variables are structured as the columns of the data, but that does not always have to be the case, variables can also be in the rows of the data.
Now that we’ve covered some of the essential basics of R syntax, let us move to the basics of actually using your console to manipulate data.
One of the most common functions of R is the use of arithmetic operators, here is a complete list of the most common operators. - The order of operations is the same as the convention in math: parenthesis, exponents, multiplication and division (from left to right), addition and subtraction (from left to right); also known as PEMDAS. - when in doubt use parentheses!
Operator | Description |
---|---|
+ | addition |
- | subtraction |
* | multiplication |
/ | division |
^ or ** | exponentiation |
x %% y | 5 %% 2 is 1 |
x %/% y | 5 %/% 2 is 2 |
Another common operator type are logical operators
Operator | Description |
---|---|
< | Less than |
> | Great than |
<= | Less than or equal to |
>= | Greater than or equal to |
== | Equal to each other |
!= | Not equal to each other |
For any of the functions used here, such as c()
, list()
, data.frame()
, matrix()
, and any future functions, you can use the ?help
function in R to learn more about a function.
As shown below, this help function gives you more detail on the function, as well as some of the many options one can specify - such as for matrix, what are the dimensions of the matrix?
?matrix
?list
As you might have noticed in the above code chunks, in R we use the assignment operator <-
to assign value to objects. There are other assignment operators (such as =
), but it can be confusing, so stick to using <-
.
Value assignment means to set a value to be stored in an object (also called a variable). While this may seem inefficient, we could do 1 + 2
instead of x = 1, y = 2, x + y
, it becomes increasingly difficult to manually type out the values of interest when the number of values we are working with gets bigger, like in a dataset.
x = 2 # Please avoid at all costs
x <- 4 # Better assignment operator
One of the main reasons why =
can be confusing is because there is another operator ==
which is the equal-to operator. So it checks if the value on the left is the same as the value on the right, i.e. is 1 == 2
, mathematically it is not. The opposite of the equal-to operator is the not-equal-to operator !=
x == 4
## [1] TRUE
x == 5
## [1] FALSE
x !=5
## [1] TRUE
This equal-to operator becomes very useful when we want to check if two values are the same. Take for instance those forms that ask you to type in your email twice, it probably uses an equal-to operator to determine if you typed the same email twice.
email <- 'john.smith@duke.edu'
email2 <- 'james.smith@duke.edu'
email == email2
## [1] FALSE
x == 600 & y == 700 # & == 'and this is also true'
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## [1] FALSE FALSE FALSE
x == 1000 | y == 7 # | == 'this, or this, or both'
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## [1] NA FALSE NA
c()
operator for vectors means concatenate
c()
can only be used for data of the same typepaste()
function'a' + 'b' # Doesn't work
## Error in "a" + "b": non-numeric argument to binary operator
string.one <- "Hello"
string.two <- "World"
c(string.one, string.two)
## [1] "Hello" "World"
c(1,2,3,4,5)
## [1] 1 2 3 4 5
paste(string.one, string.two, sep = " ") # Default sep input
## [1] "Hello World"
paste0(string.one, string.two) # Eliminates spacing
## [1] "HelloWorld"
paste(string.one, whole.number) # Typecasts numeric
## [1] "Hello 6"
Remember in [### Vectors and Lists] I mentioned that vectors have to be the same type? Look at what happens when we try and concatenate characters and numbers:
x <- c(1, "a", 2, "b") # but they do have to be the same type!
# notice that 1 and 2 are of type character, this is called 'type casting'
Seq and Rep
In cases where there is some repetition or sequences in the multiple values, such as an index when we want to number off observations or a repeating sequence to group observations, we can also use seq()
and rep()
to make the job easier. These functions work both on numeric and character objects.
x <- c(1,2,3,4,5) # the c stands for concatenate
x <- seq(from = 1, to = 5, by = 1) # same as above
x <- 1:5 # same as above
x <- rep(1, times = 10)
x <- factor(LETTERS[1:4]); names(x) <- letters[1:4]
x
## a b c d
## A B C D
## Levels: A B C D
rep(x, 2)
## a b c d a b c d
## A B C D A B C D
## Levels: A B C D
Finally, let’s consider what first steps you can take for analyzing a dataset or a series of values.
Take two hypothetical objects: x.vector
which is a vector of 100 numbers (don’t worry about the rnorm function used to create it, for now); and y.data
, a 30x3 dataframe, with some randomly dispersed NAs.
x.vector <- rnorm(100, 5, 1) # x is a vector of 100 numbers, dont worry about rnorm as a function for now
x.vector
## [1] 4.722845 2.756958 4.963318 5.173789 4.630642 5.269010 5.115868 6.846651
## [9] 4.174590 4.566602 5.523526 5.367035 3.216588 4.335152 4.502461 5.049381
## [17] 4.994670 5.129686 4.663295 5.654411 4.649022 6.717467 5.176222 4.378097
## [25] 3.229161 5.024140 4.891609 4.472859 6.777776 7.422602 4.216680 4.388373
## [33] 3.311464 5.153788 4.669442 5.749298 4.336615 4.922969 5.690669 5.484640
## [41] 6.368945 5.837094 4.895967 4.067469 4.058879 4.762727 4.573517 5.748911
## [49] 5.194847 2.444230 5.826110 4.462081 5.296993 6.606340 5.199908 4.737173
## [57] 5.846989 4.221250 5.222095 3.860765 5.128377 3.714430 5.337542 5.755695
## [65] 3.102078 6.725996 7.370974 4.375311 4.764400 5.218061 6.897340 4.902640
## [73] 6.756010 6.479825 3.694455 4.280277 5.792453 4.144253 5.817015 5.332845
## [81] 5.104136 6.571989 4.627619 4.789822 5.127112 5.224334 4.545933 5.039345
## [89] 3.948539 5.641053 4.476115 5.649160 5.498963 5.274503 5.348276 6.222451
## [97] 4.085869 6.228419 6.139003 6.735859
df <- data.frame(A = rep(1:3, times = 10), B = rep(4:6, times = 10), C = rep(7:9, times = 10))
y.data <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.80, 0.20), size = length(cc), replace = TRUE) ]))
datatable(y.data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )
We can look directly at the objects by calling them:
x.vector
## [1] 4.722845 2.756958 4.963318 5.173789 4.630642 5.269010 5.115868 6.846651
## [9] 4.174590 4.566602 5.523526 5.367035 3.216588 4.335152 4.502461 5.049381
## [17] 4.994670 5.129686 4.663295 5.654411 4.649022 6.717467 5.176222 4.378097
## [25] 3.229161 5.024140 4.891609 4.472859 6.777776 7.422602 4.216680 4.388373
## [33] 3.311464 5.153788 4.669442 5.749298 4.336615 4.922969 5.690669 5.484640
## [41] 6.368945 5.837094 4.895967 4.067469 4.058879 4.762727 4.573517 5.748911
## [49] 5.194847 2.444230 5.826110 4.462081 5.296993 6.606340 5.199908 4.737173
## [57] 5.846989 4.221250 5.222095 3.860765 5.128377 3.714430 5.337542 5.755695
## [65] 3.102078 6.725996 7.370974 4.375311 4.764400 5.218061 6.897340 4.902640
## [73] 6.756010 6.479825 3.694455 4.280277 5.792453 4.144253 5.817015 5.332845
## [81] 5.104136 6.571989 4.627619 4.789822 5.127112 5.224334 4.545933 5.039345
## [89] 3.948539 5.641053 4.476115 5.649160 5.498963 5.274503 5.348276 6.222451
## [97] 4.085869 6.228419 6.139003 6.735859
View(y.data)
x.vector
or View(y.data)
) might not always be useful because you’ll just a really long list of numbers and objects.
View()
opens up the dataframe in a new tab, and allows you to scroll through itx.vector
using length()
, as well as look at the first couple of values of y.data
using head()
, before deciding how to proceed.head()
, in View()
, and also by calling them directly using colnames()
length(x.vector)
## [1] 100
x.vector
## [1] 4.722845 2.756958 4.963318 5.173789 4.630642 5.269010 5.115868 6.846651
## [9] 4.174590 4.566602 5.523526 5.367035 3.216588 4.335152 4.502461 5.049381
## [17] 4.994670 5.129686 4.663295 5.654411 4.649022 6.717467 5.176222 4.378097
## [25] 3.229161 5.024140 4.891609 4.472859 6.777776 7.422602 4.216680 4.388373
## [33] 3.311464 5.153788 4.669442 5.749298 4.336615 4.922969 5.690669 5.484640
## [41] 6.368945 5.837094 4.895967 4.067469 4.058879 4.762727 4.573517 5.748911
## [49] 5.194847 2.444230 5.826110 4.462081 5.296993 6.606340 5.199908 4.737173
## [57] 5.846989 4.221250 5.222095 3.860765 5.128377 3.714430 5.337542 5.755695
## [65] 3.102078 6.725996 7.370974 4.375311 4.764400 5.218061 6.897340 4.902640
## [73] 6.756010 6.479825 3.694455 4.280277 5.792453 4.144253 5.817015 5.332845
## [81] 5.104136 6.571989 4.627619 4.789822 5.127112 5.224334 4.545933 5.039345
## [89] 3.948539 5.641053 4.476115 5.649160 5.498963 5.274503 5.348276 6.222451
## [97] 4.085869 6.228419 6.139003 6.735859
head(y.data)
## A B C
## 1 NA 4 7
## 2 NA NA 8
## 3 3 6 9
## 4 1 4 7
## 5 2 NA 8
## 6 3 NA 9
colnames(y.data)
## [1] "A" "B" "C"
We can also summarize (summary()
) and tabulate (table()
) the vector and dataframe variables respectively, to determine what the range of values are and whether there are NAs in the data. - The summary tells us what the min, max, and mean of each object of interest is, you can also find these values by calling them directly using specific functions: min()
, max()
, mean()
. - Additionally you can find the standard deviation of x using sd()
- Note that tabulating the vector is not informative here because all the values are different - However, it is pretty useful for the dataframe, because there are repeated numbers in the dataframe
summary(x.vector)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.444 4.475 5.110 5.084 5.663 7.423
head(table(x.vector))
## x.vector
## 2.44422976235314 2.7569581590775 3.10207803311483 3.21658760958651
## 1 1 1 1
## 3.22916141094047 3.31146424567282
## 1 1
summary(y.data$A)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 1 2 2 3 3 7
summary(y.data$B)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.000 4.000 5.000 4.864 5.750 6.000 8
table(y.data$B)
##
## 4 5 6
## 9 7 6
table(y.data$C)
##
## 7 8 9
## 10 7 8
So far we have been working with hypothetical values and objects, but next week we will be talking more about data, how to clean and use data, and what we can do with it!
For a much more detailed intro to R, check out: https://cran.r-project.org/doc/manuals/R-intro.pdf
?help
function is incredibly useful for checking the syntax of some common r functions - what goes into the function? How can we use the function? Often at the end of the help section there are examples as well of how the function is usedAnd the best way to be more familiar with R is to play around with it! You can try using some of the included datafiles in RStudio, and try working with the data. mtcars
is a great beginner dataset to start with.
data <- mtcars