All material should be available on Sakai (after class), and if you have more questions feel free to reach out and ask for help. Trial and error really helps, work with your data and refer to your previous notes, applying the concepts from class on your data will really help you master the material faster.
Assigning and saving values
Types
## [1] "1" "a" "2" "b"
## chr [1:4] "1" "a" "2" "b"
On the internet: - https://stackoverflow.com - someone has probably asked your question before
Last week we opened data by using the point and click shortcut to load the data into R.
File > Import Dataset > From text (base)
, (see image below if these instructions are unclear) and then from the pop-up, locate your downloaded csv file.Today, I’ll show you how we load and work with data in-line in the code.
By the end of lab and you should have…..
Most of the time, the way we collect data is through some external source. That means our data is stored in some specific file format elsewhere, and we want to load it into R in order to work with the dataset in R. Before loading in the data we’ll be using for this exercise, we can run through some of the common file types you’ll see in and the commands you will use to load these files into R.
Sometimes there is no choice on the format that data is available, in other times, data is available for download in many forms, let us compare the CCES 2020 and CCES 2016 data as an example.
CCES 2020 Data on Harvard Dataverse
.csv
and .dta
. Typically, researchers will pick the form that suits their statistical programme best, but I will show you how to load both forms into R.CCES 2016 Data on Harvard Dataverse - However, the CCES 2016 data only exist in .tab
If you aren’t sure you can try googling the format.
In summary, there are several ‘read’ functions that load data into R:
BUT, To use the read.spss()
and read.dta()
functions, we will need a package to read the data
R packages are extensions to R, they usually contain code and data that can be installed by users. Think of chrome extensions, they are applications written by developers to make your life easier, some of the things chrome extensions do are just shortcuts to bookmark pages or open external applications.
One of the first ways you will interact with packages is using a package to load data as seen below. However, packages are also used to run regressions, plot graphs, as we will explore later.
In R, you load R packages by first installing them > and then loading them. I will demonstrate this using the foreign package, foreign allows you to read data stored by programs like SPSS
and Stata
.
install.packages("foreign")
where foreign is the name of your package.library(foreign)
.""
quotation marks, while library does not.## Warning: package 'foreign' was built under R version 4.1.2
With that, let us load the two datasets we will be working with today. Go to the following websites and download the fatal encounters data in .csv
format, and the CCES data (I would recommend using the Stata format and then loading the .dta
file in), and load it into your R.
Today the research question we are interested in is are attitudes about law enforcement affected by police violence against civilians?
Let’s brainstorm What is the independent variable? What is the dependent variable? What could be the unit of analysis? [Collect answers from class] [Break-out activity, half class do independent, half do dependent] How would you measure either of these? Take 10 mins, find a partner, and go over the codebook of the variables together.
A useful strategy when trying to think about manipulating data is to identify what our absolutely ideal looking dataset would look like once we are finished working. I do this in nearly every project myself. What would it look like here. Remember, we are trying to determine whether attitudes about law enforcement are affected by police violence against civilians.
A few things we want from an ideal dataset
The data should be stored in a single dataframe
The main identifier should be the unit of analysis
It should only have the variables we are interested in, so as to keep a tidy and clean dataset. CCES16 has 563 (!) variables an fatalencounters has 22. We can get rid of all these extraneous variables to have a tidier/cleaner dataframe for analysis.
What are the variables we are interested in keeping in the dataframes? [[ Breakout activity 2]]
The next step in learning how to create one unified dataset is to first select the variables we are interested in, in each of the datasets (fatal, and cces2016). This brings us to the next part of the class, subsetting.
One of the most common things you will do in R is subset data. Subsetting involves only keeping part of some larger dataframe based on a set of criteria that we offer the data in R. Put another way, subsetting is one way to manipulate a dataframe to only include those observations you need for analysis.
When subsetting there are generally two approaches:
#rows 1:10, 8th column only#
cces16[c(1:10), c(1:8)]
## V101 commonweight_vv commonweight_vv_post commonweight_vv_lgbt
## 1 222168628 1.3442142 1.0422667 1.1953457
## 2 273691199 1.1823518 1.2801033 1.2805647
## 3 284214415 0.2173962 NA 0.2489303
## 4 287557695 0.5323201 NA 0.9287502
## 5 290387662 1.2611058 1.0925191 1.1618944
## 6 290932100 0.5306634 0.5429082 0.5196270
## 7 292860642 1.4463003 1.3823826 1.3733937
## 8 295367942 1.6942828 NA NA
## 9 295717127 2.4004763 2.0036886 2.0924199
## 10 295859014 0.8881277 0.7797803 0.8284138
## commonweight commonweight_post tookpost lookupzip
## 1 0.9449668 0.7304500 Yes 03442
## 2 0.7672564 0.8928381 Yes 70118
## 3 0.0906038 0.0000000 No 63851
## 4 0.1425828 0.0000000 No 36703
## 5 1.2064976 1.0190072 Yes 80526
## 6 0.8877278 0.8722550 Yes 35816
## 7 1.1120297 1.0095162 Yes 75402
## 8 2.2875483 0.0000000 No 16833
## 9 2.4149777 1.8877565 Yes 30711
## 10 0.7974210 0.6824087 Yes 17522
Now we want to keep the 8th through column (variable) and rows 1:10 in our Fatal Encounters dataframe. We would do the following:
#rows 1:10, 8th through 13th column only#
fatal[c(1:10), c(8:13)]
## Imputation.probability date address
## 1 0.986065754 1/12/00 Lexington Avenue and East First Street
## 2 0.98614316 1/21/00 Upper Sumner Hill Road
## 3 Not imputed 1/23/00 2104-1962 MA-138
## 4 Not imputed 1/23/00 2104-1962 MA-138
## 5 Not imputed 1/28/00 270 Valley St,
## 6 Not imputed 1/28/00 Bayview Avenue and Arlington Avenue
## 7 0.992780017 2/11/00 528 Mill Rd
## 8 Not imputed 3/16/00 Mantua Grove Road and Forest Parkway
## 9 Not imputed 3/29/00 West Street and Pine Street
## 10 0.962815715 4/5/00 779-625 Flaghole Rd
## City State ZIP
## 1 Clifton NJ 7011
## 2 Sumner ME 4292
## 3 Raynham MA 2767
## 4 Raynham MA 2767
## 5 Providence RI 2909
## 6 Jersey City NJ 7305
## 7 Irvington NJ 7111
## 8 West Deptford NJ 8066
## 9 Abington MA 2351
## 10 Andover NH 3216
subset()
command, which is the most common way of subsetting data. In particular, we use this command to only keep some subset of the data based on a set of criteria, rather than the location (row/column) of data.For example, we can use the subset command to make our data only represent survey respondents on the CCES in the state of North Carolina. We do so using the following syntax
str(cces16$inputstate) # states are names (character), not numeric
## Ord.factor w/ 76 levels "Alabama"<"Alaska"<..: 30 19 26 1 6 1 44 39 11 39 ...
## - attr(*, "comment")= chr "State"
NC <- subset(cces16, inputstate == "North Carolina")
## NC <- subset(cces16, inputstate == 37)
# subset if inputstate is 37 (North Carolina)
If we want to subset based on MULTIPLE conditions, we can do that easily, using & (and) | (or) in our subset command. For example, say we want to only include CCES survey respondents from North Carolina and Georgia, we use the OR (|) to subset based on those criteria.
ganc <-subset(cces16, inputstate == "North Carolina" | inputstate == "Georgia") # subset if inputstate is 37 (North Carolina) OR 13 (Georgia)
But what about if we want to subset to only look at say, individuals who work at a union (union) and self-report having voted in 2012 (CC16_316) ? We need to consult our codebook and be sure to use the & (and) during the subset command.
unionvote <-subset(cces16, CC16_316 == 4 & union == 1)
Ok. Now we have some rough sense of how the subset command works in R. With that in mind, let’s return to the research question at hand, which is attempting to assess whether ratings of the police become less favorable after police killings of civilians.
Recall that we want the following variables from cces2016: * RespondentID (V101) * Respondent Race (race) * Respondent Zipcode (lookupzip) * Respondent Grade for Local Police (CC16_427_b)
#this code keeps all rows and ONLY the column names I call here
police_response <- cces16[, c("CC16_427_b", "lookupzip")]
cces16merge <- cces16[, c("V101", "race", "lookupzip", "CC16_427_b", "inputstate")]
Now we need to do the same with the fatal encounters dataset, but what we need is two things - number of police killings per zipcode, and over a certain period of time. Note in the fatal.csv that the dataset starts from 2000 and ends in 2021. While the survey is only in 2016 (what’s the exact dates?? check pg 7 of the codebook).
How do we do this?
Zipcode of killing (ZIP) Date
In addition, let’s only limit this to killings that occurred in the year before respondents took their survey. Because survey responses began being collected as early as September 28th of 2016, we will subset our data to only include police killings between September 27th of 2015 and September 27th of 2016.
Now…look…there are 0 observations for FEmerge!! What happened???
The problem lies with the structure of the date variable in fatalencounters. Using the class() command, we can see the type of data that the date variable is in the fatal encounters dataset.
summary(fatal$date)
## Length Class Mode
## 30021 character character
class(fatal$date)
## [1] "character"
summary(fatal$date)
## Length Class Mode
## 30021 character character
So instead, we need to convert this to a date object. To do so, we’ll need to use the as.date() command.The syntax of as.date is as follows:
Where %y stands for year, %m stands for month, and %d stands for day and the / marks represent the dividing characters. Looking at the date variable in fatalencounters, it seems the format is actually month/day/year so we can specify that below.
fatal$dateformatted <- as.Date(fatal$date, format = "%m/%d/%y")
It is important to create a new variable rather than overwrite the old variable because it allows us to compare and see if the conversion to a date object worked. If we overwrote the old variable with our new variable we couldn’t validate this.
Let’s compare the results:
head(fatal$dateformatted, 10)
## [1] "2000-01-12" "2000-01-21" "2000-01-23" "2000-01-23" "2000-01-28"
## [6] "2000-01-28" "2000-02-11" "2000-03-16" "2000-03-29" "2000-04-05"
head(fatal$date, 10)
## [1] "1/12/00" "1/21/00" "1/23/00" "1/23/00" "1/28/00" "1/28/00" "2/11/00"
## [8] "3/16/00" "3/29/00" "4/5/00"
See! The conversion worked! Now we can run that same code from earlier to subset our data.
femerge <- subset(fatal, dateformatted > "2015-09-27" & dateformatted < "2016-09-27")
But things are a bit more complex here. Remember, we want to ultimately merge the fatalencounters dataset with the cces data, so that we can see if residing in a zipcode with police violence affects attitudes towards police. Right now this dataset has the unit of analysis of individual police killing of a civilian. If we ultimately merge this dataset with the cces data, we will end up duplicating many CCES responses, because in any zipcode with multiple police killings of civilians, we will end up merging EACH of them with EACH survey respondent. To put it in more concrete terms, if there are 8 police killings in the 27701 zipcode and 1 survey respondent in the 27701 zipcode, a simple merge will duplicate that 1 survey respondent 8 times in our dataframe.
So how should we aggregate the data? [class discussion] - list steps below
So how do we do this in R?
dplyr/tidyverse package
.#install.packages("dplyr")
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.2
Some options from dplyr to manipulate data: ** use ?help
to check what these functions do, eg: ?select
**
Applying it on our dataset, this is what we want to do:
The syntax of these aggregations are as follows: newdf<-oldf %>% group_by(idvar) %>% summarise(obs = n(), newvar = sum(oldvar))
femergecollapsed <- femerge %>%
group_by(ZIP) %>%
summarise(numberofkillings = n())
head(femergecollapsed, 10)
## # A tibble: 10 × 2
## ZIP numberofkillings
## <int> <int>
## 1 1089 1
## 2 1420 1
## 3 1542 1
## 4 1841 1
## 5 1851 1
## 6 1904 1
## 7 1915 1
## 8 2121 1
## 9 2149 1
## 10 2151 1
Now we have two dataframes the way we want them. How do we join the information from the cces dataframe with the information from the fatalencounters dataframe? In R, we can use the merge()
command.
In R you use the merge() function to combine data frames. This function tries to identify columns or rows that are common between the two different data frames using some common identifiers.
There are 4 different kinds of merges, the type of merge determines what type of data is kept in the merger.
## Length Class Mode
## 64600 AsIs character
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1013 33157 60712 58468 85044 99921
## [1] 15724
## [1] 10806
In this case we need to do a left outer join, because there are more zipcodes in the survey data (cces16) than the police killings data (fatal). That means we want to keep all the data on our survey respondents ratings of the police, but only need to keep the data from fatal encounters that merges with our survey responses. That’s because we don’t need the data on every police killing of a civilian, we just want to compare how survey respondents living in zipcodes with a police killing rate their local police compared to respondents living in a zipcode without a recent police killing.
The syntax of a merge is as follows.
merge(cces, fatalencounters, by.x = c(“lookupzip”), by.y = c("ZIP), all.x = TRUE )
The key to merging is to merge on some common identifying information across datasets. In this case that is zipcode! So in the by.x and by.y commands, we input the zipcode variable from both datasets.
mergeddf<-merge(cces16merge, femergecollapsed, by.x = c("lookupzip"), by.y = c("ZIP"), all.x = TRUE)
head(mergeddf, 10)
## lookupzip V101 race CC16_427_b inputstate
## 1 01001 303171628 White D - Below Average Massachusetts
## 2 01001 303338685 White <NA> Massachusetts
## 3 01001 303195238 White A - Excellent Massachusetts
## 4 01001 303517334 White C - Average Massachusetts
## 5 01001 303566841 Hispanic <NA> Massachusetts
## 6 01001 304195469 White B - Above Average Massachusetts
## 7 01002 301420185 White C - Average Massachusetts
## 8 01002 302478815 White A - Excellent Massachusetts
## 9 01002 302521827 White B - Above Average Massachusetts
## 10 01002 303215426 White A - Excellent Massachusetts
## numberofkillings
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## 7 NA
## 8 NA
## 9 NA
## 10 NA
Success!!! Let’s check out the data now using table()
, which will tabulate rows and aggreate them for us. While we’re there, lets use is.na()
or summary()
to find out if we have missing data or not.
table(mergeddf$numberofkillings)
##
## 1 2 3 4
## 6228 1033 215 60
summary(mergeddf$numberofkillings)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 1.00 1.00 1.22 1.00 4.00 57064
sum(is.na(mergeddf$numberofkillings))
## [1] 57064
Whoa! We have 56,722 missing values? Why?
The reason is because our fatalencounters dataset presents every single police killing of a civilian that occurred (ostensibly). But it does NOT present data on the number of killings in each zipcode in the US. If we really believe fatalencounters then, we should treat all the missing/NA values as 0’s, meaning no police killings of civilians in the last 12 months.
To do so we will use the square brackets [ ] or indexing, and the is.na () command
mergeddf$numberofkillings[is.na(mergeddf$numberofkillings)] <- 0
mergeddf2 <- mergeddf %>%
mutate(killings = ifelse(is.na(numberofkillings) == TRUE, 0, numberofkillings)) ## another way to recode numberofkillings NA's into 0, into the new variable killings
summary(mergeddf$numberofkillings)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1421 0.0000 4.0000
sum(is.na(mergeddf$numberofkillings))
## [1] 0
In addition, we also might want to compare respondents living in zipcodes with no police killings to respondents living in zipcodes with any number (1 or more). This requires us to use the same square brackets [ ] or indexing to create a new variable. The syntax to do this is
mergeddf$binarykilling[mergeddf$numberofkillings == 0] <- 0
mergeddf$binarykilling[mergeddf$numberofkillings > 0] <- 1
Awesome! Finally, we have a final problem. Look at the dependent variable, grade for local police. Its class is not numeric, so we can’t run any statistical analysis on it (and it becomes more difficult to plot). Lets convert this to numerical. It is a relatively simple fix.
head(mergeddf$CC16_427_b)
## [1] D - Below Average <NA> A - Excellent C - Average
## [5] <NA> B - Above Average
## 7 Levels: A - Excellent < B - Above Average < ... < Not Asked
table(mergeddf$CC16_427_b)
##
## A - Excellent B - Above Average C - Average D - Below Average
## 7772 18170 21320 3881
## F - Poor Skipped Not Asked
## 1627 0 0
class(mergeddf$CC16_427_b)
## [1] "ordered" "factor"
mergeddf$policegrade<-as.numeric(mergeddf$CC16_427_b)
table(mergeddf$policegrade)
##
## 1 2 3 4 5
## 7772 18170 21320 3881 1627
Though we’ll cover this more in Lab number four / five, considering we did all this work, lets use ggplot and take a look!
To summarize, we learned about the following functions / commands in R
read.csv
read_dta
summary()
names()
head()
subset()
[r, c]
class()
merge()
group_by()
summarise()
table()
is.na()
as.numeric()
This is obviously a lot to remember and a lot to make sense of in a single lecture. To that end, this video and rmd should serve the purpose of being a resource for you to use in the future when you are working on your final project. Other key resources for you to use, which I draw on heavily for this guide: