Thanks to Christine Albano, who pulled this module together!

Load script for module #2.2

  1. Click here to download the script! Save the script to a convenient folder on your laptop.

  2. Load your script in RStudio.

In this tutorial we introduce the ‘Tidyverse’ set of packages, which were designed to facilitate everyday data management and visualization tasks in R using a consistent data structure and syntax. Much of the content in this tutorial is based on the online resource R for Data Science by Garrett Grolemund and Hadley Wickham.

The core tidyverse set of packages includes many packages. For now, we will focus on five that are commonly used for data wrangling:

  • dplyr, for data manipulation

  • tidyr, for data tidying

  • readr, for data import

  • tibble, for tibbles, a re-imagining of data frames

  • magrittr, for implementing the ‘pipe’ operator

  • ggplot2, for data visualisation (the subject of the next module)

  • purrr, for functional programming

  • stringr, for strings

  • forcats, for factors

Learning goals

  • Use Tidyverse grammar (e.g., piping) and data structures to perform basic data manipulation tasks

  • Use tidyr and dplyr data transformation ‘verbs’ to wrangle data

  • Work with and parse dates using lubridate

# install.packages("tidyverse")
library(tidyverse)

Using the pipe operator %>% in R

The ‘tidyverse’ set of packages takes advantage of the pipe operator %>%, which provides a clean and intuitive way to structure code and perform sequential operations in R.

Key advantages include:

  • code is more readable and intuitive – reads left to right, rather than inside out as is the case for nested function

  • perform multiple operations without creating a bunch of intermediate (temporary) datasets

This operator comes from the magrittr package, which is included in the installation of all of the tidyverse packages. The shortcut for the pipe operator is ctrl-shift-m (‘m’ is for Magritte). When reading code out loud, use ‘then’ for the pipe. For example the command here:


x %>% log() %>% round(digits=2)

can be interpreted as follows:

Take “x”, THEN take its natural logarithm, THEN round the resulting value to 2 significant figures

The structure is simple. Start with the object you want to manipulate, and apply actions (e.g., functions) to that object in the order in which you want to apply them.

Here is a quick example.

####
####  Using the pipe operator %>% (ctrl-shift-m)
####

# start with a simple example
x <- 3

# calculate the log of x
log(x) # form f(x) is equivalent to
## [1] 1.098612
x %>% log() # form x %>% f
## [1] 1.098612
# example of multiple steps in pipe
round(log(x), digits=2) # form g(f(x)) is equivalent to
## [1] 1.1
x %>% log() %>% round(digits=2) # form x %>% f %>%  g
## [1] 1.1

Importing data into R using readr

Here, we import meteorological data from Hungry Horse and Polson Kerr Dams in Montana as a tibble dataframe. The readr package imports data as a special type of dataframe called a tibble. The tibble dataframe has several advantages over the ‘regular’ dataframe, including nicely formatted printing, more useful defaults, faster imports, and it also flags potential issues in your dataset early on – which can save you a lot of time later! All this said, tibble and regular dataframes are readily converted to one type vs. the other if the need arises.

Download an example dataset

The next few examples use meteorological data from Hungry Horse (HH) and Polson Kerr (PK) dams. You can download this data set by clicking here.

Import as a tibble

####
####  Import data as a Tibble dataframe and take a quick glance
####

# import meteorological data from Hungry Horse (HH) and Polson Kerr (PK) dams as tibble dataframe using readr 
clim_data <- read_csv("MTMetStations.csv")

# display tibble - note nice formatting and variable info, entire dataset is not displayed as is case in read.csv
clim_data
## # A tibble: 1,734 x 7
##    Date      PK.TMaxF PK.TMinF PK.PrcpIN HH.TMaxF HH.TMinF HH.PrcpIN
##    <chr>        <dbl>    <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
##  1 1/1/2013        27       19      0          26       22      0   
##  2 1/2/2013        27       22      0          27       23      0   
##  3 1/3/2013        23       14      0          25       21      0   
##  4 1/4/2013        25       19      0          22       13      0   
##  5 1/5/2013        29       20      0          24       13      0   
##  6 1/6/2013        36       25      0          30       20      0.1 
##  7 1/7/2013        35       30      0.16       35       22      0.23
##  8 1/8/2013        39       32      0.04       36       25      0.32
##  9 1/9/2013        44       31      0.07       39       30      0.1 
## 10 1/10/2013       39       27      0.22       46       32      0.1 
## # ... with 1,724 more rows
 # display the last few lines of the data frame
tail(clim_data)
## # A tibble: 6 x 7
##   Date      PK.TMaxF PK.TMinF PK.PrcpIN HH.TMaxF HH.TMinF HH.PrcpIN
##   <chr>        <dbl>    <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 9/25/2017       59       47     0.001       56       39         0
## 2 9/26/2017       66       47     0           58       43         0
## 3 9/27/2017       66       41     0           68       43         0
## 4 9/28/2017       67       39     0           66       43         0
## 5 9/29/2017       70       40     0           71       38         0
## 6 9/30/2017       62       47     0.06        71       38         0

Making data tidy using tidyr

All of the tidyverse set of packages are designed to work with Tidy formatted data.

This means:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

This is what it looks like:

The tidyr package has several functions to help you get your data into this format.

Use “pivot_longer()” to get all of the values and variables from multiple columns into a single column.

Use “pivot_wider()” to distribute two variables in a single column into separate columns, with their data values(‘value’)

For a nice intro to ‘pivot_wider’ and ‘pivot_longer’, see this link

Use “separate()” to separate a single column into two separate columns. “unite()” does the opposite

####
####  Use Tidyr verbs to make data 'tidy'
####

# look at clim_data -- is it in tidy format? What do we need to do to get it there?
head(clim_data)
## # A tibble: 6 x 7
##   Date     PK.TMaxF PK.TMinF PK.PrcpIN HH.TMaxF HH.TMinF HH.PrcpIN
##   <chr>       <dbl>    <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 1/1/2013       27       19         0       26       22       0  
## 2 1/2/2013       27       22         0       27       23       0  
## 3 1/3/2013       23       14         0       25       21       0  
## 4 1/4/2013       25       19         0       22       13       0  
## 5 1/5/2013       29       20         0       24       13       0  
## 6 1/6/2013       36       25         0       30       20       0.1
# gather column names into a new column called 'climvar_station', and all of the numeric precip and temp values into a column called 'value'. By including -Date, we indicate that we don't want to gather this column.
clim_vars_longer <- clim_data %>% pivot_longer( 
                           cols = !Date,
                           names_to = "climvar_station",
                           values_to = "value"
                    )

clim_vars_longer
## # A tibble: 10,404 x 3
##    Date     climvar_station value
##    <chr>    <chr>           <dbl>
##  1 1/1/2013 PK.TMaxF           27
##  2 1/1/2013 PK.TMinF           19
##  3 1/1/2013 PK.PrcpIN           0
##  4 1/1/2013 HH.TMaxF           26
##  5 1/1/2013 HH.TMinF           22
##  6 1/1/2013 HH.PrcpIN           0
##  7 1/2/2013 PK.TMaxF           27
##  8 1/2/2013 PK.TMinF           22
##  9 1/2/2013 PK.PrcpIN           0
## 10 1/2/2013 HH.TMaxF           27
## # ... with 10,394 more rows
# separate the climvar_station column into two separate columns that identify the climate variable and the station
clim_vars_separate <- clim_vars_longer %>% separate(
                           col = climvar_station, 
                           into = c("Station","climvar")
                      )

clim_vars_separate
## # A tibble: 10,404 x 4
##    Date     Station climvar value
##    <chr>    <chr>   <chr>   <dbl>
##  1 1/1/2013 PK      TMaxF      27
##  2 1/1/2013 PK      TMinF      19
##  3 1/1/2013 PK      PrcpIN      0
##  4 1/1/2013 HH      TMaxF      26
##  5 1/1/2013 HH      TMinF      22
##  6 1/1/2013 HH      PrcpIN      0
##  7 1/2/2013 PK      TMaxF      27
##  8 1/2/2013 PK      TMinF      22
##  9 1/2/2013 PK      PrcpIN      0
## 10 1/2/2013 HH      TMaxF      27
## # ... with 10,394 more rows
# pivot_wider distributes the clim_var column into separate columns, with the data values from the 'value' column
tidy_clim_data <- clim_vars_separate %>% pivot_wider( 
                        names_from = climvar, 
                        values_from = value
                  )

tidy_clim_data
## # A tibble: 3,468 x 5
##    Date     Station TMaxF TMinF PrcpIN
##    <chr>    <chr>   <dbl> <dbl>  <dbl>
##  1 1/1/2013 PK         27    19      0
##  2 1/1/2013 HH         26    22      0
##  3 1/2/2013 PK         27    22      0
##  4 1/2/2013 HH         27    23      0
##  5 1/3/2013 PK         23    14      0
##  6 1/3/2013 HH         25    21      0
##  7 1/4/2013 PK         25    19      0
##  8 1/4/2013 HH         22    13      0
##  9 1/5/2013 PK         29    20      0
## 10 1/5/2013 HH         24    13      0
## # ... with 3,458 more rows
# repeat above as single pipe series without creation of intermediate datasets
  
tidy_clim_data <- clim_data %>% 
  pivot_longer(cols = !Date,
               names_to = "climvar_station",
               values_to = "value") %>% 
  separate(col = climvar_station, 
           into = c("Station","climvar")) %>% 
  pivot_wider(names_from = climvar, 
              values_from = value)
  
tidy_clim_data
## # A tibble: 3,468 x 5
##    Date     Station TMaxF TMinF PrcpIN
##    <chr>    <chr>   <dbl> <dbl>  <dbl>
##  1 1/1/2013 PK         27    19      0
##  2 1/1/2013 HH         26    22      0
##  3 1/2/2013 PK         27    22      0
##  4 1/2/2013 HH         27    23      0
##  5 1/3/2013 PK         23    14      0
##  6 1/3/2013 HH         25    21      0
##  7 1/4/2013 PK         25    19      0
##  8 1/4/2013 HH         22    13      0
##  9 1/5/2013 PK         29    20      0
## 10 1/5/2013 HH         24    13      0
## # ... with 3,458 more rows

Using dplyr – the data wrangling workhorse

dplyr is by far one of the most useful packages in all of the tidyverse as it allows you to quickly and easily summarize and manipulate your data once you have it in tidy form. The basic dplyr verbs include:

  * grouping data with *group_by()*

  * filtering rows with *filter()*
  
  * creating new variables with *mutate()*
  
  * summarizing with *summarize()*
  
  * selecting columns with *select()*
  
  * sorting columns by row with *arrange()*

you can add variants _all (affects all variables), _at (affects selected variables), _if (affects variables that meet criteria) to most of these verbs

####
####  Use dplyr verbs to wrangle data
####

# example of simple data selection and summary using group_by, summarize, and mutate verbs

# take tidy_clim_data, then
# group data by station, then 
# calculate summaries and put in columns with names mean.precip.in, mean.TMax.F, and mean.Tmin.F, then 
# transform to metric and put in new columns mean.precip.in, mean.TMax.F, and mean.Tmin.F

station_mean1 <- tidy_clim_data %>%
  group_by(Station) %>% 
  summarize(
    mean.precip.in = mean(PrcpIN, na.rm=TRUE),
    mean.TMax.F = mean(TMaxF, na.rm=TRUE),
    mean.TMin.F = mean(TMinF, na.rm=TRUE)) %>%
  mutate(
    mean.precip.mm = mean.precip.in * 25.4,
    mean.TMax.C = (mean.TMax.F - 32) * 5 / 9,
    mean.TMin.C = (mean.TMin.F - 32) * 5 / 9
  )
  
station_mean1
## # A tibble: 2 x 7
##   Station mean.precip.in mean.TMax.F mean.TMin.F mean.precip.mm mean.TMax.C
##   <chr>            <dbl>       <dbl>       <dbl>          <dbl>       <dbl>
## 1 HH              0.0972        56.0        36.3           2.47        13.3
## 2 PK              0.0442        57.8        36.5           1.12        14.3
## # ... with 1 more variable: mean.TMin.C <dbl>
# using variants

# take tidy_clim_data, then
# group data by station, then 
# calculate summary (mean of all non-NA values) for numeric data only, then 
# transform temp data (.) from F to C, then
# transform precip data (.) from in to mm


station_mean2 <- tidy_clim_data %>%
  group_by(Station) %>% 
  summarize_if(is.numeric, mean, na.rm=TRUE) %>%
  mutate_at(vars(TMaxF, TMinF), funs(C=(.-32)*5/9)) %>% 
  mutate_at(vars(PrcpIN), funs(Prcp.mm=.*25.4))

station_mean2
## # A tibble: 2 x 7
##   Station PrcpIN TMaxF TMinF TMaxF_C TMinF_C Prcp.mm
##   <chr>    <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>
## 1 HH      0.0972  56.0  36.3    13.3    2.36    2.47
## 2 PK      0.0442  57.8  36.5    14.3    2.48    1.12

Working with Dates with lubridate - Date from character string

Dates can be tricky to work with and we often want to eventually get to the point that we can summarize our data by years, seasons, months, or even day of the year. The lubridate package in R (also part of the tidyverse) is very useful for creating and parsing dates for this purpose. Date/time data often comes as strings in a variety of orders/formats. Lubridate is useful because it automatically works out the date format once you specify the order of the components. To to do this, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:

#### 
####  Using lubridate to format and create date data types
####

library(lubridate)

date_string <- ("2017-01-31")

# convert date string into date format by identifing the order in which year, month, and day appear in your dates, then arrange "y", "m", and "d" in the same order. That gives you the name of the lubridate function that will parse your date

date_dtformat <- ymd(date_string)

# note the different formats of the date_string and date_dtformat objects in the environment window.

# a variety of other formats/orders can also be accommodated. Note how each of these are reformatted to "2017-01-31" A timezone can be specified using tz=

mdy("January 31st, 2017")
## [1] "2017-01-31"
dmy("31-Jan-2017")
## [1] "2017-01-31"
ymd(20170131)
## [1] "2017-01-31"
ymd(20170131, tz = "UTC")
## [1] "2017-01-31 UTC"
# can also make a date from components. this is useful if you have columns for year, month, day in a dataframe
year<-2017
month<-1
day<-31
make_date(year, month, day)
## [1] "2017-01-31"
# times can be included as well. Note that unless otherwise specified, R assumes UTC time

ymd_hms("2017-01-31 20:11:59")
## [1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
## [1] "2017-01-31 08:01:00 UTC"
# we can also have R tell us the current time or date

now()
## [1] "2020-09-04 14:22:01 PDT"
now(tz = "UTC")
## [1] "2020-09-04 21:22:01 UTC"
today()
## [1] "2020-09-04"

Working with Dates with lubridate - Parsing Dates

Once dates are in date format, we can easily pull out their individual components like this:

####
####  Parsing dates with lubridate
####

datetime <- ymd_hms("2016-07-08 12:34:56")

# year
year(datetime)
## [1] 2016
# month as numeric
month(datetime)
## [1] 7
# month as name
month(datetime, label = TRUE)
## [1] Jul
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
# day of month
mday(datetime)
## [1] 8
# day of year (julian day)
yday(datetime)
## [1] 190
# day of week
wday(datetime)
## [1] 6
wday(datetime, label = TRUE, abbr = FALSE)
## [1] Friday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Working with Dates - Parsing Dates Example

We can use lubridate and dplyr to make new columns from a date for year, month, day of month, and day of year

first we convert the character string date into date format. Here we are naming the new column “Date”, so R will replace character string with date formatted dates

#### 
####  Using lubridate with dataframes and dplyr verbs
####

# going back to our tidy_clim_data dataset we see that the date column is formatted as character, not date
head(tidy_clim_data)
## # A tibble: 6 x 5
##   Date     Station PrcpIN TMaxF TMinF
##   <chr>    <chr>    <dbl> <dbl> <dbl>
## 1 1/1/2013 HH           0    26    22
## 2 1/1/2013 PK           0    27    19
## 3 1/1/2014 HH           0    38    27
## 4 1/1/2014 PK           0    35    24
## 5 1/1/2015 HH           0     7     0
## 6 1/1/2015 PK           0     7    -1
# change format of date column
tidy_clim_data <- tidy_clim_data %>% 
  mutate(Date = mdy(Date))
tidy_clim_data
## # A tibble: 3,468 x 5
##    Date       Station PrcpIN TMaxF TMinF
##    <date>     <chr>    <dbl> <dbl> <dbl>
##  1 2013-01-01 HH        0       26    22
##  2 2013-01-01 PK        0       27    19
##  3 2014-01-01 HH        0       38    27
##  4 2014-01-01 PK        0       35    24
##  5 2015-01-01 HH        0        7     0
##  6 2015-01-01 PK        0        7    -1
##  7 2016-01-01 HH        0       22    15
##  8 2016-01-01 PK        0       22    17
##  9 2017-01-01 HH        0.16    30    16
## 10 2017-01-01 PK        0.34    22     7
## # ... with 3,458 more rows

now we can use mutate to create individual columns for the date components

# parse date into year, month, day, and day of year columns
tidy_clim_data <- tidy_clim_data %>% mutate(
  Year = year(Date),
  Month = month(Date),
  Day = mday(Date),
  Yday = yday(Date))

tidy_clim_data
## # A tibble: 3,468 x 9
##    Date       Station PrcpIN TMaxF TMinF  Year Month   Day  Yday
##    <date>     <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
##  1 2013-01-01 HH        0       26    22  2013     1     1     1
##  2 2013-01-01 PK        0       27    19  2013     1     1     1
##  3 2014-01-01 HH        0       38    27  2014     1     1     1
##  4 2014-01-01 PK        0       35    24  2014     1     1     1
##  5 2015-01-01 HH        0        7     0  2015     1     1     1
##  6 2015-01-01 PK        0        7    -1  2015     1     1     1
##  7 2016-01-01 HH        0       22    15  2016     1     1     1
##  8 2016-01-01 PK        0       22    17  2016     1     1     1
##  9 2017-01-01 HH        0.16    30    16  2017     1     1     1
## 10 2017-01-01 PK        0.34    22     7  2017     1     1     1
## # ... with 3,458 more rows
# calculate total annual precipitation by station and year
annual_sum_precip_by_station <- tidy_clim_data %>%
  group_by(Station, Year) %>%
  summarise(PrecipSum = sum(PrcpIN))

annual_sum_precip_by_station 
## # A tibble: 10 x 3
## # Groups:   Station [2]
##    Station  Year PrecipSum
##    <chr>   <dbl>     <dbl>
##  1 HH       2013      29.8
##  2 HH       2014      40.8
##  3 HH       2015      25.7
##  4 HH       2016      47.0
##  5 HH       2017      25.1
##  6 PK       2013      14.7
##  7 PK       2014      19.9
##  8 PK       2015      12.6
##  9 PK       2016      17.7
## 10 PK       2017      11.8

Challenge Exercises

Exercise 1 - make data tidy

  • Create a tibble dataframe using the code below and make it tidy. Do you need to spread or gather? What are the variables? what are the observations?
pregnancy <- tribble(         
    ~pregnant, ~male, ~female,          
    "yes",     NA,    10,           
    "no",      20,    12        
)

Exercise 2 - data summary with dplyr

  • use the dplyr verbs and the tidy_clim_data dataset you created above to calculate monthly average Tmin and Tmax for each station

Exercise 3 - make and parse a date with lubridate

make a date object out of a character string of your birthday and find the day of year it occurs on.

–go to next module–