Tidying Examples

library(tidyverse)

## ── Attaching packages ───────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.0

## ── Conflicts ──────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Are temperatures rising?

temps = read_csv('../data/bristol_temps.csv')
knitr::kable(head(temps))

year	jan	feb	mar	apr	may	jun	jul	aug	sep	oct	nov	dec
1900	5.3	3.6	4.3	9.3	11.2	15.0	18.9	16.5	15.2	11.0	8.0	7.8
1901	4.1	2.8	5.3	9.6	12.7	15.0	18.4	16.3	14.7	10.4	5.1	4.2
1902	5.6	2.2	7.6	8.2	10.1	14.5	16.2	15.9	14.1	10.6	7.6	5.1
1903	5.1	7.4	8.1	7.7	12.3	13.6	16.4	15.4	14.3	11.7	7.5	4.5
1904	5.0	4.5	5.1	9.7	11.8	14.5	18.1	16.2	13.4	10.3	6.3	5.3
1905	4.1	5.9	7.6	8.4	12.0	15.4	18.5	15.8	13.2	8.1	5.4	5.2

This data is messy. The values are temperatures in a month, in a year. So, we should have a column for month and a column for year. Yet, we have months on the columns and years on the rows. That means the data is not tidy. Tidy data might look something like this:

year	month	temperature
1900	jan	5.3
1900	feb	3.6
1900	mar	4.3
1900	apr	9.3
1900	may	11.2
1900	jun	15.0
1900	jul	18.9
1900	aug	16.5
1900	sep	15.2
1900	oct	11.0
1900	nov	8.0
1900	dec	7.8

But, who cares? I can use the rowMeans function to get the average temperature in each year. That answers my question:

plot(temps$year, rowMeans(temps[ , -1]))

But… if we want something that’s not the mean? We have rowSums, but no rowSds, rowMedians, basically anything else. However, if we had tidy data, we would not have to rely on special functions like rowMeans. So, let’s do something better.

The data currently has a variable (month) in its columns. Focusing on the values as we discussed above, there are only three variables: year, month, and temperature. If the data were tidy, then these would be the only columns.

To make data longer, we use the pivot_longer function:

temps %>% pivot_longer(jan:dec)

## # A tibble: 1,428 x 3
##     year name  value
##    <dbl> <chr> <dbl>
##  1  1900 jan     5.3
##  2  1900 feb     3.6
##  3  1900 mar     4.3
##  4  1900 apr     9.3
##  5  1900 may    11.2
##  6  1900 jun    15  
##  7  1900 jul    18.9
##  8  1900 aug    16.5
##  9  1900 sep    15.2
## 10  1900 oct    11  
## # … with 1,418 more rows

This is almost our tidy example from above, except the column names are wrong. We can re-name the columns inside of the pivot_longer call:

long_temps = temps %>% pivot_longer(jan:dec, names_to='month', values_to='temp')
knitr::kable(head(long_temps))

year	month	temp
1900	jan	5.3
1900	feb	3.6
1900	mar	4.3
1900	apr	9.3
1900	may	11.2
1900	jun	15.0

Together, this gives us what we need to replicate the rowMeans solution in a tidy fashion. This would involve computing the average temperature in each year in this long data. That is, we need to group the data by a specific column, and then summarize the values within each group.

We can do that with group_by and summarize. group_by is an operation that collects together sets of rows based on a common value. Then, we typically use the summarize() function (or summarise) to transform the values in each group according to some rule. Here, we use the standard R mean function to get the mean of the temperatures in each group. This gives us a similar result to the rowMeans above.

long_temps %>% group_by(year) %>% summarize(mean_temp = mean(temp))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 119 x 2
##     year mean_temp
##    <dbl>     <dbl>
##  1  1900     10.5 
##  2  1901      9.88
##  3  1902      9.81
##  4  1903     10.3 
##  5  1904     10.0 
##  6  1905      9.97
##  7  1906     10.4 
##  8  1907      9.88
##  9  1908     10.5 
## 10  1909      9.53
## # … with 109 more rows

We can plot this in the same way, or use ggplot2 to make a plot that’s a little more pretty:

long_temps %>%
          group_by(year) %>% # make the groups for each year
          summarize(temp = mean(temp)) %>% # get the average temperature
      ggplot(aes(x=year, y=temp)) + # plot the result
        geom_point() + 
        geom_line() + 
        geom_smooth()

## `summarise()` ungrouping output (override with `.groups` argument)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Alternatively, with the tidy data already, we can plot the values by year in quite a few ways. Just as a bunch of dots in each year:

ggplot(long_temps, aes(x=year, y=temp)) + geom_point() + geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

or as a set of boxplots in each year:

ggplot(long_temps, aes(x=year, group=year, y=temp)) + geom_boxplot()

What’s the climate curve for Bristol?

A climate curve shows the range of temperatures a place experiences every month. It’s kind of a standard summary of the climate of a place, and basically every Wikipedia page on a place a version of one. For this, we would need to get a summary of the typical temperature per month and plot that.

There are a few ways this could get done using the tidy data, but doing this with the standard data would be very verbose.

First, we could use the group option in ggplot:

ggplot(long_temps, aes(x=month, y=temp, group=year)) + geom_line()

But that doesn’t look quite right: it’s sorting the months in alphabetical order, not calendar order. We can force this order, though, by making the months an ordered factor, which is like a categorical variable with a set ordering. This is very simple to do, and requires us to first get the order of the factors. I’ll do this by taking them from our original un-tidy dataframe:

month_order = colnames(select(temps, -year))

and the using ordered to force the months into an ordered categorical variable:

long_temps['month'] = ordered(long_temps$month, levels=month_order)

Finally we can make our plot:

long_temps %>% ggplot(aes(x=month, y=temp, group=year)) +
              geom_line(alpha=.1) + 
              geom_smooth(color='orangered', se=F, aes(group=NA))

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

What’s the relationship beteen Phosphate and Salinity?

This data contains information on sample sites in the ocean. It provides measures of nutrients, as well as information on the prevalence of species of phytoplankton.

species = read_csv('../data/species.csv')
knitr::kable(head(species))

Year	Date	Latitude	Longitude	Depth	A. robusta	A. unicornis	C. divergens	E. huxleyi	G. ericsonii	NiCompounds	Phosphate	Silicate	Salinity	Temperature	Chlorophyll
2003	2003-05-22	-29.5	-26.9	13	0.00000	60.75334	0.000	1761.8469	1397.3269	0.080900	0.0991000	1.08	35.804	21.330	0.03
2003	2003-05-22	-29.5	-26.9	116	353.35689	117.78563	0.000	353.3569	1060.0707	2.402083	0.3227000	1.32	35.740	17.220	0.11
2003	2003-05-24	-21.6	-25.0	22	71.27584	0.00000	0.000	997.8617	142.5517	0.012900	0.2010156	1.07	37.290	26.618	0.03
2003	2003-05-24	-21.6	-25.0	34	0.00000	112.10762	0.000	1681.6143	224.2152	0.013100	0.2034769	1.08	37.291	26.620	0.03
2003	2003-05-24	-21.6	-25.0	60	0.00000	102.45902	102.459	1434.4262	2356.5574	0.016300	0.2082746	1.10	37.290	26.624	0.03
2003	2003-05-24	-21.6	-25.0	140	186.56716	0.00000	0.000	149.2537	298.5075	0.025800	0.3755803	1.23	36.608	21.342	0.11

For this specific question, the data is actually tidy! the values for, say, the Salinity or Phosphate column represent the salinity of a specific sample site in the ocean. If we want to ask questions about the relationship between those variables for each of those observations, then the data is tidy enough:

species %>% ggplot(aes(x=Phosphate, y=Salinity)) + geom_point()

What’s the species distribution?

However, an enterprising student will notice: the species/genera prevalence columns are not tidy! Each Genus is denoted by the first letter in the species name. So, the values in the column called E. Huxleyi actually reflects the prevalence (in organisms per volume) of the organism E. Huxleyi. Further, that column name actually mixes the genera & species name together! So, asking questions about the relationship between species and genera, or the co-occurence of genera/species has to tidy this data a bit.

To split the data up again, we can use pivot_longer again in the same way we did before:

species_long = species %>% pivot_longer(`A. robusta`:`S. florida`, names_to=c("Genus", "Species"), names_sep='. ', values_to='prevalence')

But, note now that we’ve duplicated tons of measurements of the other rows! This automatically demonstrates why it’s sometimes not economical to input data in a tidy format… it often involves a lot of repetition! Now, though, we can look at the distributions of species/genera in our data, since each value of the prevalence column is a single observation of the prevalence of a single organism.

species_long %>%
  ggplot(aes(x=Genus, y=prevalence, color=Species)) + geom_boxplot() + 
  scale_y_continuous(trans='log10') + annotation_logticks(sides='l') + ylab("Prevalence (log10)")

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 664 rows containing non-finite values (stat_boxplot).

All the functions you may want:

pivot_wider(): take rows and convert them into columns
pivot_longer(): take columns and convert them into rows
group_by(group_name) %>% summarize(group_summary = ..., group_summary2=...): split the dataframe into groups based on the unique values of group_name & summarize those groups using one (or more) functions
- count()/n(): count the number of observations in the group
- first(variable): get the first value of variable in the group
- last(variable): get the last value of variable in the group
- n_distinct(variable): get the number of unique values of variable in the group. Gives the number of unqiue rows if variable is omitted.
filter(): subset the data by rows according to some condition(s)
- between(variable, lower, higher): is TRUE for rows where the value of variable is at least as large as lower and at most as large as higher.
- distinct(variable): is TRUE for rows when the value for that row not yet been seen (going from the first row to the last row)
select(): subset the data by columns according to some condition(s)
- contains(variable,value): is TRUE for rows where variable is a string that contains value. See ??tidy_select for more.
mutate(new_variable = do_something(old_variable)): add a new_variable to the data that is computed using an old_variable of the dataframe
- transmute(new_variable = do_something(old_variable)): same as mutate, but only new_variable is returned
- rename(new_name = old_name): rename the old_name column to be called new_name
separate(variable, sep="_")/unite(variable1, variable2, sep="_"): split (or join) a column into new columns based on a separator, sep
- separate_rows(variable, sep=':'): split a column into rows, where each new row is a copy of the original, but now its value of variable has been split according to sep
arrange(variable): sort the dataframe based on increasing values of variable (i.e. sort is ascending)
- arrange(desc(variable)) sort the dataframe based on decreasing values of variable
joins: merge dataframes based on set theory
- left/right/inner/full_join()
- semi/anti_join
expand_grid(a, b, c): make a new dataframe out of all the combinations of values in a, b, and c
- seq(start, stop, step): make a sequence from the start to the stop with a step size of step
- rep(vector, number): repeat the vector