Sculpting Data for Fun and Profit

The Answers…

library(tidyverse) # for tidy tools and reshaping methods
library(sf) # for working with spatial data
library(ggplot2) # for plotting
weca = st_read('../data/weca.gpkg', quiet=T)
knitr::kable(head(st_drop_geometry(weca))[,c(4,5,6,7)])
lsoa_name price_dec_1995 price_mar_1996 price_jun_1996
Bath and North East Somerset 007A 49000 54000 85500
Bath and North East Somerset 007B 68250 68950 67000
Bath and North East Somerset 007C 72500 73750 66500
Bath and North East Somerset 010E 111250 110000 103500
Bath and North East Somerset 010B 70000 89500 89500
Bath and North East Somerset 010A 62500 60500 64000

1. Taking a first look at the data

1.1

The following dataset of temperatures from the Filton weather station has a different tidiness problem. Which problem is it, and how can you tell?

library(readr)
temps = read_csv("../data/bristol_yearly_temps.csv")
temps
year stat degrees_celsius
2016 maximum 18.7
2016 median 10.6
2016 minimum 6.1
2017 maximum 18.4
2017 median 12.4
2017 minimum 5.5
2018 maximum 21.1
2018 median 11.5
2018 minimum 4.4

This temperature data is too long. It mixes information about different kinds of statistics in the stat column, and therefore contains many different kinds of temperatures in the degrees_celsius column. This is Wickham’s second problem: multiple variables are stored in a single column; the maximum, minimum, and median temperatures are all stored within the degrees_celsius column.

1.2

How would you tidy the data & compute the range of temperatures for each year?

To resolve this issue, we need to spread, or pivot_wider, the data to split the stat column into three variables.

temps_wide = pivot_wider(temps, 
            id_cols=year, # this uniquely identifies each row
            names_from=stat, # this contains the "names" of our variables
            values_from=degrees_celsius # this contains the values of our variables
            ) 
temps_wide %>% mutate(range = maximum - minimum)
year maximum median minimum range
2013 20.9 10.55 4.4 16.5
2014 20.0 12.80 6.5 13.5
2015 17.3 12.05 5.3 12.0
2016 18.7 10.60 6.1 12.6
2017 18.4 12.40 5.5 12.9
2018 21.1 11.50 4.4 16.7

An alternative approach to compute the temperature range might filter the temps data. This is more complicated, even though it provides the right result. This is because the actual “pivot” occurs manually: you select the maximum and minimum values individually using the filter and select verbs. This only works because you know the columns that will be needed for the range, and that number of columns is relatively small. If the spread were more complicated, or involved more variables, this would entail significant copying & pasting.

maxes = temps %>% filter(stat == 'maximum') %>% # get the maximum column
                     select(-stat) %>% # drop the stat column, just keep year & value
                     rename(maximum = degrees_celsius) # rename the degrees to maximum
mins = temps %>% filter(stat=='minimum') %>% 
                 select(-stat) %>% # drop the stat column, as before
                 rename(minimum = degrees_celsius) # rename the degrees to minimum
join = merge(maxes, mins, by='year') # join the minima and maxima by years
join %>% mutate(range = maximum - minimum) # compute the range
year maximum minimum range
114 2013 20.9 4.4 16.5
115 2014 20.0 6.5 13.5
116 2015 17.3 5.3 12.0
117 2016 18.7 6.1 12.6
118 2017 18.4 5.5 12.9
119 2018 21.1 4.4 16.7

1.3

The weca data has one of these issues. Which one does it have, and how can you tell?

The weca data is too wide. It mixes information about the quarter/year in with the information about price. This means that the data has Wickham’s first problem: column headers are values, not variable names. In theory, lsoa, la, price, quarter, and year are distinct variables that each encode a distinct bit of information about the observation. Further, this means that the “tidy” version of this data contains a single row for the price of a single LSOA in a single quarter in a single year. The data needs to be “melted” to become tidy, which will be completed in the next step.

2. Tidying Data

2.1

It will be helpful to have tidy data for some analyses in the remainder of the exercise. Use a pivot to transform the data to make each row measure the price of an LSOA in a specific month & year

wecatidy = pivot_longer(weca, price_dec_1995:price_dec_2018, names_to=c(NA, 'quarter', 'year'), names_sep='_', values_to='price')
la_code la_name lsoa_code lsoa_name lsoa11cd geom quarter year price
E06000022 Bath and North East Somerset E01014370 Bath and North East Somerset 007A E01014370 MULTIPOLYGON (((-2.357691 5… dec 1995 49000
E06000022 Bath and North East Somerset E01014370 Bath and North East Somerset 007A E01014370 MULTIPOLYGON (((-2.357691 5… mar 1996 54000
E06000022 Bath and North East Somerset E01014370 Bath and North East Somerset 007A E01014370 MULTIPOLYGON (((-2.357691 5… jun 1996 85500
E06000022 Bath and North East Somerset E01014370 Bath and North East Somerset 007A E01014370 MULTIPOLYGON (((-2.357691 5… sep 1996 122000
E06000022 Bath and North East Somerset E01014370 Bath and North East Somerset 007A E01014370 MULTIPOLYGON (((-2.357691 5… dec 1996 129000
E06000022 Bath and North East Somerset E01014370 Bath and North East Somerset 007A E01014370 MULTIPOLYGON (((-2.357691 5… mar 1997 129000

2.2

Using your new tidy data, use ggplot2 to make a boxplot where the horizontal axis is years, the vertical axis is price, and boxes are colored by local authority.

ggplot(wecatidy, aes(x=year, y=price, color=la_name)) + geom_boxplot() + theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))
## Warning: Removed 581 rows containing non-finite values (stat_boxplot).

2.3

Using tidy verbs such as group_by, select, mutate, and merge, can you:

  • compute the average price for each year? This can be done using a group by + summarize
wecatidy %>% group_by(year) %>% summarize(mean_price = mean(price, na.rm=T))
year mean_price
1995 58154.10
1996 58988.64
1997 62572.58
1998 69444.76
1999 76872.93
2000 91882.90
  • compute the average price in each local authority each year? This can be done using a group by + summarize. Note that there are now two levels which we need to capture: the year and the local authority. Summarize operates on each group in sequence, from right to left. Thus, if we groupby(year, la_name), then the summary is built first for la_name, then year, if requested:
wecatidy %>% group_by(year, la_name) %>% 
  summarize(mean_price = mean(price, na.rm=T))
year la_name mean_price
1995 Bath and North East Somerset 70375.12
1995 Bristol, City of 51259.92
1995 South Gloucestershire 60340.01
1996 Bath and North East Somerset 72169.71
1996 Bristol, City of 51352.70
1996 South Gloucestershire 61347.14
  • Compute the median price for each LSOA since 2008? this is slightly more complex, requiring a filter operation, and the recognition that you can groupby(lsoa_code), just like any other group.
wecatidy %>% filter(year >= 2008) %>% 
  group_by(lsoa_code) %>% summarize(price = median(price))
lsoa_code price
E01014370 254375.0
E01014371 251662.5
E01014372 280000.0
E01014373 224250.0
E01014374 317500.0
E01014375 399499.0
  • Challenge: make a map of the percentage change in price from 2008 til now in each LSOA? This requires the hint information, plus everything we’ve done before.
library(rcartocolor)
price_changes = wecatidy %>% filter(year >= 2008) %>% # 
                             group_by(lsoa_code, year) %>% 
                             summarize(price = median(price, na.rm=T)) %>%
                             arrange(year) %>%
                             summarize(pct_change = (last(price) - first(price))/first(price)*100)
price_changes = merge(weca[c('lsoa_code', 'geom')], price_changes, by='lsoa_code')

The price change table now looks like this:

lsoa_code pct_change geometry
E01014370 26.378667 MULTIPOLYGON (((-2.357691 5…
E01014371 86.639470 MULTIPOLYGON (((-2.351819 5…
E01014372 33.333333 MULTIPOLYGON (((-2.356748 5…
E01014373 25.000000 MULTIPOLYGON (((-2.320094 5…
E01014374 34.724409 MULTIPOLYGON (((-2.307303 5…
E01014375 2.054054 MULTIPOLYGON (((-2.342037 5…

And, we can make a map using either continuous:

price_changes %>%
ggplot(., aes(fill=pct_change)) + 
  geom_sf(lwd=.1) + 
  scale_fill_carto_c('% Change in Price', palette='Sunset', direction=-1)

or discrete map classification methods, which emphasizes the visual differences.

price_changes %>% mutate(map_class = cut(pct_change, quantile(pct_change, probs=seq(0,1,.2)))) %>%
ggplot(., aes(fill=map_class)) + 
  geom_sf(lwd=.1) + 
  scale_fill_carto_d('% Change in Price', palette='Sunset', direction=-1)