The Answers…

library(tidyverse) # for tidy tools and reshaping methods
library(sf) # for working with spatial data
library(ggplot2) # for plotting
weca = st_read('../data/weca.gpkg', quiet=T)
knitr::kable(head(st_drop_geometry(weca))[,c(4,5,6,7)])

lsoa_name	price_dec_1995	price_mar_1996	price_jun_1996
Bath and North East Somerset 007A	49000	54000	85500
Bath and North East Somerset 007B	68250	68950	67000
Bath and North East Somerset 007C	72500	73750	66500
Bath and North East Somerset 010E	111250	110000	103500
Bath and North East Somerset 010B	70000	89500	89500
Bath and North East Somerset 010A	62500	60500	64000

1. Taking a first look at the data

1.1

The following dataset of temperatures from the Filton weather station has a different tidiness problem. Which problem is it, and how can you tell?

library(readr)
temps = read_csv("../data/bristol_yearly_temps.csv")
temps

year	stat	degrees_celsius
2016	maximum	18.7
2016	median	10.6
2016	minimum	6.1
2017	maximum	18.4
2017	median	12.4
2017	minimum	5.5
2018	maximum	21.1
2018	median	11.5
2018	minimum	4.4

This temperature data is too long. It mixes information about different kinds of statistics in the stat column, and therefore contains many different kinds of temperatures in the degrees_celsius column. This is Wickham’s second problem: multiple variables are stored in a single column; the maximum, minimum, and median temperatures are all stored within the degrees_celsius column.

1.2

How would you tidy the data & compute the range of temperatures for each year?

To resolve this issue, we need to spread, or pivot_wider, the data to split the stat column into three variables.

temps_wide = pivot_wider(temps, 
            id_cols=year, # this uniquely identifies each row
            names_from=stat, # this contains the "names" of our variables
            values_from=degrees_celsius # this contains the values of our variables
            ) 
temps_wide %>% mutate(range = maximum - minimum)

year	maximum	median	minimum	range
2013	20.9	10.55	4.4	16.5
2014	20.0	12.80	6.5	13.5
2015	17.3	12.05	5.3	12.0
2016	18.7	10.60	6.1	12.6
2017	18.4	12.40	5.5	12.9
2018	21.1	11.50	4.4	16.7

An alternative approach to compute the temperature range might filter the temps data. This is more complicated, even though it provides the right result. This is because the actual “pivot” occurs manually: you select the maximum and minimum values individually using the filter and select verbs. This only works because you know the columns that will be needed for the range, and that number of columns is relatively small. If the spread were more complicated, or involved more variables, this would entail significant copying & pasting.

maxes = temps %>% filter(stat == 'maximum') %>% # get the maximum column
                     select(-stat) %>% # drop the stat column, just keep year & value
                     rename(maximum = degrees_celsius) # rename the degrees to maximum
mins = temps %>% filter(stat=='minimum') %>% 
                 select(-stat) %>% # drop the stat column, as before
                 rename(minimum = degrees_celsius) # rename the degrees to minimum
join = merge(maxes, mins, by='year') # join the minima and maxima by years
join %>% mutate(range = maximum - minimum) # compute the range

	year	maximum	minimum	range
114	2013	20.9	4.4	16.5
115	2014	20.0	6.5	13.5
116	2015	17.3	5.3	12.0
117	2016	18.7	6.1	12.6
118	2017	18.4	5.5	12.9
119	2018	21.1	4.4	16.7

1.3

The weca data has one of these issues. Which one does it have, and how can you tell?

The weca data is too wide. It mixes information about the quarter/year in with the information about price. This means that the data has Wickham’s first problem: column headers are values, not variable names. In theory, lsoa, la, price, quarter, and year are distinct variables that each encode a distinct bit of information about the observation. Further, this means that the “tidy” version of this data contains a single row for the price of a single LSOA in a single quarter in a single year. The data needs to be “melted” to become tidy, which will be completed in the next step.

2. Tidying Data

2.1

It will be helpful to have tidy data for some analyses in the remainder of the exercise. Use a pivot to transform the data to make each row measure the price of an LSOA in a specific month & year

wecatidy = pivot_longer(weca, price_dec_1995:price_dec_2018, names_to=c(NA, 'quarter', 'year'), names_sep='_', values_to='price')

la_code	la_name	lsoa_code	lsoa_name	lsoa11cd	geom	quarter	year	price
E06000022	Bath and North East Somerset	E01014370	Bath and North East Somerset 007A	E01014370	MULTIPOLYGON (((-2.357691 5…	dec	1995	49000
E06000022	Bath and North East Somerset	E01014370	Bath and North East Somerset 007A	E01014370	MULTIPOLYGON (((-2.357691 5…	mar	1996	54000
E06000022	Bath and North East Somerset	E01014370	Bath and North East Somerset 007A	E01014370	MULTIPOLYGON (((-2.357691 5…	jun	1996	85500
E06000022	Bath and North East Somerset	E01014370	Bath and North East Somerset 007A	E01014370	MULTIPOLYGON (((-2.357691 5…	sep	1996	122000
E06000022	Bath and North East Somerset	E01014370	Bath and North East Somerset 007A	E01014370	MULTIPOLYGON (((-2.357691 5…	dec	1996	129000
E06000022	Bath and North East Somerset	E01014370	Bath and North East Somerset 007A	E01014370	MULTIPOLYGON (((-2.357691 5…	mar	1997	129000

2.2

Using your new tidy data, use ggplot2 to make a boxplot where the horizontal axis is years, the vertical axis is price, and boxes are colored by local authority.

ggplot(wecatidy, aes(x=year, y=price, color=la_name)) + geom_boxplot() + theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))

## Warning: Removed 581 rows containing non-finite values (stat_boxplot).

2.3

Using tidy verbs such as group_by, select, mutate, and merge, can you:

compute the average price for each year? This can be done using a group by + summarize

wecatidy %>% group_by(year) %>% summarize(mean_price = mean(price, na.rm=T))

year	mean_price
1995	58154.10
1996	58988.64
1997	62572.58
1998	69444.76
1999	76872.93
2000	91882.90

compute the average price in each local authority each year? This can be done using a group by + summarize. Note that there are now two levels which we need to capture: the year and the local authority. Summarize operates on each group in sequence, from right to left. Thus, if we groupby(year, la_name), then the summary is built first for la_name, then year, if requested:

wecatidy %>% group_by(year, la_name) %>% 
  summarize(mean_price = mean(price, na.rm=T))

year	la_name	mean_price
1995	Bath and North East Somerset	70375.12
1995	Bristol, City of	51259.92
1995	South Gloucestershire	60340.01
1996	Bath and North East Somerset	72169.71
1996	Bristol, City of	51352.70
1996	South Gloucestershire	61347.14

Compute the median price for each LSOA since 2008? this is slightly more complex, requiring a filter operation, and the recognition that you can groupby(lsoa_code), just like any other group.

wecatidy %>% filter(year >= 2008) %>% 
  group_by(lsoa_code) %>% summarize(price = median(price))

lsoa_code	price
E01014370	254375.0
E01014371	251662.5
E01014372	280000.0
E01014373	224250.0
E01014374	317500.0
E01014375	399499.0

Challenge: make a map of the percentage change in price from 2008 til now in each LSOA? This requires the hint information, plus everything we’ve done before.

library(rcartocolor)
price_changes = wecatidy %>% filter(year >= 2008) %>% # 
                             group_by(lsoa_code, year) %>% 
                             summarize(price = median(price, na.rm=T)) %>%
                             arrange(year) %>%
                             summarize(pct_change = (last(price) - first(price))/first(price)*100)
price_changes = merge(weca[c('lsoa_code', 'geom')], price_changes, by='lsoa_code')

The price change table now looks like this:

lsoa_code	pct_change	geometry
E01014370	26.378667	MULTIPOLYGON (((-2.357691 5…
E01014371	86.639470	MULTIPOLYGON (((-2.351819 5…
E01014372	33.333333	MULTIPOLYGON (((-2.356748 5…
E01014373	25.000000	MULTIPOLYGON (((-2.320094 5…
E01014374	34.724409	MULTIPOLYGON (((-2.307303 5…
E01014375	2.054054	MULTIPOLYGON (((-2.342037 5…

And, we can make a map using either continuous:

price_changes %>%
ggplot(., aes(fill=pct_change)) + 
  geom_sf(lwd=.1) + 
  scale_fill_carto_c('% Change in Price', palette='Sunset', direction=-1)

or discrete map classification methods, which emphasizes the visual differences.

price_changes %>% mutate(map_class = cut(pct_change, quantile(pct_change, probs=seq(0,1,.2)))) %>%
ggplot(., aes(fill=map_class)) + 
  geom_sf(lwd=.1) + 
  scale_fill_carto_d('% Change in Price', palette='Sunset', direction=-1)

Sculpting Data for Fun and Profit

The Answers…

1. Taking a first look at the data

1.1

1.2

1.3

2. Tidying Data

2.1

2.2

2.3