Sculpting Data for Fun and Profit

Rent & housing price is a fundamental part of the economics of society. It is a major driver of both urbanization, de-urbanization, and internal social shifts in habitation within cities. Further, the housing market, its structure, and its behavior is the single most important driver of the business cycle, meaning that, fundamentally, the negotiations between landlords & tenants, sellers & buyers drives both the speed, duration, and the depth of recession and serves as a clear and substantial driver of income inequality in cities. At this point, you’ve probably heard of “gentrification”, and that’s driven by house prices, too. Related to this concept, one dataset we will use often in this course is the quarterly house prices since 1996 in the West of England Combined Authority.

Answers to these problems are discussed here. I’d encourage you to attempt a complete set of answers first.

The data

I provide the dataset, weca.gpkg, a “geopackage” containing the median home sale price in each LSOA in each quarter (i.e. every March, June, September, & December) starting in December of 1995 across the West of England Combined Authority. Just like most spatial data, you can read it using sf::st_read.11 You may need to change this code in order for it to run successfully on your own computer. Notably, make sure you know where your data is, and where your code is running. Make sure to set your working directory, and put the data in a place where you can get to it from that working directory.

lsoa_name	price_dec_1995	price_mar_1996	price_jun_1996
Bath and North East Somerset 007A	49000	54000	85500
Bath and North East Somerset 007B	68250	68950	67000
Bath and North East Somerset 007C	72500	73750	66500
Bath and North East Somerset 010E	111250	110000	103500
Bath and North East Somerset 010B	70000	89500	89500
Bath and North East Somerset 010A	62500	60500	64000

Viewing the columns of this dataframe yourself, you’ll see that the identifiers for each lsoa are lsoa_code and lsoa_name. The local authority for those LSOAs are recorded in la_code and la_name. Then, all of the rest of the columns are in the form price_<month>_<year>, and records the median price (in £) of all houses sold in that LSOA in that time period.22 Strictly speaking, the <month> refers to a whole fiscal quarter, and uses the month name because that is much more informative than simply the quarter number. Above, I’ve shown a cleaned-up view using just the lsoa_name and a few price columns, to give you a sense of what the data looks like.

1. Taking a first look at the data

According to Wickham (2014), there are five common problems with “messy” datasets that generally can be resolved by “tidying” the data.33 If you consult Section 3 of the Wickham (2014) paper on tidy data, you’ll find the common problems and discussion. These problems are: “column headers are values, not variable names” ; “multiple variables are stored in one column”; “variables stored in both rows and columns”; “multiple types are in one table”; “one type in multiple tables.”

1.1

The following dataset of temperatures from the Filton weather station has a different tidiness problem. Which problem is it, and how can you tell?

library(readr)
temps = read_csv("../data/bristol_yearly_temps.csv")
temps

year	stat	degrees_celsius
2016	maximum	18.7
2016	median	10.6
2016	minimum	6.1
2017	maximum	18.4
2017	median	12.4
2017	minimum	5.5
2018	maximum	21.1
2018	median	11.5
2018	minimum	4.4

1.2

How would you tidy the temps data in order to easily compute the range of temperatures in each year? Show me the code, and briefly explain why the data is now tidy.

1.3

The weca data also has one of these Wickham (2014) issues. Which one does it have, and how can you tell?

2 Tidying Data

It will be helpful to have tidy data for some analyses in the remainder of the exercise. This next section will involve doing this.

2.1

Use a pivot to transform the data to make each row measure the price of an LSOA in a specific month & year.44 Don’t forget about the optional arguments, such as names_to and names_sep, which allow you to split column names that contain “extra” information. Alternatively, you can use tidyr::separate after the pivot. Show me the code, and briefly explain why the data is now tidy.

2.2

Using your new tidy data, use ggplot2 to make a boxplot where the horizontal axis is years, the vertical axis is price, and boxes are colored by local authority.55 The ggplot2 library has some helpful guidance online. Pay critical attention to what the options are, and what the question asks you to do.

2.3

Using tidy verbs such as group_by, select, mutate, and merge, can you:

compute the average price for each year?
compute the average price in each local authority each year?
compute the median price for each LSOA since 2008?
Challenge: make a map of the percentage change in price from 2008 til now in each LSOA?66 The first and last functions get the values of the first group and the last group in a group_by. And, percent change in variable \(x\) measured at time \(t\) is computed by \(\frac{x_{t+1} - x_t}{x_t}\).