Sculpting Data for Fun and Profit

Rent & housing price is a fundamental part of the economics of society. It is a major driver of both urbanization, de-urbanization, and internal social shifts in habitation within cities. Further, the housing market, its structure, and its behavior is the single most important driver of the business cycle, meaning that, fundamentally, the negotiations between landlords & tenants, sellers & buyers drives both the speed, duration, and the depth of recession and serves as a clear and substantial driver of income inequality in cities. At this point, you’ve probably heard of “gentrification”, and that’s driven by house prices, too. Related to this concept, one dataset we will use often in this course is the quarterly house prices since 1996 in the West of England Combined Authority.

And, we’ll use the following three packages:

library(tidyverse) # for tidy tools and reshaping methods
library(sf) # for working with spatial data
library(ggplot2) # for plotting

Answers to these problems are discussed here. I’d encourage you to attempt a complete set of answers first.

The data

I provide the dataset, weca.gpkg, a “geopackage” containing the median home sale price in each LSOA in each quarter (i.e. every March, June, September, & December) starting in December of 1995 across the West of England Combined Authority. Just like most spatial data, you can read it using sf::st_read.1 You may need to change this code in order for it to run successfully on your own computer. Notably, make sure you know where your data is, and where your code is running. Make sure to set your working directory, and put the data in a place where you can get to it from that working directory.

lsoa_name price_dec_1995 price_mar_1996 price_jun_1996
Bath and North East Somerset 007A 49000 54000 85500
Bath and North East Somerset 007B 68250 68950 67000
Bath and North East Somerset 007C 72500 73750 66500
Bath and North East Somerset 010E 111250 110000 103500
Bath and North East Somerset 010B 70000 89500 89500
Bath and North East Somerset 010A 62500 60500 64000

Viewing the columns of this dataframe yourself, you’ll see that the identifiers for each lsoa are lsoa_code and lsoa_name. The local authority for those LSOAs are recorded in la_code and la_name. Then, all of the rest of the columns are in the form price_<month>_<year>, and records the median price (in £) of all houses sold in that LSOA in that time period.2 Strictly speaking, the <month> refers to a whole fiscal quarter, and uses the month name because that is much more informative than simply the quarter number. Above, I’ve shown a cleaned-up view using just the lsoa_name and a few price columns, to give you a sense of what the data looks like.

1. Taking a first look at the data

According to Wickham (2014), there are five common problems with “messy” datasets that generally can be resolved by “tidying” the data.3 If you consult Section 3 of the Wickham (2014) paper on tidy data, you’ll find the common problems and discussion. These problems are: “column headers are values, not variable names” ; “multiple variables are stored in one column”; “variables stored in both rows and columns”; “multiple types are in one table”; “one type in multiple tables.”

1.1

The following dataset of temperatures from the Filton weather station has a different tidiness problem. Which problem is it, and how can you tell?

library(readr)
temps = read_csv("../data/bristol_yearly_temps.csv")
temps
year stat degrees_celsius
2016 maximum 18.7
2016 median 10.6
2016 minimum 6.1
2017 maximum 18.4
2017 median 12.4
2017 minimum 5.5
2018 maximum 21.1
2018 median 11.5
2018 minimum 4.4

1.2

How would you tidy the temps data in order to easily compute the range of temperatures in each year? Show me the code, and briefly explain why the data is now tidy.

1.3

The weca data also has one of these Wickham (2014) issues. Which one does it have, and how can you tell?

2 Tidying Data

It will be helpful to have tidy data for some analyses in the remainder of the exercise. This next section will involve doing this.

2.1

Use a pivot to transform the data to make each row measure the price of an LSOA in a specific month & year.4 Don’t forget about the optional arguments, such as names_to and names_sep, which allow you to split column names that contain “extra” information. Alternatively, you can use tidyr::separate after the pivot. Show me the code, and briefly explain why the data is now tidy.

2.2

Using your new tidy data, use ggplot2 to make a boxplot where the horizontal axis is years, the vertical axis is price, and boxes are colored by local authority.5 The ggplot2 library has some helpful guidance online. Pay critical attention to what the options are, and what the question asks you to do.

2.3

Using tidy verbs such as group_by, select, mutate, and merge, can you: