What do you mean?

Today, we will use the actual sales prices of property listings in the WECA area. The best answers will provide both useful visualizations and textual discussion of the correct answer. Code is also useful. In most cases, the answer can be provided in a line of code, a plot, and two or three sentences.

1. Doing Data Science Mentally

Let’s get you thinking about conditional probabilities.1 These can all be done easily with filter() and group_by.. To do this, read in the sales.csv data. If listing is:

1.1

from 1999, what’s your best guess as to the price of that house?

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
sales = read_csv("../data/sales.csv")
## Rows: 350891 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): id, month, day_of_month, property_type, town, district, county, ppd...
## dbl (2): price, year
## lgl (2): is_newbuild, is_freehold
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
sales %>% filter(year == 1999) %>% summarize(mean(price))
## # A tibble: 1 x 1
##   `mean(price)`
##           <dbl>
## 1        90080.

1.2

about a detached house in 1999, what’s your best guess as to the price of that house?

sales %>% filter(year == 1999 & property_type == 'detached') %>% summarize(mean(price))
## # A tibble: 1 x 1
##   `mean(price)`
##           <dbl>
## 1       141262.

2. A Group-based model

Fit a linear model that predicts listing price over the years, treating “year” as a group variable.2 Be careful! In this model, are years numbers or categories? Interpreting this model:

model1 = lm(price ~ as.factor(year)-1, data=sales)

2.1

In this model, what does the estimate for each “year” represent?

This is either the mean of each group, or is the contrast between that year and the reference category., which is the first year by default.

2.2

Which years are not “statistically significant?” Why might this be the case?

If you fit the model using the “contrast” specification, then some of the early years will not be statistically significant. This is because they measure the change relative to 1995; the actual prediction for, say 1996, is (Intercept) + 1996 in this case. The model which I fit above (with the -1 term) uses each group separately, like in the pub example from last week.

2.3

What’s the model’s prediction of listing price of a detached house in 1999? How does this compare to your guess from section 1?

Since our model doesn’t “know” about property type, we can ignore it in the scenario:

scenario = data.frame(year=1999)
predict(model1, scenario)
##        1 
## 90080.24

You can see that this is the same estimate from section 1; you should be expecting this, because (as we saw last week), a group-based model will just use the mean within each group as its prediction, if there aren’t any other variables in play.

2.4 (Challenge)

Is the estimated relationship between price and year in this model linear? No! you can see this very quickly by just plotting the model’s predictions for each year. It almost exactly matches the predictions from the geom_smooth(), which you know is a kernel regression, and thus is very nonlinear! So, why do we call this a “linear model”? Well… it’s the linear combination of \[ y = \text{is_1995}*\beta_1 + \text{is_1996}*\beta_2 + ... + \text{is_2017}*\beta_23 + e_i\] Those is_1995 effects are grouping variables (sometimes callde “one hot variables”) that are 1 when an observation obeys the condition and zero otherwise). R automatically builds them for any variable that’s a “factor” (a.k.a. “category”). This is the same logic we’ve discussed before for spline models, nonlinear models that split up the data into separate “regions” and fit a regression in that region. Here, our “separate” regression happens within each year!

scenario = data.frame(year=seq(1995, 2017, 1))
scenario %>% 
  mutate(prediction = predict(model1, .)) %>%
  ggplot(aes(x=year, y=prediction)) + 
  geom_point() + 
  geom_smooth(data=sales, aes(x=year, y=price), se=F)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

2.5

Are there any issues with the model that you can identify? 3 There are three things I would keep in mind here. First, recall that you can use the residuals(model) function to extract your model prediction errors and you can also get predictions using predict(model). Second, you may find it useful to create columns containing model predictions or errors. Finally, remember the typical issues that models can have, such as heteroskedasticity in residuals, nonlinearity in response, or bias when predictions are grouped in a way the model ignores. Tons! it’s biased in terms of both prediction errors in most of the other categorical varuables. For example, in property type:

sales %>% 
  mutate(residual = residuals(model1)) %>% 
  group_by(property_type) %>% 
  summarize(mean(residual))
## # A tibble: 4 x 2
##   property_type `mean(residual)`
##   <chr>                    <dbl>
## 1 detached                97605.
## 2 flat                   -34589.
## 3 semidetached             -201.
## 4 terraced               -18128.

But, it’s not really biased in years, since we’ve included those as group variables already:

sales %>% 
  mutate(residual = residuals(model1)) %>% 
  group_by(year) %>% 
  summarize(mean(residual))
## # A tibble: 23 x 2
##     year `mean(residual)`
##    <dbl>            <dbl>
##  1  1995         1.38e-12
##  2  1996         2.11e-12
##  3  1997        -4.27e-13
##  4  1998        -7.00e-14
##  5  1999         7.18e-13
##  6  2000         5.05e-12
##  7  2001         1.45e-12
##  8  2002        -4.24e-12
##  9  2003         1.36e-11
## 10  2004        -4.26e-13
## # ... with 13 more rows

You can also see it’s heteroskedastic using similar methods:

sales %>% 
  mutate(residual = residuals(model1)) %>% 
  group_by(year) %>% 
  summarize(sd(residual))
## # A tibble: 23 x 2
##     year `sd(residual)`
##    <dbl>          <dbl>
##  1  1995         51741.
##  2  1996         42123.
##  3  1997         46776.
##  4  1998         54421.
##  5  1999         57615.
##  6  2000         80572.
##  7  2001         82635.
##  8  2002         94216.
##  9  2003         97064.
## 10  2004        111873.
## # ... with 13 more rows

The standard deviation of the residuals doubles within the first 10 years!

2.5

Would you use this model to predict next year’s prices? Why or why not? In fact, you cant; the model only understands years as different groups, and does not know, for instance, that years follow one another. This is because you’ve treated them as a factor, rather than as a numeric variable.

3. A Linear Model

Fit a linear model that predicts the log listing price over the years, treating “year” as a continuous variable, including the property type as a group variable.4 Be careful! In this model, are years numbers or categories?

model2 = lm(log(price) ~ year + property_type, data=sales)

3.1

In this model, what does the estimate for “year” represent?5 remember, changing the units of \(y\) or \(X\) will change your interpretation of the slope in a model. This represents the typical change in log(price) for each year that passes. Thinking hard about how logs work, you might also know that if you model log(y)~X, then your beta values represent the percentage change in y, given a unit change in X! So, the slope here actually represents the estimated annual percentage change in house prices!

3.2

Why is the intercept negative? Can you adjust X or y to make this more reasonable?6 Remember, the interpretation of the intercept is our guess for \(y\) when \(X=0\) (\(\mathbf{E}[y|X=0]\)). As long as you don’t change the units of \(X\), you can shift it around arbitrarily.

This is because the intercept is our prediction of the house price in year 0… almost 2000 years ago. Since the line can only go straight in this model, our prediction is completely unrealistic. To adjust the “starting” year, we can shift the \(X\) values over! If we have \(X\), then we can build a new variable \(Z = X - k\), where \(k\) is the value where we want the intercept to explain. For example, we can plug in \(k=1995\) and fit the following model on \(Z = X-1995\). This model’s intercept, then, describes the predicted log house price when \(Z=0\), which corresponds to \(X=1995\):

model2 = lm(log(price) ~ year_shifted + property_type, data=sales %>% mutate(year_shifted = year - 1995))
model2
## 
## Call:
## lm(formula = log(price) ~ year_shifted + property_type, data = sales %>% 
##     mutate(year_shifted = year - 1995))
## 
## Coefficients:
##               (Intercept)               year_shifted  
##                   11.5832                     0.0718  
##         property_typeflat  property_typesemidetached  
##                   -0.6344                    -0.4507  
##     property_typeterraced  
##                   -0.5797

You can see that, since the scale of y and X remained the same, the slope also remains the same. This kind of shifting is very common, as we aim to make what we estimate as useful as possible. You’ll also note that the property_type values also remain the same because we’re just changing the value of the intercept; the contrast with the intercept will remain the same. Fitting the model with separate intercepts by group would shift all of the group intercepts up from zero

lm(log(price) ~ year + property_type - 1, data=sales)
## 
## Call:
## lm(formula = log(price) ~ year + property_type - 1, data = sales)
## 
## Coefficients:
##                      year      property_typedetached  
##                    0.0718                  -131.6490  
##         property_typeflat  property_typesemidetached  
##                 -132.2834                  -132.0998  
##     property_typeterraced  
##                 -132.2287
lm(log(price) ~ year_shifted + property_type - 1, data=sales %>% mutate(year_shifted = year - 1995))
## 
## Call:
## lm(formula = log(price) ~ year_shifted + property_type - 1, data = sales %>% 
##     mutate(year_shifted = year - 1995))
## 
## Coefficients:
##              year_shifted      property_typedetached  
##                    0.0718                    11.5832  
##         property_typeflat  property_typesemidetached  
##                   10.9488                    11.1325  
##     property_typeterraced  
##                   11.0035

You can see for yourself that this represents four separate lines, each with different intercepts but the same slope:

#get the unique property types
property_types = unique(sales$property_type)
#get the years
years = min(sales$year):max(sales$year)
#make a scenario where we want predictions in each property type for each year
scenarios = expand_grid(property_type=property_types, year=years)

scenarios %>% 
  # shift the year, since the model is fit on year_shifted
  mutate(year_shifted = year - 1995) %>% 
  # make a prediction for the scenarios
  mutate(prediction=predict(model2, .)) %>%
  # start a plot, which will use the scenario data
  ggplot() +
  # add the house price data
  geom_jitter(data=sales, aes(x=year, y=log(price), color=property_type), alpha=.1) + 
  # and plot the lines
  geom_line(aes(x=year, y=prediction, color=property_type))

3.3

In which year is this model’s prediction most biased? least biased?

The 2 biggest underpredictions occur in these years:

year_errors = sales %>% 
  mutate(residual = residuals(model2)) %>% 
  group_by(year) %>% 
  summarize(av_error = mean(residual)) %>% 
  arrange(av_error)

year_errors %>% tail(2)
## # A tibble: 2 x 2
##    year av_error
##   <dbl>    <dbl>
## 1  2007    0.255
## 2  2004    0.255

And overpredictions here:

year_errors %>% head(2)
## # A tibble: 2 x 2
##    year av_error
##   <dbl>    <dbl>
## 1  1995   -0.267
## 2  1996   -0.228

For the least biased values, you need to get the smallest errors:

year_errors %>% mutate(abs_error = abs(av_error)) %>% arrange(abs_error) %>% head(2)
## # A tibble: 2 x 3
##    year  av_error abs_error
##   <dbl>     <dbl>     <dbl>
## 1  2009 -0.000222  0.000222
## 2  2001  0.0169    0.0169

While still small, they’re not 17 zeros small.

3.4

What’s the model’s prediction of listing price7 remember that this model is trained using the log of price, not price. To get back the “price” from a “log of price” variable, you need to use exp(); the exp() “un-does” the log(). For example, log(750000)=13.5278285, and exp(13.5278285)= 750000. of a detached house in 1999? How does this compare to your guesses from Sections 1 & 2?

This can be done in the same manner as before., but now we need to take care of our transformations… Build a scenario:

scenario = data.frame(year = 1999, property_type="detached")

adjust for any transformations you did to \(X\):

scenario = scenario %>% mutate(year_shifted = year - 1995)

and then predict:

pred_logprice = predict(model2, scenario)
pred_logprice
##        1 
## 11.87035

Since we have transformed y, we need to un-transform the prediction to get back the original units:

exp(pred_logprice)
##        1 
## 142964.9

Given information about the detached house, our prediction is way higher in this model, and is much closer to the \(141262\) we got in section 1. The difference here is that our model in this section treats year as a continuous variable, rather than categorical one. \(\mathbf{E}[y|X]\) changes because the nature of our \(X\) changes!

3.5

Would you use this model to predict next year’s prices? Why or why not?

You can, because the model treats years not as a group, but as a continuous variable. This means it’s possible to think about a “next” year.

3.5 Challenge

Train the same model, but use only data from before 1999. What’s this model’s prediction of the listing price in 1999? How does this compare to the other three predictions?

This model represents our true guess when the stakes are high. Our past guesses are all in-sample predictions, in that we’re just predicting about the things our model already has seen. Here, though, we artificially restrict the data so that we don’t know anything about 1999, and predict based on what we know until 1999. This concept of out-of-sample accuracy is extremely important for data science because of its focus on prediction, so we’ll cover ways to estimate this later in the course.

model3 = lm(log(price) ~ year + property_type, data=sales %>% filter(year < 1999))
pred_logprice_pre1999 = predict(model3, data.frame(year=1999, property_type='detached'))
exp(pred_logprice_pre1999)
##        1 
## 126005.6