What do you mean?

Today, we will use the actual sales prices of property listings in the WECA area. The best answers will provide both useful visualizations and textual discussion of the correct answer. Code is also useful. In most cases, the answer can be provided in a line of code, a plot, and two or three sentences.

1. Doing Data Science Mentally

Let’s get you thinking about conditional probabilities.1 These can all be done easily with filter() and group_by.. To do this, read in the sales.csv data. If listing is:

1.1

from 1999, what’s your best guess as to the price of that house?

1.2

about a detached house in 1999, what’s your best guess as to the price of that house?

1.3

Is the relationship between house price and year linear? is it “homoskedastic?”2 That is, does its variance stay the same over time?

2. One Linear model

Fit a linear model that predicts listing price over the years, treating “year” as a group variable.3 Be careful! In this model, are years numbers or categories? Interpreting this model:

2.1

In this model, what does the estimate for each “year” represent?

2.2

Which years are not “statistically significant?” Why might this be the case?

2.3

What’s the model’s prediction of the listing price of a detached house in 1999? How does this compare to your guess from section 1?

2.4 (Challenge)

Is the estimated relationship between price and year in this model linear?

2.5

Are there any issues with the model that you can identify? 4 There are three things I would keep in mind here. First, recall that you can use the residuals(model) function to extract your model prediction errors and you can also get predictions using predict(model). Second, you may find it useful to create columns containing model predictions or errors. Finally, remember the typical issues that models can have, such as heteroskedasticity in residuals, nonlinearity in response, or bias when predictions are grouped in a way the model ignores.

2.6

Can you use this model to predict next year’s prices? Why or why not?

3. Another Linear Model

Fit a linear model that predicts the log listing price over the years, treating “year” as a continuous variable.5 Be careful! In this model, are years numbers or categories?

3.1

In this model, what does the estimate for “year” represent?6 remember, changing the units of \(y\) or \(X\) will change your interpretation of the slope in a model.

3.2

Why is the intercept negative? Can you adjust X or y to make this more reasonable?7 Remember, the interpretation of the intercept is our guess for \(y\) when \(X=0\) (\(\mathbf{E}[y|X=0]\)). As long as you don’t change the units of \(X\), you can shift it around arbitrarily.

3.3

In which year is this model’s prediction most biased? least biased?

3.4

What’s the model’s prediction of listing price8 remember that this model is trained using the log of price, not price. To get back the “price” from a “log of price” variable, you need to use exp(); the exp() “un-does” the log(). For example, log(750000)=13.5278285, and exp(13.5278285)= 750000. of a detached house in 1999? How does this compare to your guesses from Sections 1 & 2?

3.5

Would you use this model to predict next year’s prices? Why or why not?

3.5 Challenge

Train the same model, but use only data from before 1999. What’s this model’s prediction of the listing price in 1999? How does this compare to the other three predictions?