Levi John Wolf (levi.john.wolf[at]bristol.ac.uk)

Office hours: 2-5 PM Mondays (calendly.com/ljwolf)1 Or, of course, by request.

- Full Schedule
- Blackboard Forum (requires login)
- Midterm Assignment (data, answers)
- Topic Survey: Regression Trees win!

Lectures are 5PM Monday Local Time, delivered *online*.

Labs are 9AM Tuesday morning local time, delivered *in person*.

Geographic data science is an important emerging set of practices and skills that have become useful in a wide variety of environmental and social sciences. This module will teach students the introduction to critical/core concepts in the arrangement and analysis of data. Beyond linear modelling, this module offers students an “instrumental” knowledge of various high-level methods in data science, but also offers a “deeper” route to understanding the more fundamental concepts and theory behind many of the estimators used in day-to-day data science. The purpose of this module is twofold. Its immediate aims are to ensure that students are provided a working introduction to common concepts and concerns that practicing geographic data scientists face. It will include some practical programming and data cleaning skills, but is mainly oriented towards statistical analysis. **This is not a programming course**, but *requires* some basic programming at the outset to prepare for analysis. Instead, **this course is focused on analysis**, and successful students will need to be able to conduct a successful analysis from start to finish.

This course is based on a solid understanding of multivariate regression. If you would like to refresh your memory/understanding of linear regression, please consider the review reading listed below in the reading section of this document.

A short diagnostic quiz to check your background knowledge is here. You can take it as many times as you like. Your responses are anonymous, and will not be connected to your grade in any way.

If you intend to use your own computer for the unit, make sure you have installed:

- RStudio
- R version 4.0 or higher2 Those on Windows may find
`installr::updateR`

useful. - The
`tidyverse`

package

The course will be structured in four blocks:

**Tidy**: Learn some theory behind why some data is “easy” to work with, and how to leverage this theory to do better analysis. (2 weeks)**Visualization**: Learn about color, structure, and presentation of scientific diagrams. (2 weeks)**Regression**: (Re)learn regression as a “supervised” learning problem, focusing on making good guesses given your information. (2+3 weeks)**Student Choice**: Vote on topics, such as “Clustering,” “Data Reduction,” “NetCDF”, or “Smoothing Regression” (1 week)

Final marks are based on a mid-term exam and a final. The midterm is 40% of your overall mark, and the final 60%. We will provide answers on the “interim” workbooks as the course progresses. There will be one “consolidation” review before the final. For each assessment, answer keys will be posted after the due date, and the answers will be walked through in class. The final assessment is worth 60% of the overall mark. The midterm will cover the first two topics. The final will be *cumulative*, meaning that you’ll be expected to know how to tidy and visualize by then.

In addition to the timetabled lectures, there may be pre-recorded videos to help explain or discuss specific components of the reading. All lectures will be delivered live online.

The labs are intended as time for *peer teaching and learning*, so fostering a sense of community is critical for the module.

Data for assessments will be uploaded to Blackboard, as well as on the schedule at the bottom of this syllabus. The data required for the course is uploaded here, as well as on blackboard.

Readings are listed in the schedule. Please attempt the reading each week before the timetabled lecture. In some weeks, there may also be a short recorded lecture to clarify the reading. Readings for the module will be drawn primarily from three sources.

- R4DS,
`R`

for data science, by Garrett Grolemund & Hadley Wickham. This source is free to all and publically available. - FDV Fundamentals of data visualization, by Claus Wilke. Will generally be useful in this module as a reference for plotting and example of first-class visualization style. The book itself has no code, but you can refer to the R Markdowns used to build the book on the GitHub page by clicking the Rmd file.
- ISL, Introduction to Statistical Learning, by Gareth James et al. It is available as a PDF from the author’s website. This book is the best simple introduction to data science concepts out there.
- SR is an introduction to theory-driven statistical modeling. It’s the single-best book (I think) to learn about how cutting edge contemporary statisticians think about doing statistical analyses. It is slightly too advanced for this course, but simple excerpts may be recommended from time to time.

Often, ISL and SR contain very different developments of the same material. Broadly speaking, this arises from the fact that ISL is written from a “machine learning” perspective and “SR” is written from a “statistical” perspective. After the schedule, I discuss where “alternative” readings can be used to understand or cover the topic from a different perspective. You *do not* have to read both sources.

For reference, other good books to review and consolidate your programming and computation knowledge include:

- GR, Geocomputation in
`R`

, by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow. This is free to all and publicly available. - AR, Advanced R, by Hadley Wickham. This is free to all and publicly available. Will not generally be useful in this module, but good to know about if needed.
- ARM, Data Analysis using Regression and Multilevel/Hierarchical Models, by Andrew Gelman and Jennifer Hill. This source is not free/public, but is available on Blackboard through the university library.

Lectures are held synchronously on Zoom at 5PM Mondays local time.

One lab practical is held each week on Tuesday at 9AM local time.

I appreciate that this does not leave much time for consolidating your knowledge from lecture. So, *do the reading before the lecture*, and be proactive in scheduling appointments in my Monday Afternoon Office Hours.

Don’t ask, just book.

For all materials I have written, if you change the `.html`

at the end of the URL to `.Rmd`

, you can download the original R Markdown for the assignment. For example, the first comprehension material is available at `https://ljwolf.org/teaching/gds/t1.html`

, and the R Markdown used to build that material is `https://ljwolf.org/teaching/gds/t1.Rmd`

.

Block | Week.Starting | Topic | Reading | Materials |
---|---|---|---|---|

Tidy | 27 September | The normal form for data | R4DS 12.1-2, Paper | T1 |

Tidy | 4 October | A vocabulary for data shaping | R4DS 5, 12.3-4 | T2 |

Viz | 18 October | On the Grammar of Graphics | FDA 1-4 | V1 |

Viz | 25 October | A taxonomy of plots | FDA 5,9,12,14 | V2 |

Reg I | 1 November | Theory of Statistical Learning | ISL 2.1; SR 1.1-2 | R1.1 |

Reg I | 8 November | Regression as a supervised learning task | ISL 3.1-2 | R1.2 |

Reg II | 15 November | Consolidation week | ISL 3.3-5 | MA R1.2A |

Reg II | 22 November | Moving beyond the normal task | ISL 4.1-3 | R2.2 |

Reg II | 29 November | Justifying your conclusions | ISL 5.1 | R2.3 |

Topic | 6 December | Student Choice! | ISL 8.1-2 | Trees |

Close | 13 December | Review and Consolidation | Mock Final |

*NOTE: abbreviations used in the table are covered in the reading section of this document.*

SR’s chapter on linear regression covers similar material to ISL, but focuses on the statistical perspective. This means the two are very different: whereas ISL provides a more “classical” presentation of regression for applied settings, SR focuses on explaining the conceptual basis for regression, working from the basic *distributional theory* of regression up to regression itself. SR’s chapter 5 is, again, similar to ISL 3.3-3.5 but with much greater philosophical and conceptual depth. Equivalents of ISL 4.1-3 exist in SR 9.2, but the level of sophistication may be again more statistical than desired. SR 6 again is an analogue of ISL 12, but they approach the treatment from *very* different perspectives.