Geographic Data Science

Purpose

Geographic data science is an important emerging set of practices and skills that have become useful in a wide variety of environmental and social sciences. This module will teach students the introduction to critical/core concepts in the arrangement and analysis of data. Beyond linear modelling, this module offers students an “instrumental” knowledge of various high-level methods in data science, but also offers a “deeper” route to understanding the more fundamental concepts and theory behind many of the estimators used in day-to-day data science. The purpose of this module is twofold. Its immediate aims are to ensure that students are provided a working introduction to common concepts and concerns that practicing geographic data scientists face. It will include some practical programming and data cleaning skills, but is mainly oriented towards statistical analysis. This is not a programming course, but requires some basic programming at the outset to prepare for analysis. Instead, this course is focused on analysis, and successful students will need to be able to conduct a successful analysis from start to finish.

This course is based on a solid understanding of multivariate regression. If you would like to refresh your memory/understanding of linear regression, please consider the review reading listed below in the reading section of this document.

Getting Started

A short diagnostic quiz to check your background knowledge is here. You can take it as many times as you like. Your responses are anonymous, and will not be connected to your grade in any way.

If you intend to use your own computer for the unit, make sure you have installed:

RStudio
R version 4.0 or higher22 Those on Windows may find installr::updateR useful.
The tidyverse package

Mark Structure

The course will be structured in four blocks:

Tidy: Learn some theory behind why some data is “easy” to work with, and how to leverage this theory to do better analysis. (2 weeks)
Visualization: Learn about color, structure, and presentation of scientific diagrams. (2 weeks)
Regression: (Re)learn regression as a “supervised” learning problem, focusing on making good guesses given your information. (2+3 weeks)
Student Choice: Vote on topics, such as “Clustering,” “Data Reduction,” “NetCDF”, or “Smoothing Regression” (1 week)

Final marks are based on a mid-term exam and a final. The midterm is 40% of your overall mark, and the final 60%. We will provide answers on the “interim” workbooks as the course progresses. There will be one “consolidation” review before the final. For each assessment, answer keys will be posted after the due date, and the answers will be walked through in class. The final assessment is worth 60% of the overall mark. The midterm will cover the first two topics. The final will be cumulative, meaning that you’ll be expected to know how to tidy and visualize by then.

In addition to the timetabled lectures, there may be pre-recorded videos to help explain or discuss specific components of the reading. All lectures will be delivered live online.

The labs are intended as time for peer teaching and learning, so fostering a sense of community is critical for the module.

Materials

Data for assessments will be uploaded to Blackboard, as well as on the schedule at the bottom of this syllabus. The data required for the course is uploaded here, as well as on blackboard.

Reading

Readings are listed in the schedule. Please attempt the reading each week before the timetabled lecture. In some weeks, there may also be a short recorded lecture to clarify the reading. Readings for the module will be drawn primarily from three sources.

R4DS, R for data science, by Garrett Grolemund & Hadley Wickham. This source is free to all and publically available.
FDV Fundamentals of data visualization, by Claus Wilke. Will generally be useful in this module as a reference for plotting and example of first-class visualization style. The book itself has no code, but you can refer to the R Markdowns used to build the book on the GitHub page by clicking the Rmd file.
ISL, Introduction to Statistical Learning, by Gareth James et al. It is available as a PDF from the author’s website. This book is the best simple introduction to data science concepts out there.
SR is an introduction to theory-driven statistical modeling. It’s the single-best book (I think) to learn about how cutting edge contemporary statisticians think about doing statistical analyses. It is slightly too advanced for this course, but simple excerpts may be recommended from time to time.

Often, ISL and SR contain very different developments of the same material. Broadly speaking, this arises from the fact that ISL is written from a “machine learning” perspective and “SR” is written from a “statistical” perspective. After the schedule, I discuss where “alternative” readings can be used to understand or cover the topic from a different perspective. You do not have to read both sources.

For reference, other good books to review and consolidate your programming and computation knowledge include:

GR, Geocomputation in R, by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow. This is free to all and publicly available.
AR, Advanced R, by Hadley Wickham. This is free to all and publicly available. Will not generally be useful in this module, but good to know about if needed.
ARM, Data Analysis using Regression and Multilevel/Hierarchical Models, by Andrew Gelman and Jennifer Hill. This source is not free/public, but is available on Blackboard through the university library.

Schedule

Lectures are held synchronously on Zoom at 5PM Mondays local time.

One lab practical is held each week on Tuesday at 9AM local time.

I appreciate that this does not leave much time for consolidating your knowledge from lecture. So, do the reading before the lecture, and be proactive in scheduling appointments in my Monday Afternoon Office Hours.

Don’t ask, just book.

For all materials I have written, if you change the .html at the end of the URL to .Rmd, you can download the original R Markdown for the assignment. For example, the first comprehension material is available at https://ljwolf.org/teaching/gds/t1.html, and the R Markdown used to build that material is https://ljwolf.org/teaching/gds/t1.Rmd.

Block	Week.Starting	Topic	Reading	Materials
Tidy	27 September	The normal form for data	R4DS 12.1-2, Paper	T1
Tidy	4 October	A vocabulary for data shaping	R4DS 5, 12.3-4	T2
Viz	18 October	On the Grammar of Graphics	FDA 1-4	V1
Viz	25 October	A taxonomy of plots	FDA 5,9,12,14	V2
Reg I	1 November	Theory of Statistical Learning	ISL 2.1; SR 1.1-2	R1.1
Reg I	8 November	Regression as a supervised learning task	ISL 3.1-2	R1.2
Reg II	15 November	Consolidation week	ISL 3.3-5	MA R1.2A
Reg II	22 November	Moving beyond the normal task	ISL 4.1-3	R2.2
Reg II	29 November	Justifying your conclusions	ISL 5.1	R2.3
Topic	6 December	Student Choice!	ISL 8.1-2	Trees
Close	13 December	Review and Consolidation		Mock Final

NOTE: abbreviations used in the table are covered in the reading section of this document.

Alternative Readings

SR’s chapter on linear regression covers similar material to ISL, but focuses on the statistical perspective. This means the two are very different: whereas ISL provides a more “classical” presentation of regression for applied settings, SR focuses on explaining the conceptual basis for regression, working from the basic distributional theory of regression up to regression itself. SR’s chapter 5 is, again, similar to ISL 3.3-3.5 but with much greater philosophical and conceptual depth. Equivalents of ISL 4.1-3 exist in SR 9.2, but the level of sophistication may be again more statistical than desired. SR 6 again is an analogue of ISL 12, but they approach the treatment from very different perspectives.