# Geographic Data Science

#### Updated: 2021-11-26

Levi John Wolf (levi.john.wolf[at]bristol.ac.uk)

Office hours: 2-5 PM Mondays (calendly.com/ljwolf)1 Or, of course, by request.

## Quick Info:

Lectures are 5PM Monday Local Time, delivered online.

Labs are 9AM Tuesday morning local time, delivered in person.

## Purpose

Geographic data science is an important emerging set of practices and skills that have become useful in a wide variety of environmental and social sciences. This module will teach students the introduction to critical/core concepts in the arrangement and analysis of data. Beyond linear modelling, this module offers students an “instrumental” knowledge of various high-level methods in data science, but also offers a “deeper” route to understanding the more fundamental concepts and theory behind many of the estimators used in day-to-day data science. The purpose of this module is twofold. Its immediate aims are to ensure that students are provided a working introduction to common concepts and concerns that practicing geographic data scientists face. It will include some practical programming and data cleaning skills, but is mainly oriented towards statistical analysis. This is not a programming course, but requires some basic programming at the outset to prepare for analysis. Instead, this course is focused on analysis, and successful students will need to be able to conduct a successful analysis from start to finish.

This course is based on a solid understanding of multivariate regression. If you would like to refresh your memory/understanding of linear regression, please consider the review reading listed below in the reading section of this document.

## Getting Started

A short diagnostic quiz to check your background knowledge is here. You can take it as many times as you like. Your responses are anonymous, and will not be connected to your grade in any way.

If you intend to use your own computer for the unit, make sure you have installed:

• RStudio
• R version 4.0 or higher2 Those on Windows may find installr::updateR useful.
• The tidyverse package

## Mark Structure

The course will be structured in four blocks:

• Tidy: Learn some theory behind why some data is “easy” to work with, and how to leverage this theory to do better analysis. (2 weeks)
• Visualization: Learn about color, structure, and presentation of scientific diagrams. (2 weeks)
• Regression: (Re)learn regression as a “supervised” learning problem, focusing on making good guesses given your information. (2+3 weeks)
• Student Choice: Vote on topics, such as “Clustering,” “Data Reduction,” “NetCDF”, or “Smoothing Regression” (1 week)

Final marks are based on a mid-term exam and a final. The midterm is 40% of your overall mark, and the final 60%. We will provide answers on the “interim” workbooks as the course progresses. There will be one “consolidation” review before the final. For each assessment, answer keys will be posted after the due date, and the answers will be walked through in class. The final assessment is worth 60% of the overall mark. The midterm will cover the first two topics. The final will be cumulative, meaning that you’ll be expected to know how to tidy and visualize by then.

In addition to the timetabled lectures, there may be pre-recorded videos to help explain or discuss specific components of the reading. All lectures will be delivered live online.

The labs are intended as time for peer teaching and learning, so fostering a sense of community is critical for the module.

## Materials

Data for assessments will be uploaded to Blackboard, as well as on the schedule at the bottom of this syllabus. The data required for the course is uploaded here, as well as on blackboard.

Readings are listed in the schedule. Please attempt the reading each week before the timetabled lecture. In some weeks, there may also be a short recorded lecture to clarify the reading. Readings for the module will be drawn primarily from three sources.

• R4DS, R for data science, by Garrett Grolemund & Hadley Wickham. This source is free to all and publically available.
• FDV Fundamentals of data visualization, by Claus Wilke. Will generally be useful in this module as a reference for plotting and example of first-class visualization style. The book itself has no code, but you can refer to the R Markdowns used to build the book on the GitHub page by clicking the Rmd file.
• ISL, Introduction to Statistical Learning, by Gareth James et al. It is available as a PDF from the author’s website. This book is the best simple introduction to data science concepts out there.
• SR is an introduction to theory-driven statistical modeling. It’s the single-best book (I think) to learn about how cutting edge contemporary statisticians think about doing statistical analyses. It is slightly too advanced for this course, but simple excerpts may be recommended from time to time.

Often, ISL and SR contain very different developments of the same material. Broadly speaking, this arises from the fact that ISL is written from a “machine learning” perspective and “SR” is written from a “statistical” perspective. After the schedule, I discuss where “alternative” readings can be used to understand or cover the topic from a different perspective. You do not have to read both sources.

For reference, other good books to review and consolidate your programming and computation knowledge include:

• GR, Geocomputation in R, by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow. This is free to all and publicly available.
• AR, Advanced R, by Hadley Wickham. This is free to all and publicly available. Will not generally be useful in this module, but good to know about if needed.
• ARM, Data Analysis using Regression and Multilevel/Hierarchical Models, by Andrew Gelman and Jennifer Hill. This source is not free/public, but is available on Blackboard through the university library.

## Schedule

Lectures are held synchronously on Zoom at 5PM Mondays local time.

One lab practical is held each week on Tuesday at 9AM local time.

I appreciate that this does not leave much time for consolidating your knowledge from lecture. So, do the reading before the lecture, and be proactive in scheduling appointments in my Monday Afternoon Office Hours.

For all materials I have written, if you change the .html at the end of the URL to .Rmd, you can download the original R Markdown for the assignment. For example, the first comprehension material is available at https://ljwolf.org/teaching/gds/t1.html, and the R Markdown used to build that material is https://ljwolf.org/teaching/gds/t1.Rmd.

Tidy 27 September The normal form for data R4DS 12.1-2, Paper T1
Tidy 4 October A vocabulary for data shaping R4DS 5, 12.3-4 T2
Viz 18 October On the Grammar of Graphics FDA 1-4 V1
Viz 25 October A taxonomy of plots FDA 5,9,12,14 V2
Reg I 1 November Theory of Statistical Learning ISL 2.1; SR 1.1-2 R1.1
Reg I 8 November Regression as a supervised learning task ISL 3.1-2 R1.2
Reg II 15 November Consolidation week ISL 3.3-5 MA R1.2A
Reg II 22 November Moving beyond the normal task ISL 4.1-3 R2.2
Reg II 29 November Justifying your conclusions ISL 5.1 R2.3
Topic 6 December Student Choice! ISL 8.1-2 Trees
Close 13 December Review and Consolidation Mock Final

NOTE: abbreviations used in the table are covered in the reading section of this document.