Understanding the Preferences

Instant Runoff Voting

Borda Count

Topic Survey: Trees & Forests win!

Understanding the Preferences

Instant Runoff Voting

Borda Count

In the topic selection quiz, we had 17 voters (out of an electorate of 26 people), for a turnout of 65.4 percent! The responses from Microsoft Forms are shown below:

library(readxl)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.1     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(knitr)
responses = read_excel("../content/data/topic-survey.xlsx", sheet='Form1')
responses %>% head(1) %>% kable()

ID	Start time	Completion time	Email	Rank the following options below in terms of which topic you’d like to study for the final section of the course, either by clicking and dragging the option to its desired place, or by clicking/ta…
1	2021-11-02 13:24:02	2021-11-02 13:25:04	anonymous	Splines and friends [ISL Ch. 7] Splines (and other “Generalized Additive Models”) are an extension of typical regression models that allow for nonlinearity. They’re what you see by default any time you use “geom_smooth()” in ggplot2 (which is a kernel regression called LOWESS). They’re most useful when trends are nonlinear. They see loads of applications in biogeography, epidemiology, public health, and conservation. ;Unsupervised Learning by Data Reduction: reducing redundant variables [ISL Ch. 12.2] Oftentimes, data that we analyze is redundant, meaning that two (or more) variables often measure basically the same thing. Data Reduction is a practice involving a variety of approaches (such as principal components analysis) that can “compress” many different variables into a few “composite” variables to simplify analysis. This is very often done when building indexes or other hybrid measures of an underlying phenomena that is not directly measured. Classic examples of this are in the SoVI social vulnerability index used in assessing hazard risk or the Index of Multiple Deprivation in the UK. As with all unsupervised learning techniques, there is no “right answer,” only various ways to assess quality. ;Unsupervised Learning with Clustering: building “types” of data [ISL Ch. 12.4] Clustering involves building up “types” of observations as a kind of statistical shorthand to describe your data. This is often used in demographic research, ecology, and is one of the widest used methods for data exploration. Clustering methods generally are useful when you want to learn about structure in your data, and understand what observations are similar to one another. As an “unsupervised technique,” there generally is no “right answer,” but there are metrics that tell you which answers are better than others. ;Multilevel modelling: learning from context [SR 12] Multilevel models are a technique from statistics that sees increasing adoption across social and environmental science. The principle behind multilevel models is similar to that in other data science approaches: nest simpler models within one another. Multilevel models allow you to specify how parts of your model may themselves be outcomes from another process. For example, your income may depend on your age, your job, and your seniority. Your seniority also depends on your age. A multilevel model will recognize this, and can simultaneously estimate the relationship between age and seniority, and use that to “correctly” assign how much information about your earnings comes from age directly, versus what information about age “leaks in” through seniority. ;Regression trees and forests [ISL Ch. 8] Regression trees are “rule-based” predictors of an outcome. That is, they learn the associations between an outcome and the input data with a “decision tree” (see, for instance, one predicting survival on the Titanic: https://en.wikipedia.org/wiki/Decision_tree_learning#/media/File:Decision_Tree.jpg). With enough variables, we can build a forest of decision trees that can be useful in predicting new data, and also which gives us an indication of which variables are “useful” in predicting an outcome. These see wide use across social and environmental sciences. ;None of the above If you rank this option, tell me what you want in the free response below. Otherwise, leave this last! ;

This data is not tidy! In order for me to run an election, I need to make it tidy.

The first step, I think, is to make the column names easier to work with. So, let’s shorten them:

colnames(responses) <- c('id', 'start', 'finish', 'email', 'ranking')

Then, Microsoft Forms stitches together the text in the ranked options using the “;” separator. So, I can separate the rankings using the separate()!

responses_ranked <- responses %>% 
  separate(ranking, sep=';', into=c('1','2','3','4','5','6'))

## Warning: Expected 6 pieces. Additional pieces discarded in 17 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17].

responses_ranked %>% head(2) %>% kable()

id	start	finish	email	1	2	3	4	5	6
1	2021-11-02 13:24:02	2021-11-02 13:25:04	anonymous	Splines and friends [ISL Ch. 7] Splines (and other “Generalized Additive Models”) are an extension of typical regression models that allow for nonlinearity. They’re what you see by default any time you use “geom_smooth()” in ggplot2 (which is a kernel regression called LOWESS). They’re most useful when trends are nonlinear. They see loads of applications in biogeography, epidemiology, public health, and conservation.	Unsupervised Learning by Data Reduction: reducing redundant variables [ISL Ch. 12.2] Oftentimes, data that we analyze is redundant, meaning that two (or more) variables often measure basically the same thing. Data Reduction is a practice involving a variety of approaches (such as principal components analysis) that can “compress” many different variables into a few “composite” variables to simplify analysis. This is very often done when building indexes or other hybrid measures of an underlying phenomena that is not directly measured. Classic examples of this are in the SoVI social vulnerability index used in assessing hazard risk or the Index of Multiple Deprivation in the UK. As with all unsupervised learning techniques, there is no “right answer,” only various ways to assess quality.	Unsupervised Learning with Clustering: building “types” of data [ISL Ch. 12.4] Clustering involves building up “types” of observations as a kind of statistical shorthand to describe your data. This is often used in demographic research, ecology, and is one of the widest used methods for data exploration. Clustering methods generally are useful when you want to learn about structure in your data, and understand what observations are similar to one another. As an “unsupervised technique,” there generally is no “right answer,” but there are metrics that tell you which answers are better than others.	Multilevel modelling: learning from context [SR 12] Multilevel models are a technique from statistics that sees increasing adoption across social and environmental science. The principle behind multilevel models is similar to that in other data science approaches: nest simpler models within one another. Multilevel models allow you to specify how parts of your model may themselves be outcomes from another process. For example, your income may depend on your age, your job, and your seniority. Your seniority also depends on your age. A multilevel model will recognize this, and can simultaneously estimate the relationship between age and seniority, and use that to “correctly” assign how much information about your earnings comes from age directly, versus what information about age “leaks in” through seniority.	Regression trees and forests [ISL Ch. 8] Regression trees are “rule-based” predictors of an outcome. That is, they learn the associations between an outcome and the input data with a “decision tree” (see, for instance, one predicting survival on the Titanic: https://en.wikipedia.org/wiki/Decision_tree_learning#/media/File:Decision_Tree.jpg). With enough variables, we can build a forest of decision trees that can be useful in predicting new data, and also which gives us an indication of which variables are “useful” in predicting an outcome. These see wide use across social and environmental sciences.	None of the above If you rank this option, tell me what you want in the free response below. Otherwise, leave this last!
2	2021-11-02 13:26:01	2021-11-02 13:27:12	anonymous	Unsupervised Learning by Data Reduction: reducing redundant variables [ISL Ch. 12.2] Oftentimes, data that we analyze is redundant, meaning that two (or more) variables often measure basically the same thing. Data Reduction is a practice involving a variety of approaches (such as principal components analysis) that can “compress” many different variables into a few “composite” variables to simplify analysis. This is very often done when building indexes or other hybrid measures of an underlying phenomena that is not directly measured. Classic examples of this are in the SoVI social vulnerability index used in assessing hazard risk or the Index of Multiple Deprivation in the UK. As with all unsupervised learning techniques, there is no “right answer,” only various ways to assess quality.	Unsupervised Learning with Clustering: building “types” of data [ISL Ch. 12.4] Clustering involves building up “types” of observations as a kind of statistical shorthand to describe your data. This is often used in demographic research, ecology, and is one of the widest used methods for data exploration. Clustering methods generally are useful when you want to learn about structure in your data, and understand what observations are similar to one another. As an “unsupervised technique,” there generally is no “right answer,” but there are metrics that tell you which answers are better than others.	Splines and friends [ISL Ch. 7] Splines (and other “Generalized Additive Models”) are an extension of typical regression models that allow for nonlinearity. They’re what you see by default any time you use “geom_smooth()” in ggplot2 (which is a kernel regression called LOWESS). They’re most useful when trends are nonlinear. They see loads of applications in biogeography, epidemiology, public health, and conservation.	Regression trees and forests [ISL Ch. 8] Regression trees are “rule-based” predictors of an outcome. That is, they learn the associations between an outcome and the input data with a “decision tree” (see, for instance, one predicting survival on the Titanic: https://en.wikipedia.org/wiki/Decision_tree_learning#/media/File:Decision_Tree.jpg). With enough variables, we can build a forest of decision trees that can be useful in predicting new data, and also which gives us an indication of which variables are “useful” in predicting an outcome. These see wide use across social and environmental sciences.	Multilevel modelling: learning from context [SR 12] Multilevel models are a technique from statistics that sees increasing adoption across social and environmental science. The principle behind multilevel models is similar to that in other data science approaches: nest simpler models within one another. Multilevel models allow you to specify how parts of your model may themselves be outcomes from another process. For example, your income may depend on your age, your job, and your seniority. Your seniority also depends on your age. A multilevel model will recognize this, and can simultaneously estimate the relationship between age and seniority, and use that to “correctly” assign how much information about your earnings comes from age directly, versus what information about age “leaks in” through seniority.	None of the above If you rank this option, tell me what you want in the free response below. Otherwise, leave this last!

Now, you can see that the ranked responses are now spread across columns. The values contain the topic that is being ranked. This means that the “topic” variable is missing! We need to force the data longer in order to capture this in a tidy format. I suggest a dataset where the “topic” and the “rank” are separate columns:

responses_long <- responses_ranked %>% 
  pivot_longer('1':'6', names_to='rank', values_to='topic')
responses_long %>% head(3) %>% kable()

id	start	finish	email	rank	topic
1	2021-11-02 13:24:02	2021-11-02 13:25:04	anonymous	1	Splines and friends [ISL Ch. 7] Splines (and other “Generalized Additive Models”) are an extension of typical regression models that allow for nonlinearity. They’re what you see by default any time you use “geom_smooth()” in ggplot2 (which is a kernel regression called LOWESS). They’re most useful when trends are nonlinear. They see loads of applications in biogeography, epidemiology, public health, and conservation.
1	2021-11-02 13:24:02	2021-11-02 13:25:04	anonymous	2	Unsupervised Learning by Data Reduction: reducing redundant variables [ISL Ch. 12.2] Oftentimes, data that we analyze is redundant, meaning that two (or more) variables often measure basically the same thing. Data Reduction is a practice involving a variety of approaches (such as principal components analysis) that can “compress” many different variables into a few “composite” variables to simplify analysis. This is very often done when building indexes or other hybrid measures of an underlying phenomena that is not directly measured. Classic examples of this are in the SoVI social vulnerability index used in assessing hazard risk or the Index of Multiple Deprivation in the UK. As with all unsupervised learning techniques, there is no “right answer,” only various ways to assess quality.
1	2021-11-02 13:24:02	2021-11-02 13:25:04	anonymous	3	Unsupervised Learning with Clustering: building “types” of data [ISL Ch. 12.4] Clustering involves building up “types” of observations as a kind of statistical shorthand to describe your data. This is often used in demographic research, ecology, and is one of the widest used methods for data exploration. Clustering methods generally are useful when you want to learn about structure in your data, and understand what observations are similar to one another. As an “unsupervised technique,” there generally is no “right answer,” but there are metrics that tell you which answers are better than others.

Finally, I want to cut off all the extra text that just served to explain the topic to you. To do this, I’ll us separate again to cut off the rest of the values after the first square bracket, which I used to indicate the readings for the topics:

responses_final <- responses_long %>% 
  separate(topic, into=c('topic', NA), sep='\\[')

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 17 rows [6, 12,
## 18, 24, 30, 36, 42, 48, 54, 60, 66, 72, 78, 84, 90, 96, 102].

responses_final %>% head(6) %>% kable()

id	start	finish	email	rank	topic
1	2021-11-02 13:24:02	2021-11-02 13:25:04	anonymous	1	Splines and friends
1	2021-11-02 13:24:02	2021-11-02 13:25:04	anonymous	2	Unsupervised Learning by Data Reduction: reducing redundant variables
1	2021-11-02 13:24:02	2021-11-02 13:25:04	anonymous	3	Unsupervised Learning with Clustering: building “types” of data
1	2021-11-02 13:24:02	2021-11-02 13:25:04	anonymous	4	Multilevel modelling: learning from context
1	2021-11-02 13:24:02	2021-11-02 13:25:04	anonymous	5	Regression trees and forests
1	2021-11-02 13:24:02	2021-11-02 13:25:04	anonymous	6	None of the above If you rank this option, tell me what you want in the free response below. Otherwise, leave this last!

With this, we can do some interesting analytics. First, we can just make a simple crosstab of the responses:

responses_final %>% 
  xtabs(~ topic + rank, data=.) %>%
  kable()

From this, we can tell a few things:

No one ranked the “choose your own” topic! Reasonable, as it’s always hard to win a write-in election. But, I might’ve chosen a write-in arbitrarily if it were good enough a topic!
Multilevel Modelling and Regression trees/forests are tied in the number of first choices. Let’s focus in on this here by computing the number of first-choices for each option

responses_final %>% 
  group_by(id) %>% # for each person, make a sub-dataframe
  arrange(rank) %>% # sorted each sub-dataframe by rank
  summarize(choice = first(topic)) %>% # grab the first topic in each sub-df
  group_by(choice) %>% # now, group by the first choices
  summarize(n_first_choice = n()) %>% # and count the number of first choices
  arrange(n_first_choice) # show the result

## # A tibble: 5 x 2
##   choice                                                          n_first_choice
##   <chr>                                                                    <int>
## 1 "Unsupervised Learning by Data Reduction: reducing redundant v…              1
## 2 "Splines and friends "                                                       2
## 3 "Unsupervised Learning with Clustering: building \"types\" of …              2
## 4 "Multilevel modelling: learning from context "                               6
## 5 "Regression trees and forests "                                              6

To resolve this tie, let’s use two methods: Instant Runoff Voting and Borda Count. Let’s hope they agree, since there’s no guarantee!

To run an IRV, we need to pick the topic that had the lowest first-choice votes and select those folks’ second choices. So, the lowest first-choice is the Unsupervised Learning by Data Reduction topic. So, we remove this option, and re-compute the top-ranked choice for each person:

responses_final %>% 
  # remove the Unsupervised Learning by data reduction topic
  filter(!str_detect(topic, '^Unsupervised Learning by')) %>%
  # same steps as before
  group_by(id) %>% 
  arrange(rank) %>% 
  # now this may select topics of rank 1 
  # (for folks whose first choice is still in the running) 
  # or 2 (for the one person who wanted unsupervised learning most)
  summarize(choice = first(topic)) %>%
  group_by(choice) %>% 
  summarize(n_first_choice = n()) %>%
  arrange(n_first_choice) %>% 
  kable()

choice	n_first_choice
Splines and friends	2
Unsupervised Learning with Clustering: building “types” of data	3
Multilevel modelling: learning from context	6
Regression trees and forests	6

We’re still tied, so now we need to remove the lowest again: the Splines topic:

responses_final %>% 
  # remove the Unsupervised Learning by data reduction topic
  filter(!str_detect(topic, '^Unsupervised Learning by')) %>%
  # remove the splines topic
  filter(!str_detect(topic, '^Splines')) %>%
  # same steps as before
  group_by(id) %>% 
  arrange(rank) %>% 
  # now this may select topics of rank 1 
  # (for folks whose first choice is still in the running) 
  # or 2 (for the one person who wanted unsupervised learning most)
  # or possibly three (if the UL person also ranked splines 2nd)
  summarize(choice = first(topic)) %>%
  group_by(choice) %>% 
  summarize(n_first_choice = n()) %>%
  arrange(n_first_choice) %>% 
  kable()

choice	n_first_choice
Unsupervised Learning with Clustering: building “types” of data	4
Multilevel modelling: learning from context	6
Regression trees and forests	7

So, Regression trees and forests wins in an instant runoff 🥳!

If you don’t like this model of election, let’s look at the Borda count. Here, we give “points” that are proportional to the rank given. So, someone’s first choice gets 6 points, their second gets 5, and their last choice gets only one point. This one is very easy to compute using a tidy recipe:

responses_final %>%
  # compute the "score" for each choice:
  mutate(score = 7-as.numeric(rank)) %>%
  # group by the topic
  group_by(topic) %>% 
  # get the topic's total score:
  summarize(overall_score = sum(score)) %>%
  # show me the winners!
  arrange(desc(overall_score)) %>%
  kable()

topic	overall_score
Regression trees and forests	80
Multilevel modelling: learning from context	72
Unsupervised Learning with Clustering: building “types” of data	65
Splines and friends	64
Unsupervised Learning by Data Reduction: reducing redundant variables	59
None of the above If you rank this option, tell me what you want in the free response below. Otherwise, leave this last!	17

So, Regression forests & trees win here, too 🎉!

Multilevel modelling: learning from context

None of the above If you rank this option, tell me what you want in the free response below. Otherwise, leave this last!

Regression trees and forests

Splines and friends

Unsupervised Learning by Data Reduction: reducing redundant variables

Unsupervised Learning with Clustering: building “types” of data