This assessment will test your knowledge & abilities to work with data in R. Complete what you can; incomplete answers will still be able to achieve partial credit, and non-working code or descriptions of what you think you should do will also receive partial credit. Also, if a question has a (c) next to it, it’s considered a “Challenge” question! So, you can still score OK if you leave it blank. Submit the HTML or PDF of the “knitted” document for maximum points.

Data for the assessment is available alongside this document on the course blackboard, or here.

Section 1: Concepts

The data sets shown in the tabs below are included with your midterm assessment. Each of the data sets may have one (or more) “tidy” issues, and thus may violate the three rules of tidy data we’ve discussed before in the course:

  1. Each variable must have its own column.
  2. Each sample must have its own row.
  3. Every value must have its own cell.

Using this information,1 describe the following data sets in terms of their “tidyness.” Are the data sets tidy? If so, why? If not, why not? Note that I’m asking for a short paragraph; no code is needed to answer these fully.

Sometimes, I also provide a rationale for why the data is displayed this way. This* does not mean *the data is tidy!

Note: click the tabs to cycle through each of the data sets.

Movies

This data set describes the finances of movies in terms of the movie budget (budget), the domestic (US) gross box office revenue (domgross) and the international box office revenue (intgross) from 1970 to 2013.2

movies <- read_csv('./midterm-movies.csv')
movies %>% head(12) %>% kable()
year movie_name finance dollars
1970 Beyond the Valley of the Dolls budget 1000000
1970 Beyond the Valley of the Dolls domgross 9000000
1970 Beyond the Valley of the Dolls intgross 9000000
1971 Escape from the Planet of the Apes budget 2500000
1971 Escape from the Planet of the Apes domgross 12300000
1971 Escape from the Planet of the Apes intgross 12300000
1971 Shaft budget 53012938
1971 Shaft domgross 70327868
1971 Shaft intgross 107190108
1971 Straw Dogs budget 25000000
1971 Straw Dogs domgross 10324441
1971 Straw Dogs intgross 11253821

Aussie Birds

This dataset counts the numbers of bird species recorded in urban or rural parts of bioregions across Australia. The survey was conducted from 2014 to 2015. Shown below are the first 10 columns of six random rows from the dataframe, as there are many more bird species in Australia. This data is formatted this way in order to make the selection of specific species easy.

birds <- read_csv('./midterm-birds.csv')
birds %>% drop_na() %>% arrange(bioregions) %>% sample_n(6) %>% select(1:10) %>% kable()
survey_year urban_rural bioregions Bassian Thrush Chestnut-breasted Mannikin Wild Duck Willie Wagtail Regent Bowerbird Rufous Fantail Spiny-cheeked Honeyeater
2014 Urban South Eastern Highlands 0 0 0 2 0 0 0
2015 Rural Brigalow Belt South 0 0 0 2 0 0 0
2015 Rural Flinders Lofty Block 0 0 0 2 0 0 0
2015 Urban Victorian Midlands 0 0 0 0 0 0 0
2015 Urban South East Coastal Plain 0 0 0 2 0 0 0
2015 Urban Sydney Basin 0 0 3 3 0 2 0

Songs

This dataset describes 32,833 songs on Spotify in terms of statistics that Spotify record about the song.3 Below, I only a show a few columns from the dataset, but a few columns are explained.4 This data is formatted directly from the Spotify API, and so is structured for ease of use in a database.

spotify <- read_csv('./midterm-songs.csv')
spotify %>% select(track_name, track_artist, 
                   track_popularity, 
                   danceability, loudness, duration_ms) %>% 
  sample_n(10, weight=track_popularity) %>% 
  kable()
track_name track_artist track_popularity danceability loudness duration_ms
Stellar Jamila Woods 49 0.684 -9.751 122346
Peeping Tom - Original Mix Jamie Berry 53 0.720 -4.713 210000
Transmission - 2010 Remaster Joy Division 48 0.497 -8.088 215148
Tungba Ajebutter22 54 0.749 -5.850 188421
Bounce Back Big Sean 76 0.780 -5.628 222360
Holy Diver Dio 35 0.469 -9.913 350013
Hit and Run Midnght 43 0.775 -10.914 172203
You should be sad Halsey 86 0.591 -6.350 205473
HUMBLE. Kendrick Lamar 84 0.908 -6.638 177000
Stay High - Habits Remix Tove Lo 69 0.736 -7.036 258387

French Trains

These are the number of trains that have run between any two stations for services operated by the National Society of French Railways, the state-owned rail company in France, from 2015 to 2018. Below, I just show the first ten originating stations (where the trains leave from) and the first four destination stations (where the trains arrive). This kind of data is often called “spatial interaction data,” and is used to measure the “interaction” between different spatial units. It is often presented in this format for usability: readers can scan across a row to quickly compare the number of trains that destinations receive from a specific origin.

trains <- read_csv('./midterm-trains.csv')
trains %>% 
  select(1:5) %>% 
  slice(1:10) %>% 
  kable()
departure_station POITIERS QUIMPER RENNES ST PIERRE DES CORPS
PARIS MONTPARNASSE 31632 16242 38850 24930
TOURS 0 0 0 0
LYON PART DIEU 0 0 5814 0
PARIS EST 0 0 0 0
NANCY 0 0 0 0
STRASBOURG 0 0 0 0
NANTES 0 0 0 0
DUNKERQUE 0 0 0 0
MARSEILLE ST CHARLES 0 0 0 0
BORDEAUX ST JEAN 0 0 0 0

EU Energy

The following table records the percentage energy supplied by different forms of energy across countries in the EU, as well as the “EU-28” and “EA-19” groups of European member nations. This kind of wide-but-short display format is often useful to fit tables like this alongside text in a document.

energy <- read_csv('./midterm-energy.csv')
energy %>% select(1:12) %>% kable()
energy_type EU_28 EA_19 EE CY MT PL NL EL IE IT LV
Conventional thermal 45.9 43.6 93.9 91.4 90.9 90.2 83.9 68.6 68.3 66.0 61.0
Nuclear 25.5 27.0 0.0 0.0 0.0 0.0 2.9 0.0 0.0 0.0 0.0
Hydro 11.8 11.9 0.2 0.0 0.0 1.5 0.1 11.4 3.2 17.6 37.2
Wind 12.2 12.2 6.0 4.6 0.0 8.1 10.9 12.4 28.6 6.2 1.8
Solar 4.0 4.7 0.0 4.0 0.0 0.2 2.2 7.5 0.0 8.2 0.0
Geothermal & others 0.4 0.6 0.0 0.0 9.1 0.0 0.0 0.0 0.0 2.0 0.0
Total 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0

MarioKart

This dataset describes MarioKart 64 “World Record” races. There are typically two “categories” in MarioKart racing: the “Single Lap” category—how fast a racer can complete a single lap—and “Three Lap,” which measures the time to complete a typical three-lap race. Along with the times (in seconds), the date for the most recent world record is recorded. This is often the format in which tables like this are viewed.

mk_records <- read_csv('./midterm-races.csv')
mk_records %>% kable()
track single_lap_record single_lap_record_date three_lap_record three_lap_record_date
Banshee Boardwalk 40.78 2020-06-11 124.09 2021-01-15
Bowser’s Castle 43.15 2021-02-02 132.00 2021-02-02
Choco Mountain 38.02 2020-07-17 115.93 2020-07-14
D.K.’s Jungle Parkway 42.04 2020-10-26 131.62 2019-12-31
Frappe Snowland 38.27 2019-12-25 119.95 2017-12-23
Kalimari Desert 38.96 2020-11-07 123.94 2018-04-20
Koopa Troopa Beach 30.78 2020-10-18 95.25 2020-07-13
Luigi Raceway 37.58 2021-01-10 117.77 2020-11-06
Mario Raceway 27.62 2021-01-26 87.51 2020-08-13
Moo Moo Farm 27.80 2018-12-29 85.93 2020-02-18
Rainbow Road 116.35 2020-09-04 351.87 2020-09-28
Royal Raceway 55.50 2020-06-11 171.25 2020-09-07
Sherbet Land 37.72 2021-02-19 115.15 2021-01-26
Toad’s Turnpike 58.69 2020-09-20 177.80 2020-09-28
Wario Stadium 85.82 2021-01-26 260.01 2019-10-11
Yoshi Valley 31.25 2018-01-18 102.13 2021-01-26

Eco Risk

The following are Paini et al.’s estimates of the potential cost (in millions of USD) of invasive species to the ecosystem of each country. This is a direct digitization of table S2 in the Supplemental Material, so the formatting of the table is decided by concerns of printing and typesetting.5

risk <- read_csv('./midterm-risk.csv', name_repair='minimal')
risk %>% kable()
rank country damage rank country damage rank country damage
1 China $117,290 43 Denmark $1,417 85 Burundi $397.9
2 USA $70,381 44 Nepal $1,411 86 Lithuania $392.4
3 Brazil $33,760 45 Sudan $1,373 87 Moldova $387.5
4 India $33,065 46 Portugal $1,365 88 Armenia $336.0
5 Japan $23,490 47 Belgium $1,351 89 Malaysia $333.0
6 Korea Republic of $14,349 48 Kazakhstan $1,344 90 Bosnia and Herzegovina $327.4
7 Turkey $13,267 49 Czech Republic $1,336 91 Kyrgyzstan $302.0
8 Argentina $13,204 50 Austria $1,304 92 Georgia (Republic) $301.5
9 France $12,532 51 Iraq $1,234 93 Tajikistan $297.1
10 Mexico $11,277 52 Kenya $1,230 94 Ireland $277.6
11 Iran $11,276 53 Mozambique $1,218 95 Lebanon $276.8
12 Nigeria $10,251 54 Cambodia $1,121 96 Nicaragua $264.1
13 Indonesia $9,550 55 Ghana $1,114 97 Rwanda $255.1
14 Thailand $8,066 56 Bulgaria $1,112 98 Mauritius $227.6
15 Australia $7,815 57 Madagascar $1,074 99 Macedonia $218.4
16 Vietnam $7,490 58 Malawi $1,071 100 Congo (Republic of) $212.8
17 Ukraine $6,953 59 Paraguay $1,012 101 Slovenia $202.0
18 Egypt $6,737 60 Guinea $977.5 102 Niger $197.3
19 Canada $6,694 61 Tunisia $949.2 103 Latvia $187.3
20 Pakistan $6,630 62 Ecuador $934.7 104 Panama $161.2
21 Germany $6,481 63 Switzerland $924.1 105 Togo $153.2
22 Bangladesh $5,623 64 Dominican Republic $873.0 106 Jordan $116.4
23 Spain $5,576 65 Jamaica $871.7 107 Guinea-Bissau $114.3
24 Russian Federation $5,084 66 Sri Lanka $829.1 108 Cyprus $108.5
25 Philippines $4,839 67 Yemen $806.0 109 Estonia $102.1
26 Greece $4,342 68 Saudi Arabia NA 110 Fiji NA NA
27 United Kingdom $4,005 69 Honduras $794.3 111 Mongolia $64.7
28 South Africa $3,922 70 Croatia $755.6 112 Luxembourg $64.7
29 Romania $3,524 71 Azerbaijan $730.7 113 Belize $42.2
30 Algeria $2,862 72 New Zealand $639.7 114 Cape Verde $40.8
31 Morocco $2,531 73 Albania $637.2 115 Gambia $37.6
32 Colombia $2,476 74 Finland $600.7 116 Suriname $36.5
33 Poland $2,449 75 Slovakia $573.4 117 Trinidad and Tobago $28.8
34 Ethiopia $2,312 76 Burkina Faso $557.7 118 Vanuatu $23.5
35 Venezuela $2,167 77 Costa Rica $556.8 119 Barbados $20.3
36 Chile $2,095 78 Sweden $546.9 120 Equatorial Guinea NA NA
37 Netherlands $1,981 79 Israel $518.4 121 Malta $14.4
38 Hungary $1,979 80 Uruguay $509.1 122 Qatar $5.0
39 Belarus $1,777 81 Laos $508.2 123 Iceland $4.8
40 Peru $1,580 82 Mali $504.7 124 Singapore $0.7
41 Cameroon $1,574 83 El Salvador $475.1 NA NA NA
42 Italy $1,447 84 Norway $419.4 NA NA NA

Section 2: Cleaning

For each of the data sets in Section 1, can you create a tidy version of the data set?

Movies

hint: this may need to take rows and turn them into columns!

movies %>%

Aussie Birds

hint: this may need to take columns and turn them into rows!

birds %>% 

Songs

hint: pay attention to the three rules; sometimes, you get lucky!

songs %>% 

French Trains

hint: think very carefully about what the variables are here!

trains %>%

EU Energy

hint: sometimes, we must pivot one direction before we pivot another!

energy %>% 

MarioKart

hint: sometimes, you may need to pivot twice in the same direction!

mk_records %>%

Eco Risk

hint: this data frame is uniquely messy! Try splitting it into parts that all look tidy, and then bringing them back together.

risk %>%

Section 3: Analysis

In this section, I’ll ask some specific questions about two of the data sets: Movies & Songs.

Movies

3.1

Which ten movies lost the most money domestically? Are these the same movies that lost the most money overall?

3.2

What is the average budget for a movie in each year?

3.3

Which movie had the largest gap between domestic and overseas box office performance?

3.4

Make a visualization that shows how the budget for movies has increased over time. Discuss how you designed the plot in order to emphasize this message.

3.5

Make a visualization that shows how the typical profit movies make has generally not changed over time, but that a few outliers do make increasingly more money. Discuss how you designed the plot in order to emphasize this message.

3.6(c)

You’re a data scientist working for a movie studio. Your executive is considering whether to take a risk on making a “new” movie, or whether it’d be a safer bet to make a sequel to an existing movie. So, she asks:

Do sequels make more profit per dollar spent than non-sequels?

Can you answer her question?6

Songs

3.7

What’s your best guess about the length of a track with higher than 75% popularity? How about your best guess for the popularity of a track between 2 and 3 minutes long? Which “piece” of information (popularity > 75% or duration between 2 & 3 minutes) gives you more useful information, and why do you think that?7

3.8 (c)

What is the typical “energy” of each of the playlist genres? How about the typical “valence,” meaning “happiness,” of the genres?

3.9 (c)

Make four plots8 to visualize the relationship between danceability and a few variables:

  • tempo
  • energy
  • valence
  • playlist_genre

Make sure to take into account whether the relationship is linear and address overplotting if necessary. Given these plots, what kinds of songs tend to be danceable?

3.10 (c)

Let’s assume that the difference between a band’s median track popularity and its maximum track popularity represents that band’s one-hit-wonderness. If this is the case, what are the Top 10 “one-hit-wonder” bands in the dataset? Given the results, does this comport with your understanding of what a “one hit wonder” is?


  1. You may also find the original Tidy Data paper useful in describing the different commonly-encountered issues in data formatting.↩︎

  2. Remember: I’m just using the knitr::kable() function to print the table all pretty-like in the RMarkdown.↩︎

  3. This is scraped using the spotifyr package.↩︎

  4. danceability measures how suitable a track is for dancing, varies from 0 (not danceable) to 1 (danceable). energy is a measure of the “intensity” of the song, varies from 0 (very relaxing, Shoegaze) to 1 (very energetic, Death Metal). loudness is the average decibel value of the track, varies from -60 to zero. speechiness gives the level of “talkiness” in a track, varies from 0 (no spoken words) to 1 (all spoken words), but tracks over .66 are probably nearly all spoken word, and tracks below .33 are probably songs. acousticness is the same as speechiness, but measuring whether instruments are not amplified; liveness but for whether the track has a live audience. valence tells you whether a track is “happy,” with higher scores indicate happier music. Finally, tempo records the average beats per minute for the track and duration_ms provides the duration of the song in milliseconds. ↩︎

  5. Note that I’m using the name_repair option of read_csv() in order to get exactly the column names that are in midterm-risk.csv. Without this option, readr::read_csv will append extra values to the column names in order to ensure they are each unique. Try this by removing the name_repair argument, or setting it equal to "unique" instead of "minimal".↩︎

  6. Note that you can use the str_detect() function in the tidyverse’s stringr package to give TRUE when a movie name contains 2 or a II, and FALSE otherwise. Also, it’s ok if you accidentally pick up movies that have III or IIII or H20 in their name using this strategy; we’re just making an educated guess.↩︎

  7. You may find it helpful to make a plot!↩︎

  8. It’s OK if they’re totally separate plots! That is, I don’t expect you to use facet_grid()↩︎