This assessment will test your knowledge & abilities to work with data in R. Complete what you can; incomplete answers will still be able to achieve partial credit, and non-working code or descriptions of what you think you should do will also receive partial credit. Also, if a question has a (c) next to it, it’s considered a “Challenge” question! So, you can still score OK if you leave it blank. Submit the HTML or PDF of the “knitted” document for maximum points.

Data for the assessment is available alongside this document on the course blackboard, or here.

Section 1: Concepts

The data sets shown in the tabs below are included with your midterm assessment. Each of the data sets may have one (or more) “tidy” issues, and thus may violate the three rules of tidy data we’ve discussed before in the course:

  1. Each variable must have its own column.
  2. Each sample must have its own row.
  3. Every value must have its own cell.

Using this information,1 describe the following data sets in terms of their “tidyness.” Are the data sets tidy? If so, why? If not, why not? Note that I’m asking for a short paragraph; no code is needed to answer these fully.

Sometimes, I also provide a rationale for why the data is displayed this way. This* does not mean *the data is tidy!

Note: click the tabs to cycle through each of the data sets.

Movies

This data set describes the finances of movies in terms of the movie budget (budget), the domestic (US) gross box office revenue (domgross) and the international box office revenue (intgross) from 1970 to 2013.2

movies <- read_csv('./midterm-movies.csv')
movies %>% head(12) %>% kable()
year movie_name finance dollars
1970 Beyond the Valley of the Dolls budget 1000000
1970 Beyond the Valley of the Dolls domgross 9000000
1970 Beyond the Valley of the Dolls intgross 9000000
1971 Escape from the Planet of the Apes budget 2500000
1971 Escape from the Planet of the Apes domgross 12300000
1971 Escape from the Planet of the Apes intgross 12300000
1971 Shaft budget 53012938
1971 Shaft domgross 70327868
1971 Shaft intgross 107190108
1971 Straw Dogs budget 25000000
1971 Straw Dogs domgross 10324441
1971 Straw Dogs intgross 11253821

This dataset is not tidy. The column “finance” contains the different variable names that need to get turned into columns. This means that the column “dollars” does all contain dollar amounts, but they come from the same sample: the movie. Your target should have a dataframe where each row is a movie, and each column expresses the year of the movie, the budget of the movie, the domestic gross of the movie, and the international gross of the movie.

Aussie Birds

This dataset counts the numbers of bird species recorded in urban or rural parts of bioregions across Australia. The survey was conducted from 2014 to 2015. Shown below are the first 10 columns of six random rows from the dataframe, as there are many more bird species in Australia. This data is formatted this way in order to make the selection of specific species easy.

birds <- read_csv('./midterm-birds.csv')
birds %>% drop_na() %>% arrange(bioregions) %>% sample_n(6) %>% select(1:10) %>% kable()
survey_year urban_rural bioregions Bassian Thrush Chestnut-breasted Mannikin Wild Duck Willie Wagtail Regent Bowerbird Rufous Fantail Spiny-cheeked Honeyeater
2015 Rural Victorian Midlands 0 0 0 6 0 1 0
2015 Rural Flinders Lofty Block 0 0 0 2 0 0 0
2015 Rural South Eastern Queensland 0 1 0 8 1 2 0
2015 Rural South East Coastal Plain 1 0 0 2 0 0 0
2015 Urban Brigalow Belt South 0 0 0 1 0 0 0
2014 Urban South Eastern Highlands 0 0 0 2 0 0 0

This dataset is not tidy. The “species” variable is implicit, and spread along columns… In order to tidy it, your final dataframe should have year, urban/rural classification, bioregion, species, and count.

Songs

This dataset describes 32,833 songs on Spotify in terms of statistics that Spotify record about the song.3 Below, I only a show a few columns from the dataset, but a few columns are explained.4 This data is formatted directly from the Spotify API, and so is structured for ease of use in a database.

spotify <- read_csv('./midterm-songs.csv')
spotify %>% select(track_name, track_artist, 
                   track_popularity, 
                   danceability, loudness, duration_ms) %>% 
  sample_n(10, weight=track_popularity) %>% 
  kable()
track_name track_artist track_popularity danceability loudness duration_ms
My Best Life (feat. Mike Waters) - Club Mix KSHMR 66 0.603 -3.693 198533
We Should Tell the Truth metr 44 0.808 -12.306 267292
Nothing in Return Monsune 52 0.672 -5.747 232200
Are You Gonna Be My Girl - UK Acoustic Version Jet 52 0.672 -5.287 237173
What About Your Friends (Album Radio Edit) [W/Rap] TLC 27 0.752 -4.454 244560
More Than a Feeling Boston 78 0.377 -8.039 285133
Takeaway The Chainsmokers 85 0.528 -8.144 209880
Stay The Night - Featuring Hayley Williams Of Paramore Zedd 61 0.596 -3.109 217347
Killing My Love - Radio Edit THIS IS ELLE 18 0.206 -5.168 194047
That’s Tuff (feat. Quavo) Rich The Kid 75 0.903 -4.313 154000

This dataset is tidy🎂Each row denotes a song on spotify. The only way you may think it’s not tidy is if the song variable contains information on the remix and collaborator(s) on the song. This is true, but the distinction between where a variable stops and where it starts is not always so clear. If the track title includes “(feat. Jay-Z)”, then you can use feature engineering to build a new variable that tries to extract this information. However, each cell has not intentionally coded a value that includes “feat.” or “Remix” info, so it’s less reasonable to suggest that all of track_title contains two (or more) variables.

French Trains

These are the number of trains that have run between any two stations for services operated by the National Society of French Railways, the state-owned rail company in France, from 2015 to 2018. Below, I just show the first ten originating stations (where the trains leave from) and the first four destination stations (where the trains arrive). This kind of data is often called “spatial interaction data,” and is used to measure the “interaction” between different spatial units. It is often presented in this format for usability: readers can scan across a row to quickly compare the number of trains that destinations receive from a specific origin.

trains <- read_csv('./midterm-trains.csv')
trains %>% 
  select(1:5) %>% 
  slice(1:10) %>% 
  kable()
departure_station POITIERS QUIMPER RENNES ST PIERRE DES CORPS
PARIS MONTPARNASSE 31632 16242 38850 24930
TOURS 0 0 0 0
LYON PART DIEU 0 0 5814 0
PARIS EST 0 0 0 0
NANCY 0 0 0 0
STRASBOURG 0 0 0 0
NANTES 0 0 0 0
DUNKERQUE 0 0 0 0
MARSEILLE ST CHARLES 0 0 0 0
BORDEAUX ST JEAN 0 0 0 0

This dataset is not tidy! the “destination” variable is spread implicitly over the columns. That is, the data is wider than it should be to be tidy. The tidy dataset would have source, destination, and count columns.

EU Energy

The following table records the percentage energy supplied by different forms of energy across countries in the EU, as well as the “EU-28” and “EA-19” groups of European member nations. This kind of wide-but-short display format is often useful to fit tables like this alongside text in a document.

energy <- read_csv('./midterm-energy.csv')
energy %>% select(1:12) %>% kable()
energy_type EU_28 EA_19 EE CY MT PL NL EL IE IT LV
Conventional thermal 45.9 43.6 93.9 91.4 90.9 90.2 83.9 68.6 68.3 66.0 61.0
Nuclear 25.5 27.0 0.0 0.0 0.0 0.0 2.9 0.0 0.0 0.0 0.0
Hydro 11.8 11.9 0.2 0.0 0.0 1.5 0.1 11.4 3.2 17.6 37.2
Wind 12.2 12.2 6.0 4.6 0.0 8.1 10.9 12.4 28.6 6.2 1.8
Solar 4.0 4.7 0.0 4.0 0.0 0.2 2.2 7.5 0.0 8.2 0.0
Geothermal & others 0.4 0.6 0.0 0.0 9.1 0.0 0.0 0.0 0.0 2.0 0.0
Total 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0

This dataset is not tidy! Its transpose, however, is! One “measurement” or “sample” should correspond to a row. So, generally speaking, we’d want one set of observations to all hang logically together, recording all the information we have about a unit of analysis. So, the clearest “tidy” version for this data is long, with one column of “energy type,” one column of “country”, and then the country column. How to deal with the EU/EA groupings of countries can be done either through redefining what the “country” variable means (it could be “energy_market” or “supply_zone”), or the values could be separated into their own tables, or could be used to indicate the aggregation into which each country falls. For the “total” energy type, though, that needs to be removed from this representation. I wouldn’t expect the latter, but either of the first two are reasonable.

One common interpretation of the tidy version of this data is just the transpose of the current dataframe. I can understand this, but it’s not as clear; the “energy type” variable is split across columns in this case.

MarioKart

This dataset describes MarioKart 64 “World Record” races. There are typically two “categories” in MarioKart racing: the “Single Lap” category—how fast a racer can complete a single lap—and “Three Lap,” which measures the time to complete a typical three-lap race. Along with the times (in seconds), the date for the most recent world record is recorded. This is often the format in which tables like this are viewed.

mk_records <- read_csv('./midterm-races.csv')
mk_records %>% kable()
track single_lap_record single_lap_record_date three_lap_record three_lap_record_date
Banshee Boardwalk 40.78 2020-06-11 124.09 2021-01-15
Bowser’s Castle 43.15 2021-02-02 132.00 2021-02-02
Choco Mountain 38.02 2020-07-17 115.93 2020-07-14
D.K.’s Jungle Parkway 42.04 2020-10-26 131.62 2019-12-31
Frappe Snowland 38.27 2019-12-25 119.95 2017-12-23
Kalimari Desert 38.96 2020-11-07 123.94 2018-04-20
Koopa Troopa Beach 30.78 2020-10-18 95.25 2020-07-13
Luigi Raceway 37.58 2021-01-10 117.77 2020-11-06
Mario Raceway 27.62 2021-01-26 87.51 2020-08-13
Moo Moo Farm 27.80 2018-12-29 85.93 2020-02-18
Rainbow Road 116.35 2020-09-04 351.87 2020-09-28
Royal Raceway 55.50 2020-06-11 171.25 2020-09-07
Sherbet Land 37.72 2021-02-19 115.15 2021-01-26
Toad’s Turnpike 58.69 2020-09-20 177.80 2020-09-28
Wario Stadium 85.82 2021-01-26 260.01 2019-10-11
Yoshi Valley 31.25 2018-01-18 102.13 2021-01-26

This data is not tidy! the variable “race type” is spread over the columns, but is mixed into each of the relevant features. This will be somewhat more challenging to tidy! The target dataframe should have one world record for each row, with columns track, event, time, and date as columns.

Eco Risk

The following are Paini et al.’s estimates of the potential cost (in millions of USD) of invasive species to the ecosystem of each country. This is a direct digitization of table S2 in the Supplemental Material, so the formatting of the table is decided by concerns of printing and typesetting.5

risk <- read_csv('./midterm-risk.csv', name_repair='minimal')
risk %>% kable()
rank country damage rank country damage rank country damage
1 China $117,290 43 Denmark $1,417 85 Burundi $397.9
2 USA $70,381 44 Nepal $1,411 86 Lithuania $392.4
3 Brazil $33,760 45 Sudan $1,373 87 Moldova $387.5
4 India $33,065 46 Portugal $1,365 88 Armenia $336.0
5 Japan $23,490 47 Belgium $1,351 89 Malaysia $333.0
6 Korea Republic of $14,349 48 Kazakhstan $1,344 90 Bosnia and Herzegovina $327.4
7 Turkey $13,267 49 Czech Republic $1,336 91 Kyrgyzstan $302.0
8 Argentina $13,204 50 Austria $1,304 92 Georgia (Republic) $301.5
9 France $12,532 51 Iraq $1,234 93 Tajikistan $297.1
10 Mexico $11,277 52 Kenya $1,230 94 Ireland $277.6
11 Iran $11,276 53 Mozambique $1,218 95 Lebanon $276.8
12 Nigeria $10,251 54 Cambodia $1,121 96 Nicaragua $264.1
13 Indonesia $9,550 55 Ghana $1,114 97 Rwanda $255.1
14 Thailand $8,066 56 Bulgaria $1,112 98 Mauritius $227.6
15 Australia $7,815 57 Madagascar $1,074 99 Macedonia $218.4
16 Vietnam $7,490 58 Malawi $1,071 100 Congo (Republic of) $212.8
17 Ukraine $6,953 59 Paraguay $1,012 101 Slovenia $202.0
18 Egypt $6,737 60 Guinea $977.5 102 Niger $197.3
19 Canada $6,694 61 Tunisia $949.2 103 Latvia $187.3
20 Pakistan $6,630 62 Ecuador $934.7 104 Panama $161.2
21 Germany $6,481 63 Switzerland $924.1 105 Togo $153.2
22 Bangladesh $5,623 64 Dominican Republic $873.0 106 Jordan $116.4
23 Spain $5,576 65 Jamaica $871.7 107 Guinea-Bissau $114.3
24 Russian Federation $5,084 66 Sri Lanka $829.1 108 Cyprus $108.5
25 Philippines $4,839 67 Yemen $806.0 109 Estonia $102.1
26 Greece $4,342 68 Saudi Arabia NA 110 Fiji NA,NA
27 United Kingdom $4,005 69 Honduras $794.3 111 Mongolia $64.7
28 South Africa $3,922 70 Croatia $755.6 112 Luxembourg $64.7
29 Romania $3,524 71 Azerbaijan $730.7 113 Belize $42.2
30 Algeria $2,862 72 New Zealand $639.7 114 Cape Verde $40.8
31 Morocco $2,531 73 Albania $637.2 115 Gambia $37.6
32 Colombia $2,476 74 Finland $600.7 116 Suriname $36.5
33 Poland $2,449 75 Slovakia $573.4 117 Trinidad and Tobago $28.8
34 Ethiopia $2,312 76 Burkina Faso $557.7 118 Vanuatu $23.5
35 Venezuela $2,167 77 Costa Rica $556.8 119 Barbados $20.3
36 Chile $2,095 78 Sweden $546.9 120 Equatorial Guinea NA,NA
37 Netherlands $1,981 79 Israel $518.4 121 Malta $14.4
38 Hungary $1,979 80 Uruguay $509.1 122 Qatar $5.0
39 Belarus $1,777 81 Laos $508.2 123 Iceland $4.8
40 Peru $1,580 82 Mali $504.7 124 Singapore $0.7
41 Cameroon $1,574 83 El Salvador $475.1 NA NA NA
42 Italy $1,447 84 Norway $419.4 NA NA NA

This data is not tidy! It should only contain three columns in the “cleaned” version. The most reasonable way to clean this datay may also not involve pivots…

Section 2: Cleaning

For each of the data sets in Section 1, can you create a tidy version of the data set?

Movies

hint: this may need to take rows and turn them into columns!

This one is a very straightforward pivot_wider():

movies_tidy <- movies %>% 
  pivot_wider(id_cols=c(year, movie_name), 
              names_from=finance, 
              values_from=dollars)
movies_tidy %>% head(6) %>% kable()
year movie_name budget domgross intgross
1970 Beyond the Valley of the Dolls 1000000 9000000 9000000
1971 Escape from the Planet of the Apes 2500000 12300000 12300000
1971 Shaft 53012938 70327868 107190108
1971 Straw Dogs 25000000 10324441 11253821
1971 The French Connection 2200000 41158757 41158757
1971 Willy Wonka &amp; the Chocolate Factory 3000000 4000000 4000000

Aussie Birds

hint: this may need to take columns and turn them into rows!

This one is a very straight-forward pivot_longer(), where the bird columns are converted into rows:

birds_tidy <- birds %>% pivot_longer(`Bassian Thrush`:`Grey Fantail`)
birds_tidy %>% head(6) %>% kable()
survey_year urban_rural bioregions name value
2014 Urban South Eastern Queensland Bassian Thrush 0
2014 Urban South Eastern Queensland Chestnut-breasted Mannikin 2
2014 Urban South Eastern Queensland Wild Duck 1
2014 Urban South Eastern Queensland Willie Wagtail 12
2014 Urban South Eastern Queensland Regent Bowerbird 0
2014 Urban South Eastern Queensland Rufous Fantail 3

Songs

hint: pay attention to the three rules; sometimes, you get lucky!

This one is already tidy:

spotify <- spotify
spotify %>% head(6) %>% kable()
track_id track_name track_artist track_popularity track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre playlist_subgenre danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms
6f807x0ima9a1j3VPbc7VN I Don’t Care (with Justin Bieber) - Loud Luxury Remix Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx I Don’t Care (with Justin Bieber) [Loud Luxury Remix] 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.748 0.916 6 -2.634 1 0.0583 0.1020 0.00e+00 0.0653 0.518 122.036 194754
0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix Maroon 5 67 63rPSO264uRjW1X5E6cWv6 Memories (Dillon Francis Remix) 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.726 0.815 11 -4.969 1 0.0373 0.0724 4.21e-03 0.3570 0.693 99.972 162600
1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4 All the Time (Don Diablo Remix) 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.675 0.931 1 -3.432 0 0.0742 0.0794 2.33e-05 0.1100 0.613 124.008 176616
75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6 Call You Mine - The Remixes 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.718 0.930 7 -3.778 1 0.1020 0.0287 9.40e-06 0.2040 0.277 121.956 169093
1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ Someone You Loved (Future Humans Remix) 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.650 0.833 1 -4.672 1 0.0359 0.0803 0.00e+00 0.0833 0.725 123.976 189052
7fvUMiyapMsRRxr07cU8Ef Beautiful People (feat. Khalid) - Jack Wins Remix Ed Sheeran 67 2yiy9cd2QktrNvWC2EUi0k Beautiful People (feat. Khalid) [Jack Wins Remix] 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop dance pop 0.675 0.919 8 -5.385 1 0.1270 0.0799 0.00e+00 0.1430 0.585 124.982 163049

French Trains

hint: think very carefully about what the variables are here!

This one again is a pivot_longer, but you need to make sure to duplicate only the departure station!

trains_tidy <- trains %>% pivot_longer(-departure_station, 
                        names_to='destination_station', 
                        values_to='n_trains')
trains_tidy %>% head(6) %>% kable()
departure_station destination_station n_trains
PARIS MONTPARNASSE POITIERS 31632
PARIS MONTPARNASSE QUIMPER 16242
PARIS MONTPARNASSE RENNES 38850
PARIS MONTPARNASSE ST PIERRE DES CORPS 24930
PARIS MONTPARNASSE PARIS MONTPARNASSE 0
PARIS MONTPARNASSE NANCY 0

EU Energy

hint: sometimes, we must pivot one direction before we pivot another!

This one is a pivot_longer(), with a filter() to make the energy_type() column a true variable. Alternatively, you could pivot_wider() after this, as discussed above, but this is not quite tidy: the energy_type variable would be spread over columns.

energy_tidy <- energy %>% 
  pivot_longer(EU_28:GE, names_to="market", values_to="percent") %>% 
  filter(energy_type != "Total")
energy_tidy %>% head(6) %>% kable()
energy_type market percent
Conventional thermal EU_28 45.9
Conventional thermal EA_19 43.6
Conventional thermal EE 93.9
Conventional thermal CY 91.4
Conventional thermal MT 90.9
Conventional thermal PL 90.2

MarioKart

hint: sometimes, you may need to pivot twice in the same direction!

This one can be done by splitting and re-combining the data. It uses two pivots, either pivoting twice on the original data, or by splitting the dataset. This kind of “split and pivot” can be more complicated, but will be easier to understand. As a rule, split the data into bits that themselves are tidy, and then join them back together if that’s useful.

# clean only times
times <- mk_records %>%
  select(track, ends_with("record")) %>%
  pivot_longer(ends_with("record"), 
               names_to='race_type', 
               values_to='duration') %>%
  separate(race_type, c('event', NA, NA, NA))

# clean only dates
dates <- mk_records %>%
  select(track, ends_with('date')) %>% 
  pivot_longer(ends_with("date"), 
               names_to='race_type', 
               values_to='datetime') %>% 
  separate(race_type, c('event', NA, NA, NA, NA))

# merge everything back together, could use the merge or the *_join functions from the reading.
mk_tidy <- inner_join(times, dates, by=c("track", "event"))

mk_tidy %>% head(6) %>% kable()

Eco Risk

hint: this data frame is uniquely messy! Try splitting it into parts that all look tidy, and then bringing them back together.

This one basically must be split and recombined. In theory, one could use the “melt and cast” strategy from the Tidy Data paper to first make the data as long as possible, then pivot it based on the columns you want. The pivot_longer() function is less powerul in this way, because it does not convert data types by default. So, the easiest direct approach is the split & recombine, as follows:

risk_tidy <- rbind(risk[1:3], #first set of columns
                   risk[4:6], #second set of columns
                   risk[7:9]  #third set of columns
                   )
risk_tidy %>% head(6) %>% kable()

For the melt & cast strategy from the very first reading, we leverage the fact that the “melt” function forces everything into the same type (if it can) so that it can all be stacked on top of one another. This usually means everything is converted into character data, and we have to manually convert it back at the end:

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
risk %>% 
  melt(id.vars = 'country') %>% 
  pivot_wider(id_cols=country, 
              names_from=variable, 
              values_from=value) %>%
  mutate(rank=as.numeric(rank)) %>%
  head(7) %>% kable()
country rank damage
China 1 $117,290
USA 2 $70,381
Brazil 3 $33,760
India 4 $33,065
Japan 5 $23,490
Korea Republic of 6 $14,349
Turkey 7 $13,267

Section 3: Analysis

In this section, I’ll ask some specific questions about two of the data sets: Movies & Songs.

Movies

3.1

Which ten movies lost the most money domestically? Are these the same movies that lost the most money overall?

There are a few ways you can do this. The very simple way is to compute the profit, sort on the profit, and then show the two separate dataframes:

movies_tidy %>% 
  ### do the domestic profits
  # compute the domestic profit
  mutate(domestic_profit = domgross - budget) %>%
  # sort the data by domestic profit
  arrange(domestic_profit) %>%
  # grab the lowest 10 profits, which will be the
  # first ten rows:
  head(10) %>%
  kable()
year movie_name budget domgross intgross domestic_profit
2012 John Carter 2.75e+08 73058679 282778100 -201941321
2013 47 Ronin 2.25e+08 38362475 145803842 -186637525
2013 The Lone Ranger 2.75e+08 89289910 259989910 -185710090
2012 Battleship 2.09e+08 65235400 304142720 -143764600
2007 The Golden Compass 2.05e+08 70107728 372234864 -134892272
2013 Jack the Giant Slayer 1.95e+08 65187603 197387603 -129812397
2011 Mars Needs Moms 1.50e+08 21392758 39549758 -128607242
2010 Prince of Persia: The Sands of Time 2.00e+08 90759676 335059676 -109240324
2011 Hugo 1.80e+08 73864507 185033215 -106135493
2005 Stealth 1.38e+08 32116746 76416746 -105883254

You can do this again to get a second dataframe for the movies:

movies_tidy %>% 
  ### do the total profits
  # compute the total profit
  mutate(profit = intgross - budget) %>%
  # sort the data by domestic profit
  arrange(profit) %>%
  # grab the lowest 10 profits, which will be the
  # first ten rows:
  head(10) %>%
  kable()
year movie_name budget domgross intgross profit
2011 Mars Needs Moms 1.50e+08 21392758 39549758 -110450242
2002 The Adventures of Pluto Nash 1.00e+08 4411102 7094995 -92905005
2013 47 Ronin 2.25e+08 38362475 145803842 -79196158
2001 Monkeybone 7.50e+07 5409517 5409517 -69590483
1999 The 13th Warrior 1.25e+08 32698899 61698899 -63301101
1997 The Postman 8.00e+07 17650704 17650704 -62349296
2005 Stealth 1.38e+08 32116746 76416746 -61583254
2001 Osmosis Jones 7.00e+07 13596911 13596911 -56403089
2003 Timeline 8.00e+07 19480739 26703184 -53296816
2013 R.I.P.D. 1.30e+08 33618855 79019947 -50980053

This is nice, but makes it hard to see the exact relationship between the top 10 films in either category; we just see the 10 lossmaking movies for one kind of profit, and then 10 in the other. We cannot say, for instance, where the 10 domestic losers fall in the overall profit table. Interestingly, the only time we’re “slicing” the data happens at the head() function. So, we can string these two things together, and just do the filtering at the end:

movies_tidy %>% 
  ### do the domestic profits
  # compute the domestic profit
  mutate(domestic_profit = domgross - budget) %>%
  # sort the data by domestic profit
  arrange(domestic_profit) %>%
  # assign the "row number" to each of these movies, so that
  # the movie with the smallest domestic profit gets "domestic_rank" of 1
  # and the next-smallest has a "domestic_rank" of 2
  mutate(domestic_rank = row_number()) %>%
  ### do the foreign profits
  # compute the overall profit
  mutate(profit = intgross - budget) %>%
  # sort the data by this profit
  arrange(profit) %>%
  # again, assign the row number of the sorted data to a rank variable
  mutate(rank = row_number()) %>% 
  # select out the columns we want to see
  select(year, movie_name, domestic_rank, domestic_profit, rank, profit) %>%
  ### keep only the top 10 movies either domestic or foreign:
  filter(domestic_rank <= 10 | rank <= 10) %>%
  # sort by domestic rank, just because:
  arrange(domestic_rank) %>%
  kable()
year movie_name domestic_rank domestic_profit rank profit
2012 John Carter 1 -201941321 510 7778100
2013 47 Ronin 2 -186637525 3 -79196158
2013 The Lone Ranger 3 -185710090 97 -15010090
2012 Battleship 4 -143764600 1210 95142720
2007 The Golden Compass 5 -134892272 1432 167234864
2013 Jack the Giant Slayer 6 -129812397 394 2387603
2011 Mars Needs Moms 7 -128607242 1 -110450242
2010 Prince of Persia: The Sands of Time 8 -109240324 1348 135059676
2011 Hugo 9 -106135493 450 5033215
2005 Stealth 10 -105883254 7 -61583254
2013 R.I.P.D. 15 -96381145 10 -50980053
2002 The Adventures of Pluto Nash 16 -95588898 2 -92905005
1999 The 13th Warrior 17 -92301101 5 -63301101
2001 Monkeybone 33 -69590483 4 -69590483
1997 The Postman 46 -62349296 6 -62349296
2003 Timeline 53 -60519261 9 -53296816
2001 Osmosis Jones 59 -56403089 8 -56403089

Now, we see that the movies that lost the most money domestically can have wildly different profits internationally. For example, John Carter lost the most money domestically, but actually made a profit internationally! Same with Prince of Persia and Hugo. But, some movies did lose a lot of money both internationally and domestically, such as 47 Ronin or Mars Meeds Moms.

3.2

What is the average budget for a movie in each year?

movies_tidy %>% 
  group_by(year) %>% 
  summarize(avg_budget = mean(budget, na.rm=T)) %>%
  tail(10) %>%
  kable()
year avg_budget
2004 49043124
2005 45754250
2006 41924722
2007 47113533
2008 49651485
2009 50151613
2010 51915116
2011 50357137
2012 64458120
2013 72097980

3.3

Which movie had the largest gap between domestic and overseas box office performance?

Depending on how you solve this problem, you may need to drop the NA values. You can use the drop_na() function.

movies_tidy %>%
  mutate(gap = abs(domgross - intgross)) %>% 
  drop_na(gap) %>%
  arrange(gap) %>% 
  select(year, movie_name, gap) %>%
  tail(10) %>% kable()
year movie_name gap
2009 Ice Age: Dawn of the Dinosaurs 690394990
2012 The Hobbit: An Unexpected Journey 711700000
2003 The Lord of the Rings: The Return of the King 763562762
2011 Transformers: Dark of the Moon 771403533
2011 Pirates of the Caribbean: On Stranger Tides 802600000
2013 Iron Man 3 803700000
2012 Skyfall 804333804
2011 Harry Potter and the Deathly Hallows: Part 2 947100000
1997 Titanic 1527000000
2009 Avatar 2023411357

3.4

Make a visualization that shows how the budget for movies has increased over time. Discuss how you designed the plot in order to emphasize this message.

You may think that something like the following plot illustrates this well:

ggplot(movies_tidy, aes(x=year, y=budget/1000000)) + 
  geom_jitter(alpha=.25) +
  geom_smooth(se=F, color='orangered') + 
  xlab("Year Movie was Made") + 
  ylab("Budget (Millions USD)")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

But, visually, half of this plot is empty! There are a few ways you can address this. First, you could just plot the average and focus in on that range:

ggplot(movies_tidy, aes(x=year, y=budget/1000000)) + 
  geom_jitter(alpha=.1) +
  geom_smooth(se=F, color='orangered') + 
  xlab("Year Movie was Made") + 
  ylab("Budget (Millions USD)") + 
  ylim(0,100)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 191 rows containing non-finite values (stat_smooth).
## Warning: Removed 204 rows containing missing values (geom_point).

Or, you could change the scale:

ggplot(movies_tidy, aes(x=year, y=budget/1000000)) + 
  geom_jitter(alpha=.1) +
  geom_smooth(se=F, color='orangered') + 
  xlab("Year Movie was Made") + 
  ylab("Budget (Millions USD)") + 
  scale_y_log10(limits=c(.1, 200))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 35 rows containing non-finite values (stat_smooth).
## Warning: Removed 44 rows containing missing values (geom_point).

3.5

Make a visualization that shows how the typical profit movies make has generally not changed over time, but that a few outliers do make increasingly more money. Discuss how you designed the plot in order to emphasize this message.

A very simple plot to do this might consider the following:

movies_tidy %>% 
  mutate(profit=intgross - budget) %>%
  ggplot(aes(x=year, y=profit)) +
  geom_jitter(alpha=.1) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 11 rows containing non-finite values (stat_smooth).
## Warning: Removed 11 rows containing missing values (geom_point).

A more sophisticated perspective might aim to strategically define the outliers, and then label them separately:

outlier_earners <- movies_tidy %>%
  # get the profit
  mutate(profit = intgross - budget) %>%
  # drop any movies that have NA profits
  drop_na(profit) %>%
  ### within each year
  group_by(year) %>%
  # arrange the movies by profit
  arrange(desc(profit)) %>%
  # then compute the difference between the movie and the next-highest earner
  mutate(profit_gap_to_next_movie = profit - lead(profit)) %>%
  # sort on this value
  arrange(profit_gap_to_next_movie) %>% 
  # and drop any NAs from this
  drop_na(profit_gap_to_next_movie) %>%
  # take the 20 movies that have the largest gap between their
  # earnings and the next highest earnings:
  tail(20)
outlier_earners %>% kable()
year movie_name budget domgross intgross profit profit_gap_to_next_movie
2003 The Lord of the Rings: The Return of the King 9.40e+07 377845905 1141408667 1047408667 234943111
1973 The Exorcist 1.20e+07 204868002 402735134 390735134 236618807
1994 The Lion King 7.93e+07 422780140 952880140 873580140 249179615
1999 Star Wars: Episode I - The Phantom Menace 1.15e+08 474544677 1007044677 892044677 259238385
1972 The Godfather 7.00e+06 134966411 268500000 261500000 261098198
2011 Harry Potter and the Deathly Hallows: Part 2 1.25e+08 381011219 1328111219 1203111219 274317143
1977 Close Encounters of the Third Kind 2.00e+07 166000000 337700000 317700000 283448575
1985 Back to the Future 1.90e+07 210609762 383874862 364874862 286285161
1994 Forrest Gump 5.50e+07 329694499 679400525 624400525 290780396
1981 Raiders of the Lost Ark 2.00e+07 248159971 389925971 369925971 296955634
1975 Jaws 1.20e+07 260000000 470700000 458700000 320023583
2001 The Lord of the Rings: The Fellowship of the Ring 1.09e+08 315544750 887217688 778217688 333459969
1996 Independence Day 7.50e+07 306169255 817400878 742400878 334500878
1983 Star Wars: Episode VI - Return of the Jedi 3.25e+07 309205079 572700000 540200000 345736426
1980 Star Wars: Episode V - The Empire Strikes Back 2.30e+07 290271960 534171960 511171960 431218421
1977 Star Wars 1.10e+07 460998007 797900000 786900000 469200000
1993 Jurassic Park 6.30e+07 395708305 1035626872 972626872 556340869
1982 E.T.: The Extra-Terrestrial 1.05e+07 435110554 792965326 782465326 620265326
1997 Titanic 2.00e+08 658672302 2185672302 1985672302 1273985623
2009 Avatar 4.25e+08 760507625 2783918982 2358918982 1561950287

You can see that, clearly, these are the big blockbuster movies that earn a ton of money. So, we can use these as separate datasets in tandem to highlight their location on the plot:

non_outliers = movies_tidy %>% 
  # This ensures that movies that are in the outliers are 
  # dropped from the "non-outliers"
  # breaking it down, 
  # ! means "not", 
  # movie_name %in% outlier_earners$movie_name is True when 
  # the movie_name is contained within the set of outlier_earners
  # movie names.
  filter(!(movie_name %in% outlier_earners$movie_name)) %>%
  # Then, we'll need the profit variable to do the plotting
  mutate(profit=intgross - budget)

ggplot() + 
  # non-outlier points
  geom_jitter(data=non_outliers, 
             aes(x=year, y=profit), color='black', alpha=.1) +
  # non-outlier profit trend
  geom_smooth(data=non_outliers,
            aes(x=year, y=profit), se=F) +
  # outliers
  geom_point(data=outlier_earners, 
             aes(x=year, y=profit), color='red') + 
  # trend among outliers
  geom_smooth(data=outlier_earners, 
              aes(x=year, y=profit), color='red', se=F)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 11 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 11 rows containing missing values (geom_point).

3.6(c)

You’re a data scientist working for a movie studio. Your executive is considering whether to take a risk on making a “new” movie, or whether it’d be a safer bet to make a sequel to an existing movie. So, she asks:

Do sequels make more profit per dollar spent than non-sequels?

Can you answer her question?6

So, let’s first look at the movies that have a “2” or “II” in their title:

is_sequel = str_detect(movies_tidy$movie_name, c("2", "II"))
movies_tidy %>% 
  mutate(is_sequel = is_sequel) %>%
  filter(is_sequel) %>% kable()
year movie_name budget domgross intgross is_sequel
1974 The Godfather: Part II 13000000 57300000 57300000 TRUE
1981 Friday the 13th Part 2 1250000 21722776 21722776 TRUE
1981 Halloween II 2500000 25533818 25533818 TRUE
1983 Superman III 39000000 59950623 59950623 TRUE
1984 Star Trek III: The Search for Spock 18000000 76471046 87000000 TRUE
1987 Evil Dead II 3500000 5923044 5923044 TRUE
1989 Back to the Future Part II 40000000 118450002 332000000 TRUE
1990 Back to the Future Part III 40000000 87666629 243700000 TRUE
1990 Child&#39;s Play 2 13000000 26904572 34166572 TRUE
1990 Die Hard 2 70000000 117323878 239814025 TRUE
1991 Teenage Mutant Ninja Turtles II: The Secret of the Ooze 25000000 78656813 78656813 TRUE
1991 Terminator 2: Judgement Day 100000000 204859496 516816151 TRUE
1991 The Naked Gun 2 1/2: The Smell of Fear 23000000 86930411 86930411 TRUE
1992 Home Alone 2: Lost in New York 20000000 173585516 358994850 TRUE
1998 Halloween H20: 20 Years Later 17000000 55041738 55041738 TRUE
1999 Toy Story 2 90000000 245852179 511358276 TRUE
2000 28 Days 43000000 37035515 62063972 TRUE
2000 Pokemon: The Movie 2000 30000000 43746923 133946923 TRUE
2002 28 Days Later… 8000000 45064915 82955633 TRUE
2002 Blade II 54000000 81676888 154338601 TRUE
2002 Star Wars: Episode II - Attack of the Clones 115000000 310676740 656695615 TRUE
2002 The Santa Clause 2 65000000 139225854 172825854 TRUE
2003 21 grams 20000000 16248701 59667625 TRUE
2003 Cradle 2 the Grave 25000000 34657731 56434942 TRUE
2003 Jeepers Creepers II 25000000 35623801 35623801 TRUE
2003 Legally Blonde 2: Red, White &amp; Blonde 25000000 90639088 125339088 TRUE
2003 X2 (X-Men 2) 125000000 214949694 407711549 TRUE
2004 2046 12000000 1442338 19202856 TRUE
2004 The Princess Diaries 2: Royal Engagement 45000000 95149435 122071435 TRUE
2005 Miss Congeniality 2: Armed &amp; Fabulous 60000000 48478006 101382396 TRUE
2005 Star Wars: Episode III - Revenge of the Sith 115000000 380270577 848998877 TRUE
2006 Clerks II 5000000 24148068 25894473 TRUE
2006 The Grudge 2 20000000 39143839 70743839 TRUE
2007 28 Weeks Later 15000000 28638916 64232714 TRUE
2007 Hostel: Part II 7500000 17544812 33606409 TRUE
2007 White Noise 2: The Light 10000000 NA 8243567 TRUE
2008 The Sisterhood of the Traveling Pants 2 27000000 44089964 44270131 TRUE
2009 2012 200000000 166112167 788408539 TRUE
2009 The Boondock Saints II: All Saints Day 8000000 10273187 10273187 TRUE
2009 The Taking of Pelham 123 110000000 65452312 152364370 TRUE
2010 127 Hours 18000000 18335230 60735230 TRUE
2010 Iron Man 2 170000000 312433331 623561331 TRUE
2010 Paranormal Activity 2 3000000 84752907 177512032 TRUE
2010 Sex and the city 2 95000000 95347692 294680778 TRUE
2011 Cars 2 200000000 191450875 560155383 TRUE
2012 Men in Black III 215000000 179020854 624821154 TRUE
2012 The Twilight Saga: Breaking Dawn - Part 2 136200000 292324737 832660037 TRUE
2013 2 Guns 61000000 75612460 132493015 TRUE
2013 42 40000000 95020213 95020213 TRUE
2013 Cloudy with a Chance of Meatballs 2 78000000 119640264 271725448 TRUE
2013 Kick-Ass 2 28000000 28795985 60839197 TRUE
2013 Red 2 84000000 53262560 137162560 TRUE
2013 The Smurfs 2 110000000 71017784 348545841 TRUE

Looks pretty good! So, let’s look at the “return”, or profit per dollar, of these two sets of movies:

movies_tidy %>% 
  mutate(is_sequel=is_sequel) %>% 
  mutate(profit = intgross - budget) %>% 
  mutate(return_on_investment = profit/budget) %>%
  group_by(is_sequel) %>% 
  summarise(median_return = median(return_on_investment, na.rm=T),
            ) %>% 
  kable()
is_sequel median_return
FALSE 1.622181
TRUE 2.261692

The median sequel has about 37% higher returns than the median non-sequel. But! there is a very long tail:

movies_tidy %>% 
  mutate(is_sequel=is_sequel) %>% 
  mutate(profit = intgross - budget) %>% 
  mutate(return_on_investment = profit/budget) %>%
  ggplot(aes(x=return_on_investment, group=is_sequel, color=is_sequel)) +
  geom_boxplot()
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).

You can see that the non-sequels have some amazing returns on investment (400 here means that the profit is 400 times the budget!) but you can see that this long tail is very sparse, compared to the middle of the data. Let’s focus in on the medians, which are somewhere between 1 and 2:

movies_tidy %>% 
  mutate(is_sequel=is_sequel) %>% 
  mutate(profit = intgross - budget) %>% 
  mutate(return_on_investment = profit/budget) %>%
  ggplot(aes(x=return_on_investment, group=is_sequel, color=is_sequel)) +
  geom_boxplot() + 
  xlim(-1,4)
## Warning: Removed 464 rows containing non-finite values (stat_boxplot).

So, we see that sequels tend to have slightly higher returns to investment on average, but that the long tail of non-sequels can yield some seriously high returns, if you get lucky.

Songs

3.7

What’s your best guess about the length of a track with higher than 75% popularity? How about your best guess for the popularity of a track between 2 and 3 minutes long? Which “piece” of information (popularity > 75% or duration between 2 & 3 minutes) gives you more useful information, and why do you think that?7

ggplot(spotify, aes(x=duration_ms/1000/60, y=track_popularity)) +
  geom_hex() + 
  xlab("Duration (minutes)") + 
  ylab("Track Popularity") + 
  scale_fill_viridis_c(option='plasma')

Just describing what we see, you can see that most of the popular songs are somewhere between 2.5 minutes and 3 minutes. You can also see that songs that are about this length(between 2.5 and 3 minutes) can have basically any popularity… seems like there are a lot of 3-minute tracks that have zero popularity! Thus, knowing a song is popular gives you way more information than knowing a song is 3 minutes long.

Doing this in a table, rather than graphically:

spotify %>% 
  mutate(duration_mins = duration_ms/1000/60) %>%
  filter(track_popularity > 75) %>%
  summarize(mean_duration = mean(duration_mins),
            sd_duration=sd(duration_mins)) %>% kable()
mean_duration sd_duration
3.554781 0.7395706
spotify %>% 
  mutate(duration_mins = duration_ms/1000/60) %>%
  filter((2 <= duration_mins)&(duration_mins <= 3)) %>%
  summarize(mean_popularity = mean(track_popularity),
            sd_popularity=sd(track_popularity)) %>% kable()
mean_popularity sd_popularity
45.48043 23.05022

You can see that the standard deviation of the track popularity is about half of the average popularity of songs between 2 and 3 minutes, while the standard deviation of the track duration is only about a fifth of the mean. This suggests that there is much more variability around the mean popularity for songs between 2 and 3 minutes than there is in the duration of songs at 75% popularity or above!

3.8 (c)

What is the typical “energy” of each of the playlist genres? How about the typical “valence,” meaning “happiness,” of the genres?

spotify %>% 
  group_by(playlist_genre) %>% 
  summarize(valence = median(valence), energy = median(energy)) %>%
  kable()
playlist_genre valence energy
edm 0.370 0.830
latin 0.628 0.729
pop 0.500 0.727
r&b 0.542 0.596
rap 0.517 0.665
rock 0.531 0.775

3.9 (c)

Make four plots8 to visualize the relationship between danceability and a few variables:

  • tempo
  • energy
  • valence
  • playlist_genre

Make sure to take into account whether the relationship is linear and address overplotting if necessary. Given these plots, what kinds of songs tend to be danceable?

library(ggpubr)

tempo <- ggplot(spotify, 
                aes(x=tempo, y=danceability)) +
  geom_point(alpha=.1) + 
  geom_smooth(aes(color=playlist_genre), se=F)
energy <- ggplot(spotify, aes(x=energy, y=danceability)) + 
  geom_point(alpha=.1) + 
  geom_smooth(aes(color=playlist_genre), se=F)
valence <- ggplot(spotify, aes(x=valence, y=danceability)) + 
  geom_point(alpha=.1) + 
  geom_smooth(aes(color=playlist_genre), se=F)
playlist_genre <- ggplot(spotify, 
                         aes(group=playlist_genre, 
                             y=danceability, 
                             color=playlist_genre)) + 
  geom_boxplot() 

ggarrange(tempo, energy, valence, playlist_genre, ncol=2, nrow=2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Interpreting this, (1a) most songs are danceable if they are around 100/120 tempo; as songs become too fast or too slow, they become less danceable; (1b) but, latin songs have a “second peak” (slightly below 200) where songs get a little bit more danceable again; (2) songs of moderate energy tend to be danceable, but latin songs that are high energy are more danceable, whereas low-energy rap songs tend to be more danceable; (3) as songs get happier, they always get more danceable, regardless of genre; (4) Rock songs are clearly less danceable in general, and the most danceable genre is rap, but this is closely followed by Latin.

3.10 (c)

Let’s assume that the difference between a band’s median track popularity and its maximum track popularity represents that band’s one-hit-wonderness. If this is the case, what are the Top 10 “one-hit-wonder” bands in the dataset? Given the results, does this comport with your understanding of what a “one hit wonder” is?

one_hit_wonders = spotify %>% 
  group_by(track_artist) %>% 
  summarize(one_hit_wonderness = max(track_popularity) - median(track_popularity)) %>%
  arrange(desc(one_hit_wonderness))
one_hit_wonders %>% head(10) %>% kable()
track_artist one_hit_wonderness
Lynyrd Skynyrd 81
Duki 76
50 Cent 75
Childish Gambino 75
KISS 74
YG 74
Sam Feldt 73
John Newman 72
Sheck Wes 72
Thin Lizzy 71

Sure; Lynyrd Skynyrd makes sense given the perennial popularity of “Sweet Home Alabama”, but you may not know who, say Duki or YG are. And, alternatively, KISS and Thin Lizzy are certainly not one hit wonders; they have a ton of very well-rated songs. So, one thing that may be missing in terms of the measure of one-hit-wonderness may be a measure of how many albums the artist has, or how many popular songs the artist has.


  1. You may also find the original Tidy Data paper useful in describing the different commonly-encountered issues in data formatting.↩︎

  2. Remember: I’m just using the knitr::kable() function to print the table all pretty-like in the RMarkdown.↩︎

  3. This is scraped using the spotifyr package.↩︎

  4. danceability measures how suitable a track is for dancing, varies from 0 (not danceable) to 1 (danceable). energy is a measure of the “intensity” of the song, varies from 0 (very relaxing, Shoegaze) to 1 (very energetic, Death Metal). loudness is the average decibel value of the track, varies from -60 to zero. speechiness gives the level of “talkiness” in a track, varies from 0 (no spoken words) to 1 (all spoken words), but tracks over .66 are probably nearly all spoken word, and tracks below .33 are probably songs. acousticness is the same as speechiness, but measuring whether instruments are not amplified; liveness but for whether the track has a live audience. valence tells you whether a track is “happy,” with higher scores indicate happier music. Finally, tempo records the average beats per minute for the track and duration_ms provides the duration of the song in milliseconds. ↩︎

  5. Note that I’m using the name_repair option of read_csv() in order to get exactly the column names that are in midterm-risk.csv. Without this option, readr::read_csv will append extra values to the column names in order to ensure they are each unique. Try this by removing the name_repair argument, or setting it equal to "unique" instead of "minimal".↩︎

  6. Note that you can use the str_detect() function in the tidyverse’s stringr package to give TRUE when a movie name contains 2 or a II, and FALSE otherwise. Also, it’s ok if you accidentally pick up movies that have III or IIII or H20 in their name using this strategy; we’re just making an educated guess.↩︎

  7. You may find it helpful to make a plot!↩︎

  8. It’s OK if they’re totally separate plots! That is, I don’t expect you to use facet_grid()↩︎