This assessment will test your knowledge & abilities to work with data in R. Complete what you can; incomplete answers will still be able to achieve partial credit, and non-working code or descriptions of what you think you should do will also receive partial credit. Also, if a question has a (c) next to it, it’s considered a “Challenge” question! So, you can still score OK if you leave it blank. Submit the HTML or PDF of the “knitted” document for maximum points.

Data for the assessment is available alongside this document on the course blackboard, or here.

Section 1: Concepts

The data sets shown in the tabs below are included with your midterm assessment. Each of the data sets may have one (or more) “tidy” issues, and thus may violate the three rules of tidy data we’ve discussed before in the course:

Each variable must have its own column.
Each sample must have its own row.
Every value must have its own cell.

Using this information,¹ describe the following data sets in terms of their “tidyness.” Are the data sets tidy? If so, why? If not, why not? Note that I’m asking for a short paragraph; no code is needed to answer these fully.

Sometimes, I also provide a rationale for why the data is displayed this way. This* does not mean *the data is tidy!

Note: click the tabs to cycle through each of the data sets.

Movies

This data set describes the finances of movies in terms of the movie budget (budget), the domestic (US) gross box office revenue (domgross) and the international box office revenue (intgross) from 1970 to 2013.²

movies <- read_csv('./midterm-movies.csv')
movies %>% head(12) %>% kable()

year	movie_name	finance	dollars
1970	Beyond the Valley of the Dolls	budget	1000000
1970	Beyond the Valley of the Dolls	domgross	9000000
1970	Beyond the Valley of the Dolls	intgross	9000000
1971	Escape from the Planet of the Apes	budget	2500000
1971	Escape from the Planet of the Apes	domgross	12300000
1971	Escape from the Planet of the Apes	intgross	12300000
1971	Shaft	budget	53012938
1971	Shaft	domgross	70327868
1971	Shaft	intgross	107190108
1971	Straw Dogs	budget	25000000
1971	Straw Dogs	domgross	10324441
1971	Straw Dogs	intgross	11253821

Aussie Birds

This dataset counts the numbers of bird species recorded in urban or rural parts of bioregions across Australia. The survey was conducted from 2014 to 2015. Shown below are the first 10 columns of six random rows from the dataframe, as there are many more bird species in Australia. This data is formatted this way in order to make the selection of specific species easy.

birds <- read_csv('./midterm-birds.csv')
birds %>% drop_na() %>% arrange(bioregions) %>% sample_n(6) %>% select(1:10) %>% kable()

survey_year	urban_rural	bioregions	Wild Duck	Willie Wagtail	Rufous Fantail
2014	Urban	South Eastern Highlands	0	2	0
2015	Rural	Brigalow Belt South	0	2	0
2015	Rural	Flinders Lofty Block	0	2	0
2015	Urban	Victorian Midlands	0	0	0
2015	Urban	South East Coastal Plain	0	2	0
2015	Urban	Sydney Basin	3	3	2

Songs

This dataset describes 32,833 songs on Spotify in terms of statistics that Spotify record about the song.³ Below, I only a show a few columns from the dataset, but a few columns are explained.⁴ This data is formatted directly from the Spotify API, and so is structured for ease of use in a database.

spotify <- read_csv('./midterm-songs.csv')
spotify %>% select(track_name, track_artist, 
                   track_popularity, 
                   danceability, loudness, duration_ms) %>% 
  sample_n(10, weight=track_popularity) %>% 
  kable()

track_name	track_artist	track_popularity	danceability	loudness	duration_ms
Stellar	Jamila Woods	49	0.684	-9.751	122346
Peeping Tom - Original Mix	Jamie Berry	53	0.720	-4.713	210000
Transmission - 2010 Remaster	Joy Division	48	0.497	-8.088	215148
Tungba	Ajebutter22	54	0.749	-5.850	188421
Bounce Back	Big Sean	76	0.780	-5.628	222360
Holy Diver	Dio	35	0.469	-9.913	350013
Hit and Run	Midnght	43	0.775	-10.914	172203
You should be sad	Halsey	86	0.591	-6.350	205473
HUMBLE.	Kendrick Lamar	84	0.908	-6.638	177000
Stay High - Habits Remix	Tove Lo	69	0.736	-7.036	258387

French Trains

These are the number of trains that have run between any two stations for services operated by the National Society of French Railways, the state-owned rail company in France, from 2015 to 2018. Below, I just show the first ten originating stations (where the trains leave from) and the first four destination stations (where the trains arrive). This kind of data is often called “spatial interaction data,” and is used to measure the “interaction” between different spatial units. It is often presented in this format for usability: readers can scan across a row to quickly compare the number of trains that destinations receive from a specific origin.

trains <- read_csv('./midterm-trains.csv')
trains %>% 
  select(1:5) %>% 
  slice(1:10) %>% 
  kable()

departure_station	POITIERS	QUIMPER	RENNES	ST PIERRE DES CORPS
PARIS MONTPARNASSE	31632	16242	38850	24930
TOURS	0	0	0	0
LYON PART DIEU	0	0	5814	0
PARIS EST	0	0	0	0
NANCY	0	0	0	0
STRASBOURG	0	0	0	0
NANTES	0	0	0	0
DUNKERQUE	0	0	0	0
MARSEILLE ST CHARLES	0	0	0	0
BORDEAUX ST JEAN	0	0	0	0

EU Energy

The following table records the percentage energy supplied by different forms of energy across countries in the EU, as well as the “EU-28” and “EA-19” groups of European member nations. This kind of wide-but-short display format is often useful to fit tables like this alongside text in a document.

energy <- read_csv('./midterm-energy.csv')
energy %>% select(1:12) %>% kable()

energy_type	EU_28	EA_19	EE	CY	MT	PL	NL	EL	IE	IT	LV
Conventional thermal	45.9	43.6	93.9	91.4	90.9	90.2	83.9	68.6	68.3	66.0	61.0
Nuclear	25.5	27.0	0.0	0.0	0.0	0.0	2.9	0.0	0.0	0.0	0.0
Hydro	11.8	11.9	0.2	0.0	0.0	1.5	0.1	11.4	3.2	17.6	37.2
Wind	12.2	12.2	6.0	4.6	0.0	8.1	10.9	12.4	28.6	6.2	1.8
Solar	4.0	4.7	0.0	4.0	0.0	0.2	2.2	7.5	0.0	8.2	0.0
Geothermal & others	0.4	0.6	0.0	0.0	9.1	0.0	0.0	0.0	0.0	2.0	0.0
Total	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0

MarioKart

This dataset describes MarioKart 64 “World Record” races. There are typically two “categories” in MarioKart racing: the “Single Lap” category—how fast a racer can complete a single lap—and “Three Lap,” which measures the time to complete a typical three-lap race. Along with the times (in seconds), the date for the most recent world record is recorded. This is often the format in which tables like this are viewed.

mk_records <- read_csv('./midterm-races.csv')
mk_records %>% kable()

track	single_lap_record	single_lap_record_date	three_lap_record	three_lap_record_date
Banshee Boardwalk	40.78	2020-06-11	124.09	2021-01-15
Bowser’s Castle	43.15	2021-02-02	132.00	2021-02-02
Choco Mountain	38.02	2020-07-17	115.93	2020-07-14
D.K.’s Jungle Parkway	42.04	2020-10-26	131.62	2019-12-31
Frappe Snowland	38.27	2019-12-25	119.95	2017-12-23
Kalimari Desert	38.96	2020-11-07	123.94	2018-04-20
Koopa Troopa Beach	30.78	2020-10-18	95.25	2020-07-13
Luigi Raceway	37.58	2021-01-10	117.77	2020-11-06
Mario Raceway	27.62	2021-01-26	87.51	2020-08-13
Moo Moo Farm	27.80	2018-12-29	85.93	2020-02-18
Rainbow Road	116.35	2020-09-04	351.87	2020-09-28
Royal Raceway	55.50	2020-06-11	171.25	2020-09-07
Sherbet Land	37.72	2021-02-19	115.15	2021-01-26
Toad’s Turnpike	58.69	2020-09-20	177.80	2020-09-28
Wario Stadium	85.82	2021-01-26	260.01	2019-10-11
Yoshi Valley	31.25	2018-01-18	102.13	2021-01-26

Eco Risk

The following are Paini et al.’s estimates of the potential cost (in millions of USD) of invasive species to the ecosystem of each country. This is a direct digitization of table S2 in the Supplemental Material, so the formatting of the table is decided by concerns of printing and typesetting.⁵

risk <- read_csv('./midterm-risk.csv', name_repair='minimal')
risk %>% kable()

rank	country	damage	rank	country	damage	rank	country	damage
1	China	$117,290	43	Denmark	$1,417	85	Burundi	$397.9
2	USA	$70,381	44	Nepal	$1,411	86	Lithuania	$392.4
3	Brazil	$33,760	45	Sudan	$1,373	87	Moldova	$387.5
4	India	$33,065	46	Portugal	$1,365	88	Armenia	$336.0
5	Japan	$23,490	47	Belgium	$1,351	89	Malaysia	$333.0
6	Korea Republic of	$14,349	48	Kazakhstan	$1,344	90	Bosnia and Herzegovina	$327.4
7	Turkey	$13,267	49	Czech Republic	$1,336	91	Kyrgyzstan	$302.0
8	Argentina	$13,204	50	Austria	$1,304	92	Georgia (Republic)	$301.5
9	France	$12,532	51	Iraq	$1,234	93	Tajikistan	$297.1
10	Mexico	$11,277	52	Kenya	$1,230	94	Ireland	$277.6
11	Iran	$11,276	53	Mozambique	$1,218	95	Lebanon	$276.8
12	Nigeria	$10,251	54	Cambodia	$1,121	96	Nicaragua	$264.1
13	Indonesia	$9,550	55	Ghana	$1,114	97	Rwanda	$255.1
14	Thailand	$8,066	56	Bulgaria	$1,112	98	Mauritius	$227.6
15	Australia	$7,815	57	Madagascar	$1,074	99	Macedonia	$218.4
16	Vietnam	$7,490	58	Malawi	$1,071	100	Congo (Republic of)	$212.8
17	Ukraine	$6,953	59	Paraguay	$1,012	101	Slovenia	$202.0
18	Egypt	$6,737	60	Guinea	$977.5	102	Niger	$197.3
19	Canada	$6,694	61	Tunisia	$949.2	103	Latvia	$187.3
20	Pakistan	$6,630	62	Ecuador	$934.7	104	Panama	$161.2
21	Germany	$6,481	63	Switzerland	$924.1	105	Togo	$153.2
22	Bangladesh	$5,623	64	Dominican Republic	$873.0	106	Jordan	$116.4
23	Spain	$5,576	65	Jamaica	$871.7	107	Guinea-Bissau	$114.3
24	Russian Federation	$5,084	66	Sri Lanka	$829.1	108	Cyprus	$108.5
25	Philippines	$4,839	67	Yemen	$806.0	109	Estonia	$102.1
26	Greece	$4,342	68 Saudi Arabia	NA	110	Fiji	NA	NA
27	United Kingdom	$4,005	69	Honduras	$794.3	111	Mongolia	$64.7
28	South Africa	$3,922	70	Croatia	$755.6	112	Luxembourg	$64.7
29	Romania	$3,524	71	Azerbaijan	$730.7	113	Belize	$42.2
30	Algeria	$2,862	72	New Zealand	$639.7	114	Cape Verde	$40.8
31	Morocco	$2,531	73	Albania	$637.2	115	Gambia	$37.6
32	Colombia	$2,476	74	Finland	$600.7	116	Suriname	$36.5
33	Poland	$2,449	75	Slovakia	$573.4	117	Trinidad and Tobago	$28.8
34	Ethiopia	$2,312	76	Burkina Faso	$557.7	118	Vanuatu	$23.5
35	Venezuela	$2,167	77	Costa Rica	$556.8	119	Barbados	$20.3
36	Chile	$2,095	78	Sweden	$546.9	120 Equatorial Guinea	NA	NA
37	Netherlands	$1,981	79	Israel	$518.4	121	Malta	$14.4
38	Hungary	$1,979	80	Uruguay	$509.1	122	Qatar	$5.0
39	Belarus	$1,777	81	Laos	$508.2	123	Iceland	$4.8
40	Peru	$1,580	82	Mali	$504.7	124	Singapore	$0.7
41	Cameroon	$1,574	83	El Salvador	$475.1	NA	NA	NA
42	Italy	$1,447	84	Norway	$419.4	NA	NA	NA

Section 2: Cleaning

For each of the data sets in Section 1, can you create a tidy version of the data set?

Movies

hint: this may need to take rows and turn them into columns!

movies %>%

Aussie Birds

hint: this may need to take columns and turn them into rows!

birds %>%

Songs

hint: pay attention to the three rules; sometimes, you get lucky!

songs %>%

French Trains

hint: think very carefully about what the variables are here!

trains %>%

EU Energy

hint: sometimes, we must pivot one direction before we pivot another!

energy %>%

MarioKart

hint: sometimes, you may need to pivot twice in the same direction!

mk_records %>%

Eco Risk

hint: this data frame is uniquely messy! Try splitting it into parts that all look tidy, and then bringing them back together.

risk %>%

Section 3: Analysis

In this section, I’ll ask some specific questions about two of the data sets: Movies & Songs.

Movies

3.1

Which ten movies lost the most money domestically? Are these the same movies that lost the most money overall?

3.2

What is the average budget for a movie in each year?

3.3

Which movie had the largest gap between domestic and overseas box office performance?

3.4

Make a visualization that shows how the budget for movies has increased over time. Discuss how you designed the plot in order to emphasize this message.

3.5

Make a visualization that shows how the typical profit movies make has generally not changed over time, but that a few outliers do make increasingly more money. Discuss how you designed the plot in order to emphasize this message.

3.6(c)

You’re a data scientist working for a movie studio. Your executive is considering whether to take a risk on making a “new” movie, or whether it’d be a safer bet to make a sequel to an existing movie. So, she asks:

Do sequels make more profit per dollar spent than non-sequels?

Can you answer her question?⁶

Songs

3.7

What’s your best guess about the length of a track with higher than 75% popularity? How about your best guess for the popularity of a track between 2 and 3 minutes long? Which “piece” of information (popularity > 75% or duration between 2 & 3 minutes) gives you more useful information, and why do you think that?⁷

3.8 (c)

What is the typical “energy” of each of the playlist genres? How about the typical “valence,” meaning “happiness,” of the genres?

3.9 (c)

Make four plots⁸ to visualize the relationship between danceability and a few variables:

tempo
energy
valence
playlist_genre

Make sure to take into account whether the relationship is linear and address overplotting if necessary. Given these plots, what kinds of songs tend to be danceable?

3.10 (c)

Let’s assume that the difference between a band’s median track popularity and its maximum track popularity represents that band’s one-hit-wonderness. If this is the case, what are the Top 10 “one-hit-wonder” bands in the dataset? Given the results, does this comport with your understanding of what a “one hit wonder” is?

You may also find the original Tidy Data paper useful in describing the different commonly-encountered issues in data formatting.↩︎
Remember: I’m just using the knitr::kable() function to print the table all pretty-like in the RMarkdown.↩︎
This is scraped using the spotifyr package.↩︎
danceability measures how suitable a track is for dancing, varies from 0 (not danceable) to 1 (danceable). energy is a measure of the “intensity” of the song, varies from 0 (very relaxing, Shoegaze) to 1 (very energetic, Death Metal). loudness is the average decibel value of the track, varies from -60 to zero. speechiness gives the level of “talkiness” in a track, varies from 0 (no spoken words) to 1 (all spoken words), but tracks over .66 are probably nearly all spoken word, and tracks below .33 are probably songs. acousticness is the same as speechiness, but measuring whether instruments are not amplified; liveness but for whether the track has a live audience. valence tells you whether a track is “happy,” with higher scores indicate happier music. Finally, tempo records the average beats per minute for the track and duration_ms provides the duration of the song in milliseconds. ↩︎
Note that I’m using the name_repair option of read_csv() in order to get exactly the column names that are in midterm-risk.csv. Without this option, readr::read_csv will append extra values to the column names in order to ensure they are each unique. Try this by removing the name_repair argument, or setting it equal to "unique" instead of "minimal".↩︎
Note that you can use the str_detect() function in the tidyverse’s stringr package to give TRUE when a movie name contains 2 or a II, and FALSE otherwise. Also, it’s ok if you accidentally pick up movies that have III or IIII or H20 in their name using this strategy; we’re just making an educated guess.↩︎
You may find it helpful to make a plot!↩︎
It’s OK if they’re totally separate plots! That is, I don’t expect you to use facet_grid()↩︎

Midterm Asessment

Due: Wednesday, 3rd November, at 2PM London

Section 1: Concepts

Movies

Aussie Birds

Songs

French Trains

EU Energy

MarioKart

Eco Risk

Section 2: Cleaning

Movies

Aussie Birds

Songs

French Trains

EU Energy

MarioKart

Eco Risk

Section 3: Analysis

Movies

3.1

3.2

3.3

3.4

3.5

3.6(c)

Songs

3.7

3.8 (c)

3.9 (c)

3.10 (c)