This assessment will test your knowledge & abilities to work with data in R. Complete what you can; incomplete answers will still be able to achieve partial credit, and non-working code or descriptions of what you think you should do will also receive partial credit. Also, if a question has a (c) next to it, it’s considered a “Challenge” question! So, you can still score OK if you leave it blank. Submit the HTML or PDF of the “knitted” document for maximum points.

Data for the assessment is available alongside this document on the course blackboard, or here.

Section 1: Concepts

The data sets shown in the tabs below are included with your midterm assessment. Each of the data sets may have one (or more) “tidy” issues, and thus may violate the three rules of tidy data we’ve discussed before in the course:

Each variable must have its own column.
Each sample must have its own row.
Every value must have its own cell.

Using this information,¹ describe the following data sets in terms of their “tidyness.” Are the data sets tidy? If so, why? If not, why not? Note that I’m asking for a short paragraph; no code is needed to answer these fully.

Sometimes, I also provide a rationale for why the data is displayed this way. This* does not mean *the data is tidy!

Note: click the tabs to cycle through each of the data sets.

Movies

This data set describes the finances of movies in terms of the movie budget (budget), the domestic (US) gross box office revenue (domgross) and the international box office revenue (intgross) from 1970 to 2013.²

movies <- read_csv('./midterm-movies.csv')
movies %>% head(12) %>% kable()

year	movie_name	finance	dollars
1970	Beyond the Valley of the Dolls	budget	1000000
1970	Beyond the Valley of the Dolls	domgross	9000000
1970	Beyond the Valley of the Dolls	intgross	9000000
1971	Escape from the Planet of the Apes	budget	2500000
1971	Escape from the Planet of the Apes	domgross	12300000
1971	Escape from the Planet of the Apes	intgross	12300000
1971	Shaft	budget	53012938
1971	Shaft	domgross	70327868
1971	Shaft	intgross	107190108
1971	Straw Dogs	budget	25000000
1971	Straw Dogs	domgross	10324441
1971	Straw Dogs	intgross	11253821

This dataset is not tidy. The column “finance” contains the different variable names that need to get turned into columns. This means that the column “dollars” does all contain dollar amounts, but they come from the same sample: the movie. Your target should have a dataframe where each row is a movie, and each column expresses the year of the movie, the budget of the movie, the domestic gross of the movie, and the international gross of the movie.

Aussie Birds

This dataset counts the numbers of bird species recorded in urban or rural parts of bioregions across Australia. The survey was conducted from 2014 to 2015. Shown below are the first 10 columns of six random rows from the dataframe, as there are many more bird species in Australia. This data is formatted this way in order to make the selection of specific species easy.

birds <- read_csv('./midterm-birds.csv')
birds %>% drop_na() %>% arrange(bioregions) %>% sample_n(6) %>% select(1:10) %>% kable()

survey_year	urban_rural	bioregions	Bassian Thrush	Chestnut-breasted Mannikin	Willie Wagtail	Regent Bowerbird	Rufous Fantail
2015	Rural	Victorian Midlands	0	0	6	0	1
2015	Rural	Flinders Lofty Block	0	0	2	0	0
2015	Rural	South Eastern Queensland	0	1	8	1	2
2015	Rural	South East Coastal Plain	1	0	2	0	0
2015	Urban	Brigalow Belt South	0	0	1	0	0
2014	Urban	South Eastern Highlands	0	0	2	0	0

This dataset is not tidy. The “species” variable is implicit, and spread along columns… In order to tidy it, your final dataframe should have year, urban/rural classification, bioregion, species, and count.

Songs

This dataset describes 32,833 songs on Spotify in terms of statistics that Spotify record about the song.³ Below, I only a show a few columns from the dataset, but a few columns are explained.⁴ This data is formatted directly from the Spotify API, and so is structured for ease of use in a database.

spotify <- read_csv('./midterm-songs.csv')
spotify %>% select(track_name, track_artist, 
                   track_popularity, 
                   danceability, loudness, duration_ms) %>% 
  sample_n(10, weight=track_popularity) %>% 
  kable()

track_name	track_artist	track_popularity	danceability	loudness	duration_ms
My Best Life (feat. Mike Waters) - Club Mix	KSHMR	66	0.603	-3.693	198533
We Should Tell the Truth	metr	44	0.808	-12.306	267292
Nothing in Return	Monsune	52	0.672	-5.747	232200
Are You Gonna Be My Girl - UK Acoustic Version	Jet	52	0.672	-5.287	237173
What About Your Friends (Album Radio Edit) [W/Rap]	TLC	27	0.752	-4.454	244560
More Than a Feeling	Boston	78	0.377	-8.039	285133
Takeaway	The Chainsmokers	85	0.528	-8.144	209880
Stay The Night - Featuring Hayley Williams Of Paramore	Zedd	61	0.596	-3.109	217347
Killing My Love - Radio Edit	THIS IS ELLE	18	0.206	-5.168	194047
That’s Tuff (feat. Quavo)	Rich The Kid	75	0.903	-4.313	154000

This dataset is tidy🎂Each row denotes a song on spotify. The only way you may think it’s not tidy is if the song variable contains information on the remix and collaborator(s) on the song. This is true, but the distinction between where a variable stops and where it starts is not always so clear. If the track title includes “(feat. Jay-Z)”, then you can use feature engineering to build a new variable that tries to extract this information. However, each cell has not intentionally coded a value that includes “feat.” or “Remix” info, so it’s less reasonable to suggest that all of track_title contains two (or more) variables.

French Trains

These are the number of trains that have run between any two stations for services operated by the National Society of French Railways, the state-owned rail company in France, from 2015 to 2018. Below, I just show the first ten originating stations (where the trains leave from) and the first four destination stations (where the trains arrive). This kind of data is often called “spatial interaction data,” and is used to measure the “interaction” between different spatial units. It is often presented in this format for usability: readers can scan across a row to quickly compare the number of trains that destinations receive from a specific origin.

trains <- read_csv('./midterm-trains.csv')
trains %>% 
  select(1:5) %>% 
  slice(1:10) %>% 
  kable()

departure_station	POITIERS	QUIMPER	RENNES	ST PIERRE DES CORPS
PARIS MONTPARNASSE	31632	16242	38850	24930
TOURS	0	0	0	0
LYON PART DIEU	0	0	5814	0
PARIS EST	0	0	0	0
NANCY	0	0	0	0
STRASBOURG	0	0	0	0
NANTES	0	0	0	0
DUNKERQUE	0	0	0	0
MARSEILLE ST CHARLES	0	0	0	0
BORDEAUX ST JEAN	0	0	0	0

This dataset is not tidy! the “destination” variable is spread implicitly over the columns. That is, the data is wider than it should be to be tidy. The tidy dataset would have source, destination, and count columns.

EU Energy

The following table records the percentage energy supplied by different forms of energy across countries in the EU, as well as the “EU-28” and “EA-19” groups of European member nations. This kind of wide-but-short display format is often useful to fit tables like this alongside text in a document.

energy <- read_csv('./midterm-energy.csv')
energy %>% select(1:12) %>% kable()

energy_type	EU_28	EA_19	EE	CY	MT	PL	NL	EL	IE	IT	LV
Conventional thermal	45.9	43.6	93.9	91.4	90.9	90.2	83.9	68.6	68.3	66.0	61.0
Nuclear	25.5	27.0	0.0	0.0	0.0	0.0	2.9	0.0	0.0	0.0	0.0
Hydro	11.8	11.9	0.2	0.0	0.0	1.5	0.1	11.4	3.2	17.6	37.2
Wind	12.2	12.2	6.0	4.6	0.0	8.1	10.9	12.4	28.6	6.2	1.8
Solar	4.0	4.7	0.0	4.0	0.0	0.2	2.2	7.5	0.0	8.2	0.0
Geothermal & others	0.4	0.6	0.0	0.0	9.1	0.0	0.0	0.0	0.0	2.0	0.0
Total	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0

This dataset is not tidy! Its transpose, however, is! One “measurement” or “sample” should correspond to a row. So, generally speaking, we’d want one set of observations to all hang logically together, recording all the information we have about a unit of analysis. So, the clearest “tidy” version for this data is long, with one column of “energy type,” one column of “country”, and then the country column. How to deal with the EU/EA groupings of countries can be done either through redefining what the “country” variable means (it could be “energy_market” or “supply_zone”), or the values could be separated into their own tables, or could be used to indicate the aggregation into which each country falls. For the “total” energy type, though, that needs to be removed from this representation. I wouldn’t expect the latter, but either of the first two are reasonable.

One common interpretation of the tidy version of this data is just the transpose of the current dataframe. I can understand this, but it’s not as clear; the “energy type” variable is split across columns in this case.

MarioKart

This dataset describes MarioKart 64 “World Record” races. There are typically two “categories” in MarioKart racing: the “Single Lap” category—how fast a racer can complete a single lap—and “Three Lap,” which measures the time to complete a typical three-lap race. Along with the times (in seconds), the date for the most recent world record is recorded. This is often the format in which tables like this are viewed.

mk_records <- read_csv('./midterm-races.csv')
mk_records %>% kable()

track	single_lap_record	single_lap_record_date	three_lap_record	three_lap_record_date
Banshee Boardwalk	40.78	2020-06-11	124.09	2021-01-15
Bowser’s Castle	43.15	2021-02-02	132.00	2021-02-02
Choco Mountain	38.02	2020-07-17	115.93	2020-07-14
D.K.’s Jungle Parkway	42.04	2020-10-26	131.62	2019-12-31
Frappe Snowland	38.27	2019-12-25	119.95	2017-12-23
Kalimari Desert	38.96	2020-11-07	123.94	2018-04-20
Koopa Troopa Beach	30.78	2020-10-18	95.25	2020-07-13
Luigi Raceway	37.58	2021-01-10	117.77	2020-11-06
Mario Raceway	27.62	2021-01-26	87.51	2020-08-13
Moo Moo Farm	27.80	2018-12-29	85.93	2020-02-18
Rainbow Road	116.35	2020-09-04	351.87	2020-09-28
Royal Raceway	55.50	2020-06-11	171.25	2020-09-07
Sherbet Land	37.72	2021-02-19	115.15	2021-01-26
Toad’s Turnpike	58.69	2020-09-20	177.80	2020-09-28
Wario Stadium	85.82	2021-01-26	260.01	2019-10-11
Yoshi Valley	31.25	2018-01-18	102.13	2021-01-26

This data is not tidy! the variable “race type” is spread over the columns, but is mixed into each of the relevant features. This will be somewhat more challenging to tidy! The target dataframe should have one world record for each row, with columns track, event, time, and date as columns.

Eco Risk

The following are Paini et al.’s estimates of the potential cost (in millions of USD) of invasive species to the ecosystem of each country. This is a direct digitization of table S2 in the Supplemental Material, so the formatting of the table is decided by concerns of printing and typesetting.⁵

risk <- read_csv('./midterm-risk.csv', name_repair='minimal')
risk %>% kable()

rank	country	damage	rank	country	damage	rank	country	damage
1	China	$117,290	43	Denmark	$1,417	85	Burundi	$397.9
2	USA	$70,381	44	Nepal	$1,411	86	Lithuania	$392.4
3	Brazil	$33,760	45	Sudan	$1,373	87	Moldova	$387.5
4	India	$33,065	46	Portugal	$1,365	88	Armenia	$336.0
5	Japan	$23,490	47	Belgium	$1,351	89	Malaysia	$333.0
6	Korea Republic of	$14,349	48	Kazakhstan	$1,344	90	Bosnia and Herzegovina	$327.4
7	Turkey	$13,267	49	Czech Republic	$1,336	91	Kyrgyzstan	$302.0
8	Argentina	$13,204	50	Austria	$1,304	92	Georgia (Republic)	$301.5
9	France	$12,532	51	Iraq	$1,234	93	Tajikistan	$297.1
10	Mexico	$11,277	52	Kenya	$1,230	94	Ireland	$277.6
11	Iran	$11,276	53	Mozambique	$1,218	95	Lebanon	$276.8
12	Nigeria	$10,251	54	Cambodia	$1,121	96	Nicaragua	$264.1
13	Indonesia	$9,550	55	Ghana	$1,114	97	Rwanda	$255.1
14	Thailand	$8,066	56	Bulgaria	$1,112	98	Mauritius	$227.6
15	Australia	$7,815	57	Madagascar	$1,074	99	Macedonia	$218.4
16	Vietnam	$7,490	58	Malawi	$1,071	100	Congo (Republic of)	$212.8
17	Ukraine	$6,953	59	Paraguay	$1,012	101	Slovenia	$202.0
18	Egypt	$6,737	60	Guinea	$977.5	102	Niger	$197.3
19	Canada	$6,694	61	Tunisia	$949.2	103	Latvia	$187.3
20	Pakistan	$6,630	62	Ecuador	$934.7	104	Panama	$161.2
21	Germany	$6,481	63	Switzerland	$924.1	105	Togo	$153.2
22	Bangladesh	$5,623	64	Dominican Republic	$873.0	106	Jordan	$116.4
23	Spain	$5,576	65	Jamaica	$871.7	107	Guinea-Bissau	$114.3
24	Russian Federation	$5,084	66	Sri Lanka	$829.1	108	Cyprus	$108.5
25	Philippines	$4,839	67	Yemen	$806.0	109	Estonia	$102.1
26	Greece	$4,342	68	Saudi Arabia	NA	110	Fiji	NA,NA
27	United Kingdom	$4,005	69	Honduras	$794.3	111	Mongolia	$64.7
28	South Africa	$3,922	70	Croatia	$755.6	112	Luxembourg	$64.7
29	Romania	$3,524	71	Azerbaijan	$730.7	113	Belize	$42.2
30	Algeria	$2,862	72	New Zealand	$639.7	114	Cape Verde	$40.8
31	Morocco	$2,531	73	Albania	$637.2	115	Gambia	$37.6
32	Colombia	$2,476	74	Finland	$600.7	116	Suriname	$36.5
33	Poland	$2,449	75	Slovakia	$573.4	117	Trinidad and Tobago	$28.8
34	Ethiopia	$2,312	76	Burkina Faso	$557.7	118	Vanuatu	$23.5
35	Venezuela	$2,167	77	Costa Rica	$556.8	119	Barbados	$20.3
36	Chile	$2,095	78	Sweden	$546.9	120	Equatorial Guinea	NA,NA
37	Netherlands	$1,981	79	Israel	$518.4	121	Malta	$14.4
38	Hungary	$1,979	80	Uruguay	$509.1	122	Qatar	$5.0
39	Belarus	$1,777	81	Laos	$508.2	123	Iceland	$4.8
40	Peru	$1,580	82	Mali	$504.7	124	Singapore	$0.7
41	Cameroon	$1,574	83	El Salvador	$475.1	NA	NA	NA
42	Italy	$1,447	84	Norway	$419.4	NA	NA	NA

This data is not tidy! It should only contain three columns in the “cleaned” version. The most reasonable way to clean this datay may also not involve pivots…

Section 2: Cleaning

For each of the data sets in Section 1, can you create a tidy version of the data set?

Movies

hint: this may need to take rows and turn them into columns!

This one is a very straightforward pivot_wider():

movies_tidy <- movies %>% 
  pivot_wider(id_cols=c(year, movie_name), 
              names_from=finance, 
              values_from=dollars)
movies_tidy %>% head(6) %>% kable()

year	movie_name	budget	domgross	intgross
1970	Beyond the Valley of the Dolls	1000000	9000000	9000000
1971	Escape from the Planet of the Apes	2500000	12300000	12300000
1971	Shaft	53012938	70327868	107190108
1971	Straw Dogs	25000000	10324441	11253821
1971	The French Connection	2200000	41158757	41158757
1971	Willy Wonka & the Chocolate Factory	3000000	4000000	4000000

Aussie Birds

hint: this may need to take columns and turn them into rows!

This one is a very straight-forward pivot_longer(), where the bird columns are converted into rows:

birds_tidy <- birds %>% pivot_longer(`Bassian Thrush`:`Grey Fantail`)
birds_tidy %>% head(6) %>% kable()

survey_year	urban_rural	bioregions	name	value
2014	Urban	South Eastern Queensland	Bassian Thrush	0
2014	Urban	South Eastern Queensland	Chestnut-breasted Mannikin	2
2014	Urban	South Eastern Queensland	Wild Duck	1
2014	Urban	South Eastern Queensland	Willie Wagtail	12
2014	Urban	South Eastern Queensland	Regent Bowerbird	0
2014	Urban	South Eastern Queensland	Rufous Fantail	3

Songs

hint: pay attention to the three rules; sometimes, you get lucky!

This one is already tidy:

spotify <- spotify
spotify %>% head(6) %>% kable()

track_id	track_name	track_artist	track_popularity	track_album_id	track_album_name	track_album_release_date	playlist_name	playlist_id	playlist_genre	playlist_subgenre	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms
6f807x0ima9a1j3VPbc7VN	I Don’t Care (with Justin Bieber) - Loud Luxury Remix	Ed Sheeran	66	2oCs0DGTsRO98Gh5ZSl2Cx	I Don’t Care (with Justin Bieber) [Loud Luxury Remix]	2019-06-14	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.748	0.916	6	-2.634	1	0.0583	0.1020	0.00e+00	0.0653	0.518	122.036	194754
0r7CVbZTWZgbTCYdfa2P31	Memories - Dillon Francis Remix	Maroon 5	67	63rPSO264uRjW1X5E6cWv6	Memories (Dillon Francis Remix)	2019-12-13	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.726	0.815	11	-4.969	1	0.0373	0.0724	4.21e-03	0.3570	0.693	99.972	162600
1z1Hg7Vb0AhHDiEmnDE79l	All the Time - Don Diablo Remix	Zara Larsson	70	1HoSmj2eLcsrR0vE9gThr4	All the Time (Don Diablo Remix)	2019-07-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.931	1	-3.432	0	0.0742	0.0794	2.33e-05	0.1100	0.613	124.008	176616
75FpbthrwQmzHlBJLuGdC7	Call You Mine - Keanu Silva Remix	The Chainsmokers	60	1nqYsOef1yKKuGOVchbsk6	Call You Mine - The Remixes	2019-07-19	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.718	0.930	7	-3.778	1	0.1020	0.0287	9.40e-06	0.2040	0.277	121.956	169093
1e8PAfcKUYoKkxPhrHqw4x	Someone You Loved - Future Humans Remix	Lewis Capaldi	69	7m7vv9wlQ4i0LFuJiE2zsQ	Someone You Loved (Future Humans Remix)	2019-03-05	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.650	0.833	1	-4.672	1	0.0359	0.0803	0.00e+00	0.0833	0.725	123.976	189052
7fvUMiyapMsRRxr07cU8Ef	Beautiful People (feat. Khalid) - Jack Wins Remix	Ed Sheeran	67	2yiy9cd2QktrNvWC2EUi0k	Beautiful People (feat. Khalid) [Jack Wins Remix]	2019-07-11	Pop Remix	37i9dQZF1DXcZDD7cfEKhW	pop	dance pop	0.675	0.919	8	-5.385	1	0.1270	0.0799	0.00e+00	0.1430	0.585	124.982	163049

French Trains

hint: think very carefully about what the variables are here!

This one again is a pivot_longer, but you need to make sure to duplicate only the departure station!

trains_tidy <- trains %>% pivot_longer(-departure_station, 
                        names_to='destination_station', 
                        values_to='n_trains')
trains_tidy %>% head(6) %>% kable()

departure_station	destination_station	n_trains
PARIS MONTPARNASSE	POITIERS	31632
PARIS MONTPARNASSE	QUIMPER	16242
PARIS MONTPARNASSE	RENNES	38850
PARIS MONTPARNASSE	ST PIERRE DES CORPS	24930
PARIS MONTPARNASSE	PARIS MONTPARNASSE	0
PARIS MONTPARNASSE	NANCY	0

EU Energy

hint: sometimes, we must pivot one direction before we pivot another!

This one is a pivot_longer(), with a filter() to make the energy_type() column a true variable. Alternatively, you could pivot_wider() after this, as discussed above, but this is not quite tidy: the energy_type variable would be spread over columns.

energy_tidy <- energy %>% 
  pivot_longer(EU_28:GE, names_to="market", values_to="percent") %>% 
  filter(energy_type != "Total")
energy_tidy %>% head(6) %>% kable()

energy_type	market	percent
Conventional thermal	EU_28	45.9
Conventional thermal	EA_19	43.6
Conventional thermal	EE	93.9
Conventional thermal	CY	91.4
Conventional thermal	MT	90.9
Conventional thermal	PL	90.2

MarioKart

hint: sometimes, you may need to pivot twice in the same direction!

This one can be done by splitting and re-combining the data. It uses two pivots, either pivoting twice on the original data, or by splitting the dataset. This kind of “split and pivot” can be more complicated, but will be easier to understand. As a rule, split the data into bits that themselves are tidy, and then join them back together if that’s useful.

# clean only times
times <- mk_records %>%
  select(track, ends_with("record")) %>%
  pivot_longer(ends_with("record"), 
               names_to='race_type', 
               values_to='duration') %>%
  separate(race_type, c('event', NA, NA, NA))

# clean only dates
dates <- mk_records %>%
  select(track, ends_with('date')) %>% 
  pivot_longer(ends_with("date"), 
               names_to='race_type', 
               values_to='datetime') %>% 
  separate(race_type, c('event', NA, NA, NA, NA))

# merge everything back together, could use the merge or the *_join functions from the reading.
mk_tidy <- inner_join(times, dates, by=c("track", "event"))

mk_tidy %>% head(6) %>% kable()

Eco Risk

hint: this data frame is uniquely messy! Try splitting it into parts that all look tidy, and then bringing them back together.

This one basically must be split and recombined. In theory, one could use the “melt and cast” strategy from the Tidy Data paper to first make the data as long as possible, then pivot it based on the columns you want. The pivot_longer() function is less powerul in this way, because it does not convert data types by default. So, the easiest direct approach is the split & recombine, as follows:

risk_tidy <- rbind(risk[1:3], #first set of columns
                   risk[4:6], #second set of columns
                   risk[7:9]  #third set of columns
                   )
risk_tidy %>% head(6) %>% kable()

For the melt & cast strategy from the very first reading, we leverage the fact that the “melt” function forces everything into the same type (if it can) so that it can all be stacked on top of one another. This usually means everything is converted into character data, and we have to manually convert it back at the end:

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

risk %>% 
  melt(id.vars = 'country') %>% 
  pivot_wider(id_cols=country, 
              names_from=variable, 
              values_from=value) %>%
  mutate(rank=as.numeric(rank)) %>%
  head(7) %>% kable()

country	rank	damage
China	1	$117,290
USA	2	$70,381
Brazil	3	$33,760
India	4	$33,065
Japan	5	$23,490
Korea Republic of	6	$14,349
Turkey	7	$13,267

Section 3: Analysis

In this section, I’ll ask some specific questions about two of the data sets: Movies & Songs.

Movies

3.1

Which ten movies lost the most money domestically? Are these the same movies that lost the most money overall?

There are a few ways you can do this. The very simple way is to compute the profit, sort on the profit, and then show the two separate dataframes:

movies_tidy %>% 
  ### do the domestic profits
  # compute the domestic profit
  mutate(domestic_profit = domgross - budget) %>%
  # sort the data by domestic profit
  arrange(domestic_profit) %>%
  # grab the lowest 10 profits, which will be the
  # first ten rows:
  head(10) %>%
  kable()

year	movie_name	budget	domgross	intgross	domestic_profit
2012	John Carter	2.75e+08	73058679	282778100	-201941321
2013	47 Ronin	2.25e+08	38362475	145803842	-186637525
2013	The Lone Ranger	2.75e+08	89289910	259989910	-185710090
2012	Battleship	2.09e+08	65235400	304142720	-143764600
2007	The Golden Compass	2.05e+08	70107728	372234864	-134892272
2013	Jack the Giant Slayer	1.95e+08	65187603	197387603	-129812397
2011	Mars Needs Moms	1.50e+08	21392758	39549758	-128607242
2010	Prince of Persia: The Sands of Time	2.00e+08	90759676	335059676	-109240324
2011	Hugo	1.80e+08	73864507	185033215	-106135493
2005	Stealth	1.38e+08	32116746	76416746	-105883254

You can do this again to get a second dataframe for the movies:

movies_tidy %>% 
  ### do the total profits
  # compute the total profit
  mutate(profit = intgross - budget) %>%
  # sort the data by domestic profit
  arrange(profit) %>%
  # grab the lowest 10 profits, which will be the
  # first ten rows:
  head(10) %>%
  kable()

year	movie_name	budget	domgross	intgross	profit
2011	Mars Needs Moms	1.50e+08	21392758	39549758	-110450242
2002	The Adventures of Pluto Nash	1.00e+08	4411102	7094995	-92905005
2013	47 Ronin	2.25e+08	38362475	145803842	-79196158
2001	Monkeybone	7.50e+07	5409517	5409517	-69590483
1999	The 13th Warrior	1.25e+08	32698899	61698899	-63301101
1997	The Postman	8.00e+07	17650704	17650704	-62349296
2005	Stealth	1.38e+08	32116746	76416746	-61583254
2001	Osmosis Jones	7.00e+07	13596911	13596911	-56403089
2003	Timeline	8.00e+07	19480739	26703184	-53296816
2013	R.I.P.D.	1.30e+08	33618855	79019947	-50980053

This is nice, but makes it hard to see the exact relationship between the top 10 films in either category; we just see the 10 lossmaking movies for one kind of profit, and then 10 in the other. We cannot say, for instance, where the 10 domestic losers fall in the overall profit table. Interestingly, the only time we’re “slicing” the data happens at the head() function. So, we can string these two things together, and just do the filtering at the end:

movies_tidy %>% 
  ### do the domestic profits
  # compute the domestic profit
  mutate(domestic_profit = domgross - budget) %>%
  # sort the data by domestic profit
  arrange(domestic_profit) %>%
  # assign the "row number" to each of these movies, so that
  # the movie with the smallest domestic profit gets "domestic_rank" of 1
  # and the next-smallest has a "domestic_rank" of 2
  mutate(domestic_rank = row_number()) %>%
  ### do the foreign profits
  # compute the overall profit
  mutate(profit = intgross - budget) %>%
  # sort the data by this profit
  arrange(profit) %>%
  # again, assign the row number of the sorted data to a rank variable
  mutate(rank = row_number()) %>% 
  # select out the columns we want to see
  select(year, movie_name, domestic_rank, domestic_profit, rank, profit) %>%
  ### keep only the top 10 movies either domestic or foreign:
  filter(domestic_rank <= 10 | rank <= 10) %>%
  # sort by domestic rank, just because:
  arrange(domestic_rank) %>%
  kable()

year	movie_name	domestic_rank	domestic_profit	rank	profit
2012	John Carter	1	-201941321	510	7778100
2013	47 Ronin	2	-186637525	3	-79196158
2013	The Lone Ranger	3	-185710090	97	-15010090
2012	Battleship	4	-143764600	1210	95142720
2007	The Golden Compass	5	-134892272	1432	167234864
2013	Jack the Giant Slayer	6	-129812397	394	2387603
2011	Mars Needs Moms	7	-128607242	1	-110450242
2010	Prince of Persia: The Sands of Time	8	-109240324	1348	135059676
2011	Hugo	9	-106135493	450	5033215
2005	Stealth	10	-105883254	7	-61583254
2013	R.I.P.D.	15	-96381145	10	-50980053
2002	The Adventures of Pluto Nash	16	-95588898	2	-92905005
1999	The 13th Warrior	17	-92301101	5	-63301101
2001	Monkeybone	33	-69590483	4	-69590483
1997	The Postman	46	-62349296	6	-62349296
2003	Timeline	53	-60519261	9	-53296816
2001	Osmosis Jones	59	-56403089	8	-56403089

Now, we see that the movies that lost the most money domestically can have wildly different profits internationally. For example, John Carter lost the most money domestically, but actually made a profit internationally! Same with Prince of Persia and Hugo. But, some movies did lose a lot of money both internationally and domestically, such as 47 Ronin or Mars Meeds Moms.

3.2

What is the average budget for a movie in each year?

movies_tidy %>% 
  group_by(year) %>% 
  summarize(avg_budget = mean(budget, na.rm=T)) %>%
  tail(10) %>%
  kable()

year	avg_budget
2004	49043124
2005	45754250
2006	41924722
2007	47113533
2008	49651485
2009	50151613
2010	51915116
2011	50357137
2012	64458120
2013	72097980

3.3

Which movie had the largest gap between domestic and overseas box office performance?

Depending on how you solve this problem, you may need to drop the NA values. You can use the drop_na() function.

movies_tidy %>%
  mutate(gap = abs(domgross - intgross)) %>% 
  drop_na(gap) %>%
  arrange(gap) %>% 
  select(year, movie_name, gap) %>%
  tail(10) %>% kable()

year	movie_name	gap
2009	Ice Age: Dawn of the Dinosaurs	690394990
2012	The Hobbit: An Unexpected Journey	711700000
2003	The Lord of the Rings: The Return of the King	763562762
2011	Transformers: Dark of the Moon	771403533
2011	Pirates of the Caribbean: On Stranger Tides	802600000
2013	Iron Man 3	803700000
2012	Skyfall	804333804
2011	Harry Potter and the Deathly Hallows: Part 2	947100000
1997	Titanic	1527000000
2009	Avatar	2023411357

3.4

Make a visualization that shows how the budget for movies has increased over time. Discuss how you designed the plot in order to emphasize this message.

You may think that something like the following plot illustrates this well:

ggplot(movies_tidy, aes(x=year, y=budget/1000000)) + 
  geom_jitter(alpha=.25) +
  geom_smooth(se=F, color='orangered') + 
  xlab("Year Movie was Made") + 
  ylab("Budget (Millions USD)")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

But, visually, half of this plot is empty! There are a few ways you can address this. First, you could just plot the average and focus in on that range:

ggplot(movies_tidy, aes(x=year, y=budget/1000000)) + 
  geom_jitter(alpha=.1) +
  geom_smooth(se=F, color='orangered') + 
  xlab("Year Movie was Made") + 
  ylab("Budget (Millions USD)") + 
  ylim(0,100)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 191 rows containing non-finite values (stat_smooth).

## Warning: Removed 204 rows containing missing values (geom_point).

Or, you could change the scale:

ggplot(movies_tidy, aes(x=year, y=budget/1000000)) + 
  geom_jitter(alpha=.1) +
  geom_smooth(se=F, color='orangered') + 
  xlab("Year Movie was Made") + 
  ylab("Budget (Millions USD)") + 
  scale_y_log10(limits=c(.1, 200))

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 35 rows containing non-finite values (stat_smooth).

## Warning: Removed 44 rows containing missing values (geom_point).

3.5

Make a visualization that shows how the typical profit movies make has generally not changed over time, but that a few outliers do make increasingly more money. Discuss how you designed the plot in order to emphasize this message.

A very simple plot to do this might consider the following:

movies_tidy %>% 
  mutate(profit=intgross - budget) %>%
  ggplot(aes(x=year, y=profit)) +
  geom_jitter(alpha=.1) + geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 11 rows containing non-finite values (stat_smooth).

## Warning: Removed 11 rows containing missing values (geom_point).

A more sophisticated perspective might aim to strategically define the outliers, and then label them separately:

outlier_earners <- movies_tidy %>%
  # get the profit
  mutate(profit = intgross - budget) %>%
  # drop any movies that have NA profits
  drop_na(profit) %>%
  ### within each year
  group_by(year) %>%
  # arrange the movies by profit
  arrange(desc(profit)) %>%
  # then compute the difference between the movie and the next-highest earner
  mutate(profit_gap_to_next_movie = profit - lead(profit)) %>%
  # sort on this value
  arrange(profit_gap_to_next_movie) %>% 
  # and drop any NAs from this
  drop_na(profit_gap_to_next_movie) %>%
  # take the 20 movies that have the largest gap between their
  # earnings and the next highest earnings:
  tail(20)
outlier_earners %>% kable()

year	movie_name	budget	domgross	intgross	profit	profit_gap_to_next_movie
2003	The Lord of the Rings: The Return of the King	9.40e+07	377845905	1141408667	1047408667	234943111
1973	The Exorcist	1.20e+07	204868002	402735134	390735134	236618807
1994	The Lion King	7.93e+07	422780140	952880140	873580140	249179615
1999	Star Wars: Episode I - The Phantom Menace	1.15e+08	474544677	1007044677	892044677	259238385
1972	The Godfather	7.00e+06	134966411	268500000	261500000	261098198
2011	Harry Potter and the Deathly Hallows: Part 2	1.25e+08	381011219	1328111219	1203111219	274317143
1977	Close Encounters of the Third Kind	2.00e+07	166000000	337700000	317700000	283448575
1985	Back to the Future	1.90e+07	210609762	383874862	364874862	286285161
1994	Forrest Gump	5.50e+07	329694499	679400525	624400525	290780396
1981	Raiders of the Lost Ark	2.00e+07	248159971	389925971	369925971	296955634
1975	Jaws	1.20e+07	260000000	470700000	458700000	320023583
2001	The Lord of the Rings: The Fellowship of the Ring	1.09e+08	315544750	887217688	778217688	333459969
1996	Independence Day	7.50e+07	306169255	817400878	742400878	334500878
1983	Star Wars: Episode VI - Return of the Jedi	3.25e+07	309205079	572700000	540200000	345736426
1980	Star Wars: Episode V - The Empire Strikes Back	2.30e+07	290271960	534171960	511171960	431218421
1977	Star Wars	1.10e+07	460998007	797900000	786900000	469200000
1993	Jurassic Park	6.30e+07	395708305	1035626872	972626872	556340869
1982	E.T.: The Extra-Terrestrial	1.05e+07	435110554	792965326	782465326	620265326
1997	Titanic	2.00e+08	658672302	2185672302	1985672302	1273985623
2009	Avatar	4.25e+08	760507625	2783918982	2358918982	1561950287

You can see that, clearly, these are the big blockbuster movies that earn a ton of money. So, we can use these as separate datasets in tandem to highlight their location on the plot:

non_outliers = movies_tidy %>% 
  # This ensures that movies that are in the outliers are 
  # dropped from the "non-outliers"
  # breaking it down, 
  # ! means "not", 
  # movie_name %in% outlier_earners$movie_name is True when 
  # the movie_name is contained within the set of outlier_earners
  # movie names.
  filter(!(movie_name %in% outlier_earners$movie_name)) %>%
  # Then, we'll need the profit variable to do the plotting
  mutate(profit=intgross - budget)

ggplot() + 
  # non-outlier points
  geom_jitter(data=non_outliers, 
             aes(x=year, y=profit), color='black', alpha=.1) +
  # non-outlier profit trend
  geom_smooth(data=non_outliers,
            aes(x=year, y=profit), se=F) +
  # outliers
  geom_point(data=outlier_earners, 
             aes(x=year, y=profit), color='red') + 
  # trend among outliers
  geom_smooth(data=outlier_earners, 
              aes(x=year, y=profit), color='red', se=F)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 11 rows containing non-finite values (stat_smooth).

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 11 rows containing missing values (geom_point).

3.6(c)

You’re a data scientist working for a movie studio. Your executive is considering whether to take a risk on making a “new” movie, or whether it’d be a safer bet to make a sequel to an existing movie. So, she asks:

Do sequels make more profit per dollar spent than non-sequels?

Can you answer her question?⁶

So, let’s first look at the movies that have a “2” or “II” in their title:

is_sequel = str_detect(movies_tidy$movie_name, c("2", "II"))
movies_tidy %>% 
  mutate(is_sequel = is_sequel) %>%
  filter(is_sequel) %>% kable()

year	movie_name	budget	domgross	intgross	is_sequel
1974	The Godfather: Part II	13000000	57300000	57300000	TRUE
1981	Friday the 13th Part 2	1250000	21722776	21722776	TRUE
1981	Halloween II	2500000	25533818	25533818	TRUE
1983	Superman III	39000000	59950623	59950623	TRUE
1984	Star Trek III: The Search for Spock	18000000	76471046	87000000	TRUE
1987	Evil Dead II	3500000	5923044	5923044	TRUE
1989	Back to the Future Part II	40000000	118450002	332000000	TRUE
1990	Back to the Future Part III	40000000	87666629	243700000	TRUE
1990	Child's Play 2	13000000	26904572	34166572	TRUE
1990	Die Hard 2	70000000	117323878	239814025	TRUE
1991	Teenage Mutant Ninja Turtles II: The Secret of the Ooze	25000000	78656813	78656813	TRUE
1991	Terminator 2: Judgement Day	100000000	204859496	516816151	TRUE
1991	The Naked Gun 2 1/2: The Smell of Fear	23000000	86930411	86930411	TRUE
1992	Home Alone 2: Lost in New York	20000000	173585516	358994850	TRUE
1998	Halloween H20: 20 Years Later	17000000	55041738	55041738	TRUE
1999	Toy Story 2	90000000	245852179	511358276	TRUE
2000	28 Days	43000000	37035515	62063972	TRUE
2000	Pokemon: The Movie 2000	30000000	43746923	133946923	TRUE
2002	28 Days Later…	8000000	45064915	82955633	TRUE
2002	Blade II	54000000	81676888	154338601	TRUE
2002	Star Wars: Episode II - Attack of the Clones	115000000	310676740	656695615	TRUE
2002	The Santa Clause 2	65000000	139225854	172825854	TRUE
2003	21 grams	20000000	16248701	59667625	TRUE
2003	Cradle 2 the Grave	25000000	34657731	56434942	TRUE
2003	Jeepers Creepers II	25000000	35623801	35623801	TRUE
2003	Legally Blonde 2: Red, White & Blonde	25000000	90639088	125339088	TRUE
2003	X2 (X-Men 2)	125000000	214949694	407711549	TRUE
2004	2046	12000000	1442338	19202856	TRUE
2004	The Princess Diaries 2: Royal Engagement	45000000	95149435	122071435	TRUE
2005	Miss Congeniality 2: Armed & Fabulous	60000000	48478006	101382396	TRUE
2005	Star Wars: Episode III - Revenge of the Sith	115000000	380270577	848998877	TRUE
2006	Clerks II	5000000	24148068	25894473	TRUE
2006	The Grudge 2	20000000	39143839	70743839	TRUE
2007	28 Weeks Later	15000000	28638916	64232714	TRUE
2007	Hostel: Part II	7500000	17544812	33606409	TRUE
2007	White Noise 2: The Light	10000000	NA	8243567	TRUE
2008	The Sisterhood of the Traveling Pants 2	27000000	44089964	44270131	TRUE
2009	2012	200000000	166112167	788408539	TRUE
2009	The Boondock Saints II: All Saints Day	8000000	10273187	10273187	TRUE
2009	The Taking of Pelham 123	110000000	65452312	152364370	TRUE
2010	127 Hours	18000000	18335230	60735230	TRUE
2010	Iron Man 2	170000000	312433331	623561331	TRUE
2010	Paranormal Activity 2	3000000	84752907	177512032	TRUE
2010	Sex and the city 2	95000000	95347692	294680778	TRUE
2011	Cars 2	200000000	191450875	560155383	TRUE
2012	Men in Black III	215000000	179020854	624821154	TRUE
2012	The Twilight Saga: Breaking Dawn - Part 2	136200000	292324737	832660037	TRUE
2013	2 Guns	61000000	75612460	132493015	TRUE
2013	42	40000000	95020213	95020213	TRUE
2013	Cloudy with a Chance of Meatballs 2	78000000	119640264	271725448	TRUE
2013	Kick-Ass 2	28000000	28795985	60839197	TRUE
2013	Red 2	84000000	53262560	137162560	TRUE
2013	The Smurfs 2	110000000	71017784	348545841	TRUE

Looks pretty good! So, let’s look at the “return”, or profit per dollar, of these two sets of movies:

movies_tidy %>% 
  mutate(is_sequel=is_sequel) %>% 
  mutate(profit = intgross - budget) %>% 
  mutate(return_on_investment = profit/budget) %>%
  group_by(is_sequel) %>% 
  summarise(median_return = median(return_on_investment, na.rm=T),
            ) %>% 
  kable()

is_sequel	median_return
FALSE	1.622181
TRUE	2.261692

The median sequel has about 37% higher returns than the median non-sequel. But! there is a very long tail:

movies_tidy %>% 
  mutate(is_sequel=is_sequel) %>% 
  mutate(profit = intgross - budget) %>% 
  mutate(return_on_investment = profit/budget) %>%
  ggplot(aes(x=return_on_investment, group=is_sequel, color=is_sequel)) +
  geom_boxplot()

## Warning: Removed 11 rows containing non-finite values (stat_boxplot).

You can see that the non-sequels have some amazing returns on investment (400 here means that the profit is 400 times the budget!) but you can see that this long tail is very sparse, compared to the middle of the data. Let’s focus in on the medians, which are somewhere between 1 and 2:

movies_tidy %>% 
  mutate(is_sequel=is_sequel) %>% 
  mutate(profit = intgross - budget) %>% 
  mutate(return_on_investment = profit/budget) %>%
  ggplot(aes(x=return_on_investment, group=is_sequel, color=is_sequel)) +
  geom_boxplot() + 
  xlim(-1,4)

## Warning: Removed 464 rows containing non-finite values (stat_boxplot).

So, we see that sequels tend to have slightly higher returns to investment on average, but that the long tail of non-sequels can yield some seriously high returns, if you get lucky.

Songs

3.7

What’s your best guess about the length of a track with higher than 75% popularity? How about your best guess for the popularity of a track between 2 and 3 minutes long? Which “piece” of information (popularity > 75% or duration between 2 & 3 minutes) gives you more useful information, and why do you think that?⁷

ggplot(spotify, aes(x=duration_ms/1000/60, y=track_popularity)) +
  geom_hex() + 
  xlab("Duration (minutes)") + 
  ylab("Track Popularity") + 
  scale_fill_viridis_c(option='plasma')

Just describing what we see, you can see that most of the popular songs are somewhere between 2.5 minutes and 3 minutes. You can also see that songs that are about this length(between 2.5 and 3 minutes) can have basically any popularity… seems like there are a lot of 3-minute tracks that have zero popularity! Thus, knowing a song is popular gives you way more information than knowing a song is 3 minutes long.

Doing this in a table, rather than graphically:

spotify %>% 
  mutate(duration_mins = duration_ms/1000/60) %>%
  filter(track_popularity > 75) %>%
  summarize(mean_duration = mean(duration_mins),
            sd_duration=sd(duration_mins)) %>% kable()

mean_duration	sd_duration
3.554781	0.7395706

spotify %>% 
  mutate(duration_mins = duration_ms/1000/60) %>%
  filter((2 <= duration_mins)&(duration_mins <= 3)) %>%
  summarize(mean_popularity = mean(track_popularity),
            sd_popularity=sd(track_popularity)) %>% kable()

mean_popularity	sd_popularity
45.48043	23.05022

You can see that the standard deviation of the track popularity is about half of the average popularity of songs between 2 and 3 minutes, while the standard deviation of the track duration is only about a fifth of the mean. This suggests that there is much more variability around the mean popularity for songs between 2 and 3 minutes than there is in the duration of songs at 75% popularity or above!

3.8 (c)

What is the typical “energy” of each of the playlist genres? How about the typical “valence,” meaning “happiness,” of the genres?

spotify %>% 
  group_by(playlist_genre) %>% 
  summarize(valence = median(valence), energy = median(energy)) %>%
  kable()

playlist_genre	valence	energy
edm	0.370	0.830
latin	0.628	0.729
pop	0.500	0.727
r&b	0.542	0.596
rap	0.517	0.665
rock	0.531	0.775

3.9 (c)

Make four plots⁸ to visualize the relationship between danceability and a few variables:

tempo
energy
valence
playlist_genre

Make sure to take into account whether the relationship is linear and address overplotting if necessary. Given these plots, what kinds of songs tend to be danceable?

library(ggpubr)

tempo <- ggplot(spotify, 
                aes(x=tempo, y=danceability)) +
  geom_point(alpha=.1) + 
  geom_smooth(aes(color=playlist_genre), se=F)
energy <- ggplot(spotify, aes(x=energy, y=danceability)) + 
  geom_point(alpha=.1) + 
  geom_smooth(aes(color=playlist_genre), se=F)
valence <- ggplot(spotify, aes(x=valence, y=danceability)) + 
  geom_point(alpha=.1) + 
  geom_smooth(aes(color=playlist_genre), se=F)
playlist_genre <- ggplot(spotify, 
                         aes(group=playlist_genre, 
                             y=danceability, 
                             color=playlist_genre)) + 
  geom_boxplot() 

ggarrange(tempo, energy, valence, playlist_genre, ncol=2, nrow=2)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Interpreting this, (1a) most songs are danceable if they are around 100/120 tempo; as songs become too fast or too slow, they become less danceable; (1b) but, latin songs have a “second peak” (slightly below 200) where songs get a little bit more danceable again; (2) songs of moderate energy tend to be danceable, but latin songs that are high energy are more danceable, whereas low-energy rap songs tend to be more danceable; (3) as songs get happier, they always get more danceable, regardless of genre; (4) Rock songs are clearly less danceable in general, and the most danceable genre is rap, but this is closely followed by Latin.

3.10 (c)

Let’s assume that the difference between a band’s median track popularity and its maximum track popularity represents that band’s one-hit-wonderness. If this is the case, what are the Top 10 “one-hit-wonder” bands in the dataset? Given the results, does this comport with your understanding of what a “one hit wonder” is?

one_hit_wonders = spotify %>% 
  group_by(track_artist) %>% 
  summarize(one_hit_wonderness = max(track_popularity) - median(track_popularity)) %>%
  arrange(desc(one_hit_wonderness))
one_hit_wonders %>% head(10) %>% kable()

track_artist	one_hit_wonderness
Lynyrd Skynyrd	81
Duki	76
50 Cent	75
Childish Gambino	75
KISS	74
YG	74
Sam Feldt	73
John Newman	72
Sheck Wes	72
Thin Lizzy	71

Sure; Lynyrd Skynyrd makes sense given the perennial popularity of “Sweet Home Alabama”, but you may not know who, say Duki or YG are. And, alternatively, KISS and Thin Lizzy are certainly not one hit wonders; they have a ton of very well-rated songs. So, one thing that may be missing in terms of the measure of one-hit-wonderness may be a measure of how many albums the artist has, or how many popular songs the artist has.

You may also find the original Tidy Data paper useful in describing the different commonly-encountered issues in data formatting.↩︎
Remember: I’m just using the knitr::kable() function to print the table all pretty-like in the RMarkdown.↩︎
This is scraped using the spotifyr package.↩︎
danceability measures how suitable a track is for dancing, varies from 0 (not danceable) to 1 (danceable). energy is a measure of the “intensity” of the song, varies from 0 (very relaxing, Shoegaze) to 1 (very energetic, Death Metal). loudness is the average decibel value of the track, varies from -60 to zero. speechiness gives the level of “talkiness” in a track, varies from 0 (no spoken words) to 1 (all spoken words), but tracks over .66 are probably nearly all spoken word, and tracks below .33 are probably songs. acousticness is the same as speechiness, but measuring whether instruments are not amplified; liveness but for whether the track has a live audience. valence tells you whether a track is “happy,” with higher scores indicate happier music. Finally, tempo records the average beats per minute for the track and duration_ms provides the duration of the song in milliseconds. ↩︎
Note that I’m using the name_repair option of read_csv() in order to get exactly the column names that are in midterm-risk.csv. Without this option, readr::read_csv will append extra values to the column names in order to ensure they are each unique. Try this by removing the name_repair argument, or setting it equal to "unique" instead of "minimal".↩︎
Note that you can use the str_detect() function in the tidyverse’s stringr package to give TRUE when a movie name contains 2 or a II, and FALSE otherwise. Also, it’s ok if you accidentally pick up movies that have III or IIII or H20 in their name using this strategy; we’re just making an educated guess.↩︎
You may find it helpful to make a plot!↩︎
It’s OK if they’re totally separate plots! That is, I don’t expect you to use facet_grid()↩︎

Midterm Asessment

Due: Wednesday, 3rd November, at 2PM London

Section 1: Concepts

Movies

Aussie Birds

Songs

French Trains

EU Energy

MarioKart

Eco Risk

Section 2: Cleaning

Movies

Aussie Birds

Songs

French Trains

EU Energy

MarioKart

Eco Risk

Section 3: Analysis

Movies

3.1

3.2

3.3

3.4

3.5

3.6(c)

Songs

3.7

3.8 (c)

3.9 (c)

3.10 (c)