10 Assignments

Below you will find:

  • A description of how code will be turned in and evaluated.

  • Four assignments that you will complete during the semester.

  • A description of the Project.

10.1 Turning in code

  1. All files submitted will be .r files and these must be annotated and have descriptive header information. Annotation and descriptive header information is reviewed in the pages of this course’s website.

  2. For assignments that require you to submit multiple files, all of the files will need to run error free and accomplish the tasks for credit to be earned.

  3. You will email me your .r files.

10.2 Assignment 01 - Working in R and RStudio

10.2.1 Assignment 01.01

  • Make a new script in the ‘code’ directory, the name of the script will be called ‘init.proj.r’

  • The code will populate the working directory with the following sub-directories: “R”, “data”, “doc”, “figs”, and “output”.

  • The goal of the script will be to automatically creates a directory structure outlined in (https://nicercode.github.io/blog/2013-04-05-projects/).

10.2.2 Assignment 01.02

  • Make a new script called ‘import.and.write.r’

  • This script will do the following:

  1. Write the entire mtcars data as a csv file to the ‘data’ directory (mtcars is in the package ‘datasets’)

  2. Import the the .csv file you wrote, selecting only the first 10 rows.

  3. From this imported and reduced data frame, make a new data frame that includes a column that is the hp per cylinder (divide hp by cyl)

  4. Name this column “hp.per.cyl”.

  5. Print the data frame to the screen

  6. Save the data frame the ‘data’ directory as mtcars.red.RData (it will be saved as file type .RData).

10.2.3 Assignment 01.03

  • Make a new script, the name of the script will be called ‘index.practice.r’
  1. x <- c(“ww”, “ee”, “ff”, “uu”, “kk”), use indexing to return:
  1. “ee”, “ff”
  2. “ee”
  3. “ff”
  1. Create a <- c(2, 4, 6, 8) and b <- c(TRUE, FALSE, TRUE, FALSE). Write code to determine the maximum and minimum values of a[b].

  2. Write an R expression that will return the sum of 10 for the vector x <- c(2, 1, 4, 2, 1, NA)

  3. Write an R expression that will return the sum of 11 for the vector x <- c(2, 1, 4, 2, 1, NA)

  4. Consider the data frame s. S <- data.frame(first= as.factor(c(“x”, “y”, “a”, “b”, “x”, “z”)), second=c(2, 4, 6, 8, 10, 12)).

Write two different R statements that will return the output 2, 4, 10, by using the variable first as an index vector.

  1. Write two different R statements that will return the positions of 3 and 7 in the vector x <- c(1, 3, 6, 7, 3, 7, 8, 9, 3, 7, 2).

10.2.4 Assignment 01.04

  • Make a new script, the name of the script will be called ‘logical.operators.practice.r’

Using the mtcars data:

  1. Use logical operators to output only those rows of data where column mpg is between 15 and 20 (excluding 15 and 20).

  2. Use logical operators to output only those rows of data where column cyl is equal to 6 and column am is not 0.

  3. Output only those rows of data where columns vs and am have the same value 1.

  4. Output only those rows of data where at least vs or am have the value 1.

10.2.5 Assignment 01.05

  • Make a new script, the name of the script will be called ‘loops.practice.r’

Using the mtcars data:

  1. Write a for loop that iterates over the first seven values of the first numeric column in mtcars and prints the cube of each number using print().

  2. Write a for loop that iterates over the first seven values of the first three numeric columns in mtcars and prints values using print(). Save these values inside the loop in an object called first.vals.

  3. Write a for loop that iterates over the column names of the “iris” dataset. The loop will print each column name and the number of characters in the column name in parenthesis. Example output for one iteration of the loop is: “Sepal.Length (12)”. Use the following functions print(), paste() and nchar().

  4. Write a while loop that prints out standard random normal numbers (use rnorm()) but stops (breaks) if you get a number bigger than 1.

  5. Use a while loop to investigate the number of terms required before the product (1)(2)(3)(4)…(n) reaches a value greater than 10 million.

10.3 Assignment 02

  • Make a new script, ‘tidy.fundamentals.r’

  • Using the flights data in Using nycflights13::flights, and using only pipes for all exercises:

  1. Find all flights that: had an arrival delay of two hours or less, and flew to HOU, SFO, LAX, MSY, or ATL, were not operated by American. From the flights that satisfy these criteria, print a table of numbers that reports the the number of flights for each destination, that satisfy these criteria, the minimum, median, and maximum delay (in hours), and the percent of these flights that were operated by Delta. Sort the the summary table by alphabetical order of the destination airport.

  2. Report the six airlines have the fastest flights (highest speed) and list these with the airline, destination, and name of airline.

  3. What is the average delay for the seven airlines that are alphabetically the last? What is the difference in average delay of these airlines vs. those seven that are alphabetically the first?

From the tables in R4DS12, Compute the rate for table2, and table4a + table4b. You will need to perform four operations:

  1. Extract the number of TB cases per country per year.
  2. Extract the matching population per country per year.
  3. Divide cases by population, and multiply by 10000.
  4. Store back in the appropriate place.
  1. using the Lahman package, merge Lahman::Batting and Lahman::Fielding by each common variable, starting with player ID. Following the merge, for the years in the 1990 to 2016, report the mean batting and fielding statistics for the five members of the San Francisco Giants that had the maximum number of home runs (HR). How do these compare, using a similar extraction, to the LA Dodgers.

  2. Using the Lahman package, find the average batting statistics for only for pitchers that had at least 10 wins in any season of their career.

10.4 Assignment 03

  1. Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

  2. What does str_wrap() do? When might you want to use it?

  3. What does str_trim() do? What’s the opposite of str_trim()?

  4. Write a function that turns (e.g.) a vector c(“a”, “b”, “c”) into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.

  5. Using the mpg data, create one plot on the fuel economy data with customised title, subtitle, caption, x, y, and colour labels.

  6. The geom_smooth() is somewhat misleading because the hwy for large engines is skewed upwards due to the inclusion of lightweight sports cars with big engines. Use your modelling tools to fit and display a better model. Hint: you will need to mutate a new vector that is derived from a model. Then use ggplot2 functionality to plot the model on top of the data.

10.5 Assignment 04

10.6 Project

10.6.1 Project Requirement 01:

The focus of the project is not to review the actual scientific implications (statistical inference and hypothesis testing). Instead, the focus of the project will be data exploration, using R4DS approaches.

You will upload a .Rmd file that I will knit to an .html file.

This document will include your name and the title of your project.

You will write approximately 800 words allocated into three sections:

1.) Introduction, three sentences maximum.

2.) Data Description, four sentences maximum. The data must be larger than 2,000 records and have greater than 10 varaibles. The data must be publicly available on the web. As part of your data description, your .Rmd file will obtain data directly from the web, load the data, and then you will write code to display it (raw or summarized, your choice) to help with your explanation of the structure and scope of the data. You will have a minimum of three tables or figures to describe the scope of the data, this needs to be coded using data wrangling from your imported data.

3.) Describe the deliverable (at least three) that you will accomplish (one sentence per deliverable). These will likely be testable hypotheses or consicse objectives that you will evaluate by wrangling data, visualization, and analysis.

4.) The computational methods section will use the remainder of your text. You will describe the general and then specific approaches you will use to evaluate the data. Describe in detail how you will manipulate the data with code that you will write (i.e. describe the computational methods in detail). You may want to provide code to describe your approach (this will not count in your word limit).

5.) Present the project idea to the class - 20 minute presentation using only material you have provided me in the .Rmd file to me.

The document will be titled “Name.Proj.Prop.Rmd”, where “Name” is your last name. You are very welcome to meet with me in person or by phone to discuss the project scope and objectives prior to submitting the .Rmd file.

Rubric:

  • Did you follow the requirements for length? 2 of 20 points.

  • Do data correspond to the dimension requirements? 2 of 20 points.

  • Did you follow Tufte’s philosophy in your data representation? 4 of 20 points.

  • Are the deliverables sufficiently interesting and complex to require substantial analysis? 6 of 20 points.

  • Do the computational methods include code that (4 of 20 points):

  1. That have all (five facets) of tidyverse fundamentals, maximizing the use of pipes?
  2. Include loops to showcase your command of this method?
  3. Exhibit R4DS workflow?
  4. Exhibit advanced and elegant graphic display?
  5. Use advanced statistical and analytical methods.

10.6.2 Project Requirement 02:

You will prepare a .Rmd file of your project. I will knit this to a .html file that you will present to the class.

The grade assigned based on the level of complexity of the analysis and amount of original coding and the rubric above will be used.

Full credit will be given to projects that showcase your expertise in all aspects of the techniques discussed in this class and I will follwo the rubric below. As above, the only projects that contain every aspect of import, tidy, transform, visualize, model, and communicate can be given full credit. The code you provide in your project will be extensive and will document your mastery of the tidyverse, data visualization, and communicate original and interesting analysis. My expectation is that the coding will be extensive and demonstrate the philosophy of reproducible workflow that we have studied. The .html knit must knit on my machine.

Rubric:

  • Did you follow Tufte’s philosophy in your data representation? 5 of 20 points.

  • Are the deliverables sufficiently interesting and complex to require substantial analysis? 5 of 20 points.

  • Do the computational methods include (5 of 20 points):

  1. That have all (five facets) of tidyverse fundamentals, maximizing the use of pipes?
  2. Include loops to showcase your command of this method?
  3. Exhibit R4DS workflow?
  • Is the project complex and appropriate to the expectations of graduate level instruction (5 of 20 points)?