Assignment6

Abigail Griffin

2023-02-03

Daily Assignment 6

Data manipulations using the dplyr package - use the dplyr and tidyverse packages.

1: Examine the structure of the iris data set. How many observations and variables are in the data set?

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(dplyr) # loading these two packages so the elements within them I use are functional
str(iris) # there are 150 observations (rows) and 5 variables (columns)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
glimpse(iris) # shows a glimpse of the data - makes it possible to see every column in df
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
data(iris)

2: Create a new data frame iris1 that contains only the species virginica and versicolor with sepal lengths longer than 6 cm and sepal widths longer than 2.5 cm. How many observations and variables are in the data set?

iris1<-filter(iris, Species == "virginica" | Species == "versicolor", Sepal.Length>6, Sepal.Width>2.5) # filter out according to these variables # species must be "virginica" OR ( | ) "versicolor" AND must be >2.5 but <6. I had to correct my code here. I had: iris1<-filter(iris, Species == c("virginica","versicolor"), Sepal.Length>6, Sepal.Width>2.5). This was incorrect as the double equals / combination excluded half the values. This is due to the fact that == requires vectors to be of the same length. When Species == is done separately and separated by |, it returns the correct df. Alternatively I could have used %in% operator for vectors of different lengths.
glimpse(iris1)
## Rows: 56
## Columns: 5
## $ Sepal.Length <dbl> 7.0, 6.4, 6.9, 6.5, 6.3, 6.6, 6.1, 6.7, 6.1, 6.1, 6.4, 6.…
## $ Sepal.Width  <dbl> 3.2, 3.2, 3.1, 2.8, 3.3, 2.9, 2.9, 3.1, 2.8, 2.8, 2.9, 3.…
## $ Petal.Length <dbl> 4.7, 4.5, 4.9, 4.6, 4.7, 4.6, 4.7, 4.4, 4.0, 4.7, 4.3, 4.…
## $ Petal.Width  <dbl> 1.4, 1.5, 1.5, 1.5, 1.6, 1.3, 1.4, 1.4, 1.3, 1.2, 1.3, 1.…
## $ Species      <fct> versicolor, versicolor, versicolor, versicolor, versicolo…
# There are 56 observations and 5 variables.

3: Now, create a iris2 data frame from iris1 that contains only the columns for Species, Sepal.Length, and Sepal.Width. How many observations and variables are in the data set?

iris2<-select(iris1, Species, Sepal.Length, Sepal.Width) # select only these columns from iris1, assign this new df to iris2
glimpse(iris2)
## Rows: 56
## Columns: 3
## $ Species      <fct> versicolor, versicolor, versicolor, versicolor, versicolo…
## $ Sepal.Length <dbl> 7.0, 6.4, 6.9, 6.5, 6.3, 6.6, 6.1, 6.7, 6.1, 6.1, 6.4, 6.…
## $ Sepal.Width  <dbl> 3.2, 3.2, 3.1, 2.8, 3.3, 2.9, 2.9, 3.1, 2.8, 2.8, 2.9, 3.…
# There are 56 observations and 3 variables

4: Create an iris3 data frame from iris2 that orders the observations from largest to smallest sepal length. Show the first 6 rows of this data set.

iris3<-arrange(iris2, by=desc(Sepal.Length)) # arrange iris2 in descending order and assign this df the name iris3
head(iris3) # print the first 6 rows of iris3
##     Species Sepal.Length Sepal.Width
## 1 virginica          7.9         3.8
## 2 virginica          7.7         3.8
## 3 virginica          7.7         2.6
## 4 virginica          7.7         2.8
## 5 virginica          7.7         3.0
## 6 virginica          7.6         3.0

5: Create an iris4 data frame from iris3 that creates a column with a sepal area (length * width) value for each observation. How many observations and variables are in the data set?

iris4<-mutate(iris3, Sepal.Area=Sepal.Length*Sepal.Width) # mutate iris 3 by adding a Sepal.Area column with these conditions
glimpse(iris4)
## Rows: 56
## Columns: 4
## $ Species      <fct> virginica, virginica, virginica, virginica, virginica, vi…
## $ Sepal.Length <dbl> 7.9, 7.7, 7.7, 7.7, 7.7, 7.6, 7.4, 7.3, 7.2, 7.2, 7.2, 7.…
## $ Sepal.Width  <dbl> 3.8, 3.8, 2.6, 2.8, 3.0, 3.0, 2.8, 2.9, 3.6, 3.2, 3.0, 3.…
## $ Sepal.Area   <dbl> 30.02, 29.26, 20.02, 21.56, 23.10, 22.80, 20.72, 21.17, 2…
# There are 56 observations and 4 variables

6: Create the variable irisTab that shows the average sepal length, the average sepal width, and the sample size of the entire iris4 data frame and print irisTab.

irisTab<-summarize(iris4, meanSepal.Length=mean(Sepal.Length, na.rm=TRUE), meanSepal.Width=mean(Sepal.Width, na.rm=TRUE), SampleSize=n())
print(irisTab)
##   meanSepal.Length meanSepal.Width SampleSize
## 1         6.698214        3.041071         56

7: Finally, create iris5 that calculates the average sepal length, the average sepal width, and the sample size for each species of in the iris4 data frame and print iris5.

.<-group_by(iris4, Species) # I did not assign a different name to the grouped variables as the answer key did, instead put summarize afterwards. I changed this so iris5 would print the summary, but the output is still the same.
iris5<-summarize(., meanSepal.Length=mean(Sepal.Length, na.rm=TRUE), meanSepal.Width=mean(Sepal.Width, na.rm=TRUE), SampleSize=n())
print(iris5)
## # A tibble: 2 × 4
##   Species    meanSepal.Length meanSepal.Width SampleSize
##   <fct>                 <dbl>           <dbl>      <int>
## 1 versicolor             6.48            2.99         17
## 2 virginica              6.79            3.06         39

8: In these exercises, you have successively modified different versions of the data frame iris1 iris2 iris3 iris4 iris5. At each stage, the output data frame from one operation serves as the input for the next. A more efficient way to do this is to use the pipe operator %>% from the tidyr package. Rework all of your previous statements (except for irisTab) into an extended piping operation that uses iris as the input and generates irisFinal as the output.

irisFinal<-iris %>%
  filter(Species=="virginica" | Species=="versicolor", Sepal.Length>6, Sepal.Width>2.5) %>% # answer key used %in% operator 
  select(Species, Sepal.Length, Sepal.Width) %>%
  arrange(by=desc(Sepal.Length)) %>%
  mutate(Sepal.Area = Sepal.Length * Sepal.Width) %>%
  group_by(Species) %>%
  summarize(meanSepal.Length=mean(Sepal.Length, na.rm=TRUE), meanSepal.Width=mean(Sepal.Width, na.rm=TRUE), SampleSize=n())
  

print(irisFinal)
## # A tibble: 2 × 4
##   Species    meanSepal.Length meanSepal.Width SampleSize
##   <fct>                 <dbl>           <dbl>      <int>
## 1 versicolor             6.48            2.99         17
## 2 virginica              6.79            3.06         39