Daily Assignment 6
Data manipulations using the dplyr package - use the dplyr and
tidyverse packages.
1: Examine the structure of the iris data set. How many observations
and variables are in the data set?
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr) # loading these two packages so the elements within them I use are functional
str(iris) # there are 150 observations (rows) and 5 variables (columns)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
glimpse(iris) # shows a glimpse of the data - makes it possible to see every column in df
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
2: Create a new data frame iris1 that contains only the species
virginica and versicolor with sepal lengths longer than 6 cm and sepal
widths longer than 2.5 cm. How many observations and variables are in
the data set?
iris1<-filter(iris, Species == "virginica" | Species == "versicolor", Sepal.Length>6, Sepal.Width>2.5) # filter out according to these variables # species must be "virginica" OR ( | ) "versicolor" AND must be >2.5 but <6. I had to correct my code here. I had: iris1<-filter(iris, Species == c("virginica","versicolor"), Sepal.Length>6, Sepal.Width>2.5). This was incorrect as the double equals / combination excluded half the values. This is due to the fact that == requires vectors to be of the same length. When Species == is done separately and separated by |, it returns the correct df. Alternatively I could have used %in% operator for vectors of different lengths.
glimpse(iris1)
## Rows: 56
## Columns: 5
## $ Sepal.Length <dbl> 7.0, 6.4, 6.9, 6.5, 6.3, 6.6, 6.1, 6.7, 6.1, 6.1, 6.4, 6.…
## $ Sepal.Width <dbl> 3.2, 3.2, 3.1, 2.8, 3.3, 2.9, 2.9, 3.1, 2.8, 2.8, 2.9, 3.…
## $ Petal.Length <dbl> 4.7, 4.5, 4.9, 4.6, 4.7, 4.6, 4.7, 4.4, 4.0, 4.7, 4.3, 4.…
## $ Petal.Width <dbl> 1.4, 1.5, 1.5, 1.5, 1.6, 1.3, 1.4, 1.4, 1.3, 1.2, 1.3, 1.…
## $ Species <fct> versicolor, versicolor, versicolor, versicolor, versicolo…
# There are 56 observations and 5 variables.
3: Now, create a iris2 data frame from iris1 that contains only the
columns for Species, Sepal.Length, and Sepal.Width. How many
observations and variables are in the data set?
iris2<-select(iris1, Species, Sepal.Length, Sepal.Width) # select only these columns from iris1, assign this new df to iris2
glimpse(iris2)
## Rows: 56
## Columns: 3
## $ Species <fct> versicolor, versicolor, versicolor, versicolor, versicolo…
## $ Sepal.Length <dbl> 7.0, 6.4, 6.9, 6.5, 6.3, 6.6, 6.1, 6.7, 6.1, 6.1, 6.4, 6.…
## $ Sepal.Width <dbl> 3.2, 3.2, 3.1, 2.8, 3.3, 2.9, 2.9, 3.1, 2.8, 2.8, 2.9, 3.…
# There are 56 observations and 3 variables
4: Create an iris3 data frame from iris2 that orders the
observations from largest to smallest sepal length. Show the first 6
rows of this data set.
iris3<-arrange(iris2, by=desc(Sepal.Length)) # arrange iris2 in descending order and assign this df the name iris3
head(iris3) # print the first 6 rows of iris3
## Species Sepal.Length Sepal.Width
## 1 virginica 7.9 3.8
## 2 virginica 7.7 3.8
## 3 virginica 7.7 2.6
## 4 virginica 7.7 2.8
## 5 virginica 7.7 3.0
## 6 virginica 7.6 3.0
5: Create an iris4 data frame from iris3 that creates a column with
a sepal area (length * width) value for each observation. How many
observations and variables are in the data set?
iris4<-mutate(iris3, Sepal.Area=Sepal.Length*Sepal.Width) # mutate iris 3 by adding a Sepal.Area column with these conditions
glimpse(iris4)
## Rows: 56
## Columns: 4
## $ Species <fct> virginica, virginica, virginica, virginica, virginica, vi…
## $ Sepal.Length <dbl> 7.9, 7.7, 7.7, 7.7, 7.7, 7.6, 7.4, 7.3, 7.2, 7.2, 7.2, 7.…
## $ Sepal.Width <dbl> 3.8, 3.8, 2.6, 2.8, 3.0, 3.0, 2.8, 2.9, 3.6, 3.2, 3.0, 3.…
## $ Sepal.Area <dbl> 30.02, 29.26, 20.02, 21.56, 23.10, 22.80, 20.72, 21.17, 2…
# There are 56 observations and 4 variables
6: Create the variable irisTab that shows the average sepal length,
the average sepal width, and the sample size of the entire iris4 data
frame and print irisTab.
irisTab<-summarize(iris4, meanSepal.Length=mean(Sepal.Length, na.rm=TRUE), meanSepal.Width=mean(Sepal.Width, na.rm=TRUE), SampleSize=n())
print(irisTab)
## meanSepal.Length meanSepal.Width SampleSize
## 1 6.698214 3.041071 56
7: Finally, create iris5 that calculates the average sepal length,
the average sepal width, and the sample size for each species of in the
iris4 data frame and print iris5.
.<-group_by(iris4, Species) # I did not assign a different name to the grouped variables as the answer key did, instead put summarize afterwards. I changed this so iris5 would print the summary, but the output is still the same.
iris5<-summarize(., meanSepal.Length=mean(Sepal.Length, na.rm=TRUE), meanSepal.Width=mean(Sepal.Width, na.rm=TRUE), SampleSize=n())
print(iris5)
## # A tibble: 2 × 4
## Species meanSepal.Length meanSepal.Width SampleSize
## <fct> <dbl> <dbl> <int>
## 1 versicolor 6.48 2.99 17
## 2 virginica 6.79 3.06 39