Entering the tidyverse
January 23 2023
AGG
- tidyverse: collection of packages that share
philosophy, grammar (or how the code is structured), and data
structures. More intuitive
- operators: symbols that tell R to perform different
operations (between variables, functions, etc.)
- e.g. arithmetic operators: + , - , *, /, ^, ~
- assignment operators: <- (how we create all variables)
- logical operators: !, &, | (or)
- relational operators: ==, != (does not equal), >, <, >= (greater than or equal to), <=
- today we will learn about miscellaneous operators: %>% (forward pipe operators), %in%
Tidyverse
- Today we will be installing the tidyverse packages. You only need to
install packages once. First
install.packages("tidyverse)thenlibrary(tidyverse) - library(tidyverse) loads in packages
- dplyr: new(er) packages provides a set of tools for
manipulating data sets.
- specifically written to be fast
- individual functions that correspond to common operations
- , (comma) = both of these conditions have to be met for this to be filtered
- & (ampersand) = multiple conditions can be met
- (line) = or
- 5 core verbs:
filter():arrange():select():groupby()andsummarize():mutate():- Built in data set
library(tidyverse)
library(dplyr)
data(starwars)
glimpse(starwars)
class(starwars)- Tibble: modern take on data frames. Keeps great aspects of df’s and drops frustrating ones (change variables)
- If we want to clean up data:
- Check for NA’s
anyNA(starwars) # are there any NA's? returns a logical # is.na #complete.cases
starwarsClean<-starwars[complete.cases(starwars[, 1:10]),] # removes all rows with NAs in all rows, the first 10 columns # second comma says give me all columns back
# anyNA(starwarsClean[,1:10]) # we have completely removed all NA's in first 10 columns- use
**filter()**: picks/subsets observations (ROWS) by their values
filter(starwarsClean, gender=="masculine" & height < 180) # can replace & (ampersand) with a comma which means and - both of these conditions have to be met for this to be filtered
filter(starwarsClean, gender=="masculine" , height < 180 , height > 100) # multiple conditions for the same variable
filter(starwarsClean, gender=="masculine" | gender=="feminine") # gender is masculine or feminine- now working with
%in%operator, a matching operator. Similar to the==but you can compare vectors of different length (can’t do with==)
a<-LETTERS[1:10]
length(a) # how many elements are in this vector
#output of %in% depends of first vector
# a %in% b
# b %in% a- use
%in%to subset
eyes<-filter(starwars, eye_color %in% c("blue", "brown")) # subsetting characters with these eye colors
view(eyes) # opens up table in another tab
eyes2<-filter(starwars, eye_color == "blue" | eye_color == "brown")
view(eyes2)arrange(): reorders rows
sw<-arrange(starwarsClean, by=height) # arranges the whole data frame by the variable you specify- can also use helper function
desc()
arrange(starwarsClean, by=desc(height)) # arrange in descending order- put in additional conditions
tail(sw) # missing values are added to the end ??select(): chooses variables (COLUMNS) by their names
select(starwarsClean, 1:10)
select(starwarsClean)
select(starwarsClean, name: species) # ??
select(starwarsClean, -(films:starships))
starwarsClean[,1:11]everything(): rearrange columns. a helper function that is useful if you have a couple variables that you want to move to the beginning
select(starwarsClean, name, gender, species, everything()) # reorders this but doesn't keep it reordered.contains(): another helper function
select(starwarsClean, contains ("color")) # others include: ends_with(), starts with(), num_range()select()can also rename columns
select(starwarsClean, haircolor=hair_color) # returns only renamed column
rename(starwarsClean, haircolor=hair_color) # returns whole data setmutate(): creates new variables using functions of existing variables- let’s create a new column that is height divided by mass
mutate(starwarsClean, ratio=height/mass)
starwars_lbs<-mutate(starwarsClean, mass_lbs=mass*2.2)
starwars_lbs<-select(starwars_lbs, 1:3, mass_lbs, everything()) # everything() is everything else. brought it to the front using select
glimpse(starwars_lbs)transmute():
transmute(starwarsClean, mass_lbs=mass*2.2) # only returns mutated columns
transmute(starwarsClean, mass, mass_lbs=mass*2.2, height)group_by()andsummarize()
summarize(starwarsClean, meanHeight=mean(height)) # throws NA if any NAs are in df - need to use na.rm
summarize(starwarsClean, meanHeight=mean(height), TotalNumber=n())- use
group_by()for maximum usefulness
starwarsGenders<-group_by(starwars, gender)
head(starwarsGenders) # lets you view first 6 rows, can also do head(starwarsGender, 10) to see first 10. Can also use tail(starwarsGender) to see last 6
summarize(starwarsGenders, meanHeight=mean(height, na.rm=TRUE), TotalNumber=n()) # TotalNumber=n() counts how many belong to each groupPiping %>%
- piping is used to emphasize a sequence of actions
- allows you to pass an intermediate result onto the next function (uses output of one function as input of the next)
- avoid if you need to manipulate more than one object/variable at a time or if a variable is meaningful
- formatting: should always have a space before the
%>%followed by a new line
starwarsClean %>%
group_by(gender) %>%
summarize(meanHeight=mean(height, na.rm = TRUE, TotalNumber=n()))
# na.rm=TRUE removes NAs
# this data frame acts as %>% input
# of this function
# much cleaner with piping!case_when(): is useful for multiple if/ifelse statements
starwarsClean %>%
mutate(sp=case_when(species=="Human"~"Human", TRUE~"Non-Human"))
# if species = Human, put Human. Tilda tells what to put. If species is not true, put Non-Human.
# uses condition, puts "Human" if True in sp column, puts "Non-Human" if it's FALSE