Simple Data Analysis and More Complex Control Structures
January 30 2023
AGG
Loading in data from last week.
dryadData<-read.table(file="data/veysey-babbitt_data_amphibians.csv", header=TRUE, sep = ",")Set up libraries
library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggthemes)Analyses
- For experimental designs, you usually have independent/explanatory variable (x-axis) vs. dependent/response variable (y-axis)
- Continuous v. discrete variables affect this
- Continuous variables: range of numbers (e.g. height, weight, temperature)
- Discrete/Categorical variables: categories (e.g. species, treatments, site)
Basic linear regression analysis (2 continuous variables)
- Is there a relationship between the mean pool hydroperiod and the number of breeding frogs caught?
glimpse(dryadData)## Rows: 132
## Columns: 13
## $ wetland <int> 7, 7, 7, 7, 7, 7, 19, 19, 19, 19, 19, 19, 20, 20, 2…
## $ year <int> 2004, 2005, 2006, 2007, 2008, 2009, 2004, 2005, 200…
## $ treatment <chr> "30m", "30m", "30m", "30m", "30m", "30m", "30m", "3…
## $ species <chr> "RASY", "RASY", "RASY", "RASY", "RASY", "RASY", "RA…
## $ count.total.adults <int> 91, 47, 128, 60, 108, NA, 74, 75, 162, 120, 14, 44,…
## $ No.males <int> 61, 15, 58, 46, 89, NA, 58, 53, 130, 63, 8, 24, 113…
## $ No.females <int> 30, 32, 70, 14, 19, NA, 16, 22, 32, 57, 6, 20, 33, …
## $ No.recap <int> NA, 3, 18, 10, 6, NA, NA, 5, 29, 13, 0, 0, NA, 13, …
## $ No.newcap <int> 82, 44, 110, 50, 102, NA, 70, 70, 133, 107, 14, 44,…
## $ mean.hydro <dbl> 101.00, 101.00, 101.00, 101.00, 101.00, 101.00, 107…
## $ sd.hydro <dbl> 34.58, 34.58, 34.58, 34.58, 34.58, 34.58, 38.01, 38…
## $ dv.cutornot.yrs <int> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, …
## $ dv.30m.yrs <int> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, …
regModel<-lm(count.total.adults ~ mean.hydro, data=dryadData) # note y~x
summary(regModel) # use summary instead of print to get more pertinent stat data##
## Call:
## lm(formula = count.total.adults ~ mean.hydro, data = dryadData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.30 -47.30 -13.88 31.11 492.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9898 16.5591 0.181 0.857
## mean.hydro 0.5955 0.1224 4.867 3.27e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.67 on 128 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1496
## F-statistic: 23.69 on 1 and 128 DF, p-value: 3.27e-06
hist(regModel$residuals)summary(regModel)$"r.squared"## [1] 0.1561683
summary(regModel)[["r.squared"]]## [1] 0.1561683
# View(summary(regModel)) # View must be uppercase, comment out View because RMarkdown does not like
regPlot<-ggplot(data=dryadData, aes(x=mean.hydro, y=count.total.adults+1)) + # adding one = log transform?
geom_point(size=2) +
stat_smooth(method="lm", se=0.99) # sets standard error bars, default is 96
regPlot + theme_few()## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Basic ANOVA
- Was there a statistically significant difference in the number of
adults among wetlands? Remember
y~x
ANOModel<-aov(count.total.adults~wetland, data=dryadData)
summary(ANOModel)## Df Sum Sq Mean Sq F value Pr(>F)
## wetland 1 17356 17356 3.28 0.0725 .
## Residuals 128 677235 5291
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness
- Use
factor()to tell ggplot these are different groups - otherwise just treats it as different numbers
ANOModel<-aov(count.total.adults~factor(wetland), data=dryadData)dryadData %>%
group_by(wetland) %>%
summarize(avgHydro = mean(count.total.adults, na.rm=T), N=n())## # A tibble: 11 × 3
## wetland avgHydro N
## <int> <dbl> <int>
## 1 7 70.5 12
## 2 19 56.4 12
## 3 20 113. 12
## 4 25 133. 12
## 5 30 70.5 12
## 6 39 144. 12
## 7 55 42.3 12
## 8 59 32.5 12
## 9 124 118. 12
## 10 129 9.92 12
## 11 141 69.9 12
- Going to use a box plot with an ANOVA - go hand in hand with visualizing these results
dryadData$wetland## [1] 7 7 7 7 7 7 19 19 19 19 19 19 20 20 20 20 20 20
## [19] 25 25 25 25 25 25 30 30 30 30 30 30 39 39 39 39 39 39
## [37] 55 55 55 55 55 55 59 59 59 59 59 59 124 124 124 124 124 124
## [55] 129 129 129 129 129 129 141 141 141 141 141 141 7 7 7 7 7 7
## [73] 19 19 19 19 19 19 20 20 20 20 20 20 25 25 25 25 25 25
## [91] 30 30 30 30 30 30 39 39 39 39 39 39 55 55 55 55 55 55
## [109] 59 59 59 59 59 59 124 124 124 124 124 124 129 129 129 129 129 129
## [127] 141 141 141 141 141 141
# class(dryadData$wetland) = integer. Need to override using factor
dryadData$wetland <- factor(dryadData$wetland)
class(dryadData$wetland)## [1] "factor"
ANOplot<- ggplot(data=dryadData, mapping=aes(x=wetland, y=count.total.adults)) +
geom_boxplot()
ANOplot## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
- If we wanted to see what it looked like between species?
ANOplot2<- ggplot(data=dryadData, mapping=aes(x=wetland, y=count.total.adults, fill=species)) + # mapping fills in the aesthetics, allows you to give aesthetics arguments
geom_boxplot() +
scale_fill_grey() # use gear icon -> Chunk Output in Console to see plot in Console
ANOplot2## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
- This code saves pdf to repo folder:
ggsave(file="SpeciesBoxplots.pdf", plot=ANOplot2, device="pdf")
Logistic Regression
- Simulate data set
# xVar<-sort(rgamma())- gamma probabilities: continuous variables that are always positive and have a skewed distribution
xVar <- sort(rgamma(n=200,shape=5,scale=5))
yVar <- sample(rep(c(1,0),each=100),prob=seq_len(200)) # prob=seq_len gives it equal probability of being sampled bc n=200, same as if we did 1:200
logRegData <- data.frame(xVar,yVar) Logistic Regression Analysis
logRegModel<-glm(yVar~xVar, data=logRegData, family=binomial(link=logit)) # generic linear model, very important function if you're stats-oriented # latter part - how we specify we want a logistic regression
summary(logRegModel)$coefficients## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.9161445 0.49504974 -5.890609 3.847749e-09
## xVar 0.1128415 0.01845451 6.114577 9.681350e-10
logRegPlot<-ggplot(data=logRegData, aes(x=xVar, y=yVar)) +
geom_point() +
stat_smooth(method=glm, method.args=list(family=binomial))
logRegPlot # binomial is what tells R it is a logistic regression because we have 0 or 1## `geom_smooth()` using formula = 'y ~ x'
Contingency table (chi-squared) Analysis
- Are there differences in the counts of males and females between species?
countData<-dryadData %>%
group_by(species) %>%
summarize(Males=sum(No.males, na.rm=T), Females = sum(No.females, na.rm=T)) %>%
select(-species) %>%
as.matrix()
countData # gets data into a table## Males Females
## [1,] 2090 1513
## [2,] 4095 2422
row.names(countData) = c("SS", "WF") # change row names
countData## Males Females
## SS 2090 1513
## WF 4095 2422
- Chi-squared
- Get residuals
testResults<-chisq.test(countData)
testResults$residuals ##??????## Males Females
## SS -2.387410 2.993122
## WF 1.775151 -2.225526
testResults$expected ##????????## Males Females
## SS 2202.031 1400.969
## WF 3982.969 2534.031
- Mosaic plot (base R) ??? base R
mosaicplot(x=countData, col=c("goldenrod", "grey"), shade=FALSE)- Bar plot
countDataLong <- countData %>%
as_tibble() %>% # transform into tibble, doesn't like that it's a matrix
mutate(Species = c(rownames(countData))) %>% # give me species data ???
pivot_longer(cols=Males:Females, # take both columns and make them longer - put them in a row
names_to = "Sex", # name of new column which will contain Males/Females
values_to = "Count") # name of new column which contain the values for M/F- Plot bar graph
ggplot(data=countDataLong, mapping=aes(x=Species, y=Count, fill=Sex)) +
geom_bar(stat="identity", position="dodge") + #plots bars next to each other
scale_fill_manual(values=c("darkslategrey", "midnightblue"))Control Structures
ifandifelsestatementsifstatementif (condition) {expression 1}
if it’s THIS do THISif (condition) {expression 1} else {expression 2}
if it’s THIS, do THIS, but if it’s something else, do THATif (condition1) {expression 1} elseif (condition2) {expression 2} else
- else must appear on the same line as the expression
- use it for single values
z<-signif(runif(1), digits=2) # give me one random uniform value rounded to second signif value
z>0.5 # this turns it into a logical## [1] FALSE
# use {} or not
if (z>0.8) {cat(z,"is a large number", "\n")} else # {cat} is concatinate fn # if z>0.8, return that z is a large number and put a line break ("/n")
if (z<0.2) {cat(z, "is a small number", "\n")} else
{cat(z, "is a number of typical size", "\n")
cat("z^2=", z^2, "\n")
y<-TRUE} # if its anywhere in between, say it's a number of typical size, then square it # y<-TRUE - can make up expression: if something's true do this## 0.11 is a small number
- ifelse to fill vectors
- ifelse(condition, yes, no)
- e.g. insect population where females lay 10.2 eggs on average, follows Poisson distribution (discrete probability distribution showing the likely number of times an event will occur). 35% parasitism where no eggs are laid. Let’s make a distribution for 1000 individuals
tester<-runif(1000)
eggs<-ifelse(tester>0.35, rpois(n=1000, lambda = 10.2),0) # if tester is > .35, make a distribution, if not assign a 0
hist(eggs) # make a histogram of eggs data- e.g. create a vector of pvalues from a simulation and we want to create a vector to highlight the significant ones for plotting
pVals<-runif(1000)
z<-ifelse(pVals<=0.025, "lowerTail", "nonSig") # if lower than/eq to this value say "lowerTail", if not say "nonSig" # using if else
z[pVals >= 0.975] <- "upperTail" # if gr than/eq to assign "upperTail" # use subsetting
table(z) # give tables w # of lowerTail/nonSig/upperTail values## z
## lowerTail nonSig upperTail
## 24 950 26
For Loops
forloops: workhorse function for doing repetitive tasks. Universal in all computer languages. Controversial in R because:- often not necessary (we have vectorization in R)
- very slow with certain operations (binding rows/columns together)
- family of apply functions (pretty much do the same as for loops)
- Anatomy of
forloopfor(variable in sequence){starts loopbody of for loopend of loop}
varis a counter variable that holds the current value of the loop (i, j, k)- the sequence is an integer vector that defines the start/end of the loop
for (i in 1:5){
cat("stuck in a loop", i, "\n")
cat(3+2, "\n") # has nothing to do with which iteration you're in, will always produce the same value (5) in the loop
cat(3+i, "\n") # i is whatever iteration/number it is in the sequence # in other words cat(3+loop #, line break)
}## stuck in a loop 1
## 5
## 4
## stuck in a loop 2
## 5
## 5
## stuck in a loop 3
## 5
## 6
## stuck in a loop 4
## 5
## 7
## stuck in a loop 5
## 5
## 8
print(i) # will print the last number## [1] 5
- Use a counter variable that maps to the position of each element. This is a single for loop.
my_dogs<-c("chow", "akita", "malamute", "husky", "samoyed")
for(i in 1:length(my_dogs)){
cat("i=", i, "my_dogs[i]=", my_dogs[i], "\n") # i value will be changing - in first iteration it'll be chow, then akita, malamute, etc # think of i like iteration number # end with line break bc good coding practice
}## i= 1 my_dogs[i]= chow
## i= 2 my_dogs[i]= akita
## i= 3 my_dogs[i]= malamute
## i= 4 my_dogs[i]= husky
## i= 5 my_dogs[i]= samoyed
- Can also create double for loops
m<-matrix(round(runif(20), digits=2), nrow=5)
m## [,1] [,2] [,3] [,4]
## [1,] 0.87 0.43 0.78 0.69
## [2,] 0.41 0.44 0.08 0.16
## [3,] 0.82 0.43 0.95 0.35
## [4,] 0.66 0.33 0.72 0.01
## [5,] 0.85 0.58 0.88 0.55
- Use a for loop to change these values - if we were just doing loop over ROWS (not double loop)
for(i in 1:nrow(m)){
m[i,] <- m[i,] + i # all columns, loop through one row at a time
} # this goes to first row, adds 1 to it, then to second row/iteration and adds 2 to it.
m## [,1] [,2] [,3] [,4]
## [1,] 1.87 1.43 1.78 1.69
## [2,] 2.41 2.44 2.08 2.16
## [3,] 3.82 3.43 3.95 3.35
## [4,] 4.66 4.33 4.72 4.01
## [5,] 5.85 5.58 5.88 5.55
- Can do the same thing for the COLUMNS - loop over
columns
- Note: it’s good practice to use
ifor columns andjfor rows
- Note: it’s good practice to use
m<-matrix(round(runif(20), digits=2), nrow=5)
for(j in 1:ncol(m)){
m[,j] <- m[,j] + j
}
m## [,1] [,2] [,3] [,4]
## [1,] 1.89 2.74 3.31 4.84
## [2,] 1.26 2.99 3.34 4.92
## [3,] 1.72 2.88 3.76 4.81
## [4,] 1.91 2.16 3.59 4.20
## [5,] 1.17 2.90 3.73 4.37
- Loop over rows and columns
m<-matrix(round(runif(20), digits=2), nrow=5)
for(i in 1:nrow(m)){
for(j in 1:ncol(m)){
m[i,j] <- m[i,j] + i + j # adds i and j to each cell, depends on location of cell
} # end of column j loop
} # end of column i loop
print(m)## [,1] [,2] [,3] [,4]
## [1,] 2.31 3.12 4.66 5.63
## [2,] 3.67 4.87 5.55 6.56
## [3,] 4.01 5.69 6.59 7.63
## [4,] 5.57 6.65 7.70 8.95
## [5,] 6.76 7.74 8.44 9.51