Simple Data Analysis and More Complex Control Structures 01-30-2023

Abigail Griffin

2023-01-31

Simple Data Analysis and More Complex Control Structures

January 30 2023
AGG

Loading in data from last week.

dryadData<-read.table(file="data/veysey-babbitt_data_amphibians.csv", header=TRUE, sep = ",")

Set up libraries

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggthemes)

Analyses

  • For experimental designs, you usually have independent/explanatory variable (x-axis) vs. dependent/response variable (y-axis)
  • Continuous v. discrete variables affect this
    • Continuous variables: range of numbers (e.g. height, weight, temperature)
    • Discrete/Categorical variables: categories (e.g. species, treatments, site)

Basic linear regression analysis (2 continuous variables)

  • Is there a relationship between the mean pool hydroperiod and the number of breeding frogs caught?
glimpse(dryadData)
## Rows: 132
## Columns: 13
## $ wetland            <int> 7, 7, 7, 7, 7, 7, 19, 19, 19, 19, 19, 19, 20, 20, 2…
## $ year               <int> 2004, 2005, 2006, 2007, 2008, 2009, 2004, 2005, 200…
## $ treatment          <chr> "30m", "30m", "30m", "30m", "30m", "30m", "30m", "3…
## $ species            <chr> "RASY", "RASY", "RASY", "RASY", "RASY", "RASY", "RA…
## $ count.total.adults <int> 91, 47, 128, 60, 108, NA, 74, 75, 162, 120, 14, 44,…
## $ No.males           <int> 61, 15, 58, 46, 89, NA, 58, 53, 130, 63, 8, 24, 113…
## $ No.females         <int> 30, 32, 70, 14, 19, NA, 16, 22, 32, 57, 6, 20, 33, …
## $ No.recap           <int> NA, 3, 18, 10, 6, NA, NA, 5, 29, 13, 0, 0, NA, 13, …
## $ No.newcap          <int> 82, 44, 110, 50, 102, NA, 70, 70, 133, 107, 14, 44,…
## $ mean.hydro         <dbl> 101.00, 101.00, 101.00, 101.00, 101.00, 101.00, 107…
## $ sd.hydro           <dbl> 34.58, 34.58, 34.58, 34.58, 34.58, 34.58, 38.01, 38…
## $ dv.cutornot.yrs    <int> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, …
## $ dv.30m.yrs         <int> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, …
regModel<-lm(count.total.adults ~ mean.hydro, data=dryadData) # note y~x
summary(regModel) # use summary instead of print to get more pertinent stat data
## 
## Call:
## lm(formula = count.total.adults ~ mean.hydro, data = dryadData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.30  -47.30  -13.88   31.11  492.00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9898    16.5591   0.181    0.857    
## mean.hydro    0.5955     0.1224   4.867 3.27e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.67 on 128 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1496 
## F-statistic: 23.69 on 1 and 128 DF,  p-value: 3.27e-06
hist(regModel$residuals)

summary(regModel)$"r.squared"
## [1] 0.1561683
summary(regModel)[["r.squared"]]
## [1] 0.1561683
# View(summary(regModel)) # View must be uppercase, comment out View because RMarkdown does not like

regPlot<-ggplot(data=dryadData, aes(x=mean.hydro, y=count.total.adults+1)) + # adding one = log transform?
  geom_point(size=2) +
  stat_smooth(method="lm", se=0.99) # sets standard error bars, default is 96
regPlot + theme_few()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Basic ANOVA

  • Was there a statistically significant difference in the number of adults among wetlands? Remember y~x
ANOModel<-aov(count.total.adults~wetland, data=dryadData)
summary(ANOModel)
##              Df Sum Sq Mean Sq F value Pr(>F)  
## wetland       1  17356   17356    3.28 0.0725 .
## Residuals   128 677235    5291                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness
  • Use factor() to tell ggplot these are different groups - otherwise just treats it as different numbers
ANOModel<-aov(count.total.adults~factor(wetland), data=dryadData)
dryadData %>%
  group_by(wetland) %>%
  summarize(avgHydro = mean(count.total.adults, na.rm=T), N=n())
## # A tibble: 11 × 3
##    wetland avgHydro     N
##      <int>    <dbl> <int>
##  1       7    70.5     12
##  2      19    56.4     12
##  3      20   113.      12
##  4      25   133.      12
##  5      30    70.5     12
##  6      39   144.      12
##  7      55    42.3     12
##  8      59    32.5     12
##  9     124   118.      12
## 10     129     9.92    12
## 11     141    69.9     12
  • Going to use a box plot with an ANOVA - go hand in hand with visualizing these results
dryadData$wetland
##   [1]   7   7   7   7   7   7  19  19  19  19  19  19  20  20  20  20  20  20
##  [19]  25  25  25  25  25  25  30  30  30  30  30  30  39  39  39  39  39  39
##  [37]  55  55  55  55  55  55  59  59  59  59  59  59 124 124 124 124 124 124
##  [55] 129 129 129 129 129 129 141 141 141 141 141 141   7   7   7   7   7   7
##  [73]  19  19  19  19  19  19  20  20  20  20  20  20  25  25  25  25  25  25
##  [91]  30  30  30  30  30  30  39  39  39  39  39  39  55  55  55  55  55  55
## [109]  59  59  59  59  59  59 124 124 124 124 124 124 129 129 129 129 129 129
## [127] 141 141 141 141 141 141
# class(dryadData$wetland) = integer. Need to override using factor
dryadData$wetland <- factor(dryadData$wetland)
class(dryadData$wetland)
## [1] "factor"
ANOplot<- ggplot(data=dryadData, mapping=aes(x=wetland, y=count.total.adults)) +
  geom_boxplot()
ANOplot
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

  • If we wanted to see what it looked like between species?
ANOplot2<- ggplot(data=dryadData, mapping=aes(x=wetland, y=count.total.adults, fill=species)) + # mapping fills in the aesthetics, allows you to give aesthetics arguments 
  geom_boxplot() +
  scale_fill_grey() # use gear icon -> Chunk Output in Console to see plot in Console
ANOplot2
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

  • This code saves pdf to repo folder: ggsave(file="SpeciesBoxplots.pdf", plot=ANOplot2, device="pdf")

Logistic Regression

  1. Simulate data set
# xVar<-sort(rgamma())
  • gamma probabilities: continuous variables that are always positive and have a skewed distribution
xVar <- sort(rgamma(n=200,shape=5,scale=5))
yVar <- sample(rep(c(1,0),each=100),prob=seq_len(200)) # prob=seq_len gives it equal probability of being sampled bc n=200, same as if we did 1:200
logRegData <- data.frame(xVar,yVar) 

Logistic Regression Analysis

logRegModel<-glm(yVar~xVar, data=logRegData, family=binomial(link=logit)) # generic linear model, very important function if you're stats-oriented # latter part - how we specify we want a logistic regression
summary(logRegModel)$coefficients
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -2.9161445 0.49504974 -5.890609 3.847749e-09
## xVar         0.1128415 0.01845451  6.114577 9.681350e-10
logRegPlot<-ggplot(data=logRegData, aes(x=xVar, y=yVar)) +
  geom_point() +
  stat_smooth(method=glm, method.args=list(family=binomial))
logRegPlot # binomial is what tells R it is a logistic regression because we have 0 or 1
## `geom_smooth()` using formula = 'y ~ x'

Contingency table (chi-squared) Analysis

  • Are there differences in the counts of males and females between species?
countData<-dryadData %>%
  group_by(species) %>%
  summarize(Males=sum(No.males, na.rm=T), Females = sum(No.females, na.rm=T)) %>%
  select(-species) %>%
  as.matrix()
countData # gets data into a table
##      Males Females
## [1,]  2090    1513
## [2,]  4095    2422
row.names(countData) = c("SS", "WF") # change row names
countData
##    Males Females
## SS  2090    1513
## WF  4095    2422
  • Chi-squared
    • Get residuals
testResults<-chisq.test(countData)
testResults$residuals ##??????
##        Males   Females
## SS -2.387410  2.993122
## WF  1.775151 -2.225526
testResults$expected ##????????
##       Males  Females
## SS 2202.031 1400.969
## WF 3982.969 2534.031
  • Mosaic plot (base R) ??? base R
mosaicplot(x=countData, col=c("goldenrod", "grey"), shade=FALSE)

  • Bar plot
countDataLong <- countData %>%
  as_tibble() %>%  # transform into tibble, doesn't like that it's a matrix
  mutate(Species = c(rownames(countData))) %>% # give me species data ???
  pivot_longer(cols=Males:Females, # take both columns and make them longer - put them in a row
  names_to = "Sex", # name of new column which will contain Males/Females
  values_to = "Count") # name of new column which contain the values for M/F
  • Plot bar graph
ggplot(data=countDataLong, mapping=aes(x=Species, y=Count, fill=Sex)) +
  geom_bar(stat="identity", position="dodge") + #plots bars next to each other
  scale_fill_manual(values=c("darkslategrey", "midnightblue"))

Control Structures

  • if and ifelse statements
    • if statement if (condition) {expression 1}
      if it’s THIS do THIS
    • if (condition) {expression 1} else {expression 2}
      if it’s THIS, do THIS, but if it’s something else, do THAT
    • if (condition1) {expression 1} else
    • if (condition2) {expression 2} else
  • else must appear on the same line as the expression
  • use it for single values
z<-signif(runif(1), digits=2) # give me one random uniform value rounded to second signif value
z>0.5 # this turns it into a logical
## [1] FALSE
# use {} or not
if (z>0.8) {cat(z,"is a large number", "\n")} else # {cat} is concatinate fn # if z>0.8, return that z is a large number and put a line break ("/n")
  if (z<0.2) {cat(z, "is a small number", "\n")} else
  {cat(z, "is a number of typical size", "\n")
    cat("z^2=", z^2, "\n")
    y<-TRUE} # if its anywhere in between, say it's a number of typical size, then square it # y<-TRUE - can make up expression: if something's true do this
## 0.11 is a small number
  • ifelse to fill vectors
  • ifelse(condition, yes, no)
  • e.g. insect population where females lay 10.2 eggs on average, follows Poisson distribution (discrete probability distribution showing the likely number of times an event will occur). 35% parasitism where no eggs are laid. Let’s make a distribution for 1000 individuals
tester<-runif(1000)
eggs<-ifelse(tester>0.35, rpois(n=1000, lambda = 10.2),0) # if tester is > .35, make a distribution, if not assign a 0
hist(eggs) # make a histogram of eggs data

  • e.g. create a vector of pvalues from a simulation and we want to create a vector to highlight the significant ones for plotting
pVals<-runif(1000)
z<-ifelse(pVals<=0.025, "lowerTail", "nonSig") # if lower than/eq to this value say "lowerTail", if not say "nonSig" # using if else
z[pVals >= 0.975] <- "upperTail" # if gr than/eq to assign "upperTail" # use subsetting
table(z) # give tables w # of lowerTail/nonSig/upperTail values
## z
## lowerTail    nonSig upperTail 
##        24       950        26

For Loops

  • for loops: workhorse function for doing repetitive tasks. Universal in all computer languages. Controversial in R because:
    • often not necessary (we have vectorization in R)
    • very slow with certain operations (binding rows/columns together)
    • family of apply functions (pretty much do the same as for loops)
  • Anatomy of for loop
    • for(variable in sequence){starts loop body of for loop end of loop}
  • var is a counter variable that holds the current value of the loop (i, j, k)
    • the sequence is an integer vector that defines the start/end of the loop
for (i in 1:5){
  cat("stuck in a loop", i, "\n")
  cat(3+2, "\n") # has nothing to do with which iteration you're in, will always produce the same value (5) in the loop
  cat(3+i, "\n") # i is whatever iteration/number it is in the sequence # in other words cat(3+loop #, line break)
}
## stuck in a loop 1 
## 5 
## 4 
## stuck in a loop 2 
## 5 
## 5 
## stuck in a loop 3 
## 5 
## 6 
## stuck in a loop 4 
## 5 
## 7 
## stuck in a loop 5 
## 5 
## 8
print(i) # will print the last number
## [1] 5
  • Use a counter variable that maps to the position of each element. This is a single for loop.
my_dogs<-c("chow", "akita", "malamute", "husky", "samoyed")
for(i in 1:length(my_dogs)){
  cat("i=", i, "my_dogs[i]=", my_dogs[i], "\n") # i value will be changing - in first iteration it'll be chow, then akita, malamute, etc # think of i like iteration number # end with line break bc good coding practice
}
## i= 1 my_dogs[i]= chow 
## i= 2 my_dogs[i]= akita 
## i= 3 my_dogs[i]= malamute 
## i= 4 my_dogs[i]= husky 
## i= 5 my_dogs[i]= samoyed
  • Can also create double for loops
m<-matrix(round(runif(20), digits=2), nrow=5)
m
##      [,1] [,2] [,3] [,4]
## [1,] 0.87 0.43 0.78 0.69
## [2,] 0.41 0.44 0.08 0.16
## [3,] 0.82 0.43 0.95 0.35
## [4,] 0.66 0.33 0.72 0.01
## [5,] 0.85 0.58 0.88 0.55
  • Use a for loop to change these values - if we were just doing loop over ROWS (not double loop)
for(i in 1:nrow(m)){
  m[i,] <- m[i,] + i # all columns, loop through one row at a time
} # this goes to first row, adds 1 to it, then to second row/iteration and adds 2 to it.
m
##      [,1] [,2] [,3] [,4]
## [1,] 1.87 1.43 1.78 1.69
## [2,] 2.41 2.44 2.08 2.16
## [3,] 3.82 3.43 3.95 3.35
## [4,] 4.66 4.33 4.72 4.01
## [5,] 5.85 5.58 5.88 5.55
  • Can do the same thing for the COLUMNS - loop over columns
    • Note: it’s good practice to use i for columns and j for rows
m<-matrix(round(runif(20), digits=2), nrow=5)
for(j in 1:ncol(m)){
  m[,j] <- m[,j] + j
}
m
##      [,1] [,2] [,3] [,4]
## [1,] 1.89 2.74 3.31 4.84
## [2,] 1.26 2.99 3.34 4.92
## [3,] 1.72 2.88 3.76 4.81
## [4,] 1.91 2.16 3.59 4.20
## [5,] 1.17 2.90 3.73 4.37
  • Loop over rows and columns
m<-matrix(round(runif(20), digits=2), nrow=5)
for(i in 1:nrow(m)){
  for(j in 1:ncol(m)){
    m[i,j] <- m[i,j] + i + j # adds i and j to each cell, depends on location of cell
  } # end of column j loop
} # end of column i loop
print(m)
##      [,1] [,2] [,3] [,4]
## [1,] 2.31 3.12 4.66 5.63
## [2,] 3.67 4.87 5.55 6.56
## [3,] 4.01 5.69 6.59 7.63
## [4,] 5.57 6.65 7.70 8.95
## [5,] 6.76 7.74 8.44 9.51