Simple Data Analysis and More Complex Control Structures 01-30-2023

Simple Data Analysis and More Complex Control Structures

January 30 2023
AGG

Loading in data from last week.

dryadData<-read.table(file="data/veysey-babbitt_data_amphibians.csv", header=TRUE, sep = ",")

Set up libraries

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggthemes)

Analyses

For experimental designs, you usually have independent/explanatory variable (x-axis) vs. dependent/response variable (y-axis)
Continuous v. discrete variables affect this
- Continuous variables: range of numbers (e.g. height, weight, temperature)
- Discrete/Categorical variables: categories (e.g. species, treatments, site)

Basic linear regression analysis (2 continuous variables)

Is there a relationship between the mean pool hydroperiod and the number of breeding frogs caught?

glimpse(dryadData)

## Rows: 132
## Columns: 13
## $ wetland            <int> 7, 7, 7, 7, 7, 7, 19, 19, 19, 19, 19, 19, 20, 20, 2…
## $ year               <int> 2004, 2005, 2006, 2007, 2008, 2009, 2004, 2005, 200…
## $ treatment          <chr> "30m", "30m", "30m", "30m", "30m", "30m", "30m", "3…
## $ species            <chr> "RASY", "RASY", "RASY", "RASY", "RASY", "RASY", "RA…
## $ count.total.adults <int> 91, 47, 128, 60, 108, NA, 74, 75, 162, 120, 14, 44,…
## $ No.males           <int> 61, 15, 58, 46, 89, NA, 58, 53, 130, 63, 8, 24, 113…
## $ No.females         <int> 30, 32, 70, 14, 19, NA, 16, 22, 32, 57, 6, 20, 33, …
## $ No.recap           <int> NA, 3, 18, 10, 6, NA, NA, 5, 29, 13, 0, 0, NA, 13, …
## $ No.newcap          <int> 82, 44, 110, 50, 102, NA, 70, 70, 133, 107, 14, 44,…
## $ mean.hydro         <dbl> 101.00, 101.00, 101.00, 101.00, 101.00, 101.00, 107…
## $ sd.hydro           <dbl> 34.58, 34.58, 34.58, 34.58, 34.58, 34.58, 38.01, 38…
## $ dv.cutornot.yrs    <int> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, …
## $ dv.30m.yrs         <int> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, …

regModel<-lm(count.total.adults ~ mean.hydro, data=dryadData) # note y~x
summary(regModel) # use summary instead of print to get more pertinent stat data

## 
## Call:
## lm(formula = count.total.adults ~ mean.hydro, data = dryadData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110.30  -47.30  -13.88   31.11  492.00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9898    16.5591   0.181    0.857    
## mean.hydro    0.5955     0.1224   4.867 3.27e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.67 on 128 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1496 
## F-statistic: 23.69 on 1 and 128 DF,  p-value: 3.27e-06

hist(regModel$residuals)

summary(regModel)$"r.squared"

## [1] 0.1561683

summary(regModel)[["r.squared"]]

## [1] 0.1561683

# View(summary(regModel)) # View must be uppercase, comment out View because RMarkdown does not like

regPlot<-ggplot(data=dryadData, aes(x=mean.hydro, y=count.total.adults+1)) + # adding one = log transform?
  geom_point(size=2) +
  stat_smooth(method="lm", se=0.99) # sets standard error bars, default is 96
regPlot + theme_few()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 2 rows containing missing values (`geom_point()`).

Basic ANOVA

Was there a statistically significant difference in the number of adults among wetlands? Remember y~x

ANOModel<-aov(count.total.adults~wetland, data=dryadData)
summary(ANOModel)

##              Df Sum Sq Mean Sq F value Pr(>F)  
## wetland       1  17356   17356    3.28 0.0725 .
## Residuals   128 677235    5291                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness

Use factor() to tell ggplot these are different groups - otherwise just treats it as different numbers

ANOModel<-aov(count.total.adults~factor(wetland), data=dryadData)

dryadData %>%
  group_by(wetland) %>%
  summarize(avgHydro = mean(count.total.adults, na.rm=T), N=n())

## # A tibble: 11 × 3
##    wetland avgHydro     N
##      <int>    <dbl> <int>
##  1       7    70.5     12
##  2      19    56.4     12
##  3      20   113.      12
##  4      25   133.      12
##  5      30    70.5     12
##  6      39   144.      12
##  7      55    42.3     12
##  8      59    32.5     12
##  9     124   118.      12
## 10     129     9.92    12
## 11     141    69.9     12

Going to use a box plot with an ANOVA - go hand in hand with visualizing these results

dryadData$wetland

##   [1]   7   7   7   7   7   7  19  19  19  19  19  19  20  20  20  20  20  20
##  [19]  25  25  25  25  25  25  30  30  30  30  30  30  39  39  39  39  39  39
##  [37]  55  55  55  55  55  55  59  59  59  59  59  59 124 124 124 124 124 124
##  [55] 129 129 129 129 129 129 141 141 141 141 141 141   7   7   7   7   7   7
##  [73]  19  19  19  19  19  19  20  20  20  20  20  20  25  25  25  25  25  25
##  [91]  30  30  30  30  30  30  39  39  39  39  39  39  55  55  55  55  55  55
## [109]  59  59  59  59  59  59 124 124 124 124 124 124 129 129 129 129 129 129
## [127] 141 141 141 141 141 141

# class(dryadData$wetland) = integer. Need to override using factor
dryadData$wetland <- factor(dryadData$wetland)
class(dryadData$wetland)

## [1] "factor"

ANOplot<- ggplot(data=dryadData, mapping=aes(x=wetland, y=count.total.adults)) +
  geom_boxplot()
ANOplot

## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

If we wanted to see what it looked like between species?

ANOplot2<- ggplot(data=dryadData, mapping=aes(x=wetland, y=count.total.adults, fill=species)) + # mapping fills in the aesthetics, allows you to give aesthetics arguments 
  geom_boxplot() +
  scale_fill_grey() # use gear icon -> Chunk Output in Console to see plot in Console
ANOplot2

## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

This code saves pdf to repo folder: ggsave(file="SpeciesBoxplots.pdf", plot=ANOplot2, device="pdf")

Logistic Regression

Simulate data set

# xVar<-sort(rgamma())

gamma probabilities: continuous variables that are always positive and have a skewed distribution

xVar <- sort(rgamma(n=200,shape=5,scale=5))
yVar <- sample(rep(c(1,0),each=100),prob=seq_len(200)) # prob=seq_len gives it equal probability of being sampled bc n=200, same as if we did 1:200
logRegData <- data.frame(xVar,yVar)

Logistic Regression Analysis

logRegModel<-glm(yVar~xVar, data=logRegData, family=binomial(link=logit)) # generic linear model, very important function if you're stats-oriented # latter part - how we specify we want a logistic regression
summary(logRegModel)$coefficients

##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -2.9161445 0.49504974 -5.890609 3.847749e-09
## xVar         0.1128415 0.01845451  6.114577 9.681350e-10

logRegPlot<-ggplot(data=logRegData, aes(x=xVar, y=yVar)) +
  geom_point() +
  stat_smooth(method=glm, method.args=list(family=binomial))
logRegPlot # binomial is what tells R it is a logistic regression because we have 0 or 1

## `geom_smooth()` using formula = 'y ~ x'

Contingency table (chi-squared) Analysis

Are there differences in the counts of males and females between species?

countData<-dryadData %>%
  group_by(species) %>%
  summarize(Males=sum(No.males, na.rm=T), Females = sum(No.females, na.rm=T)) %>%
  select(-species) %>%
  as.matrix()
countData # gets data into a table

##      Males Females
## [1,]  2090    1513
## [2,]  4095    2422

row.names(countData) = c("SS", "WF") # change row names
countData

##    Males Females
## SS  2090    1513
## WF  4095    2422

Chi-squared
- Get residuals

testResults<-chisq.test(countData)
testResults$residuals ##??????

##        Males   Females
## SS -2.387410  2.993122
## WF  1.775151 -2.225526

testResults$expected ##????????

##       Males  Females
## SS 2202.031 1400.969
## WF 3982.969 2534.031

Mosaic plot (base R) ??? base R

mosaicplot(x=countData, col=c("goldenrod", "grey"), shade=FALSE)

Bar plot

countDataLong <- countData %>%
  as_tibble() %>%  # transform into tibble, doesn't like that it's a matrix
  mutate(Species = c(rownames(countData))) %>% # give me species data ???
  pivot_longer(cols=Males:Females, # take both columns and make them longer - put them in a row
  names_to = "Sex", # name of new column which will contain Males/Females
  values_to = "Count") # name of new column which contain the values for M/F

Plot bar graph

ggplot(data=countDataLong, mapping=aes(x=Species, y=Count, fill=Sex)) +
  geom_bar(stat="identity", position="dodge") + #plots bars next to each other
  scale_fill_manual(values=c("darkslategrey", "midnightblue"))

Control Structures

if and ifelse statements
- if statement if (condition) {expression 1}
  if it’s THIS do THIS
- if (condition) {expression 1} else {expression 2}
  if it’s THIS, do THIS, but if it’s something else, do THAT
- if (condition1) {expression 1} else
- if (condition2) {expression 2} else
else must appear on the same line as the expression
use it for single values

z<-signif(runif(1), digits=2) # give me one random uniform value rounded to second signif value
z>0.5 # this turns it into a logical

## [1] FALSE

# use {} or not
if (z>0.8) {cat(z,"is a large number", "\n")} else # {cat} is concatinate fn # if z>0.8, return that z is a large number and put a line break ("/n")
  if (z<0.2) {cat(z, "is a small number", "\n")} else
  {cat(z, "is a number of typical size", "\n")
    cat("z^2=", z^2, "\n")
    y<-TRUE} # if its anywhere in between, say it's a number of typical size, then square it # y<-TRUE - can make up expression: if something's true do this

## 0.11 is a small number

ifelse to fill vectors
ifelse(condition, yes, no)
e.g. insect population where females lay 10.2 eggs on average, follows Poisson distribution (discrete probability distribution showing the likely number of times an event will occur). 35% parasitism where no eggs are laid. Let’s make a distribution for 1000 individuals

tester<-runif(1000)
eggs<-ifelse(tester>0.35, rpois(n=1000, lambda = 10.2),0) # if tester is > .35, make a distribution, if not assign a 0
hist(eggs) # make a histogram of eggs data

e.g. create a vector of pvalues from a simulation and we want to create a vector to highlight the significant ones for plotting

pVals<-runif(1000)
z<-ifelse(pVals<=0.025, "lowerTail", "nonSig") # if lower than/eq to this value say "lowerTail", if not say "nonSig" # using if else
z[pVals >= 0.975] <- "upperTail" # if gr than/eq to assign "upperTail" # use subsetting
table(z) # give tables w # of lowerTail/nonSig/upperTail values

## z
## lowerTail    nonSig upperTail 
##        24       950        26

For Loops

for loops: workhorse function for doing repetitive tasks. Universal in all computer languages. Controversial in R because:
- often not necessary (we have vectorization in R)
- very slow with certain operations (binding rows/columns together)
- family of apply functions (pretty much do the same as for loops)
Anatomy of for loop
- for(variable in sequence){starts loop body of for loop end of loop}
var is a counter variable that holds the current value of the loop (i, j, k)
- the sequence is an integer vector that defines the start/end of the loop

for (i in 1:5){
  cat("stuck in a loop", i, "\n")
  cat(3+2, "\n") # has nothing to do with which iteration you're in, will always produce the same value (5) in the loop
  cat(3+i, "\n") # i is whatever iteration/number it is in the sequence # in other words cat(3+loop #, line break)
}

## stuck in a loop 1 
## 5 
## 4 
## stuck in a loop 2 
## 5 
## 5 
## stuck in a loop 3 
## 5 
## 6 
## stuck in a loop 4 
## 5 
## 7 
## stuck in a loop 5 
## 5 
## 8

print(i) # will print the last number

## [1] 5

Use a counter variable that maps to the position of each element. This is a single for loop.

my_dogs<-c("chow", "akita", "malamute", "husky", "samoyed")
for(i in 1:length(my_dogs)){
  cat("i=", i, "my_dogs[i]=", my_dogs[i], "\n") # i value will be changing - in first iteration it'll be chow, then akita, malamute, etc # think of i like iteration number # end with line break bc good coding practice
}

## i= 1 my_dogs[i]= chow 
## i= 2 my_dogs[i]= akita 
## i= 3 my_dogs[i]= malamute 
## i= 4 my_dogs[i]= husky 
## i= 5 my_dogs[i]= samoyed

Can also create double for loops

m<-matrix(round(runif(20), digits=2), nrow=5)
m

##      [,1] [,2] [,3] [,4]
## [1,] 0.87 0.43 0.78 0.69
## [2,] 0.41 0.44 0.08 0.16
## [3,] 0.82 0.43 0.95 0.35
## [4,] 0.66 0.33 0.72 0.01
## [5,] 0.85 0.58 0.88 0.55

Use a for loop to change these values - if we were just doing loop over ROWS (not double loop)

for(i in 1:nrow(m)){
  m[i,] <- m[i,] + i # all columns, loop through one row at a time
} # this goes to first row, adds 1 to it, then to second row/iteration and adds 2 to it.
m

##      [,1] [,2] [,3] [,4]
## [1,] 1.87 1.43 1.78 1.69
## [2,] 2.41 2.44 2.08 2.16
## [3,] 3.82 3.43 3.95 3.35
## [4,] 4.66 4.33 4.72 4.01
## [5,] 5.85 5.58 5.88 5.55

Can do the same thing for the COLUMNS - loop over columns
- Note: it’s good practice to use i for columns and j for rows

m<-matrix(round(runif(20), digits=2), nrow=5)
for(j in 1:ncol(m)){
  m[,j] <- m[,j] + j
}
m

##      [,1] [,2] [,3] [,4]
## [1,] 1.89 2.74 3.31 4.84
## [2,] 1.26 2.99 3.34 4.92
## [3,] 1.72 2.88 3.76 4.81
## [4,] 1.91 2.16 3.59 4.20
## [5,] 1.17 2.90 3.73 4.37

Loop over rows and columns

m<-matrix(round(runif(20), digits=2), nrow=5)
for(i in 1:nrow(m)){
  for(j in 1:ncol(m)){
    m[i,j] <- m[i,j] + i + j # adds i and j to each cell, depends on location of cell
  } # end of column j loop
} # end of column i loop
print(m)

##      [,1] [,2] [,3] [,4]
## [1,] 2.31 3.12 4.66 5.63
## [2,] 3.67 4.87 5.55 6.56
## [3,] 4.01 5.69 6.59 7.63
## [4,] 5.57 6.65 7.70 8.95
## [5,] 6.76 7.74 8.44 9.51

Simple Data Analysis and More Complex Control Structures 01-30-2023

Abigail Griffin

2023-01-31

Simple Data Analysis and More Complex Control Structures

Analyses

Basic linear regression analysis (2 continuous variables)

Basic ANOVA

Logistic Regression

Logistic Regression Analysis

Contingency table (chi-squared) Analysis

Control Structures

For Loops