library(lme4)
## Loading required package: Matrix
Setting a seed is always important when randomness is involved.
set.seed(886) # I rolled a d10 three times to set seed
Before learning about simulating data, you need to understand distributions. These are the basis for generating data, and considering which ones describe your data is very informative. I have code below to produce some common ones in ecology.
The normal distribution is probably the most important distribution. It states that some values, those closer to the mean, are more likely, but that any value is theoretically possible. It is characterized by two moments. Mean and variance. Mean describes where the distribution is centered and variance describes the spread around the center.
This generates 10,000 values centered around 0, with a standard deviation of 1.
normal_data <- rnorm(n = 10000, mean = 0, sd = 1)
plot(density(normal_data), main = "Normal Distribution")
This returns the area of the distribution that is less than q. Below, if the mean is set to zero and we ask area of the distribution is less than 0, it will be 50% (0.5).
pnorm(q = 0, mean = 0, sd = 1)
## [1] 0.5
Opposite of pnorm, this takes a location on the distribution and says what value falls there. Below, we find the value that is at the 0.5 mark of a normal distribution with mean 0 and sd 1.
qnorm(p = 0.5, mean = 0, sd = 1)
## [1] 0
The binomial distribution is used when only 2 values are possible. In the typical hacky tradition, I will talk about coin flipping. A more applied example would look at something like survival. The binomial distribution predicts the number of “successes” in a given number of trials.
Generates 10,000 results of 100 tosses with a fair coin. So you perform an experiment where you flip a coin 100 times… and you do that 10,000 times.
binom_data <- rbinom(n = 10000, size = 100, prob = 0.5)
hist(binom_data, main = "Binomial Distribution")
The bernoulli distribution is a special case of the binomial distribution where we have 1 sample. Here we determine the probability of that value being a 1.
Generates 10,000 single draws from a bernoulli distribution (which is binom, where size = 1).
bern_data <- rbinom(n = 10000, size = 1, prob = 0.5)
hist(bern_data, main = "Bernoulli Distribution")
The poisson distribution is a distribution that only allows for positive integers. This is most commonly used for count data. The poisson distribution only has one moment, lambda. As lambda increases, so does the mean of the distribution.
Generates 10,000 samples from a poisson distribution with a lambda of 2.
pois_data <- rpois(n = 10000, lambda = 2)
hist(pois_data, main = "Poisson Distribution")
The gamma distribution only allows for positive values, but unlike the poisson distribution, the values are continuous. The gamma distribution has 2 moments, rate and shape. These can be derived from mean and variance using the formulas below (this is called moment matching).
Generates 10,000 samples from a gamma distribution with a mean and sd of 1.
mean_gamma <- 1
sd_gamma <- 1
gam_shape <- (mean_gamma^2)/(sd_gamma^2)
gam_rte <- (mean_gamma)/(sd_gamma^2)
gamma_data <- rgamma(n = 10000, shape = gam_shape, rate = gam_rte)
plot(density(gamma_data))
The t-test is used to compare differences between 2 groups. Let’s simulate some t-test data and confirm that the t.test() function works! We will look at the differences in height between residents of Hobbiton and residents of Bree.
First we simulate height of Hobbiton resident. Let’s say we sample 100 hobbits.
hobbiton_height <- rnorm(n = 100, mean = 42, sd = 6)
Next we simulate the height of residents of Bree. Notice that Bree resident are taller.
bree_height <- rnorm(n = 100, mean = 66, sd = 7)
Now we can test for this and we recover the signal the simulated. The mean height of Bree residents is 66 and hobbiton is -24 compared to Bree.
t_test <- t.test(hobbiton_height, bree_height)
t_test
##
## Welch Two Sample t-test
##
## data: hobbiton_height and bree_height
## t = -26.281, df = 198, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -24.92414 -21.44481
## sample estimates:
## mean of x mean of y
## 42.78076 65.96524
Personally, I use glm for for most basic test. You get the same answer doing that.
t_test_data <- data.frame(city = c(rep("Hobbiton", 100), rep("Bree", 100)),
height = c(hobbiton_height, bree_height))
mod1 <- glm(height ~ city, data = t_test_data)
summary(mod1)
##
## Call:
## glm(formula = height ~ city, data = t_test_data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.9652 0.6238 105.75 <2e-16 ***
## cityHobbiton -23.1845 0.8822 -26.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 38.9116)
##
## Null deviance: 34580.5 on 199 degrees of freedom
## Residual deviance: 7704.5 on 198 degrees of freedom
## AIC: 1303.8
##
## Number of Fisher Scoring iterations: 2
A regression is used to examine the relationships between continuous response variables and at least one continuous predictor variables. In this example we will simulate data about the whininess of one Frodo Baggins. Specifically, we will look at how this variable changes with respect to days_since_rivendell and lembas_supply.
First we create the predictor variables:
# Days since rivendell. Start at day 1 and end at the destruction of the one ring (day 91).
days_since_rivendell <- 1:91
# Determine amount of food. The trip is broken into 3 sections: Rivendell to Lorien (23 days),
# Lorien (31 days), and Lorien to Mordor (37 days). I simulate food so that food decreases linearly
# after leaving a rest area.
# First leg: Days 1 - 23. Here 100 is full supply and 0 is no food. The fellowship drops about 2% every day.
food_supply_leg1 <- 100 + 1:23 * -2 + rnorm(23, 0, 2)
plot(1:23, food_supply_leg1)
# Second leg: Days 54-91. The group drops about 2.5% everyday, maybe because Frodo and Sam are not as good at
# managing food on their own.
food_supply_leg2 <- 100 + 1:37 * -2.5 + rnorm(37, 0, 2)
plot(1:37, food_supply_leg2)
# Full lembas time series: decreasing from 100, full for a month, and decreasing again.
lembas_supply <- c(food_supply_leg1, rep(100, 31), food_supply_leg2)
plot(days_since_rivendell, lembas_supply, type = "l")
Now we simulate the variable we care about: Frodo’s whininess. I say that the there is a positive impact of days since rivendell, perhaps due to the rings creator becoming more powerful. The loss of food is a larger impact on frodo, a negative relationship.
frodo_whininess <- 10 + 1 * scale(days_since_rivendell) - 2.5*scale(lembas_supply) + rnorm(91, 0, 2)
# Look at the time series
plot(days_since_rivendell, frodo_whininess)
# Merge into a dataframe
regress_data <- data.frame(frodo_whininess = frodo_whininess,
days_since_rivendell = days_since_rivendell,
lembas_supply = lembas_supply)
Now we run a model and we once again returned the simulated data.
# Model
mod2 <- glm(frodo_whininess ~ scale(days_since_rivendell) + scale(lembas_supply), data = regress_data)
# Does a pretty good job of returning the answer!
summary(mod2)
##
## Call:
## glm(formula = frodo_whininess ~ scale(days_since_rivendell) +
## scale(lembas_supply), data = regress_data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.0310 0.2110 47.549 < 2e-16 ***
## scale(days_since_rivendell) 1.3971 0.2736 5.107 1.88e-06 ***
## scale(lembas_supply) -2.4400 0.2736 -8.919 6.03e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 4.049841)
##
## Null deviance: 1455.38 on 90 degrees of freedom
## Residual deviance: 356.39 on 88 degrees of freedom
## AIC: 390.48
##
## Number of Fisher Scoring iterations: 2
In this next section we will consider how job performance is a product of experience among the monarchs of middle earth (in the kingdoms of men, none of the elf non-sense). There are reasons why we suspect a mixed model model may be applicable. We will model this as a random intercept model. Essentially we are saying that the effect of experience is the same across kingdoms, but the intercepts (which in this case is baseline performance), might differ.
First, we generate data for each kingdom separately. You will notice that each kingdom has a different number of kings, whose reign for different amounts of time. You will also notice that the intercepts are different in each kingdom, but the effect of reign length is the same across all kingdoms.
Simulate Kingdom of Arnor
reign_arnor <- rpois(8, 150)
performance_arnor <- 50 + reign_arnor*0.5 + rnorm(8, 0, 3)
plot(reign_arnor, performance_arnor)
Simulate Kingdom of Gondor
reign_gondor <- rpois(30, 150)
performance_gondor <- 50 + reign_gondor*0.5 + rnorm(30, 0, 3)
plot(reign_gondor, performance_gondor)
Simulate Kingdom of Rohan
reign_rohan <- rpois(17, 40)
performance_rohan <- 20 + reign_rohan*0.5 + rnorm(17, 0, 3)
plot(reign_rohan, performance_rohan)
Now that we have simulated these data differently, let’s merge into one dataframe and run a model. We once again return our simulated data. We see that the effect of experience is the same, but that there are baseline differences between the Kingdoms derived from Numenor (Arnor and Gondor) versus Rohan.
mixed_data <- data.frame(kingdom = c(rep("Arnor", 8), rep("Gondor", 30), rep("Rohan", 17)),
reign_length = c(reign_arnor, reign_gondor, reign_rohan),
job_performance = c(performance_arnor, performance_gondor, performance_rohan))
mod3 <- lmer(job_performance ~ reign_length + (1|kingdom), data = mixed_data)
summary(mod3)
## Linear mixed model fit by REML ['lmerMod']
## Formula: job_performance ~ reign_length + (1 | kingdom)
## Data: mixed_data
##
## REML criterion at convergence: 284.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.29548 -0.70341 -0.07935 0.59184 2.35001
##
## Random effects:
## Groups Name Variance Std.Dev.
## kingdom (Intercept) 254.329 15.948
## Residual 7.836 2.799
## Number of obs: 55, groups: kingdom, 3
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 38.53403 10.03480 3.84
## reign_length 0.50775 0.03541 14.34
##
## Correlation of Fixed Effects:
## (Intr)
## reign_lngth -0.395
coef(mod3)
## $kingdom
## (Intercept) reign_length
## Arnor 46.74702 0.5077492
## Gondor 48.51875 0.5077492
## Rohan 20.33633 0.5077492
##
## attr(,"class")
## [1] "coef.mer"
Okay this last example is for binomial data, which if you recall is comprised of 1’s and 0’s. Say you are a Dark Lord who has recently broken a siege and is seeking a hidden elven city. You can build a model of Elven cities to get a better idea of where to look.
Let’s simulate some predictor variables
# These first 3 are created from a normal distribution
temperature <- rnorm(n = 100, mean = 70, sd = 5)
precip <- rnorm(n = 100, mean = 30, sd = 8)
forest_cover <- rnorm(n = 100, mean = 60, sd = 12)
# This variable is a simple presence/absence variable. We assign a 0.2 probability to 1,
# which is to say that eagles are present 20% of the time.
eagle_pres <- rbinom(n = 100, size = 1, prob = 0.2)
Recall that a binomial regression transforms a linear equation to a scale that forces all predicted values to be between 0 and 1. We simulated 100 values for each of our predictors and here I create a vector of probabilities called p. Essentially this vector takes the info from each of our predictors and creates the probability that an elven settlement is located there.
Here we are stating that the strongest predictor of a settlement is the presence of eagles. Precip and temperature are useless.
Notice that I scale my variables. This makes the mean of each value 0 and the sd 1. When I simulated variables above, I simulated them on different scales but to compare the relative strengths of them they need to be on the same scale. Scaling variables puts them on the scale of standard deviation. An effect size of 1 tells us that we expect a change of 1 in our response variable for every change in 1 standard deviation of our response variable. Since scaling makes the mean = 0, the intercept of the model now tells us the expected value of our response variable when our predictors are held to their mean.
p <- 1/(1 + exp(-(-1 + 0.005 * scale(temperature) + 0.01 * scale(precip) + 2.5 * eagle_pres + 0.5 * scale(forest_cover))))
# Here we create our actual data. We simulate binomial data, giving this the probabilities we just calculated.
elven_cities <- rbinom(n = 100, size = 1, prob = p)
Now we can merge the data together and run the model. Once again, we get the answer we want!
# Now we can merge these all into one data frame.
binom_data <- data.frame(elven_city = elven_cities,
temperature = temperature,
precip = precip,
forest_cover = forest_cover,
eagle_pres = eagle_pres)
# Let's run a binomial regression! Here we recover our answer!
mod4.1 <- glm(elven_city ~ scale(temperature) + scale(precip) + scale(forest_cover) + eagle_pres, data = binom_data, family = "binomial")
summary(mod4.1)
##
## Call:
## glm(formula = elven_city ~ scale(temperature) + scale(precip) +
## scale(forest_cover) + eagle_pres, family = "binomial", data = binom_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9476 0.2798 -3.387 0.000708 ***
## scale(temperature) -0.4283 0.2726 -1.571 0.116074
## scale(precip) 0.1190 0.2435 0.489 0.624981
## scale(forest_cover) 0.4319 0.2521 1.714 0.086603 .
## eagle_pres 2.5462 0.5888 4.324 1.53e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 137.63 on 99 degrees of freedom
## Residual deviance: 103.47 on 95 degrees of freedom
## AIC: 113.47
##
## Number of Fisher Scoring iterations: 4
Simulation is the ideal way to perform a power analysis, in which we are interested in the number of samples to collect to answer a question (before we collect the data). While there are packages to do this, doing it by hand allows you to run a simulation in the way your study is structured (mixed effects, different error distributions, etc.) Let’s run one using the previous example.
This is a big block of code, but it’s actually pretty simple. All we are doing is running the previous eagles example, but now generating the data with different sample sizes and seeing how well we get the answer we simulated.
# Start an empty dataset
out <- data.frame(n = NA, coef = NA)
# Simulate a sample size in increments of 10
sample_size <- seq(from = 10, to = 1000, by = 10)
# Generates data using different sample sizes
for (i in sample_size) {
# Same as before, but now n varies.
temperature <- rnorm(n = i, mean = 70, sd = 5)
precip <- rnorm(n = i, mean = 30, sd = 8)
forest_cover <- rnorm(n = i, mean = 60, sd = 12)
eagle_pres <- rbinom(n = i, size = 1, prob = 0.2)
p <- 1/(1 + exp(-(-1 + 0.005 * scale(temperature) + 0.01 * scale(precip) + 2.5 * eagle_pres + 0.5 * scale(forest_cover))))
elven_cities <- rbinom(n = i, size = 1, prob = p)
binom_data <- data.frame(elven_city = elven_cities,
temperature = temperature,
precip = precip,
forest_cover = forest_cover,
eagle_pres = eagle_pres)
mod4.2 <- glm(elven_city ~ scale(temperature) + scale(precip) + scale(forest_cover) + eagle_pres, data = binom_data, family = "binomial")
# Save output of model
temp_out <- data.frame(n = i, coef = coef(mod4.2)[[5]])
out <- rbind(out, temp_out)
}
out <- out[-1,]
We see that ideally we would want a sample of size of at least around 75-100.
plot(out$n, out$coef, type = "l", ylab = "Error in coef. estimate")