Before jumping into generalized linear models (GLMs), we will recap important foundations of general linear models, such as simple and multiple regression.
2.1 Data Demonstration
In this data demo, we will first review loading R packages and data, and then review key aspects of simple and multiple linear regression.
The data for this chapter is toy data about academic performance after obtaining a Ph.D. including salary information. It is composed of both continuous and categorical variables.
In R, packages must be installed before using them. Once installed, then they can be activated in session with the library() command.
Note
One way of installing new R packages is to use the install.packages()command
## Load packageslibrary(tidyverse)library(easystats)library(ggeffects)## Read in data from fileyearspubs <-read_csv("yearspubs.csv")yearspubs #print
# A tibble: 62 × 5
years_since_phd n_pubs sex n_cites salary
<dbl> <dbl> <chr> <dbl> <dbl>
1 3 18 female 50 21876
2 6 3 female 26 54511
3 3 2 female 50 53425
4 8 17 male 34 61863
5 9 11 female 41 52926
6 6 6 male 37 47034
7 16 38 male 48 66432
8 10 48 male 56 61100
9 2 9 male 19 41934
10 5 22 male 29 97454
# ℹ 52 more rows
Note
read_csv() comes from the tidyverse package. As used here, by default it assumes that the file yearspubs.csv is in the same folder directory as the R analysis file.
Note
Each package we load provides functionality:
easystats: For making table outputs from the models. easystats website
tidyverse: Provides a suite of easier to use analysis commands. tidyverse website
ggeffects: Provides plotting and visualizations of the models. ggeffects website
Here’s a table of our variables:
Variable
Description
Values
Measurement
yrs_since_phd
Years since obtaining Ph.D.
Integer
Scale
n_cites
Number of Academic Citations
Integer
Scale
salary
Current Salary Figures
Integer
Scale
n_pubs
Number of Academic Publications
Integer
Scale
sex
Male or Female
Characters
Nominal
2.3 Prepare Data / Exploratory Data Analysis
It is always a good first step to plot data prior to fitting any model.
First, let’s consider the distributions, as histograms, of years_since_phd and salary using the ggplot() function1. You can learn more about how to use ggplot()here.
In contrast to years_since_phd, notice that salary is more normally distributed.
Now, we consider the relationship between years_since_phd and salary using a scatter plot. Notice that, for the most part, we can see a linear relationship such that salary tends to increase with the number of additional years after obtaining a Ph.D.
Now, we will fit a simple linear regression with salary as the outcome variable and years_since_phd as the predictor variable using the lm() function, and consider the results. We will then print a summary table of the model results using the model_parameters() function from the easystats package. Therefore, model_paramters() shows the model’s estimated coefficient values.
lm() syntax
When using the lm() function, we are telling R to fit a model of the form:
Formula: outcome = intercept + predictor
lm(): outcome ~ 1 + predictor
Therefore, for lm() syntax, salary is the outcome variable, and years_since_phd is the predictor variable. The intercept is denoted by the “1”.
## Fit linear model and print parametersfit <-lm(formula = salary ~ years_since_phd,data = yearspubs)model_parameters(fit)## Parameter | Coefficient | SE | 95% CI | t(60) | p## -------------------------------------------------------------------------------## (Intercept) | 45625.47 | 2473.12 | [40678.49, 50572.46] | 18.45 | < .001## years since phd | 1400.94 | 308.87 | [ 783.12, 2018.77] | 4.54 | < .001## ## Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed## using a Wald t-distribution approximation.
By default, this prints in code format as above. However, for the rest of the modules, we will be using a prettier format called markdown format with print_md():
model_parameters(fit) |>print_md()
Parameter
Coefficient
SE
95% CI
t(60)
p
(Intercept)
45625.47
2473.12
(40678.49, 50572.46)
18.45
< .001
years since phd
1400.94
308.87
(783.12, 2018.77)
4.54
< .001
This simple regression line is defined by the formula:
\[ y_{\hat{i}} = \beta_0 + \beta_1 x \]
3.1 Parameter Table
From the results output, we can assess each parameter:
The intercept (\(\beta_0\)) is the estimated salary for a professor with zero years since PhD: $45113.71, 95% CI: [$40114.89, 50112.53]. This is significantly different from zero (p < 0.001).
The coefficient for years_since_phd (\(\beta_1\)) is $1400.94, 95% CI: [$783.12, $2018.77]. Therefore, for every additional year of since PhD was associated with an increase of $783.12 to $2018.77. This zero-order effect (since it is the only predictor variable) is significantly different from zero (p < 0.001) at $1400.94.
95% CI
Confidence intervals (CI) in the frequentist interpretation is quite wordy but needs to be understood:
“If we repeated the experiment over and over again (with different samples of the same size) and computed the 95% CI in each sample, then 95% of those intervals would contain the population parameter value.”
Alternative, reasonable interpretations can be as follows:
“We can be 95% confident that the 95% CI contains the population mean.”
This first alternative isn’t our favorite as it implies a view of probability as belief (which isn’t very frequentist).
“The 95% CI contains highly reasonable estimates of the population mean.”
3.2 Effect Sizes
By standardizing the data and then refitting the data, we can calculate the effect size of years_since_phd.
## Calculate effect sizes by standardizingstandardize_parameters(fit) |>print_md()
# Standardization method: refit
Parameter
Std. Coef.
95% CI
(Intercept)
2.30e-16
[-0.22, 0.22]
years_since_phd
0.51
[ 0.28, 0.73]
Using Cohen’s D guidelines, d = 0.51 which is a medium sized effect.
We can also used this standardized model to make the following statement:
For every one SD increase in years_since_phd, we would expect a 0.51 increase in salary, \(\beta_1\) = 0.51, 95% CI: [0.28, 0.73], p<0.001.
3.3 Model Performance
One purpose of model fitting is to demonstrate accurate predictions – given a new X, can we reasonably predict a new Y?
Three popular metrics of model fit performance in service of prediction are adjusted \(R^2\), AIC, and BIC.
## Calculate model performance indicesmodel_performance(fit) |>print_md()
Indices of model performance
AIC
AICc
BIC
R2
R2 (adj.)
RMSE
Sigma
1325.90
1326.31
1332.28
0.26
0.24
10151.69
10319.50
When comparing models, a higher adjusted \(R^2\) is a better fit, and a lower AIC/BIC is a better fit. Here, we aren’t comparing two competing models, but it is easy to obtain these metrics with the packages we loaded.
3.4 Ploting Model Predictions
Lastly, the model predictions are easy to plot with predict_response. The grey bands show our uncertainty along the best fit line, and you can incorporate data into the plot as well with show_data = TRUE. These plots visualize the parameter estimates we assessed earlier in the parameter tables.
Data points may overlap. Use the `jitter` argument to add some amount of
random variation to the location of data points and avoid overplotting.
4 Multiple Regression Examples
Multiple regression allows us to control for the effects of multiple predictors on the outcome. Statistically “controlling” for a predictors is when the shared variance between predictors is accounted for/removed so that the unique contribution of any one predictor variable on the outcome variable is estimated. We will divide our recap across different multiple regression scenarios.
4.1 Two Continuous Predictors
First, we will fit a multiple linear regression with salary as the outcome variable and years_since_phd and n_cites as the predictor variables.
Let’s proceed like before when we introduce a new variable and plot the distribution for n_pubsas a histogram.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here’s how we can use the lm() function to fit a multiple regression with salary as the outcome variable and years_since_phd and n_pubs as the predictor variables.
## Fit linear model and print parametersfit2 <-lm(formula = salary ~ years_since_phd + n_pubs,data = yearspubs)model_parameters(fit2) |>print_md()
Parameter
Coefficient
SE
95% CI
t(59)
p
(Intercept)
45113.71
2498.17
(40114.89, 50112.53)
18.06
< .001
years since phd
1075.21
406.75
(261.30, 1889.11)
2.64
0.010
n pubs
151.18
123.52
(-95.98, 398.34)
1.22
0.226
This multiple regression is defined by the formula:
Multiple regression allows us the ability to control the relationship of multiple predictors on the outcome variable. As aforementioned, statistically “controlling” for a predictors is when the shared variance between predictors is accounted for/removed so that the unique effect of any one predictor variable on the outcome variable is estimated. We then can ask the following questions:
What is the relationship between \(x_1\) and \(y\) if we hold \(x2\) at a constant value? e.g., What is the salary difference between prof A (woman, 10 years) and prof B (woman, 11 years)?
If we already know \(x2\), how much does learning about \(x1\) change our prediction of \(y\)? e.g. how much does learning that prof C is male change our prediction of their salary if we already knew they have 15 years of seniority?
Remember, as partial effects, we are finding the unique effect of a predictor when controlling for other predictors explicitly in the model. Importantly, we can’t say anything when controlling for variables not in the model.
4.1.1 Parameter Table Interpretation
From the parameter table above, we can make the following statements:
The intercept is the estimated salary for a professor with zero years since PhD and zero publications: $45113.71, 95% CI: [40114.89,50112.53].
The partial effect of n_pubs was not significantly different from zero at $151.18. When controlling for years_since_PhD, each additional publication was associated with an additional −$95.98 to $398.34.
The partial effect of years_since_phd was significantly different from zero at $1075.21. Thus, when controlling for n_pubs, each additional year since PhD was associated with an additional $261.30 to $1889.11.
Partial Effects vs. Zero-Order Effects
When we use multiple regression, the slope (\(beta\)) parameters are called “partial” (i.e., unique) effects since the model is controlling for each predictors’ influence on the outcome variable.
In contrast, with simple linear regression that has only one predictor, that predictor’s effect is called a “zero-order” (i.e., total) effect.
The decision to focus on partial effects or zero-order effects depends on the specific aims of the research question beacuse they answer different questions.
4.2 One Dummy Coded Predictor
In cases of a categorical predictor(s), we need to re-coded them such that each of the categories/groups they represent can be mathematically represented as a code variable. All coding systems turn a categorical variable into g-1 code variables.
So, for example, if we have 6 groups, we need 5 code variables.
The easiest way to turn a categorical predictor into the code variables is to use the factor() function.
It will use the first level (in this case, “female”), as the reference group.
It will name the slope “factorNonref” (e.g., sexMale).
Note
When we use functions like factor() we have explicitly written out the input arguments, such as levels, in order to show exactly what information the function is processing.
The $
As a reminder, the “$” symbol extracts one of the columns in the dataframe.
In order to interpret the results, it’s important to understand how the code variables are integrated into the model’s formula:
\[ y_{\hat{i}} = \beta_0 + \beta_1 C_{1i} \]
The intercept \(\beta_0\) is the value of \(y_{\hat{i}}\) when \(C_1\) = 0. In other words, the reference group’s average value.
The slope \(\beta_1\) is the change in \(y_{\hat{i}}\) when \(C_1 + 1\). In other words, it is the group difference in average value, and adding it to the intercept means we are focusing on the non-reference group’s outcome value.
Note that if you have more than two more categories per variable (here, we only have two groups/categories), you will use the exact same approach but have more than one slope.
Now, we can fit and interpret the model:
fit3 <-lm(formula = salary ~ sex, data = yearspubs)model_parameters(fit3) |>print_md()
Parameter
Coefficient
SE
95% CI
t(60)
p
(Intercept)
51501.85
2214.97
(47071.25, 55932.46)
23.25
< .001
sex (male)
6441.78
2948.02
(544.86, 12338.69)
2.19
0.033
Therefore:
The average salary of female academics is $51,502.
Males academics make $6,442 more on average. So their average salary is $51,502 + $6,442 = $57944
4.2.1 Estimating means and plot
We will use the estimate_means() function to accomplish the same thing we did above to obtain average values.
gmeans1 <-estimate_means(fit3, by =c("sex")) gmeans1 |>print_md()
Estimated Marginal Means
sex
Mean
SE
95% CI
female
51501.85
2214.97
(47071.25, 55932.46)
male
57943.63
1945.43
(54052.18, 61835.07)
Marginal means estimated at sex
Lastly, we can visualize like so:
plot(gmeans1)
4.2.2 Contrast
To see if this apparent difference of average salary between male and female academics is statistically significant, we can perform a contrast.
Marginal contrasts estimated at sex p-value adjustment method: Holm (1979)
Therefore, there is a statistically significant difference between male and female academics, p < 0.05.
4.3 One Continuous and One Dummy Coded Predictor
Now, we will fit a multiple linear regression with salary as the outcome variable and with years_since_phd and sex as the predictor variable.
## Fit linear model and print parametersfit4 <-lm(formula = salary ~ years_since_phd + sex,data = yearspubs)model_parameters(fit4) |>print_md()
Parameter
Coefficient
SE
95% CI
t(59)
p
(Intercept)
43989.01
2667.12
(38652.12, 49325.91)
16.49
< .001
years since phd
1300.30
312.36
(675.28, 1925.32)
4.16
< .001
sex (male)
4109.49
2673.09
(-1239.35, 9458.34)
1.54
0.130
4.3.1 Estimating Means
Recall that prior to fitting the multiple regression, we changed sex into a categorical variable using the factor() function. This is an example of dummy coding. Therefore, once we fit the multiple regression we were able to separate response values based on different factor levels.
In the case of sex, we have a binary categorical variable, so we can obtain average values of salary for each sex category while also accounting for (i.e. controlling for) the (partial) effect of years_since_phd.
We will use the estimate_means() function to accomplish this.
gmeans <-estimate_means(fit4, by ="sex")gmeans |>print_md()
Estimated Marginal Means
sex
Mean
SE
95% CI
female
52818.46
1989.11
(48838.27, 56798.66)
male
56927.96
1742.00
(53442.23, 60413.69)
Marginal means estimated at sex
We can also create a plot to visualize the means.
plot(gmeans)
We can now more easily see that, averaged over all years_since_phd values, male academics earn on average more salary than women academics. However, as the plot shows, there is variability within each factor level.
To see if this apparent difference of average salary between male and female academics is statistically significant, we can perform a contrast.
Marginal contrasts estimated at sex p-value adjustment method: Holm (1979)
Therefore, there is not a statistically significant difference in means across years_since_phd between male and female academics, p = 0.130
4.4 Interactions
Interactions asks us to consider the extent to which the effect of one predictor depends on the value on another predictor. It is up to the investigator, guided by theory and previous research, to consider if including interaction terms are relevant for the research question.
Again, we include interaction terms like so:
\[ y_{\hat{i}} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2\] - \(\beta_3\) is the slope of the interaction term (the product of two or more predictors)
We can now fit our new model with the interaction term:
## Fit linear model and print parametersfit5 <-lm(formula = salary ~ years_since_phd + sex + years_since_phd:sex,data = yearspubs)model_parameters(fit5) |>print_md()
Parameter
Coefficient
SE
95% CI
t(58)
p
(Intercept)
45072.28
4462.64
(36139.34, 54005.22)
10.10
< .001
years since phd
1112.81
692.27
(-272.92, 2498.54)
1.61
0.113
sex (male)
2656.21
5486.15
(-8325.51, 13637.92)
0.48
0.630
years since phd × sex (male)
236.36
777.28
(-1319.53, 1792.25)
0.30
0.762
In this example, the years_since_phd by sex interaction term was not significant (p = 0.762).
The * symbol for interactions
In the code block above, we specified the interaction term by writing out its explicit form. There is a short-form way in R for writing out all the interaction terms by using the asterisk(*) symbol. This is especially helpful when there are more than two predictor variables where higher-order interaction terms are possible (you would not want to accidentally forget an interaction term!).
Using the example above:
Long Form Fomula: salary ~ years_since_phd + sex + years_since_phd:sex
Short Form Formula: salary ~ years_since_phd*sex
Remember, both forms are equivalent.
We can visualize the interaction by making a marginal effects, or “spotlight”, plot.
Note that in this multiple regression with an interaction term that each level of sex has a different slope. We can estimate these slopes, or the “simple” effect of years_since_phd on salary for each category of sex, with estimate_slopes(). The term “simple” is a result of adding the interaction term in the model.
estimate_slopes(fit5, trend ="years_since_phd", by ="sex") |>print_md()
Estimated Marginal Effects
sex
Coefficient
SE
95% CI
t(58)
p
female
1112.81
692.27
(-272.92, 2498.54)
1.61
0.113
male
1349.17
353.44
( 641.69, 2056.65)
3.82
< .001
Marginal effects estimated for years_since_phd
In this model with an interaction, the simple slope of the years_since_phd was significantly different from zero for men at $1349.17, but not for women (p = 0.113). Therefore, it is possible to have a non-significant interaction term, and some, but not all, significant simple slopes.
5 Assumptions
The six assumptions of general linear models can be divided into two general areas. The key is that GLMs are useful when assumptions of general linear models no longer hold or are unreasonable.
Assumptions about the Formula
Correct Functional Form
Perfectly Measured Preditors
No Collinearity/Multicollinearity
Assumptions about the Residuals
Constant Error Variance
Independence of Residuals
Normality of Residuals
When we utilize GLMs, we are trying to do better than assume that each of these six assumption of general linear models holds. One way to test some of these assumptions visually is to use the check_model() function. As an example will apply these checks on the multiple regression model without the interaction term.
check_model(fit2)
As you can see, this is an easy way to check for assumptions. Given this model, the assumptions seem to fit well. But in the future GLM models, these assumptions will no longer be reasonable.
6 Conclusion
In this recap, we fit simple and multiple regression models with interaction terms. This is great starting point for learning about GLMs. In the future modules, we will take our understanding and interpretations of general linear models into GLMs and learn when GLMs are most appropriate, especially when assumptions do not hold.
ggplot() is also included the tidyverse package.↩︎