Multiple Linear regression with R – a practical example

In this post I want to show you how to implement a Multivariate Linear Regression (or Multiple Linear Regression).

As I only exemplify how to implement it, you need to look elsewhere to understand the underlying concepts. You can find explanations on the workings of linear regression e.g. in Achen (1982), Bortz (2005), Crawley (2005), Faraway (2002), Field, Miles & Field (2012) (my personal favorite!), or Wilcox (2009). You can find other, partly more elaborate exemplifications of how to implement regressions in R in Baayen (2008), Crawley (2007), or Gries (2009).
Just some words before we start:

In contrast to Simple Linear Regressions, which test the correlation between one predictor variable and a dependent variable, Multiple Linear Regressions test the correlations between many predictors and a dependent variable. In addition, Multiple Linear regression models can test if there are interactions between predictors. There cannot be interactions in Simple Linear regressions because there is only one predictor.

With respect to model diagnostics, please have a look at the post on Simple Linear Regression, because what I wrote there on e.g. the interpretation of the diagnostic plots also applies to Multiple Regression.
I will, nonetheless, discuss some issues which are specific to Multiple Regression (e.g. multicollinearity) as they are not applicable for Simple Linear Regression.
I am often asked about the minimal sample size required for regression analysis and I typically recommend 25 cases per group. I only provide this value because I am annoyed by that question as the required sample size depends on the size of the effect you want to detect and the number of predictors in your model: if you have many predictors and want to detect even small effects, then your minimum sample size would something like 600 cases. However, Field, Miles & Field (2012:273-274) provide rules of thumb for minimal sample sizes in regression analyses (k = number of predictors in the model; categorical predictors with more than 2 levels should be transformed into dummy variables):

  • If you are only interested in the overall fit of the model (in my experience this is almost never the case), your sample size should at least be 50 + k.
  • If you are interested in individual predictors, then your minimal sample size is 104 + k.
  • If you are interested in both, choose the higher value.

You will see below that I wrote a function which checks if the minimal sample size is sufficient.

Regarding model fitting procedures, will only touch on them here and I will used stepwise step-down model selection based on AIC (Akaike information criterion). There are several methods one can use to fit the model (forced entry, stepwise, hierarchical) and even within these approaches there are subroutines and a discussion of these different procedures is beyond the scope of this post. I will, however, write another post which is dedicated exclusively to model fitting procedures.

We are going to examine whether the money spent on presents by guys depends attraction and then relationship status of the girls that the presents are given to.

So here we go….

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
###########################################################################
### Multiple Linear Regression
###########################################################################
### --- Prepare data
# Remove all lists from the current workspace
rm(list=ls(all=T))
 
# Install packages we need or which may be useful
# (to activate just delete the #)
#install.packages("rms")
#install.packages("glmulti")
#install.packages("lmtest")
#install.packages("MASS")
#install.packages("QuantPsyc")
#install.packages("car")
 
#install.packages("ggplot2")
 
# Initiate the packages
#library(rms)
#library(glmulti)
#library(lmtest)
#library(MASS)
library(car)
library(QuantPsyc)
library(boot)
library(ggplot2)
source("C:\\R/multiplot_ggplot2.R")
source("C:\\R/mlr.summary.R")
 
###########################################################################
### Load and manipulate data
########################################################################
 
# set options
options("scipen" = 100, "digits" = 4)
 
###########################################################################
# read in existing data set
mydata <- read.table("C:\\MyProjects\\MultipleLinearRegression/mlrdata.txt", sep = "\t", header = T)
 
# provide an overview of the data
head(mydata); str(mydata); summary(mydata)
 
#        status    attraction money
#1 Relationship NotInterested 86.33
#2 Relationship NotInterested 45.58
#3 Relationship NotInterested 68.43
#4 Relationship NotInterested 52.93
#5 Relationship NotInterested 61.86
#6 Relationship NotInterested 48.47
#'data.frame':   100 obs. of  3 variables:
# $ status    : Factor w/ 2 levels "Relationship",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ attraction: Factor w/ 2 levels "Interested","NotInterested": 2 2 2 2 2 2 2 2 2 2 ...
# $ money     : num  86.3 45.6 68.4 52.9 61.9 ...
#          status           attraction     money       
# Relationship:50   Interested   :50   Min.   :  0.93  
# Single      :50   NotInterested:50   1st Qu.: 49.84  
#                                      Median : 81.73  
#                                      Mean   : 88.38  
#                                      3rd Qu.:121.57  
#                                      Max.   :200.99  
 
###########################################################################
p1 <- ggplot(mydata, aes(status, money)) + 
  geom_boxplot(notch = T, aes(fill = factor(status))) +
  scale_fill_brewer() +
#  facet_wrap(~ attraction, nrow = 1) +
  theme_bw() + # backgroud white(inactive to default grey)
  labs(x = "") +
  labs(y = "Money spent on present (€)") +
  coord_cartesian(ylim = c(0, 250)) +
  guides(fill = FALSE) +
  ggtitle("Status") 
 
p2 <- ggplot(mydata, aes(attraction, money)) + 
  geom_boxplot(notch = T, aes(fill = factor(attraction))) +
  scale_fill_brewer() +
#  facet_wrap(~ attraction, nrow = 1) +
  theme_bw() + # backgroud white(inactive to default grey)
  labs(x = "") +
  labs(y = "Money spent on present (€)") +
  coord_cartesian(ylim = c(0, 250)) +
  guides(fill = FALSE) +
    ggtitle("Attraction")
 
p3 <- ggplot(mydata, aes(x = money)) + 
  geom_histogram(aes(y=..density..),
      binwidth = 10,
      colour = "black", fill = "white") +
  geom_density(alpha=.2, fill = "#FF6666")  # Overlay with transparent density plot
 
 
p4 <- ggplot(mydata, aes(status, money)) + 
  geom_boxplot(notch = F, aes(fill = factor(status))) +
  scale_fill_brewer(palette="Paired") +
  facet_wrap(~ attraction, nrow = 1) +
  theme_bw() + # backgroud white(inactive to default grey)
  labs(x = "") +
  labs(y = "Money spent on present (€)") +
  coord_cartesian(ylim = c(0, 250)) +
  guides(fill = FALSE) 
 
 
# Plot the plots
multiplot(p1, p3, p2, p4, cols = 2)
###########################################################################

mlr1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
 
 
###########################################################################
# generate initial minimal regression model including
# only the intercept (overall mean) as predictor
m0.mlr = lm(money ~ 1, data = mydata) # baseline model
m0.glm = glm(money ~ 1, family = gaussian, data = mydata) # baseline model
 
# inspect results
summary(m0.mlr)
 
#Call:
#lm(formula = money ~ 1, data = mydata)
#Residuals:
#   Min     1Q Median     3Q    Max 
#-87.45 -38.54  -6.65  33.20 112.61 
#Coefficients:
#            Estimate Std. Error t value            Pr(>|t|)    
#(Intercept)    88.38       4.86    18.2 <0.0000000000000002 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 48.6 on 99 degrees of freedom
 
# inspect results
summary(m0.glm)
 
#Call:
#glm(formula = money ~ 1, family = gaussian, data = mydata)
#Deviance Residuals: 
#   Min      1Q  Median      3Q     Max  
#-87.45  -38.54   -6.65   33.20  112.61  
#
#Coefficients:
#            Estimate Std. Error t value            Pr(>|t|)    
#(Intercept)    88.38       4.86    18.2 <0.0000000000000002 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#(Dispersion parameter for gaussian family taken to be 2359)
#    Null deviance: 233562  on 99  degrees of freedom
#Residual deviance: 233562  on 99  degrees of freedom
#AIC: 1063
#Number of Fisher Scoring iterations: 2
 
###########################################################################
# create saturated model
m1.mlr = lm(money ~ (status + attraction)^2, data = mydata)
m1.glm = glm(money ~ status * attraction, family = gaussian, data = mydata)
 
###########################################################################
### --- Fit the model to find the "best" model,
### --- i.e. the minimal adequate model
### --- We will use a step-wise step down procedure
# automated modelfitting
# a) criterion: AIC (the smaller the better the model fit)
step(m1.mlr, direction = "both")
#step(m1.glm, direction = "both")
 
#Start:  AIC=592.5
#money ~ (status + attraction)^2
#
#                    Df Sum of Sq   RSS AIC
#<none>                           34558 593
#- status:attraction  1     24947 59505 645
#
#Call:
#lm(formula = money ~ (status + attraction)^2, data = mydata)
#
#Coefficients:
#                         (Intercept)                          statusSingle  
#                                99.2                                  57.7  
#             attractionNotInterested  statusSingle:attractionNotInterested  
#                               -47.7                                 -63.2  
 
# the step function shows that the model with all main effects and 
# interactions, i.e. the model with the formula
# "money ~ status * attraction" has the lowest AIC
 
# generate new model
m2.mlr = lm(money ~ (status + attraction)^2, data = mydata)
m2.glm = glm(money ~ (status + attraction)^2, family = gaussian, data = mydata)
 
# summary
#summary(m2.glm)
summary(m2.mlr)
 
#Call:
#lm(formula = money ~ (status + attraction)^2, data = mydata)
#Residuals:
#   Min     1Q Median     3Q    Max 
#-45.08 -14.26   0.46  11.93  44.14 
#Coefficients:
#                                     Estimate Std. Error t value             Pr(>|t|)    
#(Intercept)                             99.15       3.79   26.13 < 0.0000000000000002 ***
#statusSingle                            57.69       5.37   10.75 < 0.0000000000000002 ***
#attractionNotInterested                -47.66       5.37   -8.88    0.000000000000038 ***
#statusSingle:attractionNotInterested   -63.18       7.59   -8.32    0.000000000000581 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 19 on 96 degrees of freedom
#Multiple R-squared:  0.852,     Adjusted R-squared:  0.847 
#F-statistic:  184 on 3 and 96 DF,  p-value: <0.0000000000000002
 
 
##########################################################################

Ok, so what does all this mean?

The first object reported by the summary function is the Call, i.e. you formula. Next, the summary contains an overview of the residuals, i.e. a summary of the distribution of the difference between the values predicted by the model and the observed values.
Next comes the most interesting thing: a table with the parameters for our predictors (in fact, these are the parameters for our fixed effects – we will look at what the difference between fixed and random effects in another post). We will examine that table in more detail later on.

After that table, there are our model statistics. The model statistics provide information about the model as a whole as compared to each effect individually in the coefficients table.
The “Multiple R2” value tells us how much variance is explained by our model. A value of 0 would tell us that our model is useless and does not explain any variance. A value of 1 would tell us that our model predicts the observed values perfectly. If you multiply the Multiple R2 value by 100, you get the percentage of variance explained by the model. In our case, the value of 0.852 means that our model accounts for 85.2 percent of the overall variance.
Some researchers have proposed that a model should at least have a Multiple R2 value of .05 or higher, but this is not necessary true. The Multiple R2 value only tells us how much variance is explained by our model and maybe we are interested in a set of predictors which have only a small effect on the dependent variable. So the model can tell us something even if the Multiple R2 value is small. The important thing is that the model is overall significant which means that our model performs better than predicting the outcome on the overall mean alone. But let’s continue…

The “Adjusted R2” value is similar to the Multiple R2 value but it takes into consideration the number of predictors in our model. There is, however, a more interesting aspect to the Adjusted R2 value: It tells us fit well our model is for generalizing from our sample to the overall population. If there is little difference between the Multiple R2 value and the Adjusted R2 value, our model can readily be generalized and applied to the population at large. If there is a substantial difference between the Multiple R2 value and the Adjusted R2 value, then our model is rather unstable and based on our sample we should not generalize our results. So what does this mean exactly? The difference between the Multiple R2 value and the Adjusted R2 value means that if the model were derived from the population
rather than a sample, it would account for approximately .5 percent (85.2-84.7) less variance in the outcome.

All main effects and the interaction between status and attraction affect money significantly. But what does this mean and how should you interpret this?
First of all: if there is a significant interaction, you should not interpret the main effects which are involved in the interaction. And if you do not have significant interactions, you should be aware of the fact that the main effect reflects a bivariate statistic (!), i.e. the coefficients of the predictors reflect the correlation between the dependent variable and the predictor if the other predictor has a value of 0.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
 
 
 
# extract the confidence intervalls for the coefficients
#confint(m2.glm)
confint(m2.mlr)
 
#                                      2.5 % 97.5 %
#(Intercept)                           91.62 106.69
#statusSingle                          47.04  68.34
#attractionNotInterested              -58.31 -37.01
#statusSingle:attractionNotInterested -78.24 -48.11
 
# compare the initial saturated model to the minimal adeqaute model
#anova(m0.glm, m2.glm)
anova(m0.mlr, m2.mlr)
 
#Analysis of Variance Table
#Model 1: money ~ 1
#Model 2: money ~ (status + attraction)^2
#  Res.Df    RSS Df Sum of Sq   F              Pr(>F)    
#1     99 233562                                         
#2     96  34558  3    199005 184 <0.0000000000000002 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Anova(m0.mlr, m2.mlr, type = "III")
 
#Anova Table (Type III tests)
#Response: money
#            Sum Sq Df F value              Pr(>F)    
#(Intercept) 781016  1     331 <0.0000000000000002 ***
#Residuals    34558 96                                
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
 
##########################################################################
### --- model diagnostics
##########################################################################
# create diagnostic plots
# check for cases to remove
par(mfrow = c(3, 2))
plot(m2.mlr)
qqPlot(m2.mlr, main="QQ Plot")
# generate Cook's D plot
# D values > 4/(n-k-1) can be a problem
cutoff <- 4/((nrow(mydata)-length(m2.mlr$coefficients)-2))
plot(m2.mlr, which=4, cook.levels = cutoff)
par(mfrow = c(1, 1))
 
##########################################################################

mlr2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
##########################################################################
##########################################################################
# we delete cases which are too infliuential
# but before we do that, we add some informational columns to our data set
# extract case wise diagnostics and add them to the data set
infl <- influence.measures(m2.mlr)
 
# add influence statistics to data
mydata <- data.frame(mydata, infl[[1]], infl[[2]])
head(mydata)
 
#        status    attraction money                  dfb.1_                 dfb.sttS dfb.atNI  dfb.sS.N
#1 Relationship NotInterested 86.33  0.00000000000000236805 -0.000000000000001067658  0.27414 -0.193850
#2 Relationship NotInterested 45.58  0.00000000000000022805 -0.000000000000000396634 -0.04569  0.032306
#3 Relationship NotInterested 68.43 -0.00000000000000091891  0.000000000000001134168  0.13140 -0.092911
#4 Relationship NotInterested 52.93 -0.00000000000000016896  0.000000000000000269812  0.01111 -0.007854
#5 Relationship NotInterested 61.86  0.00000000000000011789 -0.000000000000000001814  0.08021 -0.056718
#6 Relationship NotInterested 48.47 -0.00000000000000004889  0.000000000000000015629 -0.02334  0.016507
#     dffit  cov.r     cook.d  hat dfb.1_.1 dfb.sttS.1 dfb.atNI.1 dfb.sS.N.1 dffit.1 cov.r.1 cook.d.1 hat.1
#1  0.38770 0.9358 0.03658407 0.04    FALSE      FALSE      FALSE      FALSE   FALSE   FALSE    FALSE FALSE
#2 -0.06461 1.0817 0.00105355 0.04    FALSE      FALSE      FALSE      FALSE   FALSE   FALSE    FALSE FALSE
#3  0.18582 1.0491 0.00864788 0.04    FALSE      FALSE      FALSE      FALSE   FALSE   FALSE    FALSE FALSE
#4  0.01571 1.0860 0.00006233 0.04    FALSE      FALSE      FALSE      FALSE   FALSE   FALSE    FALSE FALSE
#5  0.11344 1.0722 0.00324023 0.04    FALSE      FALSE      FALSE      FALSE   FALSE   FALSE    FALSE FALSE
#6 -0.03301 1.0850 0.00027528 0.04    FALSE      FALSE      FALSE      FALSE   FALSE   FALSE    FALSE FALSE
 
# remove influential data points
remove <- apply(infl$is.inf, 1, function(x) {
  ifelse(x == TRUE, return("remove"), return("keep"))  }  )
 
# add remove to mydata  
mydata <- data.frame(mydata, remove)
 
mydata <- mydata[mydata$remove == "keep", ] 
head(mydata)
 
#        status    attraction money                  dfb.1_                 dfb.sttS dfb.atNI  dfb.sS.N
#1 Relationship NotInterested 86.33  0.00000000000000236805 -0.000000000000001067658  0.27414 -0.193850
#2 Relationship NotInterested 45.58  0.00000000000000022805 -0.000000000000000396634 -0.04569  0.032306
#3 Relationship NotInterested 68.43 -0.00000000000000091891  0.000000000000001134168  0.13140 -0.092911
#4 Relationship NotInterested 52.93 -0.00000000000000016896  0.000000000000000269812  0.01111 -0.007854
#5 Relationship NotInterested 61.86  0.00000000000000011789 -0.000000000000000001814  0.08021 -0.056718
#6 Relationship NotInterested 48.47 -0.00000000000000004889  0.000000000000000015629 -0.02334  0.016507
#     dffit  cov.r     cook.d  hat dfb.1_.1 dfb.sttS.1 dfb.atNI.1 dfb.sS.N.1 dffit.1 cov.r.1 cook.d.1 hat.1
#1  0.38770 0.9358 0.03658407 0.04    FALSE      FALSE      FALSE      FALSE   FALSE   FALSE    FALSE FALSE
#2 -0.06461 1.0817 0.00105355 0.04    FALSE      FALSE      FALSE      FALSE   FALSE   FALSE    FALSE FALSE
#3  0.18582 1.0491 0.00864788 0.04    FALSE      FALSE      FALSE      FALSE   FALSE   FALSE    FALSE FALSE
#4  0.01571 1.0860 0.00006233 0.04    FALSE      FALSE      FALSE      FALSE   FALSE   FALSE    FALSE FALSE
#5  0.11344 1.0722 0.00324023 0.04    FALSE      FALSE      FALSE      FALSE   FALSE   FALSE    FALSE FALSE
#6 -0.03301 1.0850 0.00027528 0.04    FALSE      FALSE      FALSE      FALSE   FALSE   FALSE    FALSE FALSE
#  remove
#1   keep
#2   keep
#3   keep
#4   keep
#5   keep
#6   keep
 
nrow(mydata)
 
#[1] 98
 
###########################################################################
###########################################################################
# generate initial saturated regression model including
# all variables and their interactions
m0.mlr = lm(money ~ 1, data = mydata) # baseline model
m0.glm = glm(money ~ 1, family = gaussian, data = mydata) # baseline model
 
# create saturated model
m1.mlr = lm(money ~ (status + attraction)^2, data = mydata)
m1.glm = glm(money ~ status * attraction, family = gaussian, data = mydata)
 
###########################################################################
### --- Fit the model to find the "best" model,
### --- i.e. the minimal adequate model
### --- We will use a step-wise step down procedure
# automated modelfitting
# a) criterion: AIC (the smaller the better the model fit)
#step(m1.glm, direction = "both")
step(m1.mlr, direction = "both")
 
#Start:  AIC=570.3
#money ~ (status + attraction)^2
#
#                    Df Sum of Sq   RSS AIC
#<none>                           30411 570
#- status:attraction  1     21647 52058 621
#
#Call:
#lm(formula = money ~ (status + attraction)^2, data = mydata)
#
#Coefficients:
#                         (Intercept)                          statusSingle  
#                                99.2                                  55.9  
#             attractionNotInterested  statusSingle:attractionNotInterested  
#                               -47.7                                 -59.5  
 
# the step function shows that the model with all main effects and 
# interactions, i.e. the model with the formula
# "money ~ status * attraction" has the lowest AIC
 
# generate new model
m2.mlr = lm(money ~ (status + attraction)^2, data = mydata)
m2.glm = glm(money ~ (status + attraction)^2, family = gaussian, data = mydata)
 
# summary
summary(m2.mlr)
 
#Call:
#lm(formula = money ~ (status + attraction)^2, data = mydata)
#Residuals:
#   Min     1Q Median     3Q    Max 
#-35.76 -13.51  -0.99  10.60  38.77 
#Coefficients:
#                                     Estimate Std. Error t value             Pr(>|t|)    
#(Intercept)                             99.15       3.60   27.56 < 0.0000000000000002 ***
#statusSingle                            55.85       5.14   10.87 < 0.0000000000000002 ***
#attractionNotInterested                -47.66       5.09   -9.37    0.000000000000004 ***
#statusSingle:attractionNotInterested   -59.46       7.27   -8.18    0.000000000001338 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 18 on 94 degrees of freedom
#Multiple R-squared:  0.857,     Adjusted R-squared:  0.853 
#F-statistic:  188 on 3 and 94 DF,  p-value: <0.0000000000000002
 
# summary glm
summary(m2.glm)
 
#Call:
#glm(formula = money ~ (status + attraction)^2, family = gaussian, 
#    data = mydata)
#Deviance Residuals: 
#   Min      1Q  Median      3Q     Max  
#-35.76  -13.51   -0.99   10.60   38.77  
#Coefficients:
#                                     Estimate Std. Error t value             Pr(>|t|)    
#(Intercept)                             99.15       3.60   27.56 < 0.0000000000000002 ***
#statusSingle                            55.85       5.14   10.87 < 0.0000000000000002 ***
#attractionNotInterested                -47.66       5.09   -9.37    0.000000000000004 ***
#statusSingle:attractionNotInterested   -59.46       7.27   -8.18    0.000000000001338 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#(Dispersion parameter for gaussian family taken to be 323.5)
#    Null deviance: 213227  on 97  degrees of freedom
#Residual deviance:  30411  on 94  degrees of freedom
#AIC: 850.4
#Number of Fisher Scoring iterations: 2
 
##########################################################################

An interaction is a situation in which a correlation between a predictor and the dependent variable is affected by another variable. In our scenario guys will spend even more money if the girl they are interested in is single. So the correlation between “money” and “attraction” is affected by another variable: “status”. In this case there is an interaction between “attraction” and “status”.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
 
##########################################################################
# extract the confidence intervalls for the coefficients
#confint(m2.glm)
confint(m2.mlr)
 
#                                      2.5 % 97.5 %
#(Intercept)                           92.01 106.30
#statusSingle                          45.65  66.06
#attractionNotInterested              -57.76 -37.56
#statusSingle:attractionNotInterested -73.89 -45.03
 
# compare the initial saturated model to the minimal adeqaute model
#anova(m0.glm, m2.glm)
anova(m0.mlr, m2.mlr)
 
#Analysis of Variance Table
#Model 1: money ~ 1
#Model 2: money ~ (status + attraction)^2
#  Res.Df    RSS Df Sum of Sq   F              Pr(>F)    
#1     97 213227                                         
#2     94  30411  3    182816 188 <0.0000000000000002 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Anova(m0.mlr, m2.mlr, type = "III")
 
#Anova Table (Type III tests)
#Response: money
#            Sum Sq Df F value              Pr(>F)    
#(Intercept) 760953  1     346 <0.0000000000000002 ***
#Residuals    30411 94                                
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
##########################################################################
### --- model diagnostics
##########################################################################
# create diagnostic plots
# check for cases to remove
par(mfrow = c(3, 2))
plot(m2.mlr)
qqPlot(m2.mlr, main="QQ Plot")
# generate Cook's D plot
cutoff <- 4/((nrow(mydata)-length(m2.mlr$coefficients)-2))
plot(m2.mlr, which=4, cook.levels = cutoff)
par(mfrow = c(1, 1))
##########################################################################

mlr3

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
 
##########################################################################
##########################################################################
# add additional model diagnostics to the data
 
mydata$residuals <- resid(m2.mlr)
mydata$standardized.residuals <- rstandard(m2.mlr)
mydata$studentized.residuals <- rstudent(m2.mlr)
mydata$cooks.distance <- cooks.distance(m2.mlr)
mydata$dffit <- dffits(m2.mlr)
mydata$leverage <- hatvalues(m2.mlr)
mydata$covariance.ratios <- covratio(m2.mlr)
mydata$fitted <- m2.mlr$fitted.values
 
##########################################################################
# create 3 diagnostic residual plots in 1 window
 
p1 <- histogram<-ggplot(mydata, aes(studentized.residuals)) +
  theme(legend.position = "none") +
  geom_histogram(aes(y=..density..),
    binwidth = 1,
    colour="black",
    fill="white") +
  labs(x = "Studentized Residual", y = "Density")
p2 <- histogram + stat_function(fun = dnorm, args = list(mean = mean(mydata$studentized.residuals, na.rm = TRUE), sd = sd(mydata$studentized.residuals, na.rm = TRUE)), colour = "red", size = 1)
 
 
p3 <- scatter <- ggplot(mydata, aes(fitted, studentized.residuals))
p4 <- scatter + geom_point() + geom_smooth(method = "lm", colour = "Red")+ labs(x = "Fitted Values", y = "Studentized Residual")
 
 
p5 <- qqplot.resid <- qplot(sample = mydata$studentized.residuals, stat="qq") + labs(x = "Theoretical Values", y = "Observed Values")
p6 <- qqplot.resid
 
multiplot(p2, p4, p6, cols=3)
 
##########################################################################

mlr4

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
 
##########################################################################
##########################################################################
# continue diagnostics
 
# 1: standardized residuals with an absolute value greater than 3.29 should be removed (cf. Field, Miles & Field 2012:269)
# 2: if more than 1% of our sample cases have standardized residuals > 2.58 the level of error within our model is unacceptable (cf. Field, Miles & Field 2012:269)
# 3: if more than 5% of cases have standardized residuals > 1.96 our model fit is unaaceptable (cf. Field, Miles & Field 2012:269)
# 4: cases with Cook's distance > 1 should be removed (cf. Field, Miles & Field 2012:269)
# 5: leverage cut off point:(3(k + 1)/n)(cf. Field, Miles & Field 2012:270)
 
# 1: optimal = 0 (all cases listed must be removed)
which(mydata$standardized.residuals > 3.29)
 
#integer(0)
 
# 2: optimal = 1
stdres_258 <- as.vector(sapply(mydata$standardized.residuals, function(x) {
  ifelse(sqrt((x^2)) > 2.58, 1, 0)  }  ))
(sum(stdres_258) / length(stdres_258)) * 100
 
#[1] 0
 
# 3: optimal = 5
stdres_196 <- as.vector(sapply(mydata$standardized.residuals, function(x) {
  ifelse(sqrt((x^2)) > 1.96, 1, 0)  }  ))
(sum(stdres_196) / length(stdres_196)) * 100
 
#[1] 6.122
 
# 4: optimal = 0 (all cases listed must be removed)
which(mydata$cooks.distance > 1)
 
#integer(0)
 
# 5: optimal = 0 (all cases listed must be removed IF cook's distance close to 1)
which(mydata$leverage >= (3*mean(mydata$leverage)))
 
#integer(0)
 
# checking autocorrelation:
# perform Durbin-Watson test  (optimal: large p-value)
dwt(m2.mlr)
 
#lag Autocorrelation D-W Statistic p-value
#   1        -0.01433         1.968   0.652
# Alternative hypothesis: rho != 0
 
# checking multicolliniarity: 
# extract variance inflation factors (VIF) (values greater 10 should/must be excluded; cf Myers 1990)
vif(m2.mlr)
 
#           status        attraction status:attraction 
#             2.00              1.96              2.96
 
# checking multicolliniarity: tolerance is 1/VIF 
# values smaller than .01 should/must be excluded
# Menard 1995 states that values smaller than .2 are problematic
1/vif(m2.mlr)
 
#           status        attraction status:attraction 
#           0.5000            0.5102            0.3378
 
# mean VIF: should not be greater than 1 (Bowerman & O'Connell 1990) 
mean(vif(m2.mlr))
 
#[1] 2.307
 
# checking sample size (Green 1991)
# if you are interested in the overall model: 50 + 8k (k = number of predictors)
# if you are interested in individual predictors: 104 + k
# if you are interesetd in both: take the higher value!
smplesz <- function(x) {
  ifelse((length(x$fitted) < (104 + ncol(summary(x)$coefficients)-1)) == TRUE, 
    return(
      paste("Sample too small: please increase your sample by ", 
      104 + ncol(summary(x)$coefficients)-1 - length(x$fitted), 
      " data points", collapse = "")), 
    return("Sample size sufficient")) } 
smplesz(m2.mlr)
 
#[1] "Sample too small: please increase your sample by  9  data points"
 
# checking the expected R of random data (Field, Miles & Field 2012:274)
# to see how strong the effect of a false positive could be based on the sample size
# the value ranges between 0 and 1: the closer to 0 the better! 
# if the value is above .1 there is cause for concern (i.e. you need a bigger sample) 
 
expR <- function(x) {
  ifelse(((ncol(summary(x)$coefficients)-1)/(length(x$fitted)-1)) > .05, 
    return(paste("A random sample is expected to cause a correlation of the size", 
      ((ncol(summary(x)$coefficients)-1)/(length(x$fitted)-1)), 
      "between the predictors and the predicted", collapse = "")),
     return(paste("Based on the sample size expect a false positive correlation of", 
       round(((ncol(summary(x)$coefficients)-1)/(length(x$fitted)-1)), 4), 
       "between the predictors and the predicted", 
       collapse = "")))}
expR(m2.mlr)
 
#[1] "Based on the sample size expect a false positive correlation of 0.0309 between the predictors and the predicted"
 
###############################################################
### Function for a neat output table for Multiple Linear Regression Models
###############################################################
# x = lm model, a = glm model
 
mlr.summary(m2.mlr, m2.glm) # inspect the results
 
#                                     Estimate  VIF Std. Error t value Pr(>|t|) Significance
#(Intercept)                             99.15             3.6   27.56        0  p < .001***
#statusSingle                            55.85    2       5.14   10.87        0  p < .001***
#attractionNotInterested                -47.66 1.96       5.09   -9.37        0  p < .001***
#statusSingle:attractionNotInterested   -59.46 2.96       7.27   -8.18        0  p < .001***
#Model statistics                                                                      Value
#Number of cases in model                                                                 98
#Residual standard error on 94 DF                                                      17.99
#Multiple R-squared                                                                    0.857
#Adjusted R-squared                                                                    0.853
#AIC                                                                                   850.4
#F-statistic (3, 94)                                                                  188.36
#Model p-value                                                                0  p < .001***
 
###########################################################################
### --- THE END
###########################################################################

You would report these results in a table like the one below.

mlr5

If you have to report the regression model in prose, it would read something like this:
(If there are significant interactions, you should not interpret the main effects involved in these interactions! I only do so for because I want to show how a model report could look like).
A multiple regression analysis was fitted to the data in a step-wise step down procedure using AIC (Akaike’s Information Criterion) to arrive at the final minimal adequate model. During model diagnostics 2 data points were identified as outliers and as a result removed from the data. Further visual inspection of diagnostic plots and data driven methods did not suggest further problems. The final multiple regression model was based on 98 data points and correlated significantly with the data (R2: .857, F-statistic (3, 94): 154.4, AIC: 850.4, p < .001***). The minimal adequate model contained both attraction and status as significant main effects. The status of money receivers correlates significantly with the amount of money they received (SE: 5.14, t: 10.87, p < .001***). This shows that if the person receiving the money is single, they receive 55,85 Euro more money compared to persons who are in a relationship; these receive 99.15 Euro (given that they are not interested in the receiver). Attraction also correlates significantly with the money one receives (SE: 5.09, t: -9.37, p < .001***). This shows that if the person giving the money is not attracted to the person receiving the money, they will spend -47.66 Euro less than they do if they are interested in the person (given that the receiver is in a relationship). The final minimal adequate model identified a significant interaction between status and attraction (SE: 7.27, t-value: -8.18, p < .001***): if the receiver is single but the donor is not attracted to the receiver, the donor will spend 59,46 Euro less on that person than he would if he were interested in the single person (cf. Figure 1 above). The effect size is substantial (standardized β: 0.9962) and confirms that if the subject increases calorie intake by one standard deviation (10.4kcal) the person will gain one standard deviation in weight (2308g). References

    Achen, Christopher H. 1982. Interpreting and Using Regression. London & New Delhi: Sage Publications, Inc.
    Baayen, Harald R. 2008. Analyzing Linguistic Data – A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.
    Bortz, Jürgen. 62005. Statistik für Human- und Sozialwissenschaftler. Heidelberg: Springer.
    Crawley, Michael J. 2005. Statistics – An Introduction Using R. Chichester, West Sussex: John Wiley & Sons Ltd.
    Crawley, Michael J. 2007. The R Book. Chichester, West Sussex: John Wiley & Sons Ltd.
    Faraway, Julian J. 32002. Practical Regression and Anova using R.
    Field, Andy, Jeremy Miles & Zoe Field. 22012. Discovering statistics using R. London, Thousand oaks, CA, New Delhi, Singapore: Sage Publications, Inc.
    Gries, Stefan Th. 2009. Statistics for linguists with R. A Practical Introduction. Berlin & New York: Mouton de Gruyter.
    Wilcox. Rand R. 2009. Basic Statistics – Understanding Conventional Methods and Modern Insights. Oxford: Oxford University Press.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.