Multiple Linear regression with R – a practical example

In this post I want to show you how to implement a Multivariate Linear Regression (or Multiple Linear Regression).

As I only exemplify how to implement it, you need to look elsewhere to understand the underlying concepts. You can find explanations on the workings of linear regression e.g. in Achen (1982), Bortz (2005), Crawley (2005), Faraway (2002), Field, Miles & Field (2012) (my personal favorite!), or Wilcox (2009). You can find other, partly more elaborate exemplifications of how to implement regressions in R in Baayen (2008), Crawley (2007), or Gries (2009).
Just some words before we start:

In contrast to Simple Linear Regressions, which test the correlation between one predictor variable and a dependent variable, Multiple Linear Regressions test the correlations between many predictors and a dependent variable. In addition, Multiple Linear regression models can test if there are interactions between predictors. There cannot be interactions in Simple Linear regressions because there is only one predictor.

With respect to model diagnostics, please have a look at the post on Simple Linear Regression, because what I wrote there on e.g. the interpretation of the diagnostic plots also applies to Multiple Regression.
I will, nonetheless, discuss some issues which are specific to Multiple Regression (e.g. multicollinearity) as they are not applicable for Simple Linear Regression.
I am often asked about the minimal sample size required for regression analysis and I typically recommend 25 cases per group. I only provide this value because I am annoyed by that question as the required sample size depends on the size of the effect you want to detect and the number of predictors in your model: if you have many predictors and want to detect even small effects, then your minimum sample size would something like 600 cases. However, Field, Miles & Field (2012:273-274) provide rules of thumb for minimal sample sizes in regression analyses (k = number of predictors in the model; categorical predictors with more than 2 levels should be transformed into dummy variables):

• If you are only interested in the overall fit of the model (in my experience this is almost never the case), your sample size should at least be 50 + k.
• If you are interested in individual predictors, then your minimal sample size is 104 + k.
• If you are interested in both, choose the higher value.

You will see below that I wrote a function which checks if the minimal sample size is sufficient.

Regarding model fitting procedures, will only touch on them here and I will used stepwise step-down model selection based on AIC (Akaike information criterion). There are several methods one can use to fit the model (forced entry, stepwise, hierarchical) and even within these approaches there are subroutines and a discussion of these different procedures is beyond the scope of this post. I will, however, write another post which is dedicated exclusively to model fitting procedures.

We are going to examine whether the money spent on presents by guys depends attraction and then relationship status of the girls that the presents are given to.

So here we go….

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 ########################################################################### ### Multiple Linear Regression ########################################################################### ### --- Prepare data # Remove all lists from the current workspace rm(list=ls(all=T))   # Install packages we need or which may be useful # (to activate just delete the #) #install.packages("rms") #install.packages("glmulti") #install.packages("lmtest") #install.packages("MASS") #install.packages("QuantPsyc") #install.packages("car")   #install.packages("ggplot2")   # Initiate the packages #library(rms) #library(glmulti) #library(lmtest) #library(MASS) library(car) library(QuantPsyc) library(boot) library(ggplot2) source("C:\\R/multiplot_ggplot2.R") source("C:\\R/mlr.summary.R")   ########################################################################### ### Load and manipulate data ########################################################################   # set options options("scipen" = 100, "digits" = 4)   ########################################################################### # read in existing data set mydata <- read.table("C:\\MyProjects\\MultipleLinearRegression/mlrdata.txt", sep = "\t", header = T)   # provide an overview of the data head(mydata); str(mydata); summary(mydata)   # status attraction money #1 Relationship NotInterested 86.33 #2 Relationship NotInterested 45.58 #3 Relationship NotInterested 68.43 #4 Relationship NotInterested 52.93 #5 Relationship NotInterested 61.86 #6 Relationship NotInterested 48.47 #'data.frame': 100 obs. of 3 variables: # \$ status : Factor w/ 2 levels "Relationship",..: 1 1 1 1 1 1 1 1 1 1 ... # \$ attraction: Factor w/ 2 levels "Interested","NotInterested": 2 2 2 2 2 2 2 2 2 2 ... # \$ money : num 86.3 45.6 68.4 52.9 61.9 ... # status attraction money # Relationship:50 Interested :50 Min. : 0.93 # Single :50 NotInterested:50 1st Qu.: 49.84 # Median : 81.73 # Mean : 88.38 # 3rd Qu.:121.57 # Max. :200.99   ########################################################################### p1 <- ggplot(mydata, aes(status, money)) + geom_boxplot(notch = T, aes(fill = factor(status))) + scale_fill_brewer() + # facet_wrap(~ attraction, nrow = 1) + theme_bw() + # backgroud white(inactive to default grey) labs(x = "") + labs(y = "Money spent on present (€)") + coord_cartesian(ylim = c(0, 250)) + guides(fill = FALSE) + ggtitle("Status")   p2 <- ggplot(mydata, aes(attraction, money)) + geom_boxplot(notch = T, aes(fill = factor(attraction))) + scale_fill_brewer() + # facet_wrap(~ attraction, nrow = 1) + theme_bw() + # backgroud white(inactive to default grey) labs(x = "") + labs(y = "Money spent on present (€)") + coord_cartesian(ylim = c(0, 250)) + guides(fill = FALSE) + ggtitle("Attraction")   p3 <- ggplot(mydata, aes(x = money)) + geom_histogram(aes(y=..density..), binwidth = 10, colour = "black", fill = "white") + geom_density(alpha=.2, fill = "#FF6666") # Overlay with transparent density plot     p4 <- ggplot(mydata, aes(status, money)) + geom_boxplot(notch = F, aes(fill = factor(status))) + scale_fill_brewer(palette="Paired") + facet_wrap(~ attraction, nrow = 1) + theme_bw() + # backgroud white(inactive to default grey) labs(x = "") + labs(y = "Money spent on present (€)") + coord_cartesian(ylim = c(0, 250)) + guides(fill = FALSE)     # Plot the plots multiplot(p1, p3, p2, p4, cols = 2) ########################################################################### 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 ########################################################################### # generate initial minimal regression model including # only the intercept (overall mean) as predictor m0.mlr = lm(money ~ 1, data = mydata) # baseline model m0.glm = glm(money ~ 1, family = gaussian, data = mydata) # baseline model   # inspect results summary(m0.mlr)   #Call: #lm(formula = money ~ 1, data = mydata) #Residuals: # Min 1Q Median 3Q Max #-87.45 -38.54 -6.65 33.20 112.61 #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 88.38 4.86 18.2 <0.0000000000000002 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #Residual standard error: 48.6 on 99 degrees of freedom   # inspect results summary(m0.glm)   #Call: #glm(formula = money ~ 1, family = gaussian, data = mydata) #Deviance Residuals: # Min 1Q Median 3Q Max #-87.45 -38.54 -6.65 33.20 112.61 # #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 88.38 4.86 18.2 <0.0000000000000002 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #(Dispersion parameter for gaussian family taken to be 2359) # Null deviance: 233562 on 99 degrees of freedom #Residual deviance: 233562 on 99 degrees of freedom #AIC: 1063 #Number of Fisher Scoring iterations: 2   ########################################################################### # create saturated model m1.mlr = lm(money ~ (status + attraction)^2, data = mydata) m1.glm = glm(money ~ status * attraction, family = gaussian, data = mydata)   ########################################################################### ### --- Fit the model to find the "best" model, ### --- i.e. the minimal adequate model ### --- We will use a step-wise step down procedure # automated modelfitting # a) criterion: AIC (the smaller the better the model fit) step(m1.mlr, direction = "both") #step(m1.glm, direction = "both")   #Start: AIC=592.5 #money ~ (status + attraction)^2 # # Df Sum of Sq RSS AIC # 34558 593 #- status:attraction 1 24947 59505 645 # #Call: #lm(formula = money ~ (status + attraction)^2, data = mydata) # #Coefficients: # (Intercept) statusSingle # 99.2 57.7 # attractionNotInterested statusSingle:attractionNotInterested # -47.7 -63.2   # the step function shows that the model with all main effects and # interactions, i.e. the model with the formula # "money ~ status * attraction" has the lowest AIC   # generate new model m2.mlr = lm(money ~ (status + attraction)^2, data = mydata) m2.glm = glm(money ~ (status + attraction)^2, family = gaussian, data = mydata)   # summary #summary(m2.glm) summary(m2.mlr)   #Call: #lm(formula = money ~ (status + attraction)^2, data = mydata) #Residuals: # Min 1Q Median 3Q Max #-45.08 -14.26 0.46 11.93 44.14 #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 99.15 3.79 26.13 < 0.0000000000000002 *** #statusSingle 57.69 5.37 10.75 < 0.0000000000000002 *** #attractionNotInterested -47.66 5.37 -8.88 0.000000000000038 *** #statusSingle:attractionNotInterested -63.18 7.59 -8.32 0.000000000000581 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #Residual standard error: 19 on 96 degrees of freedom #Multiple R-squared: 0.852, Adjusted R-squared: 0.847 #F-statistic: 184 on 3 and 96 DF, p-value: <0.0000000000000002     ##########################################################################

Ok, so what does all this mean?

The first object reported by the summary function is the Call, i.e. you formula. Next, the summary contains an overview of the residuals, i.e. a summary of the distribution of the difference between the values predicted by the model and the observed values.
Next comes the most interesting thing: a table with the parameters for our predictors (in fact, these are the parameters for our fixed effects – we will look at what the difference between fixed and random effects in another post). We will examine that table in more detail later on.

After that table, there are our model statistics. The model statistics provide information about the model as a whole as compared to each effect individually in the coefficients table.
The “Multiple R2” value tells us how much variance is explained by our model. A value of 0 would tell us that our model is useless and does not explain any variance. A value of 1 would tell us that our model predicts the observed values perfectly. If you multiply the Multiple R2 value by 100, you get the percentage of variance explained by the model. In our case, the value of 0.852 means that our model accounts for 85.2 percent of the overall variance.
Some researchers have proposed that a model should at least have a Multiple R2 value of .05 or higher, but this is not necessary true. The Multiple R2 value only tells us how much variance is explained by our model and maybe we are interested in a set of predictors which have only a small effect on the dependent variable. So the model can tell us something even if the Multiple R2 value is small. The important thing is that the model is overall significant which means that our model performs better than predicting the outcome on the overall mean alone. But let’s continue…

The “Adjusted R2” value is similar to the Multiple R2 value but it takes into consideration the number of predictors in our model. There is, however, a more interesting aspect to the Adjusted R2 value: It tells us fit well our model is for generalizing from our sample to the overall population. If there is little difference between the Multiple R2 value and the Adjusted R2 value, our model can readily be generalized and applied to the population at large. If there is a substantial difference between the Multiple R2 value and the Adjusted R2 value, then our model is rather unstable and based on our sample we should not generalize our results. So what does this mean exactly? The difference between the Multiple R2 value and the Adjusted R2 value means that if the model were derived from the population
rather than a sample, it would account for approximately .5 percent (85.2-84.7) less variance in the outcome.

All main effects and the interaction between status and attraction affect money significantly. But what does this mean and how should you interpret this?
First of all: if there is a significant interaction, you should not interpret the main effects which are involved in the interaction. And if you do not have significant interactions, you should be aware of the fact that the main effect reflects a bivariate statistic (!), i.e. the coefficients of the predictors reflect the correlation between the dependent variable and the predictor if the other predictor has a value of 0.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 # extract the confidence intervalls for the coefficients #confint(m2.glm) confint(m2.mlr)   # 2.5 % 97.5 % #(Intercept) 91.62 106.69 #statusSingle 47.04 68.34 #attractionNotInterested -58.31 -37.01 #statusSingle:attractionNotInterested -78.24 -48.11   # compare the initial saturated model to the minimal adeqaute model #anova(m0.glm, m2.glm) anova(m0.mlr, m2.mlr)   #Analysis of Variance Table #Model 1: money ~ 1 #Model 2: money ~ (status + attraction)^2 # Res.Df RSS Df Sum of Sq F Pr(>F) #1 99 233562 #2 96 34558 3 199005 184 <0.0000000000000002 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1   Anova(m0.mlr, m2.mlr, type = "III")   #Anova Table (Type III tests) #Response: money # Sum Sq Df F value Pr(>F) #(Intercept) 781016 1 331 <0.0000000000000002 *** #Residuals 34558 96 #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1     ########################################################################## ### --- model diagnostics ########################################################################## # create diagnostic plots # check for cases to remove par(mfrow = c(3, 2)) plot(m2.mlr) qqPlot(m2.mlr, main="QQ Plot") # generate Cook's D plot # D values > 4/(n-k-1) can be a problem cutoff <- 4/((nrow(mydata)-length(m2.mlr\$coefficients)-2)) plot(m2.mlr, which=4, cook.levels = cutoff) par(mfrow = c(1, 1))   ########################################################################## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 ########################################################################## ########################################################################## # we delete cases which are too infliuential # but before we do that, we add some informational columns to our data set # extract case wise diagnostics and add them to the data set infl <- influence.measures(m2.mlr)   # add influence statistics to data mydata <- data.frame(mydata, infl[], infl[]) head(mydata)   # status attraction money dfb.1_ dfb.sttS dfb.atNI dfb.sS.N #1 Relationship NotInterested 86.33 0.00000000000000236805 -0.000000000000001067658 0.27414 -0.193850 #2 Relationship NotInterested 45.58 0.00000000000000022805 -0.000000000000000396634 -0.04569 0.032306 #3 Relationship NotInterested 68.43 -0.00000000000000091891 0.000000000000001134168 0.13140 -0.092911 #4 Relationship NotInterested 52.93 -0.00000000000000016896 0.000000000000000269812 0.01111 -0.007854 #5 Relationship NotInterested 61.86 0.00000000000000011789 -0.000000000000000001814 0.08021 -0.056718 #6 Relationship NotInterested 48.47 -0.00000000000000004889 0.000000000000000015629 -0.02334 0.016507 # dffit cov.r cook.d hat dfb.1_.1 dfb.sttS.1 dfb.atNI.1 dfb.sS.N.1 dffit.1 cov.r.1 cook.d.1 hat.1 #1 0.38770 0.9358 0.03658407 0.04 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #2 -0.06461 1.0817 0.00105355 0.04 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #3 0.18582 1.0491 0.00864788 0.04 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #4 0.01571 1.0860 0.00006233 0.04 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #5 0.11344 1.0722 0.00324023 0.04 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #6 -0.03301 1.0850 0.00027528 0.04 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE   # remove influential data points remove <- apply(infl\$is.inf, 1, function(x) { ifelse(x == TRUE, return("remove"), return("keep")) } )   # add remove to mydata mydata <- data.frame(mydata, remove)   mydata <- mydata[mydata\$remove == "keep", ] head(mydata)   # status attraction money dfb.1_ dfb.sttS dfb.atNI dfb.sS.N #1 Relationship NotInterested 86.33 0.00000000000000236805 -0.000000000000001067658 0.27414 -0.193850 #2 Relationship NotInterested 45.58 0.00000000000000022805 -0.000000000000000396634 -0.04569 0.032306 #3 Relationship NotInterested 68.43 -0.00000000000000091891 0.000000000000001134168 0.13140 -0.092911 #4 Relationship NotInterested 52.93 -0.00000000000000016896 0.000000000000000269812 0.01111 -0.007854 #5 Relationship NotInterested 61.86 0.00000000000000011789 -0.000000000000000001814 0.08021 -0.056718 #6 Relationship NotInterested 48.47 -0.00000000000000004889 0.000000000000000015629 -0.02334 0.016507 # dffit cov.r cook.d hat dfb.1_.1 dfb.sttS.1 dfb.atNI.1 dfb.sS.N.1 dffit.1 cov.r.1 cook.d.1 hat.1 #1 0.38770 0.9358 0.03658407 0.04 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #2 -0.06461 1.0817 0.00105355 0.04 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #3 0.18582 1.0491 0.00864788 0.04 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #4 0.01571 1.0860 0.00006233 0.04 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #5 0.11344 1.0722 0.00324023 0.04 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #6 -0.03301 1.0850 0.00027528 0.04 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE # remove #1 keep #2 keep #3 keep #4 keep #5 keep #6 keep   nrow(mydata)   # 98   ########################################################################### ########################################################################### # generate initial saturated regression model including # all variables and their interactions m0.mlr = lm(money ~ 1, data = mydata) # baseline model m0.glm = glm(money ~ 1, family = gaussian, data = mydata) # baseline model   # create saturated model m1.mlr = lm(money ~ (status + attraction)^2, data = mydata) m1.glm = glm(money ~ status * attraction, family = gaussian, data = mydata)   ########################################################################### ### --- Fit the model to find the "best" model, ### --- i.e. the minimal adequate model ### --- We will use a step-wise step down procedure # automated modelfitting # a) criterion: AIC (the smaller the better the model fit) #step(m1.glm, direction = "both") step(m1.mlr, direction = "both")   #Start: AIC=570.3 #money ~ (status + attraction)^2 # # Df Sum of Sq RSS AIC # 30411 570 #- status:attraction 1 21647 52058 621 # #Call: #lm(formula = money ~ (status + attraction)^2, data = mydata) # #Coefficients: # (Intercept) statusSingle # 99.2 55.9 # attractionNotInterested statusSingle:attractionNotInterested # -47.7 -59.5   # the step function shows that the model with all main effects and # interactions, i.e. the model with the formula # "money ~ status * attraction" has the lowest AIC   # generate new model m2.mlr = lm(money ~ (status + attraction)^2, data = mydata) m2.glm = glm(money ~ (status + attraction)^2, family = gaussian, data = mydata)   # summary summary(m2.mlr)   #Call: #lm(formula = money ~ (status + attraction)^2, data = mydata) #Residuals: # Min 1Q Median 3Q Max #-35.76 -13.51 -0.99 10.60 38.77 #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 99.15 3.60 27.56 < 0.0000000000000002 *** #statusSingle 55.85 5.14 10.87 < 0.0000000000000002 *** #attractionNotInterested -47.66 5.09 -9.37 0.000000000000004 *** #statusSingle:attractionNotInterested -59.46 7.27 -8.18 0.000000000001338 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 #Residual standard error: 18 on 94 degrees of freedom #Multiple R-squared: 0.857, Adjusted R-squared: 0.853 #F-statistic: 188 on 3 and 94 DF, p-value: <0.0000000000000002   # summary glm summary(m2.glm)   #Call: #glm(formula = money ~ (status + attraction)^2, family = gaussian, # data = mydata) #Deviance Residuals: # Min 1Q Median 3Q Max #-35.76 -13.51 -0.99 10.60 38.77 #Coefficients: # Estimate Std. Error t value Pr(>|t|) #(Intercept) 99.15 3.60 27.56 < 0.0000000000000002 *** #statusSingle 55.85 5.14 10.87 < 0.0000000000000002 *** #attractionNotInterested -47.66 5.09 -9.37 0.000000000000004 *** #statusSingle:attractionNotInterested -59.46 7.27 -8.18 0.000000000001338 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # #(Dispersion parameter for gaussian family taken to be 323.5) # Null deviance: 213227 on 97 degrees of freedom #Residual deviance: 30411 on 94 degrees of freedom #AIC: 850.4 #Number of Fisher Scoring iterations: 2   ##########################################################################

An interaction is a situation in which a correlation between a predictor and the dependent variable is affected by another variable. In our scenario guys will spend even more money if the girl they are interested in is single. So the correlation between “money” and “attraction” is affected by another variable: “status”. In this case there is an interaction between “attraction” and “status”.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 ########################################################################## # extract the confidence intervalls for the coefficients #confint(m2.glm) confint(m2.mlr)   # 2.5 % 97.5 % #(Intercept) 92.01 106.30 #statusSingle 45.65 66.06 #attractionNotInterested -57.76 -37.56 #statusSingle:attractionNotInterested -73.89 -45.03   # compare the initial saturated model to the minimal adeqaute model #anova(m0.glm, m2.glm) anova(m0.mlr, m2.mlr)   #Analysis of Variance Table #Model 1: money ~ 1 #Model 2: money ~ (status + attraction)^2 # Res.Df RSS Df Sum of Sq F Pr(>F) #1 97 213227 #2 94 30411 3 182816 188 <0.0000000000000002 *** #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1   Anova(m0.mlr, m2.mlr, type = "III")   #Anova Table (Type III tests) #Response: money # Sum Sq Df F value Pr(>F) #(Intercept) 760953 1 346 <0.0000000000000002 *** #Residuals 30411 94 #--- #Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1   ########################################################################## ### --- model diagnostics ########################################################################## # create diagnostic plots # check for cases to remove par(mfrow = c(3, 2)) plot(m2.mlr) qqPlot(m2.mlr, main="QQ Plot") # generate Cook's D plot cutoff <- 4/((nrow(mydata)-length(m2.mlr\$coefficients)-2)) plot(m2.mlr, which=4, cook.levels = cutoff) par(mfrow = c(1, 1)) ########################################################################## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 ########################################################################## ########################################################################## # add additional model diagnostics to the data   mydata\$residuals <- resid(m2.mlr) mydata\$standardized.residuals <- rstandard(m2.mlr) mydata\$studentized.residuals <- rstudent(m2.mlr) mydata\$cooks.distance <- cooks.distance(m2.mlr) mydata\$dffit <- dffits(m2.mlr) mydata\$leverage <- hatvalues(m2.mlr) mydata\$covariance.ratios <- covratio(m2.mlr) mydata\$fitted <- m2.mlr\$fitted.values   ########################################################################## # create 3 diagnostic residual plots in 1 window   p1 <- histogram<-ggplot(mydata, aes(studentized.residuals)) + theme(legend.position = "none") + geom_histogram(aes(y=..density..), binwidth = 1, colour="black", fill="white") + labs(x = "Studentized Residual", y = "Density") p2 <- histogram + stat_function(fun = dnorm, args = list(mean = mean(mydata\$studentized.residuals, na.rm = TRUE), sd = sd(mydata\$studentized.residuals, na.rm = TRUE)), colour = "red", size = 1)     p3 <- scatter <- ggplot(mydata, aes(fitted, studentized.residuals)) p4 <- scatter + geom_point() + geom_smooth(method = "lm", colour = "Red")+ labs(x = "Fitted Values", y = "Studentized Residual")     p5 <- qqplot.resid <- qplot(sample = mydata\$studentized.residuals, stat="qq") + labs(x = "Theoretical Values", y = "Observed Values") p6 <- qqplot.resid   multiplot(p2, p4, p6, cols=3)   ########################################################################## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 ########################################################################## ########################################################################## # continue diagnostics   # 1: standardized residuals with an absolute value greater than 3.29 should be removed (cf. Field, Miles & Field 2012:269) # 2: if more than 1% of our sample cases have standardized residuals > 2.58 the level of error within our model is unacceptable (cf. Field, Miles & Field 2012:269) # 3: if more than 5% of cases have standardized residuals > 1.96 our model fit is unaaceptable (cf. Field, Miles & Field 2012:269) # 4: cases with Cook's distance > 1 should be removed (cf. Field, Miles & Field 2012:269) # 5: leverage cut off point:(3(k + 1)/n)(cf. Field, Miles & Field 2012:270)   # 1: optimal = 0 (all cases listed must be removed) which(mydata\$standardized.residuals > 3.29)   #integer(0)   # 2: optimal = 1 stdres_258 <- as.vector(sapply(mydata\$standardized.residuals, function(x) { ifelse(sqrt((x^2)) > 2.58, 1, 0) } )) (sum(stdres_258) / length(stdres_258)) * 100   # 0   # 3: optimal = 5 stdres_196 <- as.vector(sapply(mydata\$standardized.residuals, function(x) { ifelse(sqrt((x^2)) > 1.96, 1, 0) } )) (sum(stdres_196) / length(stdres_196)) * 100   # 6.122   # 4: optimal = 0 (all cases listed must be removed) which(mydata\$cooks.distance > 1)   #integer(0)   # 5: optimal = 0 (all cases listed must be removed IF cook's distance close to 1) which(mydata\$leverage >= (3*mean(mydata\$leverage)))   #integer(0)   # checking autocorrelation: # perform Durbin-Watson test (optimal: large p-value) dwt(m2.mlr)   #lag Autocorrelation D-W Statistic p-value # 1 -0.01433 1.968 0.652 # Alternative hypothesis: rho != 0   # checking multicolliniarity: # extract variance inflation factors (VIF) (values greater 10 should/must be excluded; cf Myers 1990) vif(m2.mlr)   # status attraction status:attraction # 2.00 1.96 2.96   # checking multicolliniarity: tolerance is 1/VIF # values smaller than .01 should/must be excluded # Menard 1995 states that values smaller than .2 are problematic 1/vif(m2.mlr)   # status attraction status:attraction # 0.5000 0.5102 0.3378   # mean VIF: should not be greater than 1 (Bowerman & O'Connell 1990) mean(vif(m2.mlr))   # 2.307   # checking sample size (Green 1991) # if you are interested in the overall model: 50 + 8k (k = number of predictors) # if you are interested in individual predictors: 104 + k # if you are interesetd in both: take the higher value! smplesz <- function(x) { ifelse((length(x\$fitted) < (104 + ncol(summary(x)\$coefficients)-1)) == TRUE, return( paste("Sample too small: please increase your sample by ", 104 + ncol(summary(x)\$coefficients)-1 - length(x\$fitted), " data points", collapse = "")), return("Sample size sufficient")) } smplesz(m2.mlr)   # "Sample too small: please increase your sample by 9 data points"   # checking the expected R of random data (Field, Miles & Field 2012:274) # to see how strong the effect of a false positive could be based on the sample size # the value ranges between 0 and 1: the closer to 0 the better! # if the value is above .1 there is cause for concern (i.e. you need a bigger sample)   expR <- function(x) { ifelse(((ncol(summary(x)\$coefficients)-1)/(length(x\$fitted)-1)) > .05, return(paste("A random sample is expected to cause a correlation of the size", ((ncol(summary(x)\$coefficients)-1)/(length(x\$fitted)-1)), "between the predictors and the predicted", collapse = "")), return(paste("Based on the sample size expect a false positive correlation of", round(((ncol(summary(x)\$coefficients)-1)/(length(x\$fitted)-1)), 4), "between the predictors and the predicted", collapse = "")))} expR(m2.mlr)   # "Based on the sample size expect a false positive correlation of 0.0309 between the predictors and the predicted"   ############################################################### ### Function for a neat output table for Multiple Linear Regression Models ############################################################### # x = lm model, a = glm model   mlr.summary(m2.mlr, m2.glm) # inspect the results   # Estimate VIF Std. Error t value Pr(>|t|) Significance #(Intercept) 99.15 3.6 27.56 0 p < .001*** #statusSingle 55.85 2 5.14 10.87 0 p < .001*** #attractionNotInterested -47.66 1.96 5.09 -9.37 0 p < .001*** #statusSingle:attractionNotInterested -59.46 2.96 7.27 -8.18 0 p < .001*** #Model statistics Value #Number of cases in model 98 #Residual standard error on 94 DF 17.99 #Multiple R-squared 0.857 #Adjusted R-squared 0.853 #AIC 850.4 #F-statistic (3, 94) 188.36 #Model p-value 0 p < .001***   ########################################################################### ### --- THE END ###########################################################################

You would report these results in a table like the one below. If you have to report the regression model in prose, it would read something like this:
(If there are significant interactions, you should not interpret the main effects involved in these interactions! I only do so for because I want to show how a model report could look like).