Dataset: birth

The dataset birth records 189 birth weights from Massachusetts, USA, and some additional variables. The variables are,

Var Description
LOW Equals YES if birth weight is below 2500g and NO otherwise.
AGE Age of the mother in years.
LWT Weight in pounds of mother.
SMK YES if mother smokes and NO otherwise.
BWT Birth weight in g (RESPONSE).

The data appears in fronter as birth.rdata. Download the data into your STAT340 course folder and load the data set in RStudio.

Overview of data

  • Take a look at the top 10 rows of the data using the head() function
head(birth, n = 10)
   LOW AGE LWT SMK  BWT
1  YES  28 120 YES  709
2  YES  29 130  NO 1021
3  YES  34 187 YES 1135
4  YES  25 105  NO 1330
5  YES  25  85  NO 1474
6  YES  27 150  NO 1588
7  YES  23  97  NO 1588
8  YES  24 128  NO 1701
9  YES  24 132  NO 1729
10 YES  21 165 YES 1790
  • Use the summary() function to get a short summary of the variables.
summary(birth)
  LOW           AGE            LWT       SMK           BWT      
 NO :130   Min.   :14.0   Min.   : 80   NO :115   Min.   : 709  
 YES: 59   1st Qu.:19.0   1st Qu.:110   YES: 74   1st Qu.:2414  
           Median :23.0   Median :121             Median :2977  
           Mean   :23.2   Mean   :130             Mean   :2945  
           3rd Qu.:26.0   3rd Qu.:140             3rd Qu.:3475  
           Max.   :45.0   Max.   :250             Max.   :4990  
  • What is the proportion of smoking mothers in the data set?

The proportion of smoking mother is 0.39

  • What is the average age of a mother giving birth?

The average age of mother giving birth is 23.2 years.

  • What is the average birth weight of children from non-smoking and smoking mothers?
tapply(birth$BWT, INDEX = birth$SMK, FUN = mean)
  NO  YES 
3055 2773 

The function returns the mean birth weight for children of non-smoking and smoking mothers.

  • What is the standard deviation of birth weight of children from non-smoking and smoking mothers?
tapply(birth$BWT, INDEX = birth$SMK, FUN = sd)
 NO YES 
752 660 

The sd() function computes the sample standard deviation of a vector of observations.

Linear Regression

Run a simple linear regression model with BWT as response and LWT as predictor, like this,

birth1 <- lm(BWT ~ LWT, data = birth)
summary(birth1)

Call:
lm(formula = BWT ~ LWT, data = birth)

Residuals:
    Min      1Q  Median      3Q     Max 
-2192.2  -503.6    -3.9   508.3  2075.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2369.67     228.43   10.37   <2e-16 ***
LWT             4.43       1.71    2.59     0.01 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

s: 718 on 187 degrees of freedom
Multiple R-squared: 0.0345,
Adjusted R-squared: 0.0294 
F-statistic: 6.69 on 1 and 187 DF,  p-value: 0.0105 

Here, the regression model is,

\[\text{BWT} = \beta_0 + \beta_1 \text{LWT} + \epsilon\]

where \(\epsilon \sim N(0, \sigma^2)\)

Test the significance of LWT on BWT with a 5% test level and at a 1% level

What is the hypothesis you are testing?

The hypothesis for testing the significance of LWT on BWT is, \[ H_0: \beta_1 = 0 \text{ vs } H_1: \beta_1 \ne 0\]

  • What is the conclusion?

The \(p\)-value corresponding to \(\beta_1\) is less than 0.05 but greater than 0.01. So, at 5% test level, LWT is significant while at 1% test level, it is not significant.

  • Find the R-squared. Do you think the model fits the data well?

The r-squared (\(R^2\)) for the model is 0.03, this shows that only 3% of variation present in birth weight (BWT) is explained by weight of mother (LWD). Here, the model fits the data poorely.

Scatter Plot

  • Make a scatter plot of LWT vs BWT

The scatter plot of LWT and BWT is,

plot(x = birth$LWT, y = birth$BWT)
abline(birth1)

  • Make a comment to the plot in light of the output of your analyses.
  1. The intercept for the regression line is 2369.67 and the slope is 4.43.
  2. The data-points are scattered around the regression line where BWT vary most
  3. Since the data-points are scattered much, the model could only explain small variation present in BWT with LWT.

Confidence Intervals

  • Find 95% confidence intervals for the regression coefficients of the birth1 model
confint(birth1)
              2.5 %  97.5 %
(Intercept) 1919.04 2820.30
LWT            1.05    7.81
  • Also find 99% confidence intervals
confint(birth1, level = 0.99)
                0.5 %  99.5 %
(Intercept) 1775.2097 2964.13
LWT           -0.0287    8.89
  • Comment on the intervals
  1. It is 95% certain that the interval (1.05, 7.809) covers the true \(\beta_1\). Similary, it is 99% certain that the interval (-0.029, 8.887) covers the true \(\beta_1\).
  2. The 99% confidence is larger than 95% confidence. In other words, being more certain about the true value needs larger confidence interval.
  3. Moreover, the 95% does not include zero while 99% interval includes zero. This is equivalent with the result that \(\beta_1\) cofficient is significant at a 5% test level, but not significant at a 1% test level.

Regression with categories

Here we will fit a separate regression for smoking and non-smoking groups. You can identify the observation numbers of the smokers by:

smokeYes <- which(birth$SMK == "YES")
Fit the same model as birth1, but separate models for non-smokers and smokers, and call the models birth2 and birth3. (Hint: select observations by the subset argument in the lm-function using the smokeYes variable.)
birth2 <- lm(BWT ~ LWT, data = birth, subset = -smokeYes)
birth3 <- lm(BWT ~ LWT, data = birth, subset = smokeYes)
Interpreate these models
  • Make a scatter plot of LWT vs BWT and add two fitted lines form the model fitted above.
plot(x = birth$LWT, y = birth$BWT)
abline(birth2, col = "red")
abline(birth3, col = "blue")

  • Comment on the plot

Fitted lines for both non-smokers and smokers seems very similar, but it is difficult to tell whether they are significantly different. We will later se how we can model both mother-groups simultaneously and be able to test this difference.

  • Is LWT significant at a 5% level on BWT for the smokers?
summary(birth3)

Call:
lm(formula = BWT ~ LWT, data = birth, subset = smokeYes)

Residuals:
    Min      1Q  Median      3Q     Max 
-2040.3  -416.3    33.9   472.2  1488.7 

Coefficients:
            Estimate Std. Error t value       Pr(>|t|)    
(Intercept)  2395.37     301.47    7.95 0.000000000019 ***
LWT             2.95       2.28    1.30            0.2    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

s: 657 on 72 degrees of freedom
Multiple R-squared: 0.0228,
Adjusted R-squared: 0.00921 
F-statistic: 1.68 on 1 and 72 DF,  p-value: 0.199 

The hypothesis for testing the significance of LWT is, \[H_0: \beta_1 = 0 \text{ vs } H_1: \beta_1 \ne 0\] From the summary of model birth3 above, p-value corresponding to LWT is higher than 0.05 and we fail to reject \(H_0\), which suggests that LWT is not significant for smokers group. In other words, LWT does not have any linear relationship with BWT at 95% confidence level for smokers group.

Assume a model with both LWT and AGE as predictors for BWT using all observations.

  • Write up the model and the model assumptions.

\[\text{BWT} = \beta_0 + \beta_1 \text{LWT} + \beta_2 \text{AGE} + \epsilon\]

Assumptions:

The error term \(\epsilon\) follows \(N(0, \sigma^2) \; iid\), i.e error terms are independently normally distributed with mean 0 and constant variance \(\sigma^2\).

  • What is the interpretation of the regression coefficients?
  1. \(\beta_1\) gives the expected amount of change in BWT for unit change in LWT when AGE is held constant, i.e. if LWT increases by 1 pound, BWT will increase by \(\beta_1\) grams for people of the same AGE.
  2. \(\beta_2\) gives the expected amount of change in BWT (in grams) if AGE increase by 1 year and LWT is held constant.

Fit the model in RStudio, call it birth4 and comment on the results.

birth4 <- lm(BWT ~ LWT + AGE, data = birth)
summary(birth4)

Call:
lm(formula = BWT ~ LWT + AGE, data = birth)

Residuals:
    Min      1Q  Median      3Q     Max 
-2232.8  -500.5    32.1   520.3  1899.3 

Coefficients:
            Estimate Std. Error t value        Pr(>|t|)    
(Intercept)  2215.76     299.24     7.4 0.0000000000044 ***
LWT             4.18       1.74     2.4           0.018 *  
AGE             8.02      10.06     0.8           0.426    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

s: 719 on 186 degrees of freedom
Multiple R-squared: 0.0378,
Adjusted R-squared: 0.0275 
F-statistic: 3.65 on 2 and 186 DF,  p-value: 0.0278 

The summary output shows that LWT is significant at 5% level of significance but not at 1%. AGE has very high p-value and thus is not significant, i.e. there is not any linear relationship of AGE with BWT. The explained variation is still very low with an \(R^2=0.038\).

Optional:

Look at the presentation file Regression.Rmd from lecture 2 and produce for the birth4-model a similar 3D-plot as on page 15. You may need to install the R-packages: rgl, nlme, mgcv and car first. Use the figure to get an understanding of the effects of LWT and AGE on BWT.

A 3D plot
library(scatterplot3d)
with(birth, {
  # Start Plot
  plot3d <- scatterplot3d(LWT, AGE, BWT, type = "p", highlight.3d = TRUE, 
                          mar = c(3, 3, 2, 3), pch = 19, cex.symbols = 0.5, 
                          main = "Residuals and fitted plane for model: birth4",
                          angle = 45, box = FALSE)
  # Add fitted plane for model birth4
  plot3d$plane3d(birth4, col = "grey50", lty.box = "solid", polygon_args = list(bg = "lightgrey"))
  # True Values
  true <- plot3d$xyz.convert(LWT, AGE, BWT)
  # Predicted Values
  fitted <- plot3d$xyz.convert(LWT, AGE, fitted(birth4))
  # Is the residuals negative?
  neg_res <- 1 + (resid(birth4) > 0)
  # Add segment for the residuals
  segments(true$x, true$y, fitted$x, fitted$y, col = c("blue", "red")[neg_res])
})

An interactive 3D plot
library(car)
scatter3d(BWT ~ LWT + AGE, data = birth, axis.ticks = TRUE, revolutions = 1)

For grouped: Smoking vs Non-Smoking:

car::scatter3d(BWT ~ LWT + AGE, data = birth, axis.ticks = TRUE, 
               revolutions = 1, groups = birth$SMK)
Interpretation
  • What is the interpretation of the estimated regression coefficients for LWT and AGE in this model?

From the summary output of birth4 model, the \(\beta\) coefficient for LWT is 4.179 and AGE is 8.021. This shows that, if weight of mother (LWT) increases by 1 pound, the birth weight (BWT) is estimated to increase by 4.179 grams if AGE is held constant. Similary, if the age of a mother (AGE) increases by 1 year, the birth weight (BWT) is estimated to increase by 8.021 grams, if LWT is held constant. The regression coefficients are therefore equal to the slopes of the gridlines of the surface in the figure.