Framingham CHD logistic regression

Modelling the 10-years risk of coronary heart disease

Author

Leonardo Cerliani

Published

July 23, 2023

Introduction

The Framingham dataset (available also on kaggle) is an ongoing study which has been running since 1948. It aims to identify demographic, lifestyle and medical factors which are associated with the 10-year risk of developing heart disease.

Here we use logistic regression to model the 10-year risk of coronary heart disease (CHD). The dataset consists of:

  • 4238 individuals, 15% of which with assessed risk of CHD

  • 43% males, 57% females between 32 and 70 yrs (median age = 49)

  • 15 recorded variables, including education, smoking habits, blood pressure measurements and anomalies, blood sample indices (e.g. glucose), medicament assumptions.

Main takeaways

Presentation slides

  • The final model correctly predicts risk of CHD in 10 years in 83% of the people who are actually at risk (sensitivity) when the threshold for binary classification is set at 0.1

  • The choice of this threshold (0.1) leads to many false positives, however given the potential life-threatening implications of false negatives - and the relatively simple continous monitoring of the patients with positive prediction - the choice obviously falls on maximising the sensitivity.

  • The model is simple, contaning only easily retrievable variables: sex, age, systolic blood pressure, glucose and smoking habits

  • Assumption of blood pressure medicaments does not appear to decrease the odds of CHD risk

  • The amount of explained variance is moderate (17%). Other unexplored variables such as alcohol consumption, stress and wealth (among others) might improve model fit and performance.

Column description

  • Sex: male or female(Nominal)
  • Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)
  • Current Smoker: whether or not the patient is a current smoker (Nominal)
  • Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.)

Medical( history)

  • BP Meds: whether or not the patient was on blood pressure medication (Nominal)
  • Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)
  • Prevalent Hyp: whether or not the patient was hypertensive (Nominal)
  • Diabetes: whether or not the patient had diabetes (Nominal)
  • Tot Chol: total cholesterol level (Continuous)
  • Sys BP: systolic blood pressure (Continuous)
  • Dia BP: diastolic blood pressure (Continuous)
  • BMI: Body Mass Index (Continuous)
  • Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)
  • Glucose: glucose level (Continuous)

Predict variable (desired target) - 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)

Abbreviations: - EDA : Exploratory Data Analysis - EV : Explanatory Variable (i.e. predictor)

Load libraries and data

Code
library(tidyverse)
library(janitor)
library(emmeans)
library(dlookr)
library(flextable)
library(jtools)
library(GGally)
library(vcd)
library(ggstatsplot)
library(gridExtra)
library(caret)
library(jtools)
library(kableExtra)
library(plotly)
library(highcharter)
library(pROC)
   

select <- dplyr::select

df <- read_csv("Graded_task_Heart_disease_data.csv") %>% 
 tibble %>% 
 janitor::clean_names()

# change appropriate variables to factor
df <- df %>% mutate_at(
 c("male","education","current_smoker","bp_meds",
   "prevalent_stroke","prevalent_hyp",
   "diabetes","ten_year_chd"), factor
)

# df %>% glimpse()

# dlookr::diagnose(df) %>% flextable

Split train/test

To estimate the perfomance of the model with out-of-sample data, we split the whole dataset in a train (60% - 2543 data points) and test (40% - 1695 data points) set.

Code
set.seed(124)
N = nrow(df)
split_index <- sample(c("train","test"), N, replace = T, prob = c(0.6,0.4))

train <- df[which(split_index == "train"),]  # train set
test <- df[which(split_index == "test"),]  # test set

Outliers check

We will carry out the check for outliers - and subsequently evaluate potential imputation for NA - only on the train data, and blindly apply this to the test. In a real world situation, the test set can grow continously, therefore we need to take decisions only on the train data.

The evaluation for potential NA imputation will be carried out later, just before modelling. The reason for this is that carrying out NA imputation first would bias the EDA.

Observations All the numerical values are in the norm, except for blood pressure. Specifically, the systolic blood pressure sys_bp is way too high, with about 4% of the participants having a bp > 180 (requiring immediate medical care).

However, in published papers on the Framinghton studies these values are accepted as plausible. We will therefore only exclude values above 250 mmHg.

Code
# train %>% select_if(is.numeric) %>% summary()

# boxplot(glucose ~ diabetes, data = train, main = "glucose ~ diabetes")

boxplot(sys_bp ~ bp_meds, data = train, main = "systolic blood pressure ~ bp_meds")

Code
boxplot(sys_bp ~ prevalent_hyp, data = train, main = "systolic blood pressure ~ hypertension")

Code
# percent of people with sys_bp > 180
pct_high_sys_bp <- round((train$sys_bp > 180) %>% sum() / nrow(train) * 100,2)

# xtabs(~ prevalent_hyp + bp_meds, data = train)

train <- train %>% filter(sys_bp <= 250)
test <- test %>% filter(sys_bp <= 250)

EDA of numerical variables

Initially, we carry out t-tests to discover which EVs exhibit (corrected) significant differences between people with and without risk of CHD.

Then we will also examine the potential correlations between variables. If the latter are substantial, we will evaluate the possibility of discarding some EVs due to collinearity.

Two-independent-samples t-tests

(an excercise in fitting many models using purrr::map)

Code
library(tidyr)

ttests <- train %>%
   # select("ten_year_chd", "age", "glucose") %>%
   
   select(-c("education","male","current_smoker","bp_meds",
           "prevalent_stroke","prevalent_hyp", "diabetes")) %>% 
   na.omit() %>%
   
   # pivot longer to prepare the nesting
   pivot_longer(
      cols = -ten_year_chd, 
      names_to = "variable", values_to = "value"
   ) %>% 
   group_by(variable) %>% 
   group_nest() %>%
   
   # fit a model to each nested df
   mutate(
    ttmod = map(data, ~ lm(value ~ ten_year_chd, data = .x))
   ) %>% 

   # get the stats
   mutate(
      tidy = map(ttmod, broom::tidy)
   ) %>% 
   unnest(tidy) %>% 
   filter(term == "ten_year_chd1") %>% 
   
   # correct for multiple comparisons
   mutate(adj_pval = p.adjust(p.value, method = "fdr")) %>% 
   # filter(adj_pval <= 0.01) %>% 
   relocate(adj_pval, .after=variable) %>% 
   arrange(adj_pval)


ttests %>%
  select(variable, statistic, adj_pval) %>%
  rename(`t value` = statistic) %>%
  flextable() %>%
  set_formatter(adj_pval = function(x) sprintf("%.2e", x))

variable

t value

adj_pval

age

12.0815563

1.01e-31

sys_bp

11.9732840

1.73e-31

dia_bp

8.2215602

8.85e-16

glucose

6.5653995

1.28e-10

bmi

5.6308260

3.22e-08

tot_chol

4.0326391

7.59e-05

cigs_per_day

2.1141771

3.96e-02

heart_rate

0.8512159

3.95e-01


Age

Code
ggbetweenstats(
   data = train,
   x = ten_year_chd,
   y = age
)

Systolic blood pressure

Code
ggbetweenstats(
   data = train,
   x = ten_year_chd,
   y = sys_bp
)

Diastolic blood pressure

Code
# 
ggbetweenstats(
   data = train,
   x = ten_year_chd,
   y = dia_bp
)

Glucose

Code
# 
ggbetweenstats(
   data = train,
   x = ten_year_chd,
   y = glucose
)

BMI

Code
# 
ggbetweenstats(
   data = train,
   x = ten_year_chd,
   y = bmi
)

Cholesterol level

Code
# 
ggbetweenstats(
   data = train,
   x = ten_year_chd,
   y = tot_chol
)

Correlation matrix

Exploring the correlation matrix of the numerical variables to detect potential collinearity

Code
train %>% 
 # select(-c("education","male","current_smoker","bp_meds",
 #           "prevalent_stroke","prevalent_hyp", "diabetes")) %>%
 select(ten_year_chd, age, sys_bp, dia_bp, bmi, glucose, tot_chol) %>% 
 na.omit() %>% 
 # mutate_at(vars(-ten_year_chd), ~ log(. + 1)) %>% 
 GGally::ggpairs(
    aes(color = ten_year_chd),
    upper = list(continuous = wrap("cor", size = 3)),
    lower = list(continuous = wrap("points", alpha = 0.7, size = 1)),
    diag = list(continuous = wrap("densityDiag", alpha = 0.5))
 )

Observations - numerical EVs

  • age appears to be the most discriminant variable for ten year CHD prediction. Other EVs which are significantly different (after correction with FDR) in people with and without CHD risk are bmi, sys_bp, dia_bp, glucose, tot_chol.

  • the systolic sys_bp and diastolic dia_bp blood pressure are correlated with each other - as expected. Since CHD is especially related to hypertension, we will only use systolic blood pressure

  • bmi and age are also correlated with blood pressure, however they appear to be important variables for CHD and their correlation with blood pressure is mild, so we will keep them

  • NB: the number of cigarettes per day (cigs_per_day) will be explored as a categorical variable (see below) with levels for tens of cigarettes per day

EDA of categorical variables

Code
# train %>% select_if(is.factor) %>% summary

As before for the numerical variables, we will first assess a dependence between risk of CHD and other categorical variables, and then explore the reciprocal relationships of all categorical variables to assess the presence of collinearity.

Chi-square tests

Sex (male)

Code
library(ggstatsplot)

ggbarstats(
   data = train %>% select(ten_year_chd, male),
   x = male,
   y = ten_year_chd,
   label = "both"
)

Hypertension

Code
ggbarstats(
   data = train %>% select(ten_year_chd, prevalent_hyp),
   x = prevalent_hyp,
   y = ten_year_chd,
   label = "both"
)

Cigarettes per day

The values are too sparse to treat this as a continous variable. We will therefore group the # cigs per day in tens and treat this as a categorical variable.

A first visual exploration shows that there appear to be mild differences in the proportion of people smoking different amount of cigarettes per day.

When carrying out a chi-square test of independence, we are warned that the results might be incorrect. This appears to be due to the fact that there are too few observations in the category for 50 and 60 cigarettes per day. When these levels are removed, a significant (p = 0.0066) dependence is shown between risk group and amount of cigarettes per day.

Code
# Suppress summarise info
options(dplyr.summarise.inform = FALSE)

train %>% 
   select(ten_year_chd, cigs_per_day) %>% 
   na.omit() %>% 
   mutate(cigs_factor = round(cigs_per_day/10,0)) %>% 
   group_by(ten_year_chd, cigs_factor) %>% 
   summarise(count = n(), .group = "drop") %>% 
   group_by(ten_year_chd) %>%
   mutate(prop = count / sum(count)) %>% 
   select(ten_year_chd, cigs_factor, prop) %>%
   ggplot(aes(x = factor(cigs_factor), y = prop, fill = ten_year_chd)) +
   geom_bar(stat = "identity", position = "dodge") + 
   labs(x = "Tens of Cigarettes per Day", y = "Proportion in each group", fill = "CHD Status")

Code
train %>% 
   select(ten_year_chd, cigs_per_day) %>% 
   mutate(tens_cigarettes = round(cigs_per_day/10, 0) ) %>% 
   filter(tens_cigarettes < 5) %>% 
   select(-cigs_per_day) %>% 
   na.omit() %>% 
   xtabs(~ ten_year_chd + tens_cigarettes, data = .) %>% 
   chisq.test()  %>% tidy() %>% kable(format = "html") %>%
   add_header_above(c("Chi-Square CHD Risk ~ Tens_of_cigarettes" = 4)) %>%
   kable_styling()
Chi-Square CHD Risk ~ Tens_of_cigarettes
statistic p.value parameter method
14.22453 0.0066119 4 Pearson's Chi-squared test

Current smoker

Code
ggbarstats(
   data = train %>% select(ten_year_chd, current_smoker),
   x = current_smoker,
   y = ten_year_chd,
   label = "both"
)

Blood pressure medication

Code
ggbarstats(
   data = train %>% select(ten_year_chd, bp_meds),
   x = bp_meds,
   y = ten_year_chd,
   label = "both"
)

Education

Code
ggbarstats(
   data = train %>% select(ten_year_chd, education),
   x = education,
   y = ten_year_chd,
   label = "both"
)

Association matrix (categorical)

Graphically display the association between binary categorical variables, including ten_year_chd to assess the presence of correlated predictors

Code
library(vcd)

train_cat <- train %>% select_if(is.factor) %>% select(-c("ten_year_chd","education"))

pairs(
   table(train_cat), 
   diag_panel = pairs_diagonal_mosaic(offset_varnames=-2.5),    #move down variable names
   upper_panel_args = list(shade = TRUE),                       #set shade colors
   lower_panel_args = list(shade = TRUE)
)

Observations - categorical EVs

  • The 10 year risk of CHD is more prevalent in males than in females
  • Not surprisingly, hypertension and assumption of blood pressure medications is also associated with risk. We will also examine the latter, although only ~ 3% of the sample takes these medications
  • The number of cigarettes smoked per day (in groups of 10) also increaes the risk of CHD. Interestingly, still, more than 50% of the people who are at risk do not smoke. It would be interesting to have data about the quality of the air of the place where they live, or whether their partner is a smoker.
  • It is also interesting that the relationship between amount of cigarettes per day and risk is not linear: there are more people at risk among those who smoke 20-30 or 40-50 cigs per day. while among those who smoke 10-20 or 30-40 cigs per day the proportion of people with and without risk is almost 50%. This strange behaviour can probably be - at least in part - explained by the fact that smokers tend to underestimate the amount of cigarettes smoked.
  • Instead, the fact of being a current smoker does not appear to be a risk factor
  • There is an interaction between being make, having hypertension and being a current smoker. We will examine whether this interaction can increase the prediction, although the fact of currently smoking per se apparently does not.

Choice of predictors

Numerical variables Age (age), High blood pressure (sys_bp), body-mass index (bmi), glucose blood level (glucose), cholesterol blood level (tot_chol) appear to be involved in predicting the risk of CHD. As noted above, cigs_per_day was treated as a categorical variables with levels for o and multiples of 10.

Categorical variables Sex (male), hypertension (prevalent_hyp), blood pressure medications (bp_meds), education (education) and number of cigarettes per day (cigs_per_day) appear to be involved in predicting the risk of CHD.

There are widespread associations between numerical and categorical variables. This is not surprising for two reasons:

  • some variables are proxy each other (e.g. sys_bp, prevalence_hyp, bp_meds)
  • some variables are linked by known relationships: e.g. with age, the bmi increases, and both are likely to lead to increased blood pressure sys_bp, which in turn can lead to hypertension and relative medicaments

Despite this, we will enter all the selected variables in the model and assess the presence of collinearity using VIF.

age

Code
library(gridExtra)

boxplot_tt <- function(response, predictor) {
   tt <- t.test(train[[response]] ~ train[[predictor]], data =train)
   
   res <- paste0("t = ", round(tt$statistic,2), " p = ", round(tt$p.value,4))
   
   p <- train %>% na.omit() %>% 
      ggplot(aes(x = .data[[predictor]], y = .data[[response]])) +
      geom_boxplot() +
      labs(title = paste0(predictor," vs ", response),
           subtitle = res)
   
   return(p)
   
}

p1 <- boxplot_tt("age", "male")
p2 <- boxplot_tt("age", "prevalent_hyp")
p3 <- boxplot_tt("age", "bp_meds")

grid.arrange(p1, p2, p3, nrow = 1)

bmi

Code
p1 <- boxplot_tt("bmi", "male")
p2 <- boxplot_tt("bmi", "prevalent_hyp")
p3 <- boxplot_tt("bmi", "bp_meds")

grid.arrange(p1, p2, p3, nrow = 1)

hypertension (sys_bp)

Code
p1 <- boxplot_tt("sys_bp", "male")
p2 <- boxplot_tt("sys_bp", "prevalent_hyp")
p3 <- boxplot_tt("sys_bp", "bp_meds")

grid.arrange(p1, p2, p3, nrow = 1)

glucose level

Code
p1 <- boxplot_tt("glucose", "male")
p2 <- boxplot_tt("glucose", "prevalent_hyp")
p3 <- boxplot_tt("glucose", "bp_meds")

grid.arrange(p1, p2, p3, nrow = 1)

cholesterol level

Code
p1 <- boxplot_tt("tot_chol", "male")
p2 <- boxplot_tt("tot_chol", "prevalent_hyp")
p3 <- boxplot_tt("tot_chol", "bp_meds")

grid.arrange(p1, p2, p3, nrow = 1)

NA Assessment

As a last step before fitting the models, we need to take care of NAs in these 6 variables. As mentioned in the beginning, we do this here in order not to bias the EDA.

In case of imputation, we will apply the same procedure blindly to the test set.

bmi is significantly different for males and females. There are only 12 values missing. We will impute the NAs with the median of each gender in either risk or no-risk group.

glucose and tot_chol are significantly different especially in people with and without hypertension. We will impute the NAs with the median of each gender in either risk or no-risk group.

bp_meds is a very important variable, and not easy to impute, so we don’t want to run risks here. We see from the table below that the people where blood pressure (bp) medication was not recorded have (median) bp values compatible with people who have either high or low blood pressure AND do not take medications. This could justify the imputation. At the same time, this is a very important variable, associated with drug use. We don’t want to run this risk. We will therefore discard people where it is not known whether they take medications.

cigs_per_day is a variable that can hardly be imputed correctly, therefore we will remove the data points with NA in this variable (less than 1%)

bmi

Code
# bmi ~ male
ggbetweenstats(
   data = train,
   x = male,
   y = bmi
)

glucose

Code
# glucose ~ hypertension
ggbetweenstats(
   data = train,
   x = prevalent_hyp,
   y = glucose
)

cholesterol

Code
# cholesterol ~ hypertension
ggbetweenstats(
   data = train,
   x = prevalent_hyp,
   y = tot_chol
)

bp_meds

Code
# bp_meds ~ prevalent_hyp
df %>% 
  select(sys_bp, prevalent_hyp, bp_meds) %>%
  group_by(bp_meds, prevalent_hyp) %>% 
  summarise(
    median_sys_bp = median(sys_bp),
    count_prevalent_hyp = n()
  ) 
# A tibble: 5 × 4
# Groups:   bp_meds [3]
  bp_meds prevalent_hyp median_sys_bp count_prevalent_hyp
  <fct>   <fct>                 <dbl>               <int>
1 0       0                      122                 2891
2 0       1                      150.                1170
3 1       1                      165                  124
4 <NA>    0                      121                   31
5 <NA>    1                      154.                  22

NA imputation on train and test set

Code
# ------------------------- train --------------------------------------------
train <- train %>%
   # impute the bmi as the median of males/females
   group_by(ten_year_chd, male) %>%
   mutate(bmi = ifelse(is.na(bmi), median(bmi, na.rm = T),bmi)) %>%
   ungroup() %>%
   
   # impute glucose and cholesterol level as the median of prevalent_hyp
   group_by(ten_year_chd, prevalent_hyp) %>%
   mutate(glucose = ifelse(is.na(glucose), median(glucose, na.rm=T), glucose)) %>%
   mutate(tot_chol = ifelse(is.na(tot_chol), median(tot_chol, na.rm=T), tot_chol)) %>%
   ungroup() %>% 
   
   # remove people where bp_meds is not unknown
   filter(!is.na(bp_meds)) %>% 
   
   # remove people with no information about education level
   filter(!is.na(education)) %>% 
   
   # # remove the mean from numerical variables
   # mutate_if(is.numeric, ~ . - mean(., na.rm = TRUE)) %>% 
   
   # remove residual NAs
   na.omit()
   
   
   

train %>% 
   select(
      ten_year_chd, age, sys_bp, bmi, glucose, tot_chol, 
      male, prevalent_hyp, bp_meds, education
   ) %>%
   diagnose %>%
   flextable

variables

types

missing_count

missing_percent

unique_count

unique_rate

ten_year_chd

factor

0

0

2

0.0008093889

age

numeric

0

0

39

0.0157830838

sys_bp

numeric

0

0

213

0.0861999191

bmi

numeric

0

0

1,099

0.4447592068

glucose

numeric

0

0

121

0.0489680291

tot_chol

numeric

0

0

227

0.0918656414

male

factor

0

0

2

0.0008093889

prevalent_hyp

factor

0

0

2

0.0008093889

bp_meds

factor

0

0

2

0.0008093889

education

factor

0

0

4

0.0016187778

Code
# ------------------------- test --------------------------------------------
test <- test %>%
   # impute the bmi as the median of males/females
   group_by(ten_year_chd, male) %>%
   mutate(bmi = ifelse(is.na(bmi), median(bmi, na.rm = T),bmi)) %>%
   ungroup() %>%
   
   # impute glucose and cholesterol level as the median of prevalent_hyp
   group_by(ten_year_chd, prevalent_hyp) %>%
   mutate(glucose = ifelse(is.na(glucose), median(glucose, na.rm=T), glucose)) %>%
   mutate(tot_chol = ifelse(is.na(tot_chol), median(tot_chol, na.rm=T), tot_chol)) %>%
   ungroup() %>% 
   
   # remove people where bp_meds is not unknown
   filter(!is.na(bp_meds)) %>% 
   
   # remove people with no information about education level
   filter(!is.na(education)) %>% 
   
   # # remove the mean from numerical variables
   # mutate_if(is.numeric, ~ . - mean(., na.rm = TRUE)) %>% 
   
   # remove residual NAs
   na.omit()




test %>% 
   select(
      ten_year_chd, age, sys_bp, bmi, glucose, tot_chol, 
      male, prevalent_hyp, bp_meds, education
   ) %>%
   diagnose %>%
   flextable

variables

types

missing_count

missing_percent

unique_count

unique_rate

ten_year_chd

factor

0

0

2

0.001265823

age

numeric

0

0

37

0.023417722

sys_bp

numeric

0

0

201

0.127215190

bmi

numeric

0

0

859

0.543670886

glucose

numeric

0

0

102

0.064556962

tot_chol

numeric

0

0

213

0.134810127

male

factor

0

0

2

0.001265823

prevalent_hyp

factor

0

0

2

0.001265823

bp_meds

factor

0

0

2

0.001265823

education

factor

0

0

4

0.002531646

Analytic strategy

It appears that there are two interesting questions:

  1. How well is the risk predicted by generic factors alone? (such as sex, age, and blood pressure)

  2. How much can we increase the prediction if we include other variables more related to hypertension, such as diagnosed hypertension or assumption of blood pressure medications?

Since we allowed some correlated variables to enter the model, we will pay close attention to the VIF (variance inflation factor).

After the “best” model is detemined by manual stepwise regression, we will compare the former to an automatic stepwise regression carried out using the AIC criterion alone.

Logistic regression

Procedure for stepwise modelling

We start by fitting the “maximal” model, which includes all the provided EVs, to have a baseline estimate for the AUC of this model and of the impact of the correlation between variables we previously detected with EDA - for the latter, we will check the VIF.

Then we fit the full model, which includes all the variables we previously selected based on the EDA. This will show how well the hints gathered in the EDA are actually reflected in the significance of the coefficients. Specifically, compared to the “maximal” model.

We will then proceed to eliminate EVs based on whether the coefficients are not sigificant or have a very small effect (in terms of odds ratio). At each step we evaluate the AIC and AUC to detect whether reducing the EVs led to an increase or decrease of the performance.

Finally we will examine the confusion matrix and adjust the threshold for binary prediction in order to have maximum Sensitivity (more on this later).

Format of the summary tables

The coefficients are reported in two forms (at the expense of verbosity):

  • original, non-standardized and non-exponentiated (using summary()): to allow calculation of the odds ratio (via exp()) of 10 year CHD given the actual values of the EVs for a specific person.

  • standardized and exponentiated (using jtools::summ()): to allow comparing the magnitude of the coefficients, to have the confidence intervals and to calculate the VIF.

Results

The AIC was similar across all examined models (between 1887 and 1890), therefore this metric had a small weight in the choice of the final model. Similarly, all the models explained at most 17% of the total variance in the data, which is not a substantial amount.

The VIF was not critical, even in the maximal model, therefore the impact of correlated predictors was minimal or negligible.

The maximal model highlighted the contribution of gender, age, systolic blood pressure, and to a lesser extent the prevalence of hypertension and education. Importantly, the p values for some of these variables was higher than in simpler models, most likely due to collinearity with other variables. The AUC of this model was 0.703.

Our full model highlighted the same variables. In addition reducing the number of EVs increased the significance of most of the selected EVs (especially age and systolic blood pressure). The AUC of this model was 0.705.

We then removed EVs with small or no effect from the full model, specifically blood pressure medications, bmi, cholesterol and education. This led to an AUC of 0.711, and all variables significant.

Further removing the EVs specifying whether the patient had hypertension led to a simpler model and did not substantially change the AUC (0.713).

Remarkably, this model contained only gender, age, systolic blood pressure and glucose blood level, in addition to number of cigarettes smoked per day. That is, two common demographic variables, and two physiologic measurements which can be easily retrieved with blood pressure measurement and a simple blood exam

The increase in odds to develop a CHD in 10 years were as follows for each EV:

  • sex: males have \(exp(0.449) \approx 57\%\) greater odds of CHD risk than females
  • cigarettes per day : \(exp(0.1766) \approx 19\%\) greater odds for any additional 10 cigarettes smoked every day
  • age : the odds of CHD increase \(exp(0.0658) \approx 6.8\%\) every year
  • systolic blood pressure : \(exp(0.0186) \approx 1.8\%\) greater odds for each mmHg
  • glucose : \(exp(0.0075) \approx 0.7\%\) greater odds for each mg

The intercept can be interpreted as the probability of a female being at risk when all predictors are set to 0. In our case we have predictors for age, glucose and blood pressure, which are obviously not zero, therefore this interpretation of the intercept is not really helpful in our case. For the sake of calculation: \(p = \frac{e^{\beta_0}}{1 + e^{\beta_0}} = \frac{exp(-8.643)}{1+exp(-8.643)} = 0.000176\)

It is at this point very interesting to notice that:

  1. Assumption of blood pressure medications does not reduce the 10 years risk of CHD
  2. Having hypertension has only a small effect on the prediction of developing a risk of CHD in 10 years

This is a very interesting result from a practical perspective, since it means that generic demographic and physiological variables are much more predictive of the risk of CHD in 10 years with respect to a diagnosis of hypertension, and that the assumption of blood pressure medicaments - presumably to normalize the blood pressure level in people with hypertension - does not reduce the risk of CHD

Our final model therefore includes:

  • sex
  • age
  • systolic blood pressure
  • glucose
  • cigarettes per day

Afterwards, we evaluated the performance of our model on the test data. Using a default probability of 0.5 for binary prediction leads to a poor Sensitivity, in that only ~ 7% of the people with a real 10 year CHD risk (\(\frac{16}{16+219}\)) are predicted as being at risk.

Since in our case the aim is to maximise the detection of people who are truly at risk - even at expense of inflating the number of false negatives - we lowered the threshold of binary prediction to 0.1. This allowed us to reach a sensitivity of 84%.

The automatic stepwise regression (stepAIC) identifies the same variables as our final model, plus a small (uncorrected significant) contribution of education level and prevalence of hypertension. However, this model has a smaller AUC (0.703) than our final model, and is also more complex. Therefore we retained our final model.

Finally, we hypothesized that the imbalance between Sensitivity and Specificity could be due to the high proportion of people not at risk in our dataset. Therefore, as a last test, we sampled an equal number of people with and without risk.

In this balanced samples model the AUC increased to 0.738. By using a threshold for binary decision of 0.3, this model correctly predicts 91.4% of the people with actual CHD risk, and a false negative rate of 70%. By adopting a more liberal threshold of 0.4, the model predicts 81.3% of the people with actual CHD, and about 50% of the people with no risk of CHD.

A final remark This model captures only a fraction of the total variance in the data (~ 15%). It is odd that other variables widely thought to be associated with cardiovascular diseases were not recorded, such as stress level (e.g. # hours of work per day/week), alcohol comsumption, quality of the air.


do_logistic_regression : a function to carry out the logistic regression and print out the summary of the model, the ROC curve - and the AUC - as well as the confusion matrix.

plot_ROC : a function that generates an interactive ROC that allows to inspect Sensitivity and Specificity for different levels of threshold for binary decision

Code
do_logistic_regression <- function(EVs, train_data = train, test_data = test, thresh = 0.5) {
   
   # Assemble the formula
   formula_string <- paste(EVs, collapse = " + ")
   formula_obj <- as.formula(paste("ten_year_chd ~", formula_string))
   
   fit <- glm(formula_obj, family = "binomial", data = train_data)

   
   # # Standard R summary
   summary(fit) %>% print()
   
   
   # Print a summary
   # summary(fit) %>% print()
   if (length(EVs) > 1) {
      jtools::summ(fit, confint = T, vifs = T, scale = F, exp = T, digits = 2) %>% print()
   } else {
      jtools::summ(fit, confint = T, vifs = T, scale = F, exp = T, digits = 2) %>% print()
   }

   # Estimated probability values from the logistic model
   phat = predict(fit, newdata = test_data, type = "response")

   # Confusion matrix and derived quantities - using caret::confusionmatrix
   sprintf("\n Using a threshold for prediction = %.2f \n\n", thresh) %>% cat()
   
   thresh <- thresh
   predicted <- ifelse(phat > thresh, 1, 0)
   
   actual_predicted <- tibble(
      actual = test_data$ten_year_chd, 
      predicted = ifelse(phat > thresh, 1, 0) 
   )
   
   caret::confusionMatrix(
      as.factor(predicted), # predicted
      as.factor(test_data$ten_year_chd), # actual 
      positive = "1"  # the positive class in our case is "1"
   ) %>% print()
   
   return(phat)
}


plot_ROC <- function(test_data = test, phat = phat) {
 # Define the probability thresholds you want to use
   pthresh <- seq(0, 1, 0.1)
   
   # Create the ROC curve
   roc_curve <- roc(test_data$ten_year_chd, phat, plot = F, print.auc=T)
   
   # Get the coordinates (sensitivity and specificity) for the specified thresholds
   coordinates <- pROC::coords(roc_curve, pthresh)

    # Using highcharter
    h <- coordinates %>% 
    hchart(
        type = "line",
        hcaes(x = specificity, y = sensitivity, threshold = threshold)
    ) %>% 
    hc_xAxis(title = list(text = "Specificity"), reversed = TRUE, min = 0, max = 1) %>%
    hc_yAxis(title = list(text = "Sensitivity"), min = 0, max = 1) %>% 
    hc_add_series(
        data = coordinates,
        type = "scatter",
        hcaes(x = specificity, y = sensitivity, threshold = threshold),
        marker = list(radius = 6, fillColor = "lightblue", lineWidth = 2),
        tooltip = list(
            headerFormat = "",
            pointFormat = "<b>Threshold: {point.threshold:.2f}</b><br>
                           Sensitivity: {point.y:.2f}<br>
                           Specificity: {point.x:.2f}")
    ) %>% 
     hc_chart(aspectRatio = 1) %>%   # Set the equal aspect ratio
     hc_title(text = paste("ROC curve - AUC =", round(roc_curve$auc,3)))
    
    return(h)
}

Manual stepwise

Maximal model

Maximal model, including all the variables in the dataset. We use this only as a benchmark for the models instructed by EDA.

We observe high VIF values in sys_bp and dia_bp - as expected.

Code
EVs <- train %>% select(-ten_year_chd) %>% colnames()
phat <- do_logistic_regression(EVs, thresh = 0.5)

Call:
glm(formula = formula_obj, family = "binomial", data = train_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8415  -0.5828  -0.4175  -0.2897   2.8226  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -7.4497344  0.8561698  -8.701  < 2e-16 ***
male1              0.3800302  0.1329243   2.859  0.00425 ** 
age                0.0608765  0.0083022   7.333 2.26e-13 ***
education2        -0.3243521  0.1517706  -2.137  0.03259 *  
education3        -0.3327557  0.1860997  -1.788  0.07377 .  
education4        -0.0502180  0.1969542  -0.255  0.79874    
current_smoker1    0.1717930  0.1922979   0.893  0.37166    
cigs_per_day       0.0149836  0.0078998   1.897  0.05787 .  
bp_meds1           0.3755894  0.2932894   1.281  0.20033    
prevalent_stroke1  0.8294602  0.6158164   1.347  0.17800    
prevalent_hyp1     0.3669952  0.1660561   2.210  0.02710 *  
diabetes1          0.3126526  0.3897172   0.802  0.42241    
tot_chol           0.0004310  0.0014095   0.306  0.75980    
sys_bp             0.0121563  0.0046474   2.616  0.00890 ** 
dia_bp             0.0004736  0.0078561   0.060  0.95193    
bmi                0.0141265  0.0151335   0.933  0.35058    
heart_rate        -0.0068913  0.0050928  -1.353  0.17601    
glucose            0.0063839  0.0028683   2.226  0.02604 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2114.4  on 2470  degrees of freedom
Residual deviance: 1854.4  on 2453  degrees of freedom
AIC: 1890.4

Number of Fisher Scoring iterations: 5

MODEL INFO:
Observations: 2471
Dependent Variable: ten_year_chd
Type: Generalized linear model
  Family: binomial 
  Link function: logit 

MODEL FIT:
χ²(17) = 259.91, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.17
Pseudo-R² (McFadden) = 0.12
AIC = 1890.45, BIC = 1995.07 

Standard errors: MLE
-------------------------------------------------------------------------
                          exp(Est.)   2.5%   97.5%   z val.      p    VIF
----------------------- ----------- ------ ------- -------- ------ ------
(Intercept)                    0.00   0.00    0.00    -8.70   0.00       
male1                          1.46   1.13    1.90     2.86   0.00   1.25
age                            1.06   1.05    1.08     7.33   0.00   1.32
education2                     0.72   0.54    0.97    -2.14   0.03   1.13
education3                     0.72   0.50    1.03    -1.79   0.07   1.13
education4                     0.95   0.65    1.40    -0.25   0.80   1.13
current_smoker1                1.19   0.81    1.73     0.89   0.37   2.61
cigs_per_day                   1.02   1.00    1.03     1.90   0.06   2.74
bp_meds1                       1.46   0.82    2.59     1.28   0.20   1.11
prevalent_stroke1              2.29   0.69    7.66     1.35   0.18   1.04
prevalent_hyp1                 1.44   1.04    2.00     2.21   0.03   1.94
diabetes1                      1.37   0.64    2.93     0.80   0.42   1.88
tot_chol                       1.00   1.00    1.00     0.31   0.76   1.07
sys_bp                         1.01   1.00    1.02     2.62   0.01   3.64
dia_bp                         1.00   0.99    1.02     0.06   0.95   2.92
bmi                            1.01   0.98    1.04     0.93   0.35   1.23
heart_rate                     0.99   0.98    1.00    -1.35   0.18   1.10
glucose                        1.01   1.00    1.01     2.23   0.03   1.89
-------------------------------------------------------------------------

 Using a threshold for prediction = 0.50 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1329  213
         1   16   22
                                          
               Accuracy : 0.8551          
                 95% CI : (0.8367, 0.8721)
    No Information Rate : 0.8513          
    P-Value [Acc > NIR] : 0.3513          
                                          
                  Kappa : 0.1249          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.09362         
            Specificity : 0.98810         
         Pos Pred Value : 0.57895         
         Neg Pred Value : 0.86187         
             Prevalence : 0.14873         
         Detection Rate : 0.01392         
   Detection Prevalence : 0.02405         
      Balanced Accuracy : 0.54086         
                                          
       'Positive' Class : 1               
                                          
Code
plot_ROC(test, phat)
Setting levels: control = 0, case = 1
Setting direction: controls < cases

Full with selected variables

This is the “full” model with all the variables chosen based on the EDA.

No problems with collinearity.

Code
EVs  <- c(
   "male", "prevalent_hyp", "bp_meds", "education",
   "age", "sys_bp", "bmi", "glucose", "tot_chol", "cigs_per_day"
)

phat <- do_logistic_regression(EVs, thresh = 0.5)

Call:
glm(formula = formula_obj, family = "binomial", data = train_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9398  -0.5882  -0.4186  -0.2876   2.8221  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -7.8924235  0.7224553 -10.924  < 2e-16 ***
male1           0.3996399  0.1306947   3.058 0.002230 ** 
prevalent_hyp1  0.3564960  0.1628251   2.189 0.028565 *  
bp_meds1        0.4381388  0.2883448   1.519 0.128638    
education2     -0.3319065  0.1512202  -2.195 0.028174 *  
education3     -0.3409724  0.1858949  -1.834 0.066621 .  
education4     -0.0373354  0.1959238  -0.191 0.848870    
age             0.0613141  0.0080512   7.616 2.63e-14 ***
sys_bp          0.0118211  0.0034909   3.386 0.000709 ***
bmi             0.0137041  0.0146029   0.938 0.348011    
glucose         0.0075754  0.0021169   3.579 0.000345 ***
tot_chol        0.0002287  0.0014069   0.163 0.870891    
cigs_per_day    0.0191629  0.0052321   3.663 0.000250 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2114.4  on 2470  degrees of freedom
Residual deviance: 1859.5  on 2458  degrees of freedom
AIC: 1885.5

Number of Fisher Scoring iterations: 5

MODEL INFO:
Observations: 2471
Dependent Variable: ten_year_chd
Type: Generalized linear model
  Family: binomial 
  Link function: logit 

MODEL FIT:
χ²(12) = 254.88, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.17
Pseudo-R² (McFadden) = 0.12
AIC = 1885.48, BIC = 1961.04 

Standard errors: MLE
----------------------------------------------------------------------
                       exp(Est.)   2.5%   97.5%   z val.      p    VIF
-------------------- ----------- ------ ------- -------- ------ ------
(Intercept)                 0.00   0.00    0.00   -10.92   0.00       
male1                       1.49   1.15    1.93     3.06   0.00   1.21
prevalent_hyp1              1.43   1.04    1.97     2.19   0.03   1.87
bp_meds1                    1.55   0.88    2.73     1.52   0.13   1.09
education2                  0.72   0.53    0.97    -2.19   0.03   1.11
education3                  0.71   0.49    1.02    -1.83   0.07   1.11
education4                  0.96   0.66    1.41    -0.19   0.85   1.11
age                         1.06   1.05    1.08     7.62   0.00   1.25
sys_bp                      1.01   1.00    1.02     3.39   0.00   2.06
bmi                         1.01   0.99    1.04     0.94   0.35   1.16
glucose                     1.01   1.00    1.01     3.58   0.00   1.02
tot_chol                    1.00   1.00    1.00     0.16   0.87   1.06
cigs_per_day                1.02   1.01    1.03     3.66   0.00   1.22
----------------------------------------------------------------------

 Using a threshold for prediction = 0.50 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1332  219
         1   13   16
                                          
               Accuracy : 0.8532          
                 95% CI : (0.8347, 0.8703)
    No Information Rate : 0.8513          
    P-Value [Acc > NIR] : 0.433           
                                          
                  Kappa : 0.0915          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.06809         
            Specificity : 0.99033         
         Pos Pred Value : 0.55172         
         Neg Pred Value : 0.85880         
             Prevalence : 0.14873         
         Detection Rate : 0.01013         
   Detection Prevalence : 0.01835         
      Balanced Accuracy : 0.52921         
                                          
       'Positive' Class : 1               
                                          
Code
plot_ROC(test, phat)
Setting levels: control = 0, case = 1
Setting direction: controls < cases

Remove no-effect EVs

Remove education, bp_meds, bmi, tot_chol since they have no effect on the model.

The AIC slightly decreases with respect to the full model.

Code
# Remove education, bp_meds, bmi, tot_chol since they have
# no effect on the model
EVs  <- c(
   "male", "prevalent_hyp",
   "age", "sys_bp", "glucose", "cigs_per_day"
)

phat <- do_logistic_regression(EVs, thresh = 0.5)

Call:
glm(formula = formula_obj, family = "binomial", data = train_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1088  -0.5889  -0.4241  -0.2951   2.7635  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -8.027822   0.562910 -14.261  < 2e-16 ***
male1           0.434934   0.127776   3.404 0.000664 ***
prevalent_hyp1  0.387263   0.160552   2.412 0.015862 *  
age             0.064891   0.007775   8.346  < 2e-16 ***
sys_bp          0.013408   0.003393   3.952 7.76e-05 ***
glucose         0.007572   0.002106   3.595 0.000324 ***
cigs_per_day    0.018188   0.005197   3.500 0.000465 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2114.4  on 2470  degrees of freedom
Residual deviance: 1870.0  on 2464  degrees of freedom
AIC: 1884

Number of Fisher Scoring iterations: 5

MODEL INFO:
Observations: 2471
Dependent Variable: ten_year_chd
Type: Generalized linear model
  Family: binomial 
  Link function: logit 

MODEL FIT:
χ²(6) = 244.31, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.16
Pseudo-R² (McFadden) = 0.12
AIC = 1884.05, BIC = 1924.73 

Standard errors: MLE
----------------------------------------------------------------------
                       exp(Est.)   2.5%   97.5%   z val.      p    VIF
-------------------- ----------- ------ ------- -------- ------ ------
(Intercept)                 0.00   0.00    0.00   -14.26   0.00       
male1                       1.54   1.20    1.98     3.40   0.00   1.16
prevalent_hyp1              1.47   1.08    2.02     2.41   0.02   1.83
age                         1.07   1.05    1.08     8.35   0.00   1.17
sys_bp                      1.01   1.01    1.02     3.95   0.00   1.95
glucose                     1.01   1.00    1.01     3.60   0.00   1.01
cigs_per_day                1.02   1.01    1.03     3.50   0.00   1.21
----------------------------------------------------------------------

 Using a threshold for prediction = 0.50 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1335  219
         1   10   16
                                          
               Accuracy : 0.8551          
                 95% CI : (0.8367, 0.8721)
    No Information Rate : 0.8513          
    P-Value [Acc > NIR] : 0.3513          
                                          
                  Kappa : 0.0958          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.06809         
            Specificity : 0.99257         
         Pos Pred Value : 0.61538         
         Neg Pred Value : 0.85907         
             Prevalence : 0.14873         
         Detection Rate : 0.01013         
   Detection Prevalence : 0.01646         
      Balanced Accuracy : 0.53033         
                                          
       'Positive' Class : 1               
                                          
Code
plot_ROC(test, phat)
Setting levels: control = 0, case = 1
Setting direction: controls < cases

Remove prevalent_hyp

Remove prevalent_hyp since it is only marginally significant

Note that the AIC only marginally increases (less than 0.2%) and this model is much simpler and requires only 4 variables, two of which are common demographic variables (age and sex), while the other two can be easily gathered through a pressure measurement and a blood test.

Code
# Remove prevalent_hyp since it is only marginally significant
EVs  <- c(
   "male", "age", "sys_bp", "glucose", "cigs_per_day"
)

phat <- do_logistic_regression(EVs, thresh = 0.5)

Call:
glm(formula = formula_obj, family = "binomial", data = train_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1053  -0.5899  -0.4330  -0.2982   2.8162  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -8.643579   0.504565 -17.131  < 2e-16 ***
male1         0.449590   0.127385   3.529 0.000417 ***
age           0.065845   0.007745   8.502  < 2e-16 ***
sys_bp        0.018684   0.002597   7.194 6.29e-13 ***
glucose       0.007547   0.002112   3.573 0.000353 ***
cigs_per_day  0.017858   0.005184   3.445 0.000571 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2114.4  on 2470  degrees of freedom
Residual deviance: 1875.8  on 2465  degrees of freedom
AIC: 1887.8

Number of Fisher Scoring iterations: 5

MODEL INFO:
Observations: 2471
Dependent Variable: ten_year_chd
Type: Generalized linear model
  Family: binomial 
  Link function: logit 

MODEL FIT:
χ²(5) = 238.54, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.16
Pseudo-R² (McFadden) = 0.11
AIC = 1887.82, BIC = 1922.69 

Standard errors: MLE
--------------------------------------------------------------------
                     exp(Est.)   2.5%   97.5%   z val.      p    VIF
------------------ ----------- ------ ------- -------- ------ ------
(Intercept)               0.00   0.00    0.00   -17.13   0.00       
male1                     1.57   1.22    2.01     3.53   0.00   1.16
age                       1.07   1.05    1.08     8.50   0.00   1.17
sys_bp                    1.02   1.01    1.02     7.19   0.00   1.14
glucose                   1.01   1.00    1.01     3.57   0.00   1.01
cigs_per_day              1.02   1.01    1.03     3.44   0.00   1.20
--------------------------------------------------------------------

 Using a threshold for prediction = 0.50 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1337  219
         1    8   16
                                          
               Accuracy : 0.8563          
                 95% CI : (0.8381, 0.8733)
    No Information Rate : 0.8513          
    P-Value [Acc > NIR] : 0.3             
                                          
                  Kappa : 0.0987          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.06809         
            Specificity : 0.99405         
         Pos Pred Value : 0.66667         
         Neg Pred Value : 0.85925         
             Prevalence : 0.14873         
         Detection Rate : 0.01013         
   Detection Prevalence : 0.01519         
      Balanced Accuracy : 0.53107         
                                          
       'Positive' Class : 1               
                                          
Code
plot_ROC(test, phat)
Setting levels: control = 0, case = 1
Setting direction: controls < cases

Assess categorical interactions

The number of cigarettes per day is a relevant factor, although being a current smoker is not. Since there is an interaction between being male, current smoker and having hypertension, it is worth exploring whether including these interactions would improve the performance of the model.

However, the results show that the coefficients associated with male:current_smoker and current_smoker:prevalent_hyp are not significant - especially after correction. Also, they are highly collinear with cigs_per_day - which becomes not significant - and do not improve the model performance.

For all these resasons, these interactions will not be included in the final model.

Code
# Try to include `male:current_smoker` and `current_smoker:prevalent_hyp`
EVs  <- c(
   "male", "age", "sys_bp", "glucose", "male:current_smoker", "current_smoker:prevalent_hyp"
)

phat <- do_logistic_regression(EVs, thresh = 0.5)

Call:
glm(formula = formula_obj, family = "binomial", data = train_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0959  -0.5871  -0.4262  -0.2940   2.7761  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -8.071834   0.575043 -14.037  < 2e-16 ***
male1                           0.411137   0.177391   2.318 0.020466 *  
age                             0.064531   0.007790   8.284  < 2e-16 ***
sys_bp                          0.013450   0.003398   3.959 7.54e-05 ***
glucose                         0.007486   0.002101   3.563 0.000367 ***
male0:current_smoker1           0.386439   0.213094   1.813 0.069760 .  
male1:current_smoker1           0.560270   0.206397   2.715 0.006637 ** 
current_smoker0:prevalent_hyp1  0.472103   0.200758   2.352 0.018693 *  
current_smoker1:prevalent_hyp1  0.287916   0.201998   1.425 0.154058    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2114.4  on 2470  degrees of freedom
Residual deviance: 1871.3  on 2462  degrees of freedom
AIC: 1889.3

Number of Fisher Scoring iterations: 5

MODEL INFO:
Observations: 2471
Dependent Variable: ten_year_chd
Type: Generalized linear model
  Family: binomial 
  Link function: logit 

MODEL FIT:
χ²(8) = 243.03, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.16
Pseudo-R² (McFadden) = 0.11
AIC = 1889.32, BIC = 1941.64 

Standard errors: MLE
-------------------------------------------------------------------------------
                                       exp(Est.)   2.5%   97.5%   z val.      p
------------------------------------ ----------- ------ ------- -------- ------
(Intercept)                                 0.00   0.00    0.00   -14.04   0.00
male1                                       1.51   1.07    2.14     2.32   0.02
age                                         1.07   1.05    1.08     8.28   0.00
sys_bp                                      1.01   1.01    1.02     3.96   0.00
glucose                                     1.01   1.00    1.01     3.56   0.00
male0:current_smoker1                       1.47   0.97    2.23     1.81   0.07
male1:current_smoker1                       1.75   1.17    2.62     2.71   0.01
current_smoker0:prevalent_hyp1              1.60   1.08    2.38     2.35   0.02
current_smoker1:prevalent_hyp1              1.33   0.90    1.98     1.43   0.15
-------------------------------------------------------------------------------
 
-------------------------------------------
                                        VIF
------------------------------------ ------
(Intercept)                                
male1                                  2.24
age                                    1.18
sys_bp                                 1.96
glucose                                1.01
male0:current_smoker1                  4.18
male1:current_smoker1                  4.18
current_smoker0:prevalent_hyp1         3.57
current_smoker1:prevalent_hyp1         3.57
-------------------------------------------

 Using a threshold for prediction = 0.50 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1332  221
         1   13   14
                                          
               Accuracy : 0.8519          
                 95% CI : (0.8334, 0.8691)
    No Information Rate : 0.8513          
    P-Value [Acc > NIR] : 0.4892          
                                          
                  Kappa : 0.0786          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.059574        
            Specificity : 0.990335        
         Pos Pred Value : 0.518519        
         Neg Pred Value : 0.857695        
             Prevalence : 0.148734        
         Detection Rate : 0.008861        
   Detection Prevalence : 0.017089        
      Balanced Accuracy : 0.524955        
                                          
       'Positive' Class : 1               
                                          
Code
plot_ROC(test, phat)
Setting levels: control = 0, case = 1
Setting direction: controls < cases

Sigmoid for age

We also decided to try to model age as a sigmoid, under the assumption that the odds of a CHD risk in ten years were overall smaller for people < 40 and would increase at a steeper rate afterwards, however the perfomance of the model did not change.

Code
sigmoid_function <- function(x, a, b, c, d) {
  # Sigmoid function formula: a + (b - a) / (1 + exp(-c * (x - d)))
  return(a + (b - a) / (1 + exp(-c * (x - d))))
}

# Define the parameters for the sigmoid function to control the curve
a_linear <- 0      # Start value (linear increase starts from 0)
b_linear <- 1      # End value of linear increase (1 represents 100%)
c_steepness <- 0.15 # Steepness factor for the sigmoid curve (adjust as needed)
d_transition <- 50 # Age at which the transition occurs (50 in your case)


train_sigmoid_age <- train %>% 
 mutate(age_sigmoid = sigmoid_function(age, a_linear, b_linear, c_steepness, d_transition))

test_sigmoid_age <- test %>% 
 mutate(age_sigmoid = sigmoid_function(age, a_linear, b_linear, c_steepness, d_transition))


plot(train_sigmoid_age$age, train_sigmoid_age$age_sigmoid)

Code
# The proxy_age variable now has a smooth transition, increasing linearly until 50 and more steeply after that.



EVs  <- c(
   "male", "age_sigmoid", "sys_bp", "glucose", "cigs_per_day"
)

phat <- do_logistic_regression(
 train_data = train_sigmoid_age,
 test_data = test_sigmoid_age,
 EVs, thresh = 0.1
)

Call:
glm(formula = formula_obj, family = "binomial", data = train_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0739  -0.5929  -0.4279  -0.2942   2.7784  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -6.433353   0.401903 -16.007  < 2e-16 ***
male1         0.449367   0.127351   3.529 0.000418 ***
age_sigmoid   2.197337   0.258184   8.511  < 2e-16 ***
sys_bp        0.018622   0.002598   7.167 7.64e-13 ***
glucose       0.007502   0.002109   3.556 0.000376 ***
cigs_per_day  0.017897   0.005184   3.452 0.000556 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2114.4  on 2470  degrees of freedom
Residual deviance: 1874.9  on 2465  degrees of freedom
AIC: 1886.9

Number of Fisher Scoring iterations: 5

MODEL INFO:
Observations: 2471
Dependent Variable: ten_year_chd
Type: Generalized linear model
  Family: binomial 
  Link function: logit 

MODEL FIT:
χ²(5) = 239.49, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.16
Pseudo-R² (McFadden) = 0.11
AIC = 1886.87, BIC = 1921.74 

Standard errors: MLE
--------------------------------------------------------------------
                     exp(Est.)   2.5%   97.5%   z val.      p    VIF
------------------ ----------- ------ ------- -------- ------ ------
(Intercept)               0.00   0.00    0.00   -16.01   0.00       
male1                     1.57   1.22    2.01     3.53   0.00   1.16
age_sigmoid               9.00   5.43   14.93     8.51   0.00   1.17
sys_bp                    1.02   1.01    1.02     7.17   0.00   1.14
glucose                   1.01   1.00    1.01     3.56   0.00   1.01
cigs_per_day              1.02   1.01    1.03     3.45   0.00   1.20
--------------------------------------------------------------------

 Using a threshold for prediction = 0.10 

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 584  39
         1 761 196
                                          
               Accuracy : 0.4937          
                 95% CI : (0.4687, 0.5186)
    No Information Rate : 0.8513          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1183          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8340          
            Specificity : 0.4342          
         Pos Pred Value : 0.2048          
         Neg Pred Value : 0.9374          
             Prevalence : 0.1487          
         Detection Rate : 0.1241          
   Detection Prevalence : 0.6057          
      Balanced Accuracy : 0.6341          
                                          
       'Positive' Class : 1               
                                          
Code
plot_ROC(test_sigmoid_age, phat)
Setting levels: control = 0, case = 1
Setting direction: controls < cases

Final model using thr = 0.5

We decided to include the following variables in the final model:

  • male
  • age
  • sys_bp
  • glucose
  • cigs_per_day

At this point, we evaluate the performance of the model in terms of sensitivity (TP/TP+FN) using a default threshold of 0.5 for binary decision on the test set.

Our aim is to maximise the amount of people at risk which are predicted to be so by the model.

Code
# The main aim is to use the model to identify the highest amount of true positives
# and minimize false negatives

EVs  <- c(
   "male", "age", "sys_bp", "glucose", "cigs_per_day"
)

phat <- do_logistic_regression(EVs, thresh = 0.5)

Call:
glm(formula = formula_obj, family = "binomial", data = train_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1053  -0.5899  -0.4330  -0.2982   2.8162  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -8.643579   0.504565 -17.131  < 2e-16 ***
male1         0.449590   0.127385   3.529 0.000417 ***
age           0.065845   0.007745   8.502  < 2e-16 ***
sys_bp        0.018684   0.002597   7.194 6.29e-13 ***
glucose       0.007547   0.002112   3.573 0.000353 ***
cigs_per_day  0.017858   0.005184   3.445 0.000571 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2114.4  on 2470  degrees of freedom
Residual deviance: 1875.8  on 2465  degrees of freedom
AIC: 1887.8

Number of Fisher Scoring iterations: 5

MODEL INFO:
Observations: 2471
Dependent Variable: ten_year_chd
Type: Generalized linear model
  Family: binomial 
  Link function: logit 

MODEL FIT:
χ²(5) = 238.54, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.16
Pseudo-R² (McFadden) = 0.11
AIC = 1887.82, BIC = 1922.69 

Standard errors: MLE
--------------------------------------------------------------------
                     exp(Est.)   2.5%   97.5%   z val.      p    VIF
------------------ ----------- ------ ------- -------- ------ ------
(Intercept)               0.00   0.00    0.00   -17.13   0.00       
male1                     1.57   1.22    2.01     3.53   0.00   1.16
age                       1.07   1.05    1.08     8.50   0.00   1.17
sys_bp                    1.02   1.01    1.02     7.19   0.00   1.14
glucose                   1.01   1.00    1.01     3.57   0.00   1.01
cigs_per_day              1.02   1.01    1.03     3.44   0.00   1.20
--------------------------------------------------------------------

 Using a threshold for prediction = 0.50 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1337  219
         1    8   16
                                          
               Accuracy : 0.8563          
                 95% CI : (0.8381, 0.8733)
    No Information Rate : 0.8513          
    P-Value [Acc > NIR] : 0.3             
                                          
                  Kappa : 0.0987          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.06809         
            Specificity : 0.99405         
         Pos Pred Value : 0.66667         
         Neg Pred Value : 0.85925         
             Prevalence : 0.14873         
         Detection Rate : 0.01013         
   Detection Prevalence : 0.01519         
      Balanced Accuracy : 0.53107         
                                          
       'Positive' Class : 1               
                                          
Code
plot_ROC(test, phat)
Setting levels: control = 0, case = 1
Setting direction: controls < cases

Final model using thr = 0.1

Leaving the default threshold of 0.5 for binary prediction leads to identifying too few true positives. For this reason, we decided to maximise the Sensitivity (TP / (TP+FN)) even at the expense of assigning high risk of CHD to some people who are not.

This rationale is motivated by the fact that in this case the decision is not about doing or not doing an open-heart surgery, but rather to start to monitor people who might be at high risk. Plus, according the to the model, an accurate monitoring only requires inexpensive procedures: blood test and pressure measurements.

For this reason, it is of paramount important to include all the people who might be at risk even if they will turn out not to be so.

The threshold for risk detection is accordingly lowered to 0.1.

This leads to determine that almost 50% of the people who are not at risk will be deemed to be actually at risk. However, this choice will lead to a Sensitivity of about 80%.

NB: We also replace the cigs_per_day with units of tens of cigarettes

Code
# The main aim is to use the model to identify the highest amount of true positives
# and minimize false negatives

# EVs  <- c(
#    "male", "age", "sys_bp", "glucose", "cigs_per_day"
# )
# 
# phat <- do_logistic_regression(EVs, thresh = 0.1)


train_cig_factor <- train %>% 
 mutate(tens_cigs = round(cigs_per_day/10,0))

test_cig_factor <- test %>% 
 mutate(tens_cigs = round(cigs_per_day/10,0))


EVs  <- c(
   "male", "age", "sys_bp", "glucose", "tens_cigs"
)

phat <- do_logistic_regression(
 EVs, 
 train_data = train_cig_factor, 
 test_data = test_cig_factor, 
 thresh = 0.1
)

Call:
glm(formula = formula_obj, family = "binomial", data = train_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1007  -0.5886  -0.4341  -0.2981   2.8165  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.636592   0.503982 -17.137  < 2e-16 ***
male1        0.448757   0.127401   3.522 0.000428 ***
age          0.065813   0.007742   8.501  < 2e-16 ***
sys_bp       0.018675   0.002597   7.191 6.45e-13 ***
glucose      0.007510   0.002108   3.562 0.000368 ***
tens_cigs    0.176640   0.051338   3.441 0.000580 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2114.4  on 2470  degrees of freedom
Residual deviance: 1875.8  on 2465  degrees of freedom
AIC: 1887.8

Number of Fisher Scoring iterations: 5

MODEL INFO:
Observations: 2471
Dependent Variable: ten_year_chd
Type: Generalized linear model
  Family: binomial 
  Link function: logit 

MODEL FIT:
χ²(5) = 238.54, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.16
Pseudo-R² (McFadden) = 0.11
AIC = 1887.81, BIC = 1922.69 

Standard errors: MLE
-------------------------------------------------------------------
                    exp(Est.)   2.5%   97.5%   z val.      p    VIF
----------------- ----------- ------ ------- -------- ------ ------
(Intercept)              0.00   0.00    0.00   -17.14   0.00       
male1                    1.57   1.22    2.01     3.52   0.00   1.16
age                      1.07   1.05    1.08     8.50   0.00   1.17
sys_bp                   1.02   1.01    1.02     7.19   0.00   1.14
glucose                  1.01   1.00    1.01     3.56   0.00   1.01
tens_cigs                1.19   1.08    1.32     3.44   0.00   1.20
-------------------------------------------------------------------

 Using a threshold for prediction = 0.10 

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 588  38
         1 757 197
                                          
               Accuracy : 0.4968          
                 95% CI : (0.4719, 0.5218)
    No Information Rate : 0.8513          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1218          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8383          
            Specificity : 0.4372          
         Pos Pred Value : 0.2065          
         Neg Pred Value : 0.9393          
             Prevalence : 0.1487          
         Detection Rate : 0.1247          
   Detection Prevalence : 0.6038          
      Balanced Accuracy : 0.6377          
                                          
       'Positive' Class : 1               
                                          
Code
plot_ROC(test, phat)
Setting levels: control = 0, case = 1
Setting direction: controls < cases

Code to manually calculate the Confusion Matrix and its derivatives. Values are not shown. This was just an excercise.

Code
# # Binary prediction from the estimated probability values 
# thresh <- 0.1
# predicted <- ifelse(phat > thresh, 1, 0)
# 
# actual_predicted <- tibble(
#    actual = test$ten_year_chd, 
#    predicted = ifelse(phat > thresh, 1, 0) 
# )
# 
# cm <- actual_predicted %>%
#   summarise(
#     TP = sum(actual == 1 & predicted == 1),
#     TN = sum(actual == 0 & predicted == 0),
#     FP = sum(actual == 0 & predicted == 1),
#     FN = sum(actual == 1 & predicted == 0)
#   )
# 
# cm %>% print
# 
# TP <- cm$TP
# TN <- cm$TN
# FP <- cm$FP
# FN <- cm$FN
# 
# 
# accuracy <- (TP + TN) / (TP+TN+FP+FN)
# print(paste0("Accuracy = ", round(accuracy,4)))
# 
# sensitivity <- TP / (TP + FN)
# print(paste0("Sensitivity / Recall / TPR = ", round(sensitivity,4)))
# 
# specificity <- TN / (TN + FP)
# print(paste0("Specificity / TNR = ", round(specificity,4)))
# 
# PPV <- TP / (TP+FP)
# print(paste0("Positive Predictive Value = ", round(PPV,2)))
# 
# NPV <- TN / (TN+FN)
# print(paste0("Negative Predictive Value = ", round(NPV,2)))

Automatic stepwise

This particular stepwise regression leads to a much more complex model. It is important to note that the particular order of the predictors can change the final estimated “best” model (based on AIC).

However what is very important is that the cigs_per_day is identified as an important predictor, despite our EDA repeatedly disconfirmed so. The reasons are therefore to be explored, however for the time being we will try to include it into our manually built model (see above.)

Code
library(MASS)

fullModel = glm(
   ten_year_chd ~ male + prevalent_hyp + bp_meds + age + sys_bp + glucose + tot_chol,
   family = "binomial", data = train
)

fullModel = glm(ten_year_chd ~ ., family = "binomial", data = train)

nullModel = glm(ten_year_chd ~ 1, family = "binomial", data = train)

step_fit <- stepAIC(
   fullModel, direction = 'backward', 
   scope = list(upper = fullModel, lower = nullModel), trace = 0
)

step_fit %>% summary()

Call:
glm(formula = ten_year_chd ~ male + age + education + cigs_per_day + 
    prevalent_stroke + prevalent_hyp + sys_bp + glucose, family = "binomial", 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0023  -0.5868  -0.4197  -0.2915   2.7940  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -7.655170   0.587371 -13.033  < 2e-16 ***
male1              0.387895   0.129655   2.992 0.002774 ** 
age                0.060750   0.007983   7.610 2.74e-14 ***
education2        -0.335970   0.150240  -2.236 0.025337 *  
education3        -0.352595   0.184679  -1.909 0.056232 .  
education4        -0.038528   0.195294  -0.197 0.843606    
cigs_per_day       0.019113   0.005227   3.657 0.000256 ***
prevalent_stroke1  0.969730   0.601408   1.612 0.106868    
prevalent_hyp1     0.372800   0.161896   2.303 0.021295 *  
sys_bp             0.013282   0.003412   3.893 9.92e-05 ***
glucose            0.007787   0.002106   3.698 0.000217 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2114.4  on 2470  degrees of freedom
Residual deviance: 1860.2  on 2460  degrees of freedom
AIC: 1882.2

Number of Fisher Scoring iterations: 5
Code
step_fit %>% jtools::summ(confint = T, vifs = T, digits = 2) %>% print()
MODEL INFO:
Observations: 2471
Dependent Variable: ten_year_chd
Type: Generalized linear model
  Family: binomial 
  Link function: logit 

MODEL FIT:
χ²(10) = 254.11, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.17
Pseudo-R² (McFadden) = 0.12
AIC = 1882.24, BIC = 1946.18 

Standard errors: MLE
----------------------------------------------------------------------
                           Est.    2.5%   97.5%   z val.      p    VIF
----------------------- ------- ------- ------- -------- ------ ------
(Intercept)               -7.66   -8.81   -6.50   -13.03   0.00       
male1                      0.39    0.13    0.64     2.99   0.00   1.19
age                        0.06    0.05    0.08     7.61   0.00   1.23
education2                -0.34   -0.63   -0.04    -2.24   0.03   1.09
education3                -0.35   -0.71    0.01    -1.91   0.06   1.09
education4                -0.04   -0.42    0.34    -0.20   0.84   1.09
cigs_per_day               0.02    0.01    0.03     3.66   0.00   1.22
prevalent_stroke1          0.97   -0.21    2.15     1.61   0.11   1.01
prevalent_hyp1             0.37    0.06    0.69     2.30   0.02   1.85
sys_bp                     0.01    0.01    0.02     3.89   0.00   1.97
glucose                    0.01    0.00    0.01     3.70   0.00   1.02
----------------------------------------------------------------------
Code
# Estimated probability values from the logistic model
phat_step_fit = predict(step_fit, newdata = test, type = "response")

Evaluate predictivity of the stepwise model

Code
# Binary prediction from the estimated probability values 
thresh <- 0.1
predicted <- ifelse(phat_step_fit > thresh, 1, 0)

actual_predicted <- tibble(
   actual = test$ten_year_chd, 
   predicted = ifelse(phat_step_fit > thresh, 1, 0) 
)

caret::confusionMatrix(
   as.factor(predicted), # predicted
   as.factor(test$ten_year_chd), # actual 
   positive = "1"  # the positive class in our case is "1"
)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 608  38
         1 737 197
                                          
               Accuracy : 0.5095          
                 95% CI : (0.4845, 0.5344)
    No Information Rate : 0.8513          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1304          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8383          
            Specificity : 0.4520          
         Pos Pred Value : 0.2109          
         Neg Pred Value : 0.9412          
             Prevalence : 0.1487          
         Detection Rate : 0.1247          
   Detection Prevalence : 0.5911          
      Balanced Accuracy : 0.6452          
                                          
       'Positive' Class : 1               
                                          
Code
plot_ROC(test, phat_step_fit)
Setting levels: control = 0, case = 1
Setting direction: controls < cases

Samples balanced for CHD risk

Our sample is very unbalanced. We will select an equal number of people with and without risk in both train and test set to assess whether this will help model performance.

Code
set.seed(9999)

min_train <- min(
   nrow(train[train$ten_year_chd == 1,]),
   nrow(train[train$ten_year_chd == 0,])   
)

balanced_train <- train %>% 
   group_by(ten_year_chd) %>% 
   slice_sample(n = min_train)


min_test <- min(
   nrow(test[test$ten_year_chd == 1,]),
   nrow(test[test$ten_year_chd == 0,])   
)

balanced_test <- test %>% 
   group_by(ten_year_chd) %>% 
   slice_sample(n = min_test)


# Full model with the selected variables
EVs  <- c(
   "male", "prevalent_hyp", "bp_meds", "education",
   "age", "sys_bp", "bmi", "glucose", "tot_chol"
)


# Remove education, bp_meds, bmi, tot_chol since they have
# no effect on the model
EVs  <- c(
   "male", "prevalent_hyp",
   "age", "sys_bp", "glucose"
)


# Remove prevalent_hyp since it is only marginally significant
EVs  <- c(
   "male", "age", "sys_bp", "glucose"
)


# The stepwise regression - below - identified cigs_per_day as an important factor.
# I will therefore add it to the model.
EVs  <- c(
   "male", "age", "sys_bp", "glucose", "cigs_per_day"
)


phat <- do_logistic_regression(
   EVs, 
   train_data = balanced_train, test_data = balanced_test,
   thresh = 0.4
)

Call:
glm(formula = formula_obj, family = "binomial", data = train_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1032  -1.0034  -0.1717   1.0161   2.0910  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -7.238967   0.720868 -10.042  < 2e-16 ***
male1         0.319629   0.169510   1.886  0.05935 .  
age           0.074519   0.010682   6.976 3.04e-12 ***
sys_bp        0.015708   0.003688   4.259 2.06e-05 ***
glucose       0.010113   0.003923   2.578  0.00994 ** 
cigs_per_day  0.030757   0.007706   3.991 6.57e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1048.04  on 755  degrees of freedom
Residual deviance:  910.19  on 750  degrees of freedom
AIC: 922.19

Number of Fisher Scoring iterations: 4

MODEL INFO:
Observations: 756
Dependent Variable: ten_year_chd
Type: Generalized linear model
  Family: binomial 
  Link function: logit 

MODEL FIT:
χ²(5) = 137.85, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.22
Pseudo-R² (McFadden) = 0.13
AIC = 922.19, BIC = 949.96 

Standard errors: MLE
--------------------------------------------------------------------
                     exp(Est.)   2.5%   97.5%   z val.      p    VIF
------------------ ----------- ------ ------- -------- ------ ------
(Intercept)               0.00   0.00    0.00   -10.04   0.00       
male1                     1.38   0.99    1.92     1.89   0.06   1.12
age                       1.08   1.06    1.10     6.98   0.00   1.18
sys_bp                    1.02   1.01    1.02     4.26   0.00   1.11
glucose                   1.01   1.00    1.02     2.58   0.01   1.01
cigs_per_day              1.03   1.02    1.05     3.99   0.00   1.20
--------------------------------------------------------------------

 Using a threshold for prediction = 0.40 

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 115  44
         1 120 191
                                          
               Accuracy : 0.6511          
                 95% CI : (0.6061, 0.6941)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : 2.818e-11       
                                          
                  Kappa : 0.3021          
                                          
 Mcnemar's Test P-Value : 4.727e-09       
                                          
            Sensitivity : 0.8128          
            Specificity : 0.4894          
         Pos Pred Value : 0.6141          
         Neg Pred Value : 0.7233          
             Prevalence : 0.5000          
         Detection Rate : 0.4064          
   Detection Prevalence : 0.6617          
      Balanced Accuracy : 0.6511          
                                          
       'Positive' Class : 1               
                                          
Code
plot_ROC(balanced_test, phat)
Setting levels: control = 0, case = 1
Setting direction: controls < cases

CHD Prediction app

Use the following model - estimated on the whole dataset - to predict the 10-year risk of CHD according to the given input.

\[ logOdds_{10yr-CHD-risk} = -8.671 + 0.51*male + 0.061*age + 0.0173*sys\_bp + 0.00758*glucose + 0.0204 * cigs\_per\_day \]

\[ p = \frac{e^{logOdds}}{1 + e^{logOdds}} \]

NB: only for illustrational purposes. Read and accept the following Disclaimer first