Display machine information:
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.29 R6_2.5.1 jsonlite_1.7.2 magrittr_2.0.1
## [5] evaluate_0.14 stringi_1.7.6 rlang_1.0.1 cli_3.1.0
## [9] rstudioapi_0.13 jquerylib_0.1.4 bslib_0.3.1 rmarkdown_2.11
## [13] tools_3.6.0 stringr_1.4.0 xfun_0.29 yaml_2.2.1
## [17] fastmap_1.1.0 compiler_3.6.0 htmltools_0.5.2 knitr_1.37
## [21] sass_0.4.0
Load database libraries and the tidyverse frontend:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(miceRanger))
Through the Shiny app developed in HW3, we observe abundant missing values in the MIMIC-IV ICU cohort we created. In this question, we use multiple imputation to obtain a data set without missing values.
Read following tutorials on the R package miceRanger for imputation: https://github.com/farrellday/miceRanger, https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html.
A more thorough book treatment of the practical imputation strategies is the book Flexible Imputation of Missing Data by Stef van Buuren.
Solution: Read the tutorial on miceRanger and some parts of the book by Stef van Buuren.
Solution:
These terminologies represent the missing data mechanism.
Solution:
While the single imputation supplements each missing data with an imputed value as the true value, the multiple imputation iterates this procedure until convergence to integrate the given estimates to take into account the uncertainty due to missing values. MICE, one of the most typical methods, starts with a random draw from the observed data, and imputes missing values several times (e.g., 5 or 10 times) by regressing each missing variable on the remaining variables under the assumption of the missing data are MAR.
NA
s. Replace apparent data entry errors by NA
s.Solution:
I deleted variables with more than 5000 NA
s, which were deathtime
, edregtime
, edouttime
, and dod
. For the remaining numerical variables, values outside the range of [Q1 - 1.5 IQR, Q3 + 1.5 IQR] were replaced with NA
s as outliers.
# import ICU cohort data created in HW3
icu_cohort <- read_rds("../hw3/mimiciv_shiny/icu_cohort.rds")
# delete patients aged 18 since I followed the instruction to keep them in HW3
icu_cohort <- icu_cohort %>%
filter(age_hadm > 18)
# check variables with more than 5000 missing values
icu_cohort %>%
select_if(colSums(is.na(icu_cohort)) > 5000) %>%
colnames()
## [1] "deathtime" "edregtime" "edouttime" "dod"
# keep only variables with less than 5000 missing values
icu_cohort_discard <- icu_cohort %>%
select_if(colSums(is.na(icu_cohort)) <= 5000) %>%
print(width = Inf)
## # A tibble: 53,065 × 39
## subject_id intime hadm_id stay_id first_careunit
## <int> <dttm> <int> <int> <chr>
## 1 12776735 2200-07-12 00:33:00 20817525 34547665 Neuro Stepdown
## 2 16256226 2150-12-20 16:09:08 20013290 39289362 Neuro Stepdown
## 3 12974563 2138-11-13 23:30:01 29618057 32563675 Neuro Stepdown
## 4 14609218 2174-06-28 21:13:00 20606189 34947848 Neuro Stepdown
## 5 12687112 2162-05-31 18:08:45 26132667 37445058 Neuro Stepdown
## 6 18190935 2115-11-12 23:23:32 21137829 34338402 Neuro Stepdown
## 7 10404210 2161-03-03 18:30:00 22880512 30254828 Neuro Stepdown
## 8 12552973 2135-09-21 18:12:03 21320643 30261549 Neuro Stepdown
## 9 14303051 2175-11-20 01:02:00 26559537 30401486 Neuro Stepdown
## 10 12347959 2140-07-11 17:54:14 22779073 30503984 Neuro Stepdown
## last_careunit outtime los admittime
## <chr> <dttm> <dbl> <dttm>
## 1 Neuro Stepdown 2200-07-13 16:44:40 1.67 2200-07-11 22:46:00
## 2 Neuro Stepdown 2150-12-21 14:58:40 0.951 2150-12-20 03:00:00
## 3 Neuro Stepdown 2138-11-15 16:25:19 1.71 2138-11-13 01:07:00
## 4 Neuro Stepdown 2174-07-05 17:01:32 6.83 2174-06-28 20:40:00
## 5 Neuro Stepdown 2162-06-04 10:16:13 3.67 2162-05-31 15:36:00
## 6 Neuro Stepdown 2115-11-15 18:06:26 2.78 2115-11-02 18:38:00
## 7 Neuro Stepdown 2161-03-05 18:10:31 1.99 2161-03-03 17:45:00
## 8 Neuro Stepdown 2135-09-23 12:56:46 1.78 2135-09-21 02:12:00
## 9 Neuro Stepdown 2175-11-21 16:06:17 1.63 2175-11-20 00:08:00
## 10 Neuro Stepdown 2140-07-19 13:01:05 7.80 2140-07-11 17:52:00
## dischtime admission_type admission_location
## <dttm> <chr> <chr>
## 1 2200-07-19 12:00:00 OBSERVATION ADMIT EMERGENCY ROOM
## 2 2150-12-21 14:50:00 SURGICAL SAME DAY ADMISSION PHYSICIAN REFERRAL
## 3 2138-11-15 15:53:00 OBSERVATION ADMIT EMERGENCY ROOM
## 4 2174-07-05 16:45:00 OBSERVATION ADMIT EMERGENCY ROOM
## 5 2162-06-04 10:16:00 OBSERVATION ADMIT PHYSICIAN REFERRAL
## 6 2115-11-26 14:30:00 OBSERVATION ADMIT TRANSFER FROM HOSPITAL
## 7 2161-03-05 18:10:00 OBSERVATION ADMIT EMERGENCY ROOM
## 8 2135-09-23 12:56:00 SURGICAL SAME DAY ADMISSION PHYSICIAN REFERRAL
## 9 2175-11-21 15:30:00 OBSERVATION ADMIT EMERGENCY ROOM
## 10 2140-07-19 13:00:00 URGENT TRANSFER FROM HOSPITAL
## discharge_location insurance language marital_status
## <chr> <chr> <chr> <chr>
## 1 SKILLED NURSING FACILITY Medicare ENGLISH MARRIED
## 2 HOME Other ENGLISH SINGLE
## 3 HOME HEALTH CARE Other ENGLISH WIDOWED
## 4 REHAB Other ENGLISH SINGLE
## 5 HOME Other ENGLISH SINGLE
## 6 CHRONIC/LONG TERM ACUTE CARE Other ENGLISH DIVORCED
## 7 HOME Other ENGLISH MARRIED
## 8 HOME Other ENGLISH MARRIED
## 9 HOME Other ENGLISH SINGLE
## 10 REHAB Other ENGLISH MARRIED
## ethnicity hospital_expire_flag gender anchor_age anchor_year
## <chr> <int> <chr> <int> <int>
## 1 OTHER 0 M 72 2192
## 2 OTHER 0 F 49 2150
## 3 WHITE 0 F 72 2138
## 4 WHITE 0 F 69 2174
## 5 BLACK/AFRICAN AMERICAN 0 M 57 2156
## 6 WHITE 0 F 57 2115
## 7 HISPANIC/LATINO 0 F 46 2161
## 8 WHITE 0 F 48 2131
## 9 WHITE 0 F 24 2175
## 10 WHITE 0 F 49 2140
## anchor_year_group age_hadm lab50960 lab51221 lab50971 lab50983 lab51301
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2008 - 2010 80 1.5 30.6 3.3 130 7.8
## 2 2017 - 2019 49 1.8 30.5 4 141 3.5
## 3 2014 - 2016 72 1.9 41.6 4.4 139 10.6
## 4 2017 - 2019 69 1.7 36 4.6 134 12.1
## 5 2011 - 2013 63 1.4 30 5 143 5.4
## 6 2017 - 2019 57 1.6 31.5 4.3 138 9.2
## 7 2017 - 2019 46 2.2 34.1 4.1 138 3.6
## 8 2011 - 2013 52 1.9 37.1 4 145 14.7
## 9 2014 - 2016 24 2 34.7 4.3 139 7.1
## 10 2017 - 2019 49 1.8 39.6 3.8 137 20.1
## lab50882 lab50893 lab50902 lab50912 lab50931 chart223761 chart220179
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 22 8.2 97 1 113 102. 98
## 2 21 8.1 108 0.5 99 98.4 111
## 3 16 8.9 104 1 78 98.7 183
## 4 23 9.2 98 0.5 106 98.3 162
## 5 19 8.4 105 8.2 127 98 165
## 6 26 9.2 98 0.4 135 98.6 132
## 7 26 9 104 0.8 99 98.4 114
## 8 22 8.3 106 0.8 157 98.4 144
## 9 24 9.2 105 0.8 85 98.6 120
## 10 20 9.6 101 0.7 98 99.5 140
## chart220181 chart220210 chart220045 thirty_day_mort
## <dbl> <dbl> <dbl> <lgl>
## 1 86 16 78 FALSE
## 2 72 5 91 FALSE
## 3 107 16 67 FALSE
## 4 99 12 74 FALSE
## 5 95 26 95 FALSE
## 6 97 23 96 FALSE
## 7 79 18 89 FALSE
## 8 99 23 71 FALSE
## 9 85 17 67 FALSE
## 10 117 12 72 FALSE
## # … with 53,055 more rows
# make a function to replace outliers to `NA`s using IQR rule
outlier_to_na <- function(x) {
qnt <- quantile(x, probs = c(.25, .75), na.rm = TRUE)
H <- 1.5 * IQR(x, na.rm = TRUE)
x[x < (qnt[1] - H)] <- NA
x[x > (qnt[2] + H)] <- NA
x
}
# replace outliers to `NA`s for numerical variables
icu_cohort_replace <- icu_cohort_discard %>%
mutate_if(is.numeric, outlier_to_na)
summary(icu_cohort_replace)
## subject_id intime hadm_id
## Min. :10000032 Min. :2110-01-11 10:16:06 Min. :20000094
## 1st Qu.:12483898 1st Qu.:2132-11-05 21:29:00 1st Qu.:22478093
## Median :14991424 Median :2152-09-29 21:13:55 Median :24947644
## Mean :14989837 Mean :2152-10-28 03:05:41 Mean :24975852
## 3rd Qu.:17495811 3rd Qu.:2172-11-04 17:18:32 3rd Qu.:27477277
## Max. :19999987 Max. :2211-05-01 06:59:19 Max. :29999828
##
## stay_id first_careunit last_careunit
## Min. :30000153 Length:53065 Length:53065
## 1st Qu.:32467111 Class :character Class :character
## Median :34987214 Mode :character Mode :character
## Mean :34984050
## 3rd Qu.:37487113
## Max. :39999810
##
## outtime los admittime
## Min. :2110-01-12 17:17:47 Min. :0.001 Min. :2110-01-11 10:14:00
## 1st Qu.:2132-11-07 16:56:01 1st Qu.:1.035 1st Qu.:2132-11-03 21:46:00
## Median :2152-10-03 19:16:01 Median :1.696 Median :2152-09-28 20:40:00
## Mean :2152-10-31 11:49:07 Mean :2.141 Mean :2152-10-27 02:08:14
## 3rd Qu.:2172-11-06 10:29:59 3rd Qu.:2.868 3rd Qu.:2172-11-04 12:57:00
## Max. :2211-05-10 22:51:06 Max. :7.227 Max. :2211-05-01 06:57:00
## NA's :5265
## dischtime admission_type admission_location
## Min. :2110-01-15 17:31:00 Length:53065 Length:53065
## 1st Qu.:2132-11-14 16:40:00 Class :character Class :character
## Median :2152-10-07 17:00:00 Mode :character Mode :character
## Mean :2152-11-05 12:22:14
## 3rd Qu.:2172-11-10 14:55:00
## Max. :2211-05-12 17:54:00
##
## discharge_location insurance language marital_status
## Length:53065 Length:53065 Length:53065 Length:53065
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## ethnicity hospital_expire_flag gender anchor_age
## Length:53065 Min. :0 Length:53065 Min. :18.00
## Class :character 1st Qu.:0 Class :character 1st Qu.:53.00
## Mode :character Median :0 Mode :character Median :65.00
## Mean :0 Mean :63.59
## 3rd Qu.:0 3rd Qu.:77.00
## Max. :0 Max. :91.00
## NA's :5514
## anchor_year anchor_year_group age_hadm lab50960
## Min. :2109 Length:53065 Min. : 19.00 Min. :1.000
## 1st Qu.:2131 Class :character 1st Qu.: 54.00 1st Qu.:1.700
## Median :2151 Mode :character Median : 66.00 Median :1.900
## Mean :2151 Mean : 64.54 Mean :1.957
## 3rd Qu.:2171 3rd Qu.: 78.00 3rd Qu.:2.200
## Max. :2207 Max. :102.00 Max. :2.900
## NA's :4541
## lab51221 lab50971 lab50983 lab51301 lab50882
## Min. :14.40 Min. :2.60 Min. :129.0 Min. : 0.00 Min. :15.00
## 1st Qu.:28.10 1st Qu.:3.70 1st Qu.:136.0 1st Qu.: 7.50 1st Qu.:21.00
## Median :32.60 Median :4.10 Median :139.0 Median :10.30 Median :23.00
## Mean :32.69 Mean :4.11 Mean :138.7 Mean :10.97 Mean :23.06
## 3rd Qu.:37.20 3rd Qu.:4.50 3rd Qu.:141.0 3rd Qu.:13.90 3rd Qu.:25.00
## Max. :51.00 Max. :5.60 Max. :148.0 Max. :24.80 Max. :31.00
## NA's :2261 NA's :3669 NA's :4107 NA's :4149 NA's :4954
## lab50893 lab50902 lab50912 lab50931
## Min. : 6.600 Min. : 91.0 Min. :0.000 Min. : 24.0
## 1st Qu.: 7.900 1st Qu.:101.0 1st Qu.:0.700 1st Qu.:102.0
## Median : 8.400 Median :105.0 Median :0.900 Median :121.0
## Mean : 8.358 Mean :104.8 Mean :0.958 Mean :127.3
## 3rd Qu.: 8.800 3rd Qu.:108.0 3rd Qu.:1.100 3rd Qu.:147.0
## Max. :10.100 Max. :118.0 Max. :2.200 Max. :235.0
## NA's :6062 NA's :3559 NA's :6736 NA's :5444
## chart223761 chart220179 chart220181 chart220210
## Min. : 96.00 Min. : 57.0 Min. : 36.0 Min. : 5.00
## 1st Qu.: 97.70 1st Qu.:106.0 1st Qu.: 70.0 1st Qu.:15.00
## Median : 98.10 Median :122.0 Median : 80.0 Median :18.00
## Mean : 98.15 Mean :122.9 Mean : 81.5 Mean :18.23
## 3rd Qu.: 98.60 3rd Qu.:139.0 3rd Qu.: 92.0 3rd Qu.:21.00
## Max. :100.30 Max. :188.0 Max. :127.0 Max. :32.00
## NA's :4879 NA's :1262 NA's :1685 NA's :1685
## chart220045 thirty_day_mort
## Min. : 37.00 Mode :logical
## 1st Qu.: 74.00 FALSE:47630
## Median : 85.00 TRUE :5435
## Mean : 86.49
## 3rd Qu.: 98.00
## Max. :136.00
## NA's :986
miceRanger
(request \(m=3\) data sets). This step is computational intensive. Make sure to save the imputation results as a file. Hint: Setting max.depth=10
in the miceRanger
function may cut some computing time.Solution:
I kept only variables of interest and used miceRanger
to save three imputed data sets to icu_cohort_imputed.rds
.
# choose variables of interest
demo_var = c("gender", "age_hadm", "marital_status", "ethnicity",
"thirty_day_mort")
lab_item <- c(50912,
50971,
50983,
50902,
50882,
51221,
51301,
50931,
50960,
50893)
lab_item = paste("lab", lab_item, sep = "")
vital_item <- c(220045,
220181,
220179,
223761,
220210)
vital_item = paste("chart", vital_item, sep = "")
# merge all variables
all_vars = c(demo_var, lab_item, vital_item)
# kept only variables we plan to use
icu_cohort_mice <- icu_cohort_replace %>%
select(all_of(all_vars))
summary(icu_cohort_mice)
## gender age_hadm marital_status ethnicity
## Length:53065 Min. : 19.00 Length:53065 Length:53065
## Class :character 1st Qu.: 54.00 Class :character Class :character
## Mode :character Median : 66.00 Mode :character Mode :character
## Mean : 64.54
## 3rd Qu.: 78.00
## Max. :102.00
##
## thirty_day_mort lab50912 lab50971 lab50983 lab50902
## Mode :logical Min. :0.000 Min. :2.60 Min. :129.0 Min. : 91.0
## FALSE:47630 1st Qu.:0.700 1st Qu.:3.70 1st Qu.:136.0 1st Qu.:101.0
## TRUE :5435 Median :0.900 Median :4.10 Median :139.0 Median :105.0
## Mean :0.958 Mean :4.11 Mean :138.7 Mean :104.8
## 3rd Qu.:1.100 3rd Qu.:4.50 3rd Qu.:141.0 3rd Qu.:108.0
## Max. :2.200 Max. :5.60 Max. :148.0 Max. :118.0
## NA's :6736 NA's :3669 NA's :4107 NA's :3559
## lab50882 lab51221 lab51301 lab50931
## Min. :15.00 Min. :14.40 Min. : 0.00 Min. : 24.0
## 1st Qu.:21.00 1st Qu.:28.10 1st Qu.: 7.50 1st Qu.:102.0
## Median :23.00 Median :32.60 Median :10.30 Median :121.0
## Mean :23.06 Mean :32.69 Mean :10.97 Mean :127.3
## 3rd Qu.:25.00 3rd Qu.:37.20 3rd Qu.:13.90 3rd Qu.:147.0
## Max. :31.00 Max. :51.00 Max. :24.80 Max. :235.0
## NA's :4954 NA's :2261 NA's :4149 NA's :5444
## lab50960 lab50893 chart220045 chart220181
## Min. :1.000 Min. : 6.600 Min. : 37.00 Min. : 36.0
## 1st Qu.:1.700 1st Qu.: 7.900 1st Qu.: 74.00 1st Qu.: 70.0
## Median :1.900 Median : 8.400 Median : 85.00 Median : 80.0
## Mean :1.957 Mean : 8.358 Mean : 86.49 Mean : 81.5
## 3rd Qu.:2.200 3rd Qu.: 8.800 3rd Qu.: 98.00 3rd Qu.: 92.0
## Max. :2.900 Max. :10.100 Max. :136.00 Max. :127.0
## NA's :4541 NA's :6062 NA's :986 NA's :1685
## chart220179 chart223761 chart220210
## Min. : 57.0 Min. : 96.00 Min. : 5.00
## 1st Qu.:106.0 1st Qu.: 97.70 1st Qu.:15.00
## Median :122.0 Median : 98.10 Median :18.00
## Mean :122.9 Mean : 98.15 Mean :18.23
## 3rd Qu.:139.0 3rd Qu.: 98.60 3rd Qu.:21.00
## Max. :188.0 Max. :100.30 Max. :32.00
## NA's :1262 NA's :4879 NA's :1685
if (file.exists("icu_cohort_imputed.rds")) {
icu_cohort_imputed <- read_rds("icu_cohort_imputed.rds")
} else {
parTime <- system.time(
icu_cohort_imputed <- miceRanger(
icu_cohort_mice,
m = 3,
max.depth = 10,
returnModels = FALSE,
verbose = TRUE
)
)
icu_cohort_imputed %>%
write_rds("icu_cohort_imputed.rds")
}
Solution: I output some imputation diagnostic plots to check the validation of the imputed distribution.
Distribution of Imputed Values:
These plots help us compare the imputed distribution (black lines) to the original distribution (red line) for each variable. It appears that most variables are reasonably well matched. A few variables, such as 50983 (sodium), 220045 (heart rate), and 220181 (mean non-invasive BP), do not match, which implies that these data may be MNAR, say dependent on missing value itself:
plotDistributions(icu_cohort_imputed, vars = 'allNumeric')
Model OOB Error: We can take a look at out-of-bag R-squared for regression by iteration. Other than the last three vitals, these variables appear to converge as they are repeated.
plotModelError(icu_cohort_imputed,vars = 'allNumeric')
Variable Importance: The table below shows which variables were used to impute each variable. Overall, each lab measurement was important to other lab measurements, and so were vitals. In addition, gender and age at admission were also important for 50912 (creatinine).
plotVarImportance(icu_cohort_imputed)
# plotCorrelations(icu_cohort_imputed, vars = 'allNumeric')
# plotVarConvergence(icu_cohort_imputed, vars = 'allNumeric')
# plotImputationVariance(icu_cohort_imputed, ncol = 2, widths = c(5,3))
Solution:
The correct Multiple Imputation strategy is to pool \(m\) imputed data sets, that is, integrate \(m\) parameter estimates given by miceRanger
, accounting for uncertainty (differences in variation) among imputed values. The variance combines the within-imputation and between-imputation variance. The pooled estimates are unbiased and have valid statistical properties compared to using one imputed data set or averaging multiple imputed data sets.
# complete imputed data sets
icu_cohort_complete <- completeData(icu_cohort_imputed)
# choose one of the imputed data sets
icu_cohort_choice <- icu_cohort_complete[[1]]
Data preparation:
# convert non-numeric variables to factors and scale numeric variables
icu_cohort_factor <- icu_cohort_choice %>%
mutate_if(is.character, as.factor) %>%
mutate_if(is.logical, as.factor) %>%
mutate_if(is.numeric, scale)
Develop at least two analytic approaches for predicting the 30-day mortality of patients admitted to ICU using demographic information (gender, age, marital status, ethnicity), first lab measurements during ICU stay, and first vital measurements during ICU stay. For example, you can use (1) logistic regression (glm()
function in base R or keras), (2) logistic regression with lasso penalty (glmnet or keras package), (3) random forest (randomForest package), or (4) neural network (keras package).
Solution:
I installed caret package to split data into 80% training set and 20% test set. By the way, caTools package can also be used.
library(caret)
set.seed(3456)
index <- createDataPartition(icu_cohort_factor$thirty_day_mort,
p = .8, list = FALSE)
training_set <- icu_cohort_factor[index, ]
test_set <- icu_cohort_factor[-index, ]
# library(caTools)
# # split = sample.split(icu_cohort_factor$thirty_day_mort, SplitRatio = 0.8)
# # training_set = subset(icu_cohort_factor, split == TRUE)
# # test_set = subset(icu_cohort_factor, split == FALSE)
Solution:
(1) Logistic regression, using glm()
.
# fit the model
logstc_model <- glm(thirty_day_mort ~ ., training_set, family = binomial)
summary(logstc_model)
##
## Call:
## glm(formula = thirty_day_mort ~ ., family = binomial, data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1545 -0.4615 -0.3247 -0.2273 3.5136
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.777130 0.445742 -6.230 4.65e-10 ***
## genderM -0.177084 0.037756 -4.690 2.73e-06 ***
## age_hadm 0.462844 0.022221 20.829 < 2e-16 ***
## marital_statusMARRIED -0.013586 0.068560 -0.198 0.8429
## marital_statusSINGLE 0.019426 0.072740 0.267 0.7894
## marital_statusWIDOWED 0.012841 0.077400 0.166 0.8682
## ethnicityASIAN 0.324664 0.453196 0.716 0.4738
## ethnicityBLACK/AFRICAN AMERICAN 0.058568 0.446048 0.131 0.8955
## ethnicityHISPANIC/LATINO 0.111799 0.455235 0.246 0.8060
## ethnicityOTHER 0.289986 0.449356 0.645 0.5187
## ethnicityUNABLE TO OBTAIN 1.166157 0.458838 2.542 0.0110 *
## ethnicityUNKNOWN 1.120795 0.443595 2.527 0.0115 *
## ethnicityWHITE 0.174367 0.442253 0.394 0.6934
## lab50912 0.336277 0.017251 19.493 < 2e-16 ***
## lab50971 0.001927 0.017504 0.110 0.9124
## lab50983 0.282916 0.022668 12.481 < 2e-16 ***
## lab50902 -0.453783 0.024530 -18.499 < 2e-16 ***
## lab50882 -0.380214 0.019694 -19.306 < 2e-16 ***
## lab51221 -0.023447 0.018573 -1.262 0.2068
## lab51301 0.121767 0.016924 7.195 6.25e-13 ***
## lab50931 0.074879 0.016548 4.525 6.04e-06 ***
## lab50960 0.044650 0.017751 2.515 0.0119 *
## lab50893 -0.170969 0.019541 -8.749 < 2e-16 ***
## chart220045 0.213008 0.018404 11.574 < 2e-16 ***
## chart220181 0.019236 0.026263 0.732 0.4639
## chart220179 -0.179341 0.025768 -6.960 3.41e-12 ***
## chart223761 -0.121512 0.017662 -6.880 5.99e-12 ***
## chart220210 0.248065 0.017843 13.903 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28050 on 42451 degrees of freedom
## Residual deviance: 23784 on 42424 degrees of freedom
## AIC: 23840
##
## Number of Fisher Scoring iterations: 6
# predict 30-day mortality using the model
prob_logstc <- predict(logstc_model, test_set, type = "response")
# translate probabilities to predictions by threshold = 0.5
predict_logstc <- ifelse(prob_logstc > 0.5, TRUE, FALSE)
# make confusion matrix
table(observed = test_set$thirty_day_mort, predicted = predict_logstc)
## predicted
## observed FALSE TRUE
## FALSE 9455 71
## TRUE 996 91
# compute accuracy
cat('Test accuracy', mean(predict_logstc == test_set$thirty_day_mort))
## Test accuracy 0.8994629
# confusionMatrix(as.factor(predict_logstc),
# as.factor(test_set$thirty_day_mort))
Accuracy by prediction on the test set was about 90%, which changed little when the threshold was further increased.
(2) Logistic regression with lasso penalty using glmnet package
library(glmnet)
# convert x to matrix and choose y
x_train <- model.matrix(thirty_day_mort ~ ., training_set)
x_test <- model.matrix(test_set$thirty_day_mort ~ ., test_set)
y_train <- training_set$thirty_day_mort
y_test <- test_set$thirty_day_mort
# find optimal value of lambda (alpha=1 => lasso)
cv.out <- cv.glmnet(x_train, y_train, alpha = 1,
family = "binomial",
type.measure = "auc")
cv.out
##
## Call: cv.glmnet(x = x_train, y = y_train, type.measure = "auc", alpha = 1, family = "binomial")
##
## Measure: AUC
##
## Lambda Index Measure SE Nonzero
## min 0.000772 48 0.7776 0.003609 23
## 1se 0.004960 28 0.7741 0.004015 15
# plot(cv.out)
# min value of lambda
lambda_min <- cv.out$lambda.min
# best value of lambda
lambda_1se <- cv.out$lambda.1se
# fit the model
coef(cv.out, s = lambda_1se)
## 29 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -2.51768007
## (Intercept) .
## genderM -0.06460353
## age_hadm 0.38888464
## marital_statusMARRIED .
## marital_statusSINGLE .
## marital_statusWIDOWED .
## ethnicityASIAN .
## ethnicityBLACK/AFRICAN AMERICAN .
## ethnicityHISPANIC/LATINO .
## ethnicityOTHER .
## ethnicityUNABLE TO OBTAIN 0.51133395
## ethnicityUNKNOWN 0.80753765
## ethnicityWHITE .
## lab50912 0.33511706
## lab50971 .
## lab50983 0.06940426
## lab50902 -0.21777511
## lab50882 -0.28241995
## lab51221 .
## lab51301 0.09231811
## lab50931 0.03624130
## lab50960 .
## lab50893 -0.08044644
## chart220045 0.16976916
## chart220181 .
## chart220179 -0.11358277
## chart223761 -0.05585606
## chart220210 0.22863656
# predict 30-day mortality using the model
prob_lasso <- predict(cv.out, newx = x_test, s = lambda_1se, type = "response")
# translate probabilities to predictions
predict_lasso <- ifelse(prob_lasso > 0.5, TRUE, FALSE)
# make confusion matrix for the fitted value versus 30-day mortality in test
table(observed = y_test, predicted = predict_lasso)
## predicted
## observed FALSE TRUE
## FALSE 9499 27
## TRUE 1047 40
# compute accuracy
cat('Test accuracy', mean(predict_lasso == test_set$thirty_day_mort))
## Test accuracy 0.8988034
Accuracy by prediction on the test set was also about 90%.
(3) Random forest (randomForest package)
library(randomForest)
# fit the model
model_rf <- randomForest(thirty_day_mort ~ ., data = training_set, ntree = 300)
# predict 30-day mortality using the model
prob_rf <- as.data.frame(predict(model_rf, test_set, type = "prob"))
predict_rf <- predict(model_rf, test_set)
# make confusion matrix
table(observed = test_set$thirty_day_mort, predicted = predict_rf)
## predicted
## observed FALSE TRUE
## FALSE 9497 29
## TRUE 1019 68
# compute accuracy
cat('Test accuracy', mean(predict_rf == test_set$thirty_day_mort))
## Test accuracy 0.9012532
Again, it was around 90% accuracy, which also remained unchanged even when the number of trees was increased.
(4) Neural network (keras package)
# install_keras(tensorflow=2.6)
# install_keras()
# virtualenv_create("r-reticulate")
library(keras)
library(tensorflow)
# data preparation for neural network
x_train_array <- array_reshape(x_train, c(nrow(x_train), 28))
x_test_array <- array_reshape(x_test, c(nrow(x_test), 28))
y_train_num <- as.numeric(y_train) - 1
y_test_num <- as.numeric(y_test) - 1
# encode y as binary class matrix
y_train_num <- to_categorical(y_train_num, 2)
## Loaded Tensorflow version 2.6.2
y_test_num <- to_categorical(y_test_num, 2)
# define the model
model_nn <- keras_model_sequential()
model_nn %>%
layer_dense(units = 20, activation = 'relu', input_shape = c(28)) %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 10, activation = 'relu') %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 2, activation = 'softmax')
# layer_dense(input_shape = c(28), units = 2, activation = 'softmax')
summary(model_nn)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_2 (Dense) (None, 20) 580
## ________________________________________________________________________________
## dropout_1 (Dropout) (None, 20) 0
## ________________________________________________________________________________
## dense_1 (Dense) (None, 10) 210
## ________________________________________________________________________________
## dropout (Dropout) (None, 10) 0
## ________________________________________________________________________________
## dense (Dense) (None, 2) 22
## ================================================================================
## Total params: 812
## Trainable params: 812
## Non-trainable params: 0
## ________________________________________________________________________________
# compile the model
model_nn %>% compile(
loss = 'binary_crossentropy',
optimizer = optimizer_rmsprop(),
metrics = c('accuracy')
)
# train and validate
history <- model_nn %>%
fit(
x_train_array,
y_train_num,
epochs = 200,
batch_size = 10000,
validation_spl1t = 0.2,
)
# predict 30-day mortality using the model
predict_neural <- evaluate(model_nn, x_test_array, y_test_num) %>%
print(width = Inf)
## loss accuracy
## 0.2717495 0.8995572
prob_neural <- predict(model_nn, x_test_array)
The neural network approach also provided roughly 90% accuracy, although I tried some sequential models.
Conclusion:
The ROC curves for these approaches are shown below. In this data set, the random forest and the neural network methods indicated better prediction accuracy than the first two ones. This is because the closer that the curve is to the top-left corner, the better the model is. However, the data set was too simple for these two methods to have real power. Also, executing the correct multiple imputation strategy or dividing data into three to use a validation data set may improve the accuracy.
# create ROC curve
library(ROCR)
pred1 <- prediction(prob_logstc, y_test)
pred2 <- prediction(prob_lasso, y_test)
pred3 <- prediction(prob_rf[2], y_test)
pred4 <- prediction(prob_neural[, 2], y_test)
perf1 <- performance(pred1, 'tpr', 'fpr')
perf2 <- performance(pred2, 'tpr', 'fpr')
perf3 <- performance(pred3, 'tpr', 'fpr')
perf4 <- performance(pred4, 'tpr', 'fpr')
plot(perf1, col = 4)
plot(perf2, add = T, col = 2)
plot(perf3, add = T, col = 3)
plot(perf4, add = T, col = 5)
title("ROC curves for the four approaches")
abline(a = 0, b = 1, lty = 2)
text(0.18, 0.4, 'Logistic', col = 4)
text(0.4, 0.55, 'Logistic with Lasso panelty', col = 2)
text(0.35, 0.95, 'Random forest', col = 3)
text(0.2, 0.8, 'Neural network', col = 5)