Biostat 203B Homework 4

Q1. Missing data
Q2. Predicting 30-day mortality

Display machine information:

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.29   R6_2.5.1        jsonlite_1.7.2  magrittr_2.0.1 
##  [5] evaluate_0.14   stringi_1.7.6   rlang_1.0.1     cli_3.1.0      
##  [9] rstudioapi_0.13 jquerylib_0.1.4 bslib_0.3.1     rmarkdown_2.11 
## [13] tools_3.6.0     stringr_1.4.0   xfun_0.29       yaml_2.2.1     
## [17] fastmap_1.1.0   compiler_3.6.0  htmltools_0.5.2 knitr_1.37     
## [21] sass_0.4.0

Load database libraries and the tidyverse frontend:

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(miceRanger))

Q1. Missing data

Through the Shiny app developed in HW3, we observe abundant missing values in the MIMIC-IV ICU cohort we created. In this question, we use multiple imputation to obtain a data set without missing values.

Read following tutorials on the R package miceRanger for imputation: https://github.com/farrellday/miceRanger, https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html.

A more thorough book treatment of the practical imputation strategies is the book Flexible Imputation of Missing Data by Stef van Buuren.

Solution: Read the tutorial on miceRanger and some parts of the book by Stef van Buuren.

Explain the jargon MCAR, MAR, and MNAR.

Solution:
These terminologies represent the missing data mechanism.

MCAR stands for missing completely at random: Missingness is independent of any data. The type of data is rarely found in real data.
MAR stands for missing at random. Missingness does not depend on the missing data but on the observed data. For instance, the older a person is, the more likely the data on income are to be missing. The probability of missing data is randomly generated conditional on the observed data.
MNAR stands for missing not at Random: Missingness depends on the missing data itself. For example, the lower your income is, the less likely you may be to report your income.

Explain in a couple of sentences how the Multiple Imputation by Chained Equations (MICE) work.

Solution:
While the single imputation supplements each missing data with an imputed value as the true value, the multiple imputation iterates this procedure until convergence to integrate the given estimates to take into account the uncertainty due to missing values. MICE, one of the most typical methods, starts with a random draw from the observed data, and imputes missing values several times (e.g., 5 or 10 times) by regressing each missing variable on the remaining variables under the assumption of the missing data are MAR.

Perform a data quality check of the ICU stays data. Discard variables with substantial missingness, say >5000 NAs. Replace apparent data entry errors by NAs.

Solution:
I deleted variables with more than 5000 NAs, which were deathtime, edregtime, edouttime, and dod. For the remaining numerical variables, values outside the range of [Q1 - 1.5 IQR, Q3 + 1.5 IQR] were replaced with NAs as outliers.

# import ICU cohort data created in HW3
icu_cohort <- read_rds("../hw3/mimiciv_shiny/icu_cohort.rds")

# delete patients aged 18 since I followed the instruction to keep them in HW3
icu_cohort <- icu_cohort %>%
  filter(age_hadm > 18)
# check variables with more than 5000 missing values
icu_cohort %>%
  select_if(colSums(is.na(icu_cohort)) > 5000) %>%
  colnames()

## [1] "deathtime" "edregtime" "edouttime" "dod"

# keep only variables with less than 5000 missing values
icu_cohort_discard <- icu_cohort %>%
  select_if(colSums(is.na(icu_cohort)) <= 5000) %>%
  print(width = Inf)

## # A tibble: 53,065 × 39
##    subject_id intime               hadm_id  stay_id first_careunit
##         <int> <dttm>                 <int>    <int> <chr>         
##  1   12776735 2200-07-12 00:33:00 20817525 34547665 Neuro Stepdown
##  2   16256226 2150-12-20 16:09:08 20013290 39289362 Neuro Stepdown
##  3   12974563 2138-11-13 23:30:01 29618057 32563675 Neuro Stepdown
##  4   14609218 2174-06-28 21:13:00 20606189 34947848 Neuro Stepdown
##  5   12687112 2162-05-31 18:08:45 26132667 37445058 Neuro Stepdown
##  6   18190935 2115-11-12 23:23:32 21137829 34338402 Neuro Stepdown
##  7   10404210 2161-03-03 18:30:00 22880512 30254828 Neuro Stepdown
##  8   12552973 2135-09-21 18:12:03 21320643 30261549 Neuro Stepdown
##  9   14303051 2175-11-20 01:02:00 26559537 30401486 Neuro Stepdown
## 10   12347959 2140-07-11 17:54:14 22779073 30503984 Neuro Stepdown
##    last_careunit  outtime               los admittime          
##    <chr>          <dttm>              <dbl> <dttm>             
##  1 Neuro Stepdown 2200-07-13 16:44:40 1.67  2200-07-11 22:46:00
##  2 Neuro Stepdown 2150-12-21 14:58:40 0.951 2150-12-20 03:00:00
##  3 Neuro Stepdown 2138-11-15 16:25:19 1.71  2138-11-13 01:07:00
##  4 Neuro Stepdown 2174-07-05 17:01:32 6.83  2174-06-28 20:40:00
##  5 Neuro Stepdown 2162-06-04 10:16:13 3.67  2162-05-31 15:36:00
##  6 Neuro Stepdown 2115-11-15 18:06:26 2.78  2115-11-02 18:38:00
##  7 Neuro Stepdown 2161-03-05 18:10:31 1.99  2161-03-03 17:45:00
##  8 Neuro Stepdown 2135-09-23 12:56:46 1.78  2135-09-21 02:12:00
##  9 Neuro Stepdown 2175-11-21 16:06:17 1.63  2175-11-20 00:08:00
## 10 Neuro Stepdown 2140-07-19 13:01:05 7.80  2140-07-11 17:52:00
##    dischtime           admission_type              admission_location    
##    <dttm>              <chr>                       <chr>                 
##  1 2200-07-19 12:00:00 OBSERVATION ADMIT           EMERGENCY ROOM        
##  2 2150-12-21 14:50:00 SURGICAL SAME DAY ADMISSION PHYSICIAN REFERRAL    
##  3 2138-11-15 15:53:00 OBSERVATION ADMIT           EMERGENCY ROOM        
##  4 2174-07-05 16:45:00 OBSERVATION ADMIT           EMERGENCY ROOM        
##  5 2162-06-04 10:16:00 OBSERVATION ADMIT           PHYSICIAN REFERRAL    
##  6 2115-11-26 14:30:00 OBSERVATION ADMIT           TRANSFER FROM HOSPITAL
##  7 2161-03-05 18:10:00 OBSERVATION ADMIT           EMERGENCY ROOM        
##  8 2135-09-23 12:56:00 SURGICAL SAME DAY ADMISSION PHYSICIAN REFERRAL    
##  9 2175-11-21 15:30:00 OBSERVATION ADMIT           EMERGENCY ROOM        
## 10 2140-07-19 13:00:00 URGENT                      TRANSFER FROM HOSPITAL
##    discharge_location           insurance language marital_status
##    <chr>                        <chr>     <chr>    <chr>         
##  1 SKILLED NURSING FACILITY     Medicare  ENGLISH  MARRIED       
##  2 HOME                         Other     ENGLISH  SINGLE        
##  3 HOME HEALTH CARE             Other     ENGLISH  WIDOWED       
##  4 REHAB                        Other     ENGLISH  SINGLE        
##  5 HOME                         Other     ENGLISH  SINGLE        
##  6 CHRONIC/LONG TERM ACUTE CARE Other     ENGLISH  DIVORCED      
##  7 HOME                         Other     ENGLISH  MARRIED       
##  8 HOME                         Other     ENGLISH  MARRIED       
##  9 HOME                         Other     ENGLISH  SINGLE        
## 10 REHAB                        Other     ENGLISH  MARRIED       
##    ethnicity              hospital_expire_flag gender anchor_age anchor_year
##    <chr>                                 <int> <chr>       <int>       <int>
##  1 OTHER                                     0 M              72        2192
##  2 OTHER                                     0 F              49        2150
##  3 WHITE                                     0 F              72        2138
##  4 WHITE                                     0 F              69        2174
##  5 BLACK/AFRICAN AMERICAN                    0 M              57        2156
##  6 WHITE                                     0 F              57        2115
##  7 HISPANIC/LATINO                           0 F              46        2161
##  8 WHITE                                     0 F              48        2131
##  9 WHITE                                     0 F              24        2175
## 10 WHITE                                     0 F              49        2140
##    anchor_year_group age_hadm lab50960 lab51221 lab50971 lab50983 lab51301
##    <chr>                <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 2008 - 2010             80      1.5     30.6      3.3      130      7.8
##  2 2017 - 2019             49      1.8     30.5      4        141      3.5
##  3 2014 - 2016             72      1.9     41.6      4.4      139     10.6
##  4 2017 - 2019             69      1.7     36        4.6      134     12.1
##  5 2011 - 2013             63      1.4     30        5        143      5.4
##  6 2017 - 2019             57      1.6     31.5      4.3      138      9.2
##  7 2017 - 2019             46      2.2     34.1      4.1      138      3.6
##  8 2011 - 2013             52      1.9     37.1      4        145     14.7
##  9 2014 - 2016             24      2       34.7      4.3      139      7.1
## 10 2017 - 2019             49      1.8     39.6      3.8      137     20.1
##    lab50882 lab50893 lab50902 lab50912 lab50931 chart223761 chart220179
##       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>       <dbl>       <dbl>
##  1       22      8.2       97      1        113       102.           98
##  2       21      8.1      108      0.5       99        98.4         111
##  3       16      8.9      104      1         78        98.7         183
##  4       23      9.2       98      0.5      106        98.3         162
##  5       19      8.4      105      8.2      127        98           165
##  6       26      9.2       98      0.4      135        98.6         132
##  7       26      9        104      0.8       99        98.4         114
##  8       22      8.3      106      0.8      157        98.4         144
##  9       24      9.2      105      0.8       85        98.6         120
## 10       20      9.6      101      0.7       98        99.5         140
##    chart220181 chart220210 chart220045 thirty_day_mort
##          <dbl>       <dbl>       <dbl> <lgl>          
##  1          86          16          78 FALSE          
##  2          72           5          91 FALSE          
##  3         107          16          67 FALSE          
##  4          99          12          74 FALSE          
##  5          95          26          95 FALSE          
##  6          97          23          96 FALSE          
##  7          79          18          89 FALSE          
##  8          99          23          71 FALSE          
##  9          85          17          67 FALSE          
## 10         117          12          72 FALSE          
## # … with 53,055 more rows

# make a function to replace outliers to `NA`s using IQR rule
outlier_to_na <- function(x) {
  qnt <- quantile(x, probs = c(.25, .75), na.rm = TRUE)
  H <- 1.5 * IQR(x, na.rm = TRUE)
  x[x < (qnt[1] - H)] <- NA
  x[x > (qnt[2] + H)] <- NA
  x
}
# replace outliers to `NA`s for numerical variables
icu_cohort_replace <- icu_cohort_discard %>% 
  mutate_if(is.numeric, outlier_to_na)
summary(icu_cohort_replace)

##    subject_id           intime                       hadm_id        
##  Min.   :10000032   Min.   :2110-01-11 10:16:06   Min.   :20000094  
##  1st Qu.:12483898   1st Qu.:2132-11-05 21:29:00   1st Qu.:22478093  
##  Median :14991424   Median :2152-09-29 21:13:55   Median :24947644  
##  Mean   :14989837   Mean   :2152-10-28 03:05:41   Mean   :24975852  
##  3rd Qu.:17495811   3rd Qu.:2172-11-04 17:18:32   3rd Qu.:27477277  
##  Max.   :19999987   Max.   :2211-05-01 06:59:19   Max.   :29999828  
##                                                                     
##     stay_id         first_careunit     last_careunit     
##  Min.   :30000153   Length:53065       Length:53065      
##  1st Qu.:32467111   Class :character   Class :character  
##  Median :34987214   Mode  :character   Mode  :character  
##  Mean   :34984050                                        
##  3rd Qu.:37487113                                        
##  Max.   :39999810                                        
##                                                          
##     outtime                         los          admittime                  
##  Min.   :2110-01-12 17:17:47   Min.   :0.001   Min.   :2110-01-11 10:14:00  
##  1st Qu.:2132-11-07 16:56:01   1st Qu.:1.035   1st Qu.:2132-11-03 21:46:00  
##  Median :2152-10-03 19:16:01   Median :1.696   Median :2152-09-28 20:40:00  
##  Mean   :2152-10-31 11:49:07   Mean   :2.141   Mean   :2152-10-27 02:08:14  
##  3rd Qu.:2172-11-06 10:29:59   3rd Qu.:2.868   3rd Qu.:2172-11-04 12:57:00  
##  Max.   :2211-05-10 22:51:06   Max.   :7.227   Max.   :2211-05-01 06:57:00  
##                                NA's   :5265                                 
##    dischtime                   admission_type     admission_location
##  Min.   :2110-01-15 17:31:00   Length:53065       Length:53065      
##  1st Qu.:2132-11-14 16:40:00   Class :character   Class :character  
##  Median :2152-10-07 17:00:00   Mode  :character   Mode  :character  
##  Mean   :2152-11-05 12:22:14                                        
##  3rd Qu.:2172-11-10 14:55:00                                        
##  Max.   :2211-05-12 17:54:00                                        
##                                                                     
##  discharge_location  insurance           language         marital_status    
##  Length:53065       Length:53065       Length:53065       Length:53065      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   ethnicity         hospital_expire_flag    gender            anchor_age   
##  Length:53065       Min.   :0            Length:53065       Min.   :18.00  
##  Class :character   1st Qu.:0            Class :character   1st Qu.:53.00  
##  Mode  :character   Median :0            Mode  :character   Median :65.00  
##                     Mean   :0                               Mean   :63.59  
##                     3rd Qu.:0                               3rd Qu.:77.00  
##                     Max.   :0                               Max.   :91.00  
##                     NA's   :5514                                           
##   anchor_year   anchor_year_group     age_hadm         lab50960    
##  Min.   :2109   Length:53065       Min.   : 19.00   Min.   :1.000  
##  1st Qu.:2131   Class :character   1st Qu.: 54.00   1st Qu.:1.700  
##  Median :2151   Mode  :character   Median : 66.00   Median :1.900  
##  Mean   :2151                      Mean   : 64.54   Mean   :1.957  
##  3rd Qu.:2171                      3rd Qu.: 78.00   3rd Qu.:2.200  
##  Max.   :2207                      Max.   :102.00   Max.   :2.900  
##                                                     NA's   :4541   
##     lab51221        lab50971       lab50983        lab51301        lab50882    
##  Min.   :14.40   Min.   :2.60   Min.   :129.0   Min.   : 0.00   Min.   :15.00  
##  1st Qu.:28.10   1st Qu.:3.70   1st Qu.:136.0   1st Qu.: 7.50   1st Qu.:21.00  
##  Median :32.60   Median :4.10   Median :139.0   Median :10.30   Median :23.00  
##  Mean   :32.69   Mean   :4.11   Mean   :138.7   Mean   :10.97   Mean   :23.06  
##  3rd Qu.:37.20   3rd Qu.:4.50   3rd Qu.:141.0   3rd Qu.:13.90   3rd Qu.:25.00  
##  Max.   :51.00   Max.   :5.60   Max.   :148.0   Max.   :24.80   Max.   :31.00  
##  NA's   :2261    NA's   :3669   NA's   :4107    NA's   :4149    NA's   :4954   
##     lab50893         lab50902        lab50912        lab50931    
##  Min.   : 6.600   Min.   : 91.0   Min.   :0.000   Min.   : 24.0  
##  1st Qu.: 7.900   1st Qu.:101.0   1st Qu.:0.700   1st Qu.:102.0  
##  Median : 8.400   Median :105.0   Median :0.900   Median :121.0  
##  Mean   : 8.358   Mean   :104.8   Mean   :0.958   Mean   :127.3  
##  3rd Qu.: 8.800   3rd Qu.:108.0   3rd Qu.:1.100   3rd Qu.:147.0  
##  Max.   :10.100   Max.   :118.0   Max.   :2.200   Max.   :235.0  
##  NA's   :6062     NA's   :3559    NA's   :6736    NA's   :5444   
##   chart223761      chart220179     chart220181     chart220210   
##  Min.   : 96.00   Min.   : 57.0   Min.   : 36.0   Min.   : 5.00  
##  1st Qu.: 97.70   1st Qu.:106.0   1st Qu.: 70.0   1st Qu.:15.00  
##  Median : 98.10   Median :122.0   Median : 80.0   Median :18.00  
##  Mean   : 98.15   Mean   :122.9   Mean   : 81.5   Mean   :18.23  
##  3rd Qu.: 98.60   3rd Qu.:139.0   3rd Qu.: 92.0   3rd Qu.:21.00  
##  Max.   :100.30   Max.   :188.0   Max.   :127.0   Max.   :32.00  
##  NA's   :4879     NA's   :1262    NA's   :1685    NA's   :1685   
##   chart220045     thirty_day_mort
##  Min.   : 37.00   Mode :logical  
##  1st Qu.: 74.00   FALSE:47630    
##  Median : 85.00   TRUE :5435     
##  Mean   : 86.49                  
##  3rd Qu.: 98.00                  
##  Max.   :136.00                  
##  NA's   :986

Impute missing values by miceRanger (request \(m=3\) data sets). This step is computational intensive. Make sure to save the imputation results as a file. Hint: Setting max.depth=10 in the miceRanger function may cut some computing time.

Solution:
I kept only variables of interest and used miceRanger to save three imputed data sets to icu_cohort_imputed.rds.

# choose variables of interest
demo_var = c("gender", "age_hadm", "marital_status", "ethnicity",
             "thirty_day_mort")
lab_item <- c(50912, 
              50971, 
              50983, 
              50902, 
              50882, 
              51221, 
              51301, 
              50931, 
              50960, 
              50893)
lab_item = paste("lab", lab_item, sep = "")
vital_item <- c(220045, 
                220181, 
                220179, 
                223761, 
                220210)
vital_item = paste("chart", vital_item, sep = "")
# merge all variables
all_vars = c(demo_var, lab_item, vital_item)
# kept only variables we plan to use
icu_cohort_mice <- icu_cohort_replace %>%
  select(all_of(all_vars))
summary(icu_cohort_mice)

##     gender             age_hadm      marital_status      ethnicity        
##  Length:53065       Min.   : 19.00   Length:53065       Length:53065      
##  Class :character   1st Qu.: 54.00   Class :character   Class :character  
##  Mode  :character   Median : 66.00   Mode  :character   Mode  :character  
##                     Mean   : 64.54                                        
##                     3rd Qu.: 78.00                                        
##                     Max.   :102.00                                        
##                                                                           
##  thirty_day_mort    lab50912        lab50971       lab50983        lab50902    
##  Mode :logical   Min.   :0.000   Min.   :2.60   Min.   :129.0   Min.   : 91.0  
##  FALSE:47630     1st Qu.:0.700   1st Qu.:3.70   1st Qu.:136.0   1st Qu.:101.0  
##  TRUE :5435      Median :0.900   Median :4.10   Median :139.0   Median :105.0  
##                  Mean   :0.958   Mean   :4.11   Mean   :138.7   Mean   :104.8  
##                  3rd Qu.:1.100   3rd Qu.:4.50   3rd Qu.:141.0   3rd Qu.:108.0  
##                  Max.   :2.200   Max.   :5.60   Max.   :148.0   Max.   :118.0  
##                  NA's   :6736    NA's   :3669   NA's   :4107    NA's   :3559   
##     lab50882        lab51221        lab51301        lab50931    
##  Min.   :15.00   Min.   :14.40   Min.   : 0.00   Min.   : 24.0  
##  1st Qu.:21.00   1st Qu.:28.10   1st Qu.: 7.50   1st Qu.:102.0  
##  Median :23.00   Median :32.60   Median :10.30   Median :121.0  
##  Mean   :23.06   Mean   :32.69   Mean   :10.97   Mean   :127.3  
##  3rd Qu.:25.00   3rd Qu.:37.20   3rd Qu.:13.90   3rd Qu.:147.0  
##  Max.   :31.00   Max.   :51.00   Max.   :24.80   Max.   :235.0  
##  NA's   :4954    NA's   :2261    NA's   :4149    NA's   :5444   
##     lab50960        lab50893       chart220045      chart220181   
##  Min.   :1.000   Min.   : 6.600   Min.   : 37.00   Min.   : 36.0  
##  1st Qu.:1.700   1st Qu.: 7.900   1st Qu.: 74.00   1st Qu.: 70.0  
##  Median :1.900   Median : 8.400   Median : 85.00   Median : 80.0  
##  Mean   :1.957   Mean   : 8.358   Mean   : 86.49   Mean   : 81.5  
##  3rd Qu.:2.200   3rd Qu.: 8.800   3rd Qu.: 98.00   3rd Qu.: 92.0  
##  Max.   :2.900   Max.   :10.100   Max.   :136.00   Max.   :127.0  
##  NA's   :4541    NA's   :6062     NA's   :986      NA's   :1685   
##   chart220179     chart223761      chart220210   
##  Min.   : 57.0   Min.   : 96.00   Min.   : 5.00  
##  1st Qu.:106.0   1st Qu.: 97.70   1st Qu.:15.00  
##  Median :122.0   Median : 98.10   Median :18.00  
##  Mean   :122.9   Mean   : 98.15   Mean   :18.23  
##  3rd Qu.:139.0   3rd Qu.: 98.60   3rd Qu.:21.00  
##  Max.   :188.0   Max.   :100.30   Max.   :32.00  
##  NA's   :1262    NA's   :4879     NA's   :1685

if (file.exists("icu_cohort_imputed.rds")) {
  icu_cohort_imputed <- read_rds("icu_cohort_imputed.rds")
} else {
  parTime <- system.time(
    icu_cohort_imputed <- miceRanger(
      icu_cohort_mice,
      m = 3,
      max.depth = 10,
      returnModels = FALSE,
      verbose = TRUE
    )
  )
  icu_cohort_imputed %>%
    write_rds("icu_cohort_imputed.rds")
}

Make imputation diagnostic plots and explain what they mean.

Solution: I output some imputation diagnostic plots to check the validation of the imputed distribution.

Distribution of Imputed Values:
These plots help us compare the imputed distribution (black lines) to the original distribution (red line) for each variable. It appears that most variables are reasonably well matched. A few variables, such as 50983 (sodium), 220045 (heart rate), and 220181 (mean non-invasive BP), do not match, which implies that these data may be MNAR, say dependent on missing value itself:

plotDistributions(icu_cohort_imputed, vars = 'allNumeric')

Model OOB Error: We can take a look at out-of-bag R-squared for regression by iteration. Other than the last three vitals, these variables appear to converge as they are repeated.

plotModelError(icu_cohort_imputed,vars = 'allNumeric')

Variable Importance: The table below shows which variables were used to impute each variable. Overall, each lab measurement was important to other lab measurements, and so were vitals. In addition, gender and age at admission were also important for 50912 (creatinine).

plotVarImportance(icu_cohort_imputed)

# plotCorrelations(icu_cohort_imputed, vars = 'allNumeric')
# plotVarConvergence(icu_cohort_imputed, vars = 'allNumeric')
# plotImputationVariance(icu_cohort_imputed, ncol = 2, widths = c(5,3))

Choose one of the imputed data sets to be used in Q2. This is not a good idea to use just one imputed data set or to average multiple imputed data sets. Explain in a couple of sentences what the correct Multiple Imputation strategy is.

Solution:
The correct Multiple Imputation strategy is to pool \(m\) imputed data sets, that is, integrate \(m\) parameter estimates given by miceRanger, accounting for uncertainty (differences in variation) among imputed values. The variance combines the within-imputation and between-imputation variance. The pooled estimates are unbiased and have valid statistical properties compared to using one imputed data set or averaging multiple imputed data sets.

# complete imputed data sets
icu_cohort_complete <- completeData(icu_cohort_imputed)
# choose one of the imputed data sets
icu_cohort_choice <- icu_cohort_complete[[1]]

Data preparation:

# convert non-numeric variables to factors and scale numeric variables
icu_cohort_factor <- icu_cohort_choice %>%
  mutate_if(is.character, as.factor) %>%
  mutate_if(is.logical, as.factor) %>%
  mutate_if(is.numeric, scale)

Q2. Predicting 30-day mortality

Develop at least two analytic approaches for predicting the 30-day mortality of patients admitted to ICU using demographic information (gender, age, marital status, ethnicity), first lab measurements during ICU stay, and first vital measurements during ICU stay. For example, you can use (1) logistic regression (glm() function in base R or keras), (2) logistic regression with lasso penalty (glmnet or keras package), (3) random forest (randomForest package), or (4) neural network (keras package).

Partition data into 80% training set and 20% test set. Stratify partitioning according the 30-day mortality status.

Solution:
I installed caret package to split data into 80% training set and 20% test set. By the way, caTools package can also be used.

library(caret)
set.seed(3456)
index <- createDataPartition(icu_cohort_factor$thirty_day_mort,
                             p = .8, list = FALSE)
training_set <- icu_cohort_factor[index, ]
test_set <- icu_cohort_factor[-index, ]

# library(caTools)
# # split = sample.split(icu_cohort_factor$thirty_day_mort, SplitRatio = 0.8)
# # training_set = subset(icu_cohort_factor, split == TRUE)
# # test_set = subset(icu_cohort_factor, split == FALSE)

Train the models using the train set.
Compare model prediction performance on the test set.

Solution:
(1) Logistic regression, using glm().

# fit the model
logstc_model <- glm(thirty_day_mort ~ ., training_set, family = binomial)
summary(logstc_model)

## 
## Call:
## glm(formula = thirty_day_mort ~ ., family = binomial, data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1545  -0.4615  -0.3247  -0.2273   3.5136  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -2.777130   0.445742  -6.230 4.65e-10 ***
## genderM                         -0.177084   0.037756  -4.690 2.73e-06 ***
## age_hadm                         0.462844   0.022221  20.829  < 2e-16 ***
## marital_statusMARRIED           -0.013586   0.068560  -0.198   0.8429    
## marital_statusSINGLE             0.019426   0.072740   0.267   0.7894    
## marital_statusWIDOWED            0.012841   0.077400   0.166   0.8682    
## ethnicityASIAN                   0.324664   0.453196   0.716   0.4738    
## ethnicityBLACK/AFRICAN AMERICAN  0.058568   0.446048   0.131   0.8955    
## ethnicityHISPANIC/LATINO         0.111799   0.455235   0.246   0.8060    
## ethnicityOTHER                   0.289986   0.449356   0.645   0.5187    
## ethnicityUNABLE TO OBTAIN        1.166157   0.458838   2.542   0.0110 *  
## ethnicityUNKNOWN                 1.120795   0.443595   2.527   0.0115 *  
## ethnicityWHITE                   0.174367   0.442253   0.394   0.6934    
## lab50912                         0.336277   0.017251  19.493  < 2e-16 ***
## lab50971                         0.001927   0.017504   0.110   0.9124    
## lab50983                         0.282916   0.022668  12.481  < 2e-16 ***
## lab50902                        -0.453783   0.024530 -18.499  < 2e-16 ***
## lab50882                        -0.380214   0.019694 -19.306  < 2e-16 ***
## lab51221                        -0.023447   0.018573  -1.262   0.2068    
## lab51301                         0.121767   0.016924   7.195 6.25e-13 ***
## lab50931                         0.074879   0.016548   4.525 6.04e-06 ***
## lab50960                         0.044650   0.017751   2.515   0.0119 *  
## lab50893                        -0.170969   0.019541  -8.749  < 2e-16 ***
## chart220045                      0.213008   0.018404  11.574  < 2e-16 ***
## chart220181                      0.019236   0.026263   0.732   0.4639    
## chart220179                     -0.179341   0.025768  -6.960 3.41e-12 ***
## chart223761                     -0.121512   0.017662  -6.880 5.99e-12 ***
## chart220210                      0.248065   0.017843  13.903  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28050  on 42451  degrees of freedom
## Residual deviance: 23784  on 42424  degrees of freedom
## AIC: 23840
## 
## Number of Fisher Scoring iterations: 6

# predict 30-day mortality using the model
prob_logstc <- predict(logstc_model, test_set, type = "response")
# translate probabilities to predictions by threshold = 0.5
predict_logstc <- ifelse(prob_logstc > 0.5, TRUE, FALSE)
# make confusion matrix
table(observed = test_set$thirty_day_mort, predicted = predict_logstc)

##         predicted
## observed FALSE TRUE
##    FALSE  9455   71
##    TRUE    996   91

# compute accuracy 
cat('Test accuracy', mean(predict_logstc == test_set$thirty_day_mort))

## Test accuracy 0.8994629

# confusionMatrix(as.factor(predict_logstc),
#                 as.factor(test_set$thirty_day_mort))

Accuracy by prediction on the test set was about 90%, which changed little when the threshold was further increased.

(2) Logistic regression with lasso penalty using glmnet package

library(glmnet)
# convert x to matrix and choose y
x_train <- model.matrix(thirty_day_mort ~ ., training_set)
x_test <- model.matrix(test_set$thirty_day_mort ~ ., test_set)
y_train <- training_set$thirty_day_mort
y_test <- test_set$thirty_day_mort

# find optimal value of lambda (alpha=1 => lasso)
cv.out <- cv.glmnet(x_train, y_train, alpha = 1, 
                    family = "binomial", 
                    type.measure = "auc")
cv.out

## 
## Call:  cv.glmnet(x = x_train, y = y_train, type.measure = "auc", alpha = 1,      family = "binomial") 
## 
## Measure: AUC 
## 
##       Lambda Index Measure       SE Nonzero
## min 0.000772    48  0.7776 0.003609      23
## 1se 0.004960    28  0.7741 0.004015      15

# plot(cv.out)
# min value of lambda
lambda_min <- cv.out$lambda.min
# best value of lambda
lambda_1se <- cv.out$lambda.1se
# fit the model
coef(cv.out, s = lambda_1se)

## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                     -2.51768007
## (Intercept)                      .         
## genderM                         -0.06460353
## age_hadm                         0.38888464
## marital_statusMARRIED            .         
## marital_statusSINGLE             .         
## marital_statusWIDOWED            .         
## ethnicityASIAN                   .         
## ethnicityBLACK/AFRICAN AMERICAN  .         
## ethnicityHISPANIC/LATINO         .         
## ethnicityOTHER                   .         
## ethnicityUNABLE TO OBTAIN        0.51133395
## ethnicityUNKNOWN                 0.80753765
## ethnicityWHITE                   .         
## lab50912                         0.33511706
## lab50971                         .         
## lab50983                         0.06940426
## lab50902                        -0.21777511
## lab50882                        -0.28241995
## lab51221                         .         
## lab51301                         0.09231811
## lab50931                         0.03624130
## lab50960                         .         
## lab50893                        -0.08044644
## chart220045                      0.16976916
## chart220181                      .         
## chart220179                     -0.11358277
## chart223761                     -0.05585606
## chart220210                      0.22863656

# predict 30-day mortality using the model
prob_lasso <- predict(cv.out, newx = x_test, s = lambda_1se, type = "response")
# translate probabilities to predictions
predict_lasso <- ifelse(prob_lasso > 0.5, TRUE, FALSE)
# make confusion matrix for the fitted value versus 30-day mortality in test
table(observed = y_test, predicted = predict_lasso)

##         predicted
## observed FALSE TRUE
##    FALSE  9499   27
##    TRUE   1047   40

# compute accuracy 
cat('Test accuracy', mean(predict_lasso == test_set$thirty_day_mort))

## Test accuracy 0.8988034

Accuracy by prediction on the test set was also about 90%.

(3) Random forest (randomForest package)

library(randomForest)
# fit the model
model_rf <- randomForest(thirty_day_mort ~ ., data = training_set, ntree = 300)
# predict 30-day mortality using the model
prob_rf <- as.data.frame(predict(model_rf, test_set, type = "prob"))
predict_rf <- predict(model_rf, test_set)
# make confusion matrix
table(observed = test_set$thirty_day_mort, predicted = predict_rf)

##         predicted
## observed FALSE TRUE
##    FALSE  9497   29
##    TRUE   1019   68

# compute accuracy 
cat('Test accuracy', mean(predict_rf == test_set$thirty_day_mort))

## Test accuracy 0.9012532

Again, it was around 90% accuracy, which also remained unchanged even when the number of trees was increased.

(4) Neural network (keras package)

# install_keras(tensorflow=2.6) 
# install_keras() 
# virtualenv_create("r-reticulate")
library(keras)
library(tensorflow)

# data preparation for neural network
x_train_array <- array_reshape(x_train, c(nrow(x_train), 28))
x_test_array <- array_reshape(x_test, c(nrow(x_test), 28))
y_train_num <- as.numeric(y_train) - 1
y_test_num <- as.numeric(y_test) - 1
# encode y as binary class matrix
y_train_num <- to_categorical(y_train_num, 2)

## Loaded Tensorflow version 2.6.2

y_test_num <- to_categorical(y_test_num, 2)

# define the model
model_nn <- keras_model_sequential()
model_nn %>% 
  layer_dense(units = 20, activation = 'relu', input_shape = c(28)) %>%
  layer_dropout(rate = 0.4) %>%
  layer_dense(units = 10, activation = 'relu') %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 2, activation = 'softmax')
  # layer_dense(input_shape = c(28), units = 2, activation = 'softmax')
summary(model_nn)

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense_2 (Dense)                     (None, 20)                      580         
## ________________________________________________________________________________
## dropout_1 (Dropout)                 (None, 20)                      0           
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 10)                      210         
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 10)                      0           
## ________________________________________________________________________________
## dense (Dense)                       (None, 2)                       22          
## ================================================================================
## Total params: 812
## Trainable params: 812
## Non-trainable params: 0
## ________________________________________________________________________________

# compile the model
model_nn %>% compile(
  loss = 'binary_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')
)
# train and validate
history <- model_nn %>%
  fit(
    x_train_array, 
    y_train_num, 
    epochs = 200, 
    batch_size = 10000, 
    validation_spl1t = 0.2,
  )

# predict 30-day mortality using the model
predict_neural <- evaluate(model_nn, x_test_array, y_test_num) %>%
  print(width = Inf)

##      loss  accuracy 
## 0.2717495 0.8995572

prob_neural <- predict(model_nn, x_test_array)

The neural network approach also provided roughly 90% accuracy, although I tried some sequential models.

Conclusion:
The ROC curves for these approaches are shown below. In this data set, the random forest and the neural network methods indicated better prediction accuracy than the first two ones. This is because the closer that the curve is to the top-left corner, the better the model is. However, the data set was too simple for these two methods to have real power. Also, executing the correct multiple imputation strategy or dividing data into three to use a validation data set may improve the accuracy.

# create ROC curve
library(ROCR)
pred1 <- prediction(prob_logstc, y_test)
pred2 <- prediction(prob_lasso, y_test)
pred3 <- prediction(prob_rf[2], y_test)
pred4 <- prediction(prob_neural[, 2], y_test)
perf1 <- performance(pred1, 'tpr', 'fpr')
perf2 <- performance(pred2, 'tpr', 'fpr')
perf3 <- performance(pred3, 'tpr', 'fpr')
perf4 <- performance(pred4, 'tpr', 'fpr')
plot(perf1, col = 4)
plot(perf2, add = T, col = 2)
plot(perf3, add = T, col = 3)
plot(perf4, add = T, col = 5)
title("ROC curves for the four approaches")
abline(a = 0, b = 1, lty = 2)
text(0.18, 0.4, 'Logistic', col = 4)
text(0.4, 0.55, 'Logistic with Lasso panelty', col = 2)
text(0.35, 0.95, 'Random forest', col = 3)
text(0.2, 0.8, 'Neural network', col = 5)

Biostat 203B Homework 4

Due Mar 18 @ 11:59PM

Tomoki Okuno

Q1. Missing data

Q2. Predicting 30-day mortality