Display machine information for reproducibility:

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.29   R6_2.5.1        jsonlite_1.7.2  magrittr_2.0.1 
##  [5] evaluate_0.14   rlang_0.4.12    stringi_1.7.6   jquerylib_0.1.4
##  [9] bslib_0.3.1     rmarkdown_2.11  tools_3.6.0     stringr_1.4.0  
## [13] xfun_0.29       yaml_2.2.1      fastmap_1.1.0   compiler_3.6.0 
## [17] htmltools_0.5.2 knitr_1.37      sass_0.4.0

Q1. Git/GitHub

  1. Apply for the Student Developer Pack at GitHub using your UCLA email. You’ll get GitHub Pro account for free (unlimited public and private repositories).
    Done.

  2. Create a private repository biostat-203b-2022-winter and add Hua-Zhou and maschepps as your collaborators with write permission.
    Done.

  3. Top directories of the repository should be hw1, hw2, … Maintain two branches main and develop. The develop branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The main branch will be your presentation area. Submit your homework files (R markdown file Rmd, html file converted from R markdown, all code and extra data sets to reproduce results) in main branch.
    Done and understood.

  4. After each homework due date, teaching assistant and instructor will check out your main branch for grading. Tag each of your homework submissions with tag names hw1, hw2, … Tagging time will be used as your submission time. That means if you tag your hw1 submission after deadline, penalty points will be deducted for late submission.
    Done.

  5. After this course, you can make this repository public and use it to demonstrate your skill sets on job market.
    Great!

Q2. Data ethics training

This exercise (and later in this course) uses the MIMIC-IV data, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic.mit.edu/docs/gettingstarted/ to (1) complete the CITI Data or Specimens Only Research course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. (Hint: The CITI training takes a couple hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)

Solution:

Q3. Linux Shell Commands

  1. The /mnt/mimiciv/1.0 folder on teaching server contains data sets from MIMIC-IV. Refer to the documentation https://mimic.mit.edu/docs/iv/ for details of data files.

    ls -l /mnt/mimiciv/1.0
    ## total 24
    ## drwxr-xr-x. 2 root root 4096 Jan  4 21:48 core
    ## drwxr-xr-x. 2 root root 4096 Jan  4 21:51 hosp
    ## drwxr-xr-x. 2 root root 4096 Jan  4 21:52 icu
    ## -rw-r--r--. 1 root root  797 Jan  4 21:48 index.html
    ## -rw-r--r--. 1 root root 2518 Jan  4 21:54 LICENSE.txt
    ## -rw-r--r--. 1 root root 2459 Jan  4 21:48 SHA256SUMS.txt

    Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files on storage and are not big data friendly practices. Just read from the data folder /mnt/mimiciv/1.0 directly in following exercises.
    Sure.

    Use Bash commands to answer following questions.

  2. Display the contents in the folders core, hosp, icu.
    Solution: Here is the code to display them.

    ls -l /mnt/mimiciv/1.0/core/
    ls -l /mnt/mimiciv/1.0/hosp/
    ls -l /mnt/mimiciv/1.0/icu/
    ## total 71472
    ## -rw-r--r--. 1 root root 17208966 Jan  4 21:48 admissions.csv.gz
    ## -rw-r--r--. 1 root root      606 Jan  4 21:48 index.html
    ## -rw-r--r--. 1 root root  2955582 Jan  4 21:48 patients.csv.gz
    ## -rw-r--r--. 1 root root 53014503 Jan  4 21:48 transfers.csv.gz
    ## total 4454012
    ## -rw-r--r--. 1 root root     430049 Jan  4 21:51 d_hcpcs.csv.gz
    ## -rw-r--r--. 1 root root   29531152 Jan  4 21:48 diagnoses_icd.csv.gz
    ## -rw-r--r--. 1 root root     863239 Jan  4 21:48 d_icd_diagnoses.csv.gz
    ## -rw-r--r--. 1 root root     579998 Jan  4 21:48 d_icd_procedures.csv.gz
    ## -rw-r--r--. 1 root root      14898 Jan  4 21:48 d_labitems.csv.gz
    ## -rw-r--r--. 1 root root   11684062 Jan  4 21:51 drgcodes.csv.gz
    ## -rw-r--r--. 1 root root  515763427 Jan  4 21:51 emar.csv.gz
    ## -rw-r--r--. 1 root root  476252563 Jan  4 21:48 emar_detail.csv.gz
    ## -rw-r--r--. 1 root root    2098831 Jan  4 21:51 hcpcsevents.csv.gz
    ## -rw-r--r--. 1 root root       2325 Jan  4 21:48 index.html
    ## -rw-r--r--. 1 root root 2091865786 Jan  4 21:50 labevents.csv.gz
    ## -rw-r--r--. 1 root root   99133381 Jan  4 21:48 microbiologyevents.csv.gz
    ## -rw-r--r--. 1 root root  422874088 Jan  4 21:48 pharmacy.csv.gz
    ## -rw-r--r--. 1 root root  501381155 Jan  4 21:51 poe.csv.gz
    ## -rw-r--r--. 1 root root   24020923 Jan  4 21:48 poe_detail.csv.gz
    ## -rw-r--r--. 1 root root  367041717 Jan  4 21:49 prescriptions.csv.gz
    ## -rw-r--r--. 1 root root    7750325 Jan  4 21:48 procedures_icd.csv.gz
    ## -rw-r--r--. 1 root root    9565293 Jan  4 21:48 services.csv.gz
    ## total 2741332
    ## -rw-r--r--. 1 root root 2350783547 Jan  4 21:54 chartevents.csv.gz
    ## -rw-r--r--. 1 root root   43296273 Jan  4 21:52 datetimeevents.csv.gz
    ## -rw-r--r--. 1 root root      55917 Jan  4 21:52 d_items.csv.gz
    ## -rw-r--r--. 1 root root    2848628 Jan  4 21:52 icustays.csv.gz
    ## -rw-r--r--. 1 root root       1103 Jan  4 21:52 index.html
    ## -rw-r--r--. 1 root root  352443512 Jan  4 21:52 inputevents.csv.gz
    ## -rw-r--r--. 1 root root   37095672 Jan  4 21:52 outputevents.csv.gz
    ## -rw-r--r--. 1 root root   20567368 Jan  4 21:52 procedureevents.csv.gz

    Why are these data files distributed as .csv.gz files instead of .csv (comma separated values) files?
    Solution: .csv.gz files are a compressed form of .csv files. Compressing a large file can reduce the file size, allowing us to store more files on storage and transmit them more speedy over the internet than .csv files.

    Read the page https://mimic.mit.edu/docs/iv/ to understand what’s in each folder.
    Done.

  3. Briefly describe what bash commands zcat, zless, zmore, and zgrep do.

    Solution: What these five commands have in common is to display a compressed file without decompressing it.
    zcat, zless, zmore, and zgrep can perform the equivalent of cat, less, more, and grep, respectively.

  4. What’s the output of following bash script?

    for datafile in /mnt/mimiciv/1.0/core/*.gz
      do
        ls -l $datafile
      done

    Solution: The output is a list of all the compressed (.gz) files in /mnt/mimiciv/1.0/core/ in a long format.
    Comment: By the way, the following code gives almost the same output.

    ls -l /mnt/mimiciv/1.0/core/ | grep '.gz'
    ## -rw-r--r--. 1 root root 17208966 Jan  4 21:48 admissions.csv.gz
    ## -rw-r--r--. 1 root root  2955582 Jan  4 21:48 patients.csv.gz
    ## -rw-r--r--. 1 root root 53014503 Jan  4 21:48 transfers.csv.gz

    Display the number of lines in each data file using a similar loop.
    Solution: I used echo to clarify the file name and counted the number of lines as follows.

    for datafile in /mnt/mimiciv/1.0/core/*.gz
      do
      echo $datafile
      zcat $datafile | wc -l
      done
    ## /mnt/mimiciv/1.0/core/admissions.csv.gz
    ## 523741
    ## /mnt/mimiciv/1.0/core/patients.csv.gz
    ## 382279
    ## /mnt/mimiciv/1.0/core/transfers.csv.gz
    ## 2189536
  5. Display the first few lines of admissions.csv.gz.
    Solution:

    zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | head -4
    ## subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,marital_status,ethnicity,edregtime,edouttime,hospital_expire_flag
    ## 14679932,21038362,2139-09-26 14:16:00,2139-09-28 11:30:00,,ELECTIVE,,HOME,Other,ENGLISH,SINGLE,UNKNOWN,,,0
    ## 15585972,24941086,2123-10-07 23:56:00,2123-10-12 11:22:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0
    ## 11989120,21965160,2147-01-14 09:00:00,2147-01-17 14:25:00,,ELECTIVE,,HOME,Other,ENGLISH,,UNKNOWN,,,0

    How many rows are in this data file?
    Solution: The following code ignores the 1st line and gives the answer.

    zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | tail -n +2 | wc -l
    ## 523740

    How many unique patients (identified by subject_id) are in this data file?
    Solution: I offer two answers as follows.

    zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
      tail -n +2 | sort | uniq -f 0 -w 8 | wc -l
    zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
      awk -F "," 'NR > 1 {print $1 | "sort | uniq"}' | wc -l
    ## 256878
    ## 256878
  6. What are the possible values taken by each of the variable admission_type, admission_location, insurance, and ethnicity? Also report the count for each unique value of these variables.
    Solution: Here are the codes and the answers.

    zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
      awk -F "," 'NR == 1 {print $6}; NR > 1 {print $6 | "sort | uniq -c"}'
    ## admission_type
    ##    7254 AMBULATORY OBSERVATION
    ##   21581 DIRECT EMER.
    ##   19991 DIRECT OBSERVATION
    ##   72072 ELECTIVE
    ##  100445 EU OBSERVATION
    ##  157896 EW EMER.
    ##   55497 OBSERVATION ADMIT
    ##   41074 SURGICAL SAME DAY ADMISSION
    ##   47930 URGENT
    zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
      awk -F "," 'NR == 1 {print $7}; NR > 1 {print $7 | "sort | uniq -c"}'
    ## admission_location
    ##   60435 
    ##     191 AMBULATORY SURGERY TRANSFER
    ##   10670 CLINIC REFERRAL
    ##  245744 EMERGENCY ROOM
    ##     379 INFORMATION NOT AVAILABLE
    ##    4467 INTERNAL TRANSFER TO OR FROM PSYCH
    ##    6067 PACU
    ##  127494 PHYSICIAN REFERRAL
    ##    8449 PROCEDURE SITE
    ##   39121 TRANSFER FROM HOSPITAL
    ##    4063 TRANSFER FROM SKILLED NURSING FACILITY
    ##   16660 WALK-IN/SELF REFERRAL
    zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
      awk -F "," 'NR == 1 {print $9}; NR > 1 {print $9 | "sort | uniq -c"}'
    ## insurance
    ##   50850 Medicaid
    ##  171360 Medicare
    ##  301530 Other
    zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
      awk -F "," 'NR == 1 {print $12}; NR > 1 {print $12 | "sort | uniq -c"}'
    ## ethnicity
    ##    1535 AMERICAN INDIAN/ALASKA NATIVE
    ##   24506 ASIAN
    ##   80293 BLACK/AFRICAN AMERICAN
    ##   29823 HISPANIC/LATINO
    ##   26813 OTHER
    ##    3740 UNABLE TO OBTAIN
    ##   19400 UNKNOWN
    ##  337630 WHITE

Q5. More fun with Linux