Biostat 203B Homework 1

Display machine information for reproducibility:

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.29   R6_2.5.1        jsonlite_1.7.2  magrittr_2.0.1 
##  [5] evaluate_0.14   rlang_0.4.12    stringi_1.7.6   jquerylib_0.1.4
##  [9] bslib_0.3.1     rmarkdown_2.11  tools_3.6.0     stringr_1.4.0  
## [13] xfun_0.29       yaml_2.2.1      fastmap_1.1.0   compiler_3.6.0 
## [17] htmltools_0.5.2 knitr_1.37      sass_0.4.0

Q1. Git/GitHub

Apply for the Student Developer Pack at GitHub using your UCLA email. You’ll get GitHub Pro account for free (unlimited public and private repositories).
Done.
Create a private repository biostat-203b-2022-winter and add Hua-Zhou and maschepps as your collaborators with write permission.
Done.
Top directories of the repository should be hw1, hw2, … Maintain two branches main and develop. The develop branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The main branch will be your presentation area. Submit your homework files (R markdown file Rmd, html file converted from R markdown, all code and extra data sets to reproduce results) in main branch.
Done and understood.
After each homework due date, teaching assistant and instructor will check out your main branch for grading. Tag each of your homework submissions with tag names hw1, hw2, … Tagging time will be used as your submission time. That means if you tag your hw1 submission after deadline, penalty points will be deducted for late submission.
Done.
After this course, you can make this repository public and use it to demonstrate your skill sets on job market.
Great!

Q2. Data ethics training

This exercise (and later in this course) uses the MIMIC-IV data, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic.mit.edu/docs/gettingstarted/ to (1) complete the CITI Data or Specimens Only Research course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. (Hint: The CITI training takes a couple hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)

Solution:

My completion report
https://www.citiprogram.org/verify/?kf7337c50-ab63-4a0b-b730-f94965ea93e0-46526237
My completion certificate
https://www.citiprogram.org/verify/?w5ed2f46d-5069-4c33-ad06-a8e12eeb07c0-46526237

Q3. Linux Shell Commands

The /mnt/mimiciv/1.0 folder on teaching server contains data sets from MIMIC-IV. Refer to the documentation https://mimic.mit.edu/docs/iv/ for details of data files.
```
ls -l /mnt/mimiciv/1.0
```
```
## total 24
## drwxr-xr-x. 2 root root 4096 Jan  4 21:48 core
## drwxr-xr-x. 2 root root 4096 Jan  4 21:51 hosp
## drwxr-xr-x. 2 root root 4096 Jan  4 21:52 icu
## -rw-r--r--. 1 root root  797 Jan  4 21:48 index.html
## -rw-r--r--. 1 root root 2518 Jan  4 21:54 LICENSE.txt
## -rw-r--r--. 1 root root 2459 Jan  4 21:48 SHA256SUMS.txt
```
Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files on storage and are not big data friendly practices. Just read from the data folder /mnt/mimiciv/1.0 directly in following exercises.
Sure.

Use Bash commands to answer following questions.

Display the contents in the folders core, hosp, icu.
Solution: Here is the code to display them.

ls -l /mnt/mimiciv/1.0/core/
ls -l /mnt/mimiciv/1.0/hosp/
ls -l /mnt/mimiciv/1.0/icu/

## total 71472
## -rw-r--r--. 1 root root 17208966 Jan  4 21:48 admissions.csv.gz
## -rw-r--r--. 1 root root      606 Jan  4 21:48 index.html
## -rw-r--r--. 1 root root  2955582 Jan  4 21:48 patients.csv.gz
## -rw-r--r--. 1 root root 53014503 Jan  4 21:48 transfers.csv.gz
## total 4454012
## -rw-r--r--. 1 root root     430049 Jan  4 21:51 d_hcpcs.csv.gz
## -rw-r--r--. 1 root root   29531152 Jan  4 21:48 diagnoses_icd.csv.gz
## -rw-r--r--. 1 root root     863239 Jan  4 21:48 d_icd_diagnoses.csv.gz
## -rw-r--r--. 1 root root     579998 Jan  4 21:48 d_icd_procedures.csv.gz
## -rw-r--r--. 1 root root      14898 Jan  4 21:48 d_labitems.csv.gz
## -rw-r--r--. 1 root root   11684062 Jan  4 21:51 drgcodes.csv.gz
## -rw-r--r--. 1 root root  515763427 Jan  4 21:51 emar.csv.gz
## -rw-r--r--. 1 root root  476252563 Jan  4 21:48 emar_detail.csv.gz
## -rw-r--r--. 1 root root    2098831 Jan  4 21:51 hcpcsevents.csv.gz
## -rw-r--r--. 1 root root       2325 Jan  4 21:48 index.html
## -rw-r--r--. 1 root root 2091865786 Jan  4 21:50 labevents.csv.gz
## -rw-r--r--. 1 root root   99133381 Jan  4 21:48 microbiologyevents.csv.gz
## -rw-r--r--. 1 root root  422874088 Jan  4 21:48 pharmacy.csv.gz
## -rw-r--r--. 1 root root  501381155 Jan  4 21:51 poe.csv.gz
## -rw-r--r--. 1 root root   24020923 Jan  4 21:48 poe_detail.csv.gz
## -rw-r--r--. 1 root root  367041717 Jan  4 21:49 prescriptions.csv.gz
## -rw-r--r--. 1 root root    7750325 Jan  4 21:48 procedures_icd.csv.gz
## -rw-r--r--. 1 root root    9565293 Jan  4 21:48 services.csv.gz
## total 2741332
## -rw-r--r--. 1 root root 2350783547 Jan  4 21:54 chartevents.csv.gz
## -rw-r--r--. 1 root root   43296273 Jan  4 21:52 datetimeevents.csv.gz
## -rw-r--r--. 1 root root      55917 Jan  4 21:52 d_items.csv.gz
## -rw-r--r--. 1 root root    2848628 Jan  4 21:52 icustays.csv.gz
## -rw-r--r--. 1 root root       1103 Jan  4 21:52 index.html
## -rw-r--r--. 1 root root  352443512 Jan  4 21:52 inputevents.csv.gz
## -rw-r--r--. 1 root root   37095672 Jan  4 21:52 outputevents.csv.gz
## -rw-r--r--. 1 root root   20567368 Jan  4 21:52 procedureevents.csv.gz

Why are these data files distributed as .csv.gz files instead of .csv (comma separated values) files?
Solution: .csv.gz files are a compressed form of .csv files. Compressing a large file can reduce the file size, allowing us to store more files on storage and transmit them more speedy over the internet than .csv files.

Read the page https://mimic.mit.edu/docs/iv/ to understand what’s in each folder.
Done.

Briefly describe what bash commands zcat, zless, zmore, and zgrep do.

Solution: What these five commands have in common is to display a compressed file without decompressing it.
zcat, zless, zmore, and zgrep can perform the equivalent of cat, less, more, and grep, respectively.

What’s the output of following bash script?

for datafile in /mnt/mimiciv/1.0/core/*.gz
  do
    ls -l $datafile
  done

Solution: The output is a list of all the compressed (.gz) files in /mnt/mimiciv/1.0/core/ in a long format.
Comment: By the way, the following code gives almost the same output.

ls -l /mnt/mimiciv/1.0/core/ | grep '.gz'

## -rw-r--r--. 1 root root 17208966 Jan  4 21:48 admissions.csv.gz
## -rw-r--r--. 1 root root  2955582 Jan  4 21:48 patients.csv.gz
## -rw-r--r--. 1 root root 53014503 Jan  4 21:48 transfers.csv.gz

Display the number of lines in each data file using a similar loop.
Solution: I used echo to clarify the file name and counted the number of lines as follows.

for datafile in /mnt/mimiciv/1.0/core/*.gz
  do
  echo $datafile
  zcat $datafile | wc -l
  done

## /mnt/mimiciv/1.0/core/admissions.csv.gz
## 523741
## /mnt/mimiciv/1.0/core/patients.csv.gz
## 382279
## /mnt/mimiciv/1.0/core/transfers.csv.gz
## 2189536

Display the first few lines of admissions.csv.gz.
Solution:

zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | head -4

## subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,marital_status,ethnicity,edregtime,edouttime,hospital_expire_flag
## 14679932,21038362,2139-09-26 14:16:00,2139-09-28 11:30:00,,ELECTIVE,,HOME,Other,ENGLISH,SINGLE,UNKNOWN,,,0
## 15585972,24941086,2123-10-07 23:56:00,2123-10-12 11:22:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0
## 11989120,21965160,2147-01-14 09:00:00,2147-01-17 14:25:00,,ELECTIVE,,HOME,Other,ENGLISH,,UNKNOWN,,,0

How many rows are in this data file?
Solution: The following code ignores the 1st line and gives the answer.

zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | tail -n +2 | wc -l

## 523740

How many unique patients (identified by subject_id) are in this data file?
Solution: I offer two answers as follows.

zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
  tail -n +2 | sort | uniq -f 0 -w 8 | wc -l
zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
  awk -F "," 'NR > 1 {print $1 | "sort | uniq"}' | wc -l

## 256878
## 256878

What are the possible values taken by each of the variable admission_type, admission_location, insurance, and ethnicity? Also report the count for each unique value of these variables.
Solution: Here are the codes and the answers.

zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
  awk -F "," 'NR == 1 {print $6}; NR > 1 {print $6 | "sort | uniq -c"}'

## admission_type
##    7254 AMBULATORY OBSERVATION
##   21581 DIRECT EMER.
##   19991 DIRECT OBSERVATION
##   72072 ELECTIVE
##  100445 EU OBSERVATION
##  157896 EW EMER.
##   55497 OBSERVATION ADMIT
##   41074 SURGICAL SAME DAY ADMISSION
##   47930 URGENT

zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
  awk -F "," 'NR == 1 {print $7}; NR > 1 {print $7 | "sort | uniq -c"}'

## admission_location
##   60435 
##     191 AMBULATORY SURGERY TRANSFER
##   10670 CLINIC REFERRAL
##  245744 EMERGENCY ROOM
##     379 INFORMATION NOT AVAILABLE
##    4467 INTERNAL TRANSFER TO OR FROM PSYCH
##    6067 PACU
##  127494 PHYSICIAN REFERRAL
##    8449 PROCEDURE SITE
##   39121 TRANSFER FROM HOSPITAL
##    4063 TRANSFER FROM SKILLED NURSING FACILITY
##   16660 WALK-IN/SELF REFERRAL

zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
  awk -F "," 'NR == 1 {print $9}; NR > 1 {print $9 | "sort | uniq -c"}'

## insurance
##   50850 Medicaid
##  171360 Medicare
##  301530 Other

zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
  awk -F "," 'NR == 1 {print $12}; NR > 1 {print $12 | "sort | uniq -c"}'

## ethnicity
##    1535 AMERICAN INDIAN/ALASKA NATIVE
##   24506 ASIAN
##   80293 BLACK/AFRICAN AMERICAN
##   29823 HISPANIC/LATINO
##   26813 OTHER
##    3740 UNABLE TO OBTAIN
##   19400 UNKNOWN
##  337630 WHITE

Q4. Who’s popular in Price and Prejudice

You and your friend just have finished reading Pride and Prejudice by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Obtain the full text of the novel from http://www.gutenberg.org/cache/epub/42671/pg42671.txt and save to your local folder.
```
wget -nc http://www.gutenberg.org/cache/epub/42671/pg42671.txt
```
Done.

Explain what wget -nc does.
Solution: wget is a command that not only downloads source code and binaries, but also allows you to retrieve an entire website or a specific hierarchy at once. -nc option indicates that a file is not retrieved again if the file has already been downloaded.

Complete the following loop to tabulate the number of times each of the four characters is mentioned using Linux commands.
Solution: Using -o option, it is possible to count the words containing each name instead of just counting the lines shown below. Moreover, If using -i option to ignore the difference between uppercase and lowercase letters, I got one more time for Lydia and Darcy as follows.
```
# wget -nc http://www.gutenberg.org/cache/epub/42671/pg42671.txt
for char in Elizabeth Jane Lydia Darcy
do
  echo $char:
  cat pg42671.txt | grep -o $char | wc -l
  cat pg42671.txt | grep -o -i $char | wc -l 
done
```
```
## Elizabeth:
## 634
## 634
## Jane:
## 294
## 294
## Lydia:
## 170
## 171
## Darcy:
## 417
## 418
```
What’s the difference between the following two commands?
```
echo 'hello, world' > test1.txt
```
and
```
echo 'hello, world' >> test2.txt
```
Solution: The difference between them is that > opens a file in overwrite mode, while >> opens a file in append mode. In other words, the first command will always input only one line of hello world no matter how many times it is run, while the second command increases in line the more times you run it.
Using your favorite text editor (e.g., vi), type the following and save the file as middle.sh:
```
#!/bin/sh
# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"
```
Solution: I used vi and put the file into Git.

Using chmod make the file executable by the owner, and run
```
chmod 700 ./middle.sh
./middle.sh pg42671.txt 20 5
```
```
## 
## Author: Jane Austen
## 
## Editor: R. W. (Robert William) Chapman
## 
```
Solution: I added a command as above and ran it.

Explain the output.
Solution: The output is the print from the 16th line to the 20th line in pg42671.txt.

Explain the meaning of "$1", "$2", and "$3" in this shell script.
Solution: The following code gives the same output:
```
head -20 pg42671.txt | tail -5
```
which implies that "$1"=pg42671.txt, "$2"=20, and "$3"=5, that is, the column numbers next to ./middle.sh in the second line of the shell script.

Why do we need the first line of the shell script?
Solution: To identify the shell (bash/csh/tcsh/zsh, etc) to execute.

Q5. More fun with Linux

Try following commands in Bash and interpret the results: cal, cal 2021, cal 9 1752 (anything unusual?), date, hostname, arch, uname -a, uptime, who am i, who, w, id, last | head, echo {con,pre}{sent,fer}{s,ed}, time sleep 5, history | tail.

cal
cal 2021
cal 9 1752
date

##     January 2022    
## Su Mo Tu We Th Fr Sa
##                    1
##  2  3  4  5  6  7  8
##  9 10 11 12 13 14 15
## 16 17 18 19 20 21 22
## 23 24 25 26 27 28 29
## 30 31
##                                2021                               
## 
##        January               February                 March       
## Su Mo Tu We Th Fr Sa   Su Mo Tu We Th Fr Sa   Su Mo Tu We Th Fr Sa
##                 1  2       1  2  3  4  5  6       1  2  3  4  5  6
##  3  4  5  6  7  8  9    7  8  9 10 11 12 13    7  8  9 10 11 12 13
## 10 11 12 13 14 15 16   14 15 16 17 18 19 20   14 15 16 17 18 19 20
## 17 18 19 20 21 22 23   21 22 23 24 25 26 27   21 22 23 24 25 26 27
## 24 25 26 27 28 29 30   28                     28 29 30 31
## 31
##         April                   May                   June        
## Su Mo Tu We Th Fr Sa   Su Mo Tu We Th Fr Sa   Su Mo Tu We Th Fr Sa
##              1  2  3                      1          1  2  3  4  5
##  4  5  6  7  8  9 10    2  3  4  5  6  7  8    6  7  8  9 10 11 12
## 11 12 13 14 15 16 17    9 10 11 12 13 14 15   13 14 15 16 17 18 19
## 18 19 20 21 22 23 24   16 17 18 19 20 21 22   20 21 22 23 24 25 26
## 25 26 27 28 29 30      23 24 25 26 27 28 29   27 28 29 30
##                        30 31
##         July                  August                September     
## Su Mo Tu We Th Fr Sa   Su Mo Tu We Th Fr Sa   Su Mo Tu We Th Fr Sa
##              1  2  3    1  2  3  4  5  6  7             1  2  3  4
##  4  5  6  7  8  9 10    8  9 10 11 12 13 14    5  6  7  8  9 10 11
## 11 12 13 14 15 16 17   15 16 17 18 19 20 21   12 13 14 15 16 17 18
## 18 19 20 21 22 23 24   22 23 24 25 26 27 28   19 20 21 22 23 24 25
## 25 26 27 28 29 30 31   29 30 31               26 27 28 29 30
## 
##        October               November               December      
## Su Mo Tu We Th Fr Sa   Su Mo Tu We Th Fr Sa   Su Mo Tu We Th Fr Sa
##                 1  2       1  2  3  4  5  6             1  2  3  4
##  3  4  5  6  7  8  9    7  8  9 10 11 12 13    5  6  7  8  9 10 11
## 10 11 12 13 14 15 16   14 15 16 17 18 19 20   12 13 14 15 16 17 18
## 17 18 19 20 21 22 23   21 22 23 24 25 26 27   19 20 21 22 23 24 25
## 24 25 26 27 28 29 30   28 29 30               26 27 28 29 30 31
## 31
## 
##    September 1752   
## Su Mo Tu We Th Fr Sa
##        1  2 14 15 16
## 17 18 19 20 21 22 23
## 24 25 26 27 28 29 30
## 
## 
## 
## Fri Jan 21 05:47:33 UTC 2022