Display machine information for reproducibility:
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.29 R6_2.5.1 jsonlite_1.7.2 magrittr_2.0.1
## [5] evaluate_0.14 rlang_0.4.12 stringi_1.7.6 jquerylib_0.1.4
## [9] bslib_0.3.1 rmarkdown_2.11 tools_3.6.0 stringr_1.4.0
## [13] xfun_0.29 yaml_2.2.1 fastmap_1.1.0 compiler_3.6.0
## [17] htmltools_0.5.2 knitr_1.37 sass_0.4.0
Apply for the Student Developer Pack at GitHub using your UCLA email. You’ll get GitHub Pro account for free (unlimited public and private repositories).
Done.
Create a private repository biostat-203b-2022-winter
and add Hua-Zhou
and maschepps
as your collaborators with write permission.
Done.
Top directories of the repository should be hw1
, hw2
, … Maintain two branches main
and develop
. The develop
branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The main
branch will be your presentation area. Submit your homework files (R markdown file Rmd
, html
file converted from R markdown, all code and extra data sets to reproduce results) in main
branch.
Done and understood.
After each homework due date, teaching assistant and instructor will check out your main branch for grading. Tag each of your homework submissions with tag names hw1
, hw2
, … Tagging time will be used as your submission time. That means if you tag your hw1
submission after deadline, penalty points will be deducted for late submission.
Done.
After this course, you can make this repository public and use it to demonstrate your skill sets on job market.
Great!
This exercise (and later in this course) uses the MIMIC-IV data, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic.mit.edu/docs/gettingstarted/ to (1) complete the CITI Data or Specimens Only Research
course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. (Hint: The CITI training takes a couple hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)
Solution:
My completion report
https://www.citiprogram.org/verify/?kf7337c50-ab63-4a0b-b730-f94965ea93e0-46526237
My completion certificate
https://www.citiprogram.org/verify/?w5ed2f46d-5069-4c33-ad06-a8e12eeb07c0-46526237
The /mnt/mimiciv/1.0
folder on teaching server contains data sets from MIMIC-IV. Refer to the documentation https://mimic.mit.edu/docs/iv/ for details of data files.
ls -l /mnt/mimiciv/1.0
## total 24
## drwxr-xr-x. 2 root root 4096 Jan 4 21:48 core
## drwxr-xr-x. 2 root root 4096 Jan 4 21:51 hosp
## drwxr-xr-x. 2 root root 4096 Jan 4 21:52 icu
## -rw-r--r--. 1 root root 797 Jan 4 21:48 index.html
## -rw-r--r--. 1 root root 2518 Jan 4 21:54 LICENSE.txt
## -rw-r--r--. 1 root root 2459 Jan 4 21:48 SHA256SUMS.txt
Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files on storage and are not big data friendly practices. Just read from the data folder /mnt/mimiciv/1.0 directly in following exercises.
Sure.
Use Bash commands to answer following questions.
Display the contents in the folders core
, hosp
, icu
.
Solution: Here is the code to display them.
ls -l /mnt/mimiciv/1.0/core/
ls -l /mnt/mimiciv/1.0/hosp/
ls -l /mnt/mimiciv/1.0/icu/
## total 71472
## -rw-r--r--. 1 root root 17208966 Jan 4 21:48 admissions.csv.gz
## -rw-r--r--. 1 root root 606 Jan 4 21:48 index.html
## -rw-r--r--. 1 root root 2955582 Jan 4 21:48 patients.csv.gz
## -rw-r--r--. 1 root root 53014503 Jan 4 21:48 transfers.csv.gz
## total 4454012
## -rw-r--r--. 1 root root 430049 Jan 4 21:51 d_hcpcs.csv.gz
## -rw-r--r--. 1 root root 29531152 Jan 4 21:48 diagnoses_icd.csv.gz
## -rw-r--r--. 1 root root 863239 Jan 4 21:48 d_icd_diagnoses.csv.gz
## -rw-r--r--. 1 root root 579998 Jan 4 21:48 d_icd_procedures.csv.gz
## -rw-r--r--. 1 root root 14898 Jan 4 21:48 d_labitems.csv.gz
## -rw-r--r--. 1 root root 11684062 Jan 4 21:51 drgcodes.csv.gz
## -rw-r--r--. 1 root root 515763427 Jan 4 21:51 emar.csv.gz
## -rw-r--r--. 1 root root 476252563 Jan 4 21:48 emar_detail.csv.gz
## -rw-r--r--. 1 root root 2098831 Jan 4 21:51 hcpcsevents.csv.gz
## -rw-r--r--. 1 root root 2325 Jan 4 21:48 index.html
## -rw-r--r--. 1 root root 2091865786 Jan 4 21:50 labevents.csv.gz
## -rw-r--r--. 1 root root 99133381 Jan 4 21:48 microbiologyevents.csv.gz
## -rw-r--r--. 1 root root 422874088 Jan 4 21:48 pharmacy.csv.gz
## -rw-r--r--. 1 root root 501381155 Jan 4 21:51 poe.csv.gz
## -rw-r--r--. 1 root root 24020923 Jan 4 21:48 poe_detail.csv.gz
## -rw-r--r--. 1 root root 367041717 Jan 4 21:49 prescriptions.csv.gz
## -rw-r--r--. 1 root root 7750325 Jan 4 21:48 procedures_icd.csv.gz
## -rw-r--r--. 1 root root 9565293 Jan 4 21:48 services.csv.gz
## total 2741332
## -rw-r--r--. 1 root root 2350783547 Jan 4 21:54 chartevents.csv.gz
## -rw-r--r--. 1 root root 43296273 Jan 4 21:52 datetimeevents.csv.gz
## -rw-r--r--. 1 root root 55917 Jan 4 21:52 d_items.csv.gz
## -rw-r--r--. 1 root root 2848628 Jan 4 21:52 icustays.csv.gz
## -rw-r--r--. 1 root root 1103 Jan 4 21:52 index.html
## -rw-r--r--. 1 root root 352443512 Jan 4 21:52 inputevents.csv.gz
## -rw-r--r--. 1 root root 37095672 Jan 4 21:52 outputevents.csv.gz
## -rw-r--r--. 1 root root 20567368 Jan 4 21:52 procedureevents.csv.gz
Why are these data files distributed as .csv.gz
files instead of .csv
(comma separated values) files?
Solution: .csv.gz
files are a compressed form of .csv
files. Compressing a large file can reduce the file size, allowing us to store more files on storage and transmit them more speedy over the internet than .csv
files.
Read the page https://mimic.mit.edu/docs/iv/ to understand what’s in each folder.
Done.
Briefly describe what bash commands zcat
, zless
, zmore
, and zgrep
do.
Solution: What these five commands have in common is to display a compressed file without decompressing it.
zcat
, zless
, zmore
, and zgrep
can perform the equivalent of cat
, less
, more
, and grep
, respectively.
What’s the output of following bash script?
for datafile in /mnt/mimiciv/1.0/core/*.gz
do
ls -l $datafile
done
Solution: The output is a list of all the compressed (.gz
) files in /mnt/mimiciv/1.0/core/
in a long format.
Comment: By the way, the following code gives almost the same output.
ls -l /mnt/mimiciv/1.0/core/ | grep '.gz'
## -rw-r--r--. 1 root root 17208966 Jan 4 21:48 admissions.csv.gz
## -rw-r--r--. 1 root root 2955582 Jan 4 21:48 patients.csv.gz
## -rw-r--r--. 1 root root 53014503 Jan 4 21:48 transfers.csv.gz
Display the number of lines in each data file using a similar loop.
Solution: I used echo
to clarify the file name and counted the number of lines as follows.
for datafile in /mnt/mimiciv/1.0/core/*.gz
do
echo $datafile
zcat $datafile | wc -l
done
## /mnt/mimiciv/1.0/core/admissions.csv.gz
## 523741
## /mnt/mimiciv/1.0/core/patients.csv.gz
## 382279
## /mnt/mimiciv/1.0/core/transfers.csv.gz
## 2189536
Display the first few lines of admissions.csv.gz
.
Solution:
zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | head -4
## subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admission_location,discharge_location,insurance,language,marital_status,ethnicity,edregtime,edouttime,hospital_expire_flag
## 14679932,21038362,2139-09-26 14:16:00,2139-09-28 11:30:00,,ELECTIVE,,HOME,Other,ENGLISH,SINGLE,UNKNOWN,,,0
## 15585972,24941086,2123-10-07 23:56:00,2123-10-12 11:22:00,,ELECTIVE,,HOME,Other,ENGLISH,,WHITE,,,0
## 11989120,21965160,2147-01-14 09:00:00,2147-01-17 14:25:00,,ELECTIVE,,HOME,Other,ENGLISH,,UNKNOWN,,,0
How many rows are in this data file?
Solution: The following code ignores the 1st line and gives the answer.
zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | tail -n +2 | wc -l
## 523740
How many unique patients (identified by subject_id
) are in this data file?
Solution: I offer two answers as follows.
zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
tail -n +2 | sort | uniq -f 0 -w 8 | wc -l
zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
awk -F "," 'NR > 1 {print $1 | "sort | uniq"}' | wc -l
## 256878
## 256878
What are the possible values taken by each of the variable admission_type
, admission_location
, insurance
, and ethnicity
? Also report the count for each unique value of these variables.
Solution: Here are the codes and the answers.
zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
awk -F "," 'NR == 1 {print $6}; NR > 1 {print $6 | "sort | uniq -c"}'
## admission_type
## 7254 AMBULATORY OBSERVATION
## 21581 DIRECT EMER.
## 19991 DIRECT OBSERVATION
## 72072 ELECTIVE
## 100445 EU OBSERVATION
## 157896 EW EMER.
## 55497 OBSERVATION ADMIT
## 41074 SURGICAL SAME DAY ADMISSION
## 47930 URGENT
zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
awk -F "," 'NR == 1 {print $7}; NR > 1 {print $7 | "sort | uniq -c"}'
## admission_location
## 60435
## 191 AMBULATORY SURGERY TRANSFER
## 10670 CLINIC REFERRAL
## 245744 EMERGENCY ROOM
## 379 INFORMATION NOT AVAILABLE
## 4467 INTERNAL TRANSFER TO OR FROM PSYCH
## 6067 PACU
## 127494 PHYSICIAN REFERRAL
## 8449 PROCEDURE SITE
## 39121 TRANSFER FROM HOSPITAL
## 4063 TRANSFER FROM SKILLED NURSING FACILITY
## 16660 WALK-IN/SELF REFERRAL
zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
awk -F "," 'NR == 1 {print $9}; NR > 1 {print $9 | "sort | uniq -c"}'
## insurance
## 50850 Medicaid
## 171360 Medicare
## 301530 Other
zcat /mnt/mimiciv/1.0/core/admissions.csv.gz | \
awk -F "," 'NR == 1 {print $12}; NR > 1 {print $12 | "sort | uniq -c"}'
## ethnicity
## 1535 AMERICAN INDIAN/ALASKA NATIVE
## 24506 ASIAN
## 80293 BLACK/AFRICAN AMERICAN
## 29823 HISPANIC/LATINO
## 26813 OTHER
## 3740 UNABLE TO OBTAIN
## 19400 UNKNOWN
## 337630 WHITE
You and your friend just have finished reading Pride and Prejudice by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Obtain the full text of the novel from http://www.gutenberg.org/cache/epub/42671/pg42671.txt and save to your local folder.
wget -nc http://www.gutenberg.org/cache/epub/42671/pg42671.txt
Done.
Explain what wget -nc
does.
Solution: wget
is a command that not only downloads source code and binaries, but also allows you to retrieve an entire website or a specific hierarchy at once. -nc
option indicates that a file is not retrieved again if the file has already been downloaded.
Complete the following loop to tabulate the number of times each of the four characters is mentioned using Linux commands.
Solution: Using -o
option, it is possible to count the words containing each name instead of just counting the lines shown below. Moreover, If using -i
option to ignore the difference between uppercase and lowercase letters, I got one more time for Lydia and Darcy as follows.
# wget -nc http://www.gutenberg.org/cache/epub/42671/pg42671.txt
for char in Elizabeth Jane Lydia Darcy
do
echo $char:
cat pg42671.txt | grep -o $char | wc -l
cat pg42671.txt | grep -o -i $char | wc -l
done
## Elizabeth:
## 634
## 634
## Jane:
## 294
## 294
## Lydia:
## 170
## 171
## Darcy:
## 417
## 418
What’s the difference between the following two commands?
echo 'hello, world' > test1.txt
and
echo 'hello, world' >> test2.txt
Solution: The difference between them is that >
opens a file in overwrite mode, while >>
opens a file in append mode. In other words, the first command will always input only one line of hello world
no matter how many times it is run, while the second command increases in line the more times you run it.
Using your favorite text editor (e.g., vi
), type the following and save the file as middle.sh
:
#!/bin/sh
# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"
Solution: I used vi
and put the file into Git.
Using chmod
make the file executable by the owner, and run
chmod 700 ./middle.sh
./middle.sh pg42671.txt 20 5
##
## Author: Jane Austen
##
## Editor: R. W. (Robert William) Chapman
##
Solution: I added a command as above and ran it.
Explain the output.
Solution: The output is the print from the 16th line to the 20th line in pg42671.txt
.
Explain the meaning of "$1"
, "$2"
, and "$3"
in this shell script.
Solution: The following code gives the same output:
head -20 pg42671.txt | tail -5
which implies that "$1"
=pg42671.txt
, "$2"
=20
, and "$3"
=5
, that is, the column numbers next to ./middle.sh
in the second line of the shell script.
Why do we need the first line of the shell script?
Solution: To identify the shell (bash/csh/tcsh/zsh, etc) to execute.
Try following commands in Bash and interpret the results: cal
, cal 2021
, cal 9 1752
(anything unusual?), date
, hostname
, arch
, uname -a
, uptime
, who am i
, who
, w
, id
, last | head
, echo {con,pre}{sent,fer}{s,ed}
, time sleep 5
, history | tail
.
cal
cal 2021
cal 9 1752
date
## January 2022
## Su Mo Tu We Th Fr Sa
## 1
## 2 3 4 5 6 7 8
## 9 10 11 12 13 14 15
## 16 17 18 19 20 21 22
## 23 24 25 26 27 28 29
## 30 31
## 2021
##
## January February March
## Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
## 1 2 1 2 3 4 5 6 1 2 3 4 5 6
## 3 4 5 6 7 8 9 7 8 9 10 11 12 13 7 8 9 10 11 12 13
## 10 11 12 13 14 15 16 14 15 16 17 18 19 20 14 15 16 17 18 19 20
## 17 18 19 20 21 22 23 21 22 23 24 25 26 27 21 22 23 24 25 26 27
## 24 25 26 27 28 29 30 28 28 29 30 31
## 31
## April May June
## Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
## 1 2 3 1 1 2 3 4 5
## 4 5 6 7 8 9 10 2 3 4 5 6 7 8 6 7 8 9 10 11 12
## 11 12 13 14 15 16 17 9 10 11 12 13 14 15 13 14 15 16 17 18 19
## 18 19 20 21 22 23 24 16 17 18 19 20 21 22 20 21 22 23 24 25 26
## 25 26 27 28 29 30 23 24 25 26 27 28 29 27 28 29 30
## 30 31
## July August September
## Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
## 1 2 3 1 2 3 4 5 6 7 1 2 3 4
## 4 5 6 7 8 9 10 8 9 10 11 12 13 14 5 6 7 8 9 10 11
## 11 12 13 14 15 16 17 15 16 17 18 19 20 21 12 13 14 15 16 17 18
## 18 19 20 21 22 23 24 22 23 24 25 26 27 28 19 20 21 22 23 24 25
## 25 26 27 28 29 30 31 29 30 31 26 27 28 29 30
##
## October November December
## Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
## 1 2 1 2 3 4 5 6 1 2 3 4
## 3 4 5 6 7 8 9 7 8 9 10 11 12 13 5 6 7 8 9 10 11
## 10 11 12 13 14 15 16 14 15 16 17 18 19 20 12 13 14 15 16 17 18
## 17 18 19 20 21 22 23 21 22 23 24 25 26 27 19 20 21 22 23 24 25
## 24 25 26 27 28 29 30 28 29 30 26 27 28 29 30 31
## 31
##
## September 1752
## Su Mo Tu We Th Fr Sa
## 1 2 14 15 16
## 17 18 19 20 21 22 23
## 24 25 26 27 28 29 30
##
##
##
## Fri Jan 21 05:47:33 UTC 2022
Solution:
cal
is a command to display the calendar for the current month, but you can also specify the year (e.g.,cal 2021
) or the year and month (e.g., cal 9 1752
) to display.cal 9 1752
, since the British Empire switched from the Julian calendar to the Gregorian calendar, there are no records from September 3 to 13 for 1752, which is reflected in the command.date
displays the current date and time in UTC.hostname
arch
uname -a
uptime
## biostat-203b-teaching-server
## x86_64
## Linux biostat-203b-teaching-server 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
## 05:47:33 up 18 days, 5:21, 5 users, load average: 1.26, 0.75, 0.53
hostname
displays the hostname of the server you are currently logged in.arch
displays the architecture (hardware name) of the Linux machine you are using.uname -a
displays all information on your system such as machine, hostname, and architecture name.uptime
displays the running time, the number of users, and the average system load on Linux.who am i
# Output on my local terminal below
## tomokiokuno ttys000 Jan 10 15:27
who
## tomokiokuno0528 pts/1 2022-01-11 01:13 (42.166.94.34.bc.googleusercontent.com)
## tomokiokuno0528 pts/3 2022-01-10 23:43 (42.166.94.34.bc.googleusercontent.com)
## maschepps pts/6 2022-01-17 19:33 (74.213.228.243)
## capj245 pts/13 2022-01-21 03:54 (s-164-67-232-61.resnet.ucla.edu)
## tokramm pts/18 2022-01-21 05:07 (76.82.34.63)
w
## 05:47:33 up 18 days, 5:21, 5 users, load average: 1.26, 0.75, 0.53
## USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
## tomokiok pts/1 42.166.94.34.bc. 11Jan22 10days 0.01s 0.01s -bash
## tomokiok pts/3 42.166.94.34.bc. 10Jan22 10days 1.14s 0.96s ssh -i /home/tomokiokuno0528/.ssh/id_rsa tomokiokuno0528@server.ucla-biostat-203b.com
## maschepp pts/6 74.213.228.243 Mon19 3days 0.03s 0.03s -bash
## capj245 pts/13 s-164-67-232-61. 03:54 1:52m 0.01s 0.01s -bash
## tokramm pts/18 76.82.34.63 05:07 39:33 0.01s 0.01s -bash
id
## uid=1025(tomokiokuno0528) gid=1027(tomokiokuno0528) groups=1027(tomokiokuno0528) context=system_u:system_r:initrc_t:s0
last | head
## tokramm pts/18 76.82.34.63 Fri Jan 21 05:07 still logged in
## tokramm pts/18 76.82.34.63 Fri Jan 21 04:58 - 05:02 (00:04)
## blei001 pts/18 cpe-76-171-0-146 Fri Jan 21 04:56 - 04:56 (00:00)
## tokramm pts/18 76.82.34.63 Fri Jan 21 04:53 - 04:54 (00:00)
## tokramm pts/18 76.82.34.63 Fri Jan 21 04:31 - 04:53 (00:21)
## capj245 pts/13 s-164-67-232-61. Fri Jan 21 03:54 still logged in
## capj245 pts/13 s-164-67-232-61. Fri Jan 21 03:45 - 03:49 (00:04)
## capj245 pts/13 s-164-67-232-61. Fri Jan 21 03:36 - 03:38 (00:01)
## yuyuanli pts/30 149.142.103.176 Fri Jan 21 01:13 - 01:14 (00:00)
## huazhou pts/7 108-64-58-205.li Fri Jan 21 01:03 - 01:52 (00:48)
who am i
displays information about the current terminal, such as username, console, and log-in time. However, the output is not displayed on a teaching server probably because the process is ongoing.who
displays the status of the users who are currently logged in.w
displays the processes running by the logged-in user and the CPU time used at the same time.id
displays identification information such as user ID, user name, group ID.last | head
displays a list of the last ten logged-in users in the newest order.echo {con,pre}{sent,fer}{s,ed}
## consents consented confers confered presents presented prefers prefered
time sleep 5
##
## real 0m5.001s
## user 0m0.001s
## sys 0m0.000s
echo {con,pre}{sent,fer}{s,ed}
displays all texts formed by \(2^3=8\) combinations.sleep 5
stops the shell for five seconds and time
is a command to measure execute time of a command. Hence,time sleep 5
is a command to check that the shell has been stopped for five seconds.history | tail
# Outputs on my terminal below
## 503 arch
## 504 uname -a
## 505 uptime
## 506 who am i
## 507 who
## 508 w
## 509 id
## 510 last | head
## 511 time sleep 5
## 512 history | tail
history | tail
displays the last 10 commands in order from oldest to newest.