HW6#
Write a .do file that performs the tasks described below. Your .do file must be called
hw2.lastname.firstname.do
. Make sure your script will run on our machines by handling filepath ambiguity in your script. Do not submit your log files as part of the assignment.
#
Use the dataset pra_hist.dta
and hosp.dta
to perform the required tasks.
global repo https://github.com/jhustata/basic/raw/main/
Context : You are conducting a study that examines the regional variation in the distribution of panel-
reactive antibody (PRA). You recruited 73 patients (px_id
= 1, …., 73 ) from 10 hospitals (hosp_id
=1, …,10)
in 3 regions (region
=A, B, C, … ), and measured PRA 3 times: visit 1, visit 2, and visit 3. You hear that the
organization that funds your research plans to extend the funding for several more visits (visit 4, visit 5,
…, visit N). Since you do not know how many more visits there will be, you decide to write a .do file that
can work regardless of how many visits the dataset has.
Codebook
Variable |
Description |
Values/Range |
---|---|---|
pra_hist.dta |
||
|
Hospital ID |
Integers: 1 – 10 |
|
Patient ID |
Integers: 1 – 94 |
|
Visit ID |
Integers: 1 – N |
|
PRA value at the visit |
Integers: 0 – 100 |
hosp.dta |
||
|
Hospital ID |
Integers: 1 – 10 |
|
Region |
Alphabets |
Note: to uniquely identify a patient you’d have to specify both hospital ID and patient ID. In other words, patients in different hospitals may have the same Patient ID
i) Load pra_hist.dta
. Print a table as shown below, which displays the number of
patients with a non-missing PRA value at each visit. N
and XX
should be replaced with the
correct values from the dataset. (Hint: how do you write a forvalue loop for all values of
visit_id
?)
use "${repo}pra_hist", clear
Question 1.i)
Visit Count
1 XX
2 XX
⋮
[omitted, but your .do file should display all variables]
⋮
N XX
ii) Create a new variable peak_pra
, which contains the maximum value of PRA within each
participant. Print the median (IQR) of peak_pra
across the patients as shown below. XX.X
should be replaced with the correct values from the dataset and formatted with one digit after
the decimal point (e.g., 12.0 ). (Hint: each patient must be counted only once when calculating
the median and IQR.)
Question 1.ii): The median (IQR) of peak_pra is xx.x (xx.x-xx.x).
iii) Another dataset provided to you, hosp.dta
, has information on which region each
hospital is located in. Use the merge
command to merge the current dataset in memory with hosp.dta
without
altering the number of observations in memory (see week 3 Section 3.3 for syntax or week 4 Section 4.2 for notes during a lecture session). Use the command list
to list the ID of the
patient with the highest peak_pra value for each region as shown below.
use "${repo}hosp", clear
XX
should be replaced with the correct values from the dataset. If there are ties (i.e., multiple
patients with the highest value), print all tied patients. If region C has ties (while A and B does
not), the table will look like below.
+------------------------------------+
| region px_id peak_pra |
|------------------------------------|
| A XXXX XXXX |
|------------------------------------|
| B XXXX XXXX |
|------------------------------------|
| C XXXX XXXX |
| C XXXX XXXX |
+------------------------------------+
Hint: your list
command should look like this
list region px_id peak_pra [insert your code here], sepby(region) noobs