Lab 5#
Please use this lab as an opportunity to review the course material and prepare yourself for the homework questions. A sample response to the lab questions will be made available on Friday May 2, 2024. While you are not expected to practice all these examples, you’re also reminded that practice makes perfect!
Tasks:
Walk through as many examples as you can to appreciate the efficiency loops bring to programming.
Make a list of the number of macros that are defined throughout this exercize. How many do you come up with? Compare this with your classmates. Post your “exhaustive” list of macros (inlude the total count) in GitHub Discussions and challenge others to identify a longer list.
Consider how some of these macros might be used when writing a flexible program. (e.g.
syntax varlist
). Part II of this lab focuses on flexible programs, all of which employ macro “names” to capture user “values” and inputs (i.e., varlists, numlists, varnames, filepaths, etc).Finally, write a script that imports NHANES DEMO.XPT (1999-2000) and iteratively
appends
NHANESDEMO.XPT
from the next two survey cycles 2001-2002, 2003-2004. Visit the website to see the naming convention for the various years (e.g. NHANES 2001 - 2002)
set timeout1 1000
import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DEMO.XPT", clear
Convert the above script into a flexible program that can import as many survey cycles as the user wishes. Review the naming conventions for the
DEMO.XPT
files for each cycle.
capture program drop nhanes_demo
program define nhanes_demo
//code
end
Part I: Loops (30 min)#
1. Settings#
Let’s start by reading in datasets we may need for demos
🌕
There’s a bug in the
if 1
codeblock below.Let’s leverage this opportunity to practice “debugging.”
We’ll use
di as err "The bug in the code is below this line"
to pinpoint the error:Insert this line at the top of your script.
Then at the bottom.
Next, in the middle.
Continue adjusting the position of this line to triangulate until you find the specific section causing the error.
This method systematically helps you narrow down the exact point of failure in your Stata code by using error messages strategically. It’s an effective way to debug, especially in scripts where the error might not be immediately apparent.
qui {
/*
1. Kidney Transplant Recipient Data (from GitHub.com)
- `transplants.dta`
- `donors.dta`
- `donors_recipients.dta`
2. NHANES 1999-2000 Demographics Data (from CDC.gov)
- `DEMO.XPT`
*/
if 1 { //Activated
cls
noi di "How many datasets are you going to use?" _request(N)
forvalues i=1/$N {
noi di "What is dataset `i'?" _request(data`i')
}
global repo "https://github.com/jhustata/basic/raw/main/"
global nhanes "https://wwwn.cdc.gov/Nchs/Nhanes/
}
}
2. Kinds of loops#
2.1 foreach v of varlist {
#
qui {
/*
1. Kidney Transplant Recipient Data (from GitHub.com)
- `transplants.dta`
- `donors.dta`
- `donors_recipients.dta`
2. NHANES 1999-2000 Demographics Data (from CDC.gov)
- `DEMO.XPT`
*/
if 0 { //Deactivated
cls
noi di "How many datasets are you going to use?" _request(N)
forvalues i=1/$N {
noi di "What is dataset `i'?" _request(data`i')
}
global repo "https://github.com/jhustata/basic/raw/main/"
global nhanes "https://wwwn.cdc.gov/Nchs/Nhanes/
}
if $N { //Import Data
use $repo$data1, clear
ds
//varlist: string of variable names
foreach v of varlist `r(varlist)' {
noi di "`v'"
}
}
}
🌕
The r(varlist)
can be replaced with a user-defined varlist in a customized program:
capture program drop myvarlist
syntax varlist
foreach v of varlist `varlist' {
noi di "`v'"
}
end
Now the user has flexibility:
myvarlist age gender race abo bmi
c(ALPHA)
#
What if you wish to loop over a list of strings that aren’t variables?
cls
//di c(ALPHA)
foreach v in `c(ALPHA)' {
di "`v'"
}
tokenize
#
You may be working within a forvalues i=1/N
loop and want to loop over some other “list” that is a string:
tokenize "`c(ALPHA)'"
forval i = 1/26 {
di "``i''"
}
🥇
set timeout1 1000
//see too much repetition? then innovate with loops!!
import sasxport5 "${nhanes}1999-2000/DEMO.XPT", clear
import sasxport5 "${nhanes}2001-2002/DEMO_B.XPT", clear
import sasxport5 "${nhanes}2003-2004/DEMO_C.XPT", clear
Can you write a script that imports NHANES DEMO.XPT (1999-2000) and iteratively appends
NHANES DEMO.XPT from the next two survey cycles 2001-2002, 2003-2004? Visit the website to see the naming convention for the various years (e.g. NHANES 2001 - 2002)
varlist
#
What if you wish to loop over non-variable strings?
cls
local varlist "Egypt Portugal Swaziland Ireland"
foreach v in `varlist' {
di "`v'"
}
2.2 foreach n of numlist {
#
qui {
//earlier code
if $N {
// earlier code
levelsof dx, local(dxcat)
//numlist: list of numbers
foreach n of numlist `dxcat' {
noi di `n'
}
}
}
variable lab
#
qui {
//earlier code
//levelsof is a "numlist"
levelsof dx, local(dxcat)
local varlab: var lab dx
//later code
}
🔥
Variable type determines the parameters you report in Table 1:
Variable Type |
Statistic |
---|---|
Continuous (Units) |
Median (IQR) |
Binary (One is enough) |
% |
Multicategory (Each reported) |
Variable label |
Specific or collapsed |
% |
dx
is a collapsed version of extended_dgn
. But for the sake of practice, lets further collapse dx
:
tab dx
recode dx (1/4=1 "Prevalent Overall")(5/8=2 "Common in subgroups")(9=3 "Miscellaneous"),gen(dx_cat)
tab dx_cat
h ds
ds, has(type string)
levelsof extended_dgn
return list
ds, not(type string)
ds, has(type int)
ds, has(varl "*TX*")
ds, has(varl *TX* *transplant*)
ds, has(format %t*)
ds, has(format *f)
Here’s a simple script that classifies each variable:
qui {
cls
ds, not(type string) //otherwise, extended diagnosis is continuous!
global threshold 9 //for multicat vs. continuous
foreach v of varlist `r(varlist)' {
levelsof `v', local(numlevels)
if r(r) == 2 {
noi di "`v' binary"
}
else if inrange(`r(r)', 3, $threshold) {
noi di "`v' multicat"
}
else {
noi di "`v' continuous"
}
}
}
value lab
#
qui {
//earlier code
//earlier code
local vallab: value lab dx
foreach n of numlist `dxcat' {
//code
}
}
label value lab
#
Let’s get familiar with variations on the foreach
command:
forvalues
#
Here we are dealing with a sequence of numbers:
forvalues i=1/9 {
di `i'
}
foreach
#
In this scenario the numbers are arbitrarily arranged:
foreach n of numlist 1 2 3 7 9 {
di `n'
}
numlist
#
You can create a macro, whose value is the numlist
local numlist "1 2 3 7 9"
foreach n of numlist `numlist' {
di `n'
}
qui {
//earlier code
foreach n of numlist `dxcat' {
local dxvarlab: lab `vallab' `n'
//later code
}
}
tokenize
#
if 2 { //Int 1-26
egen lastname = seq(), f(1) t(26)
tostring lastname, replace
tokenize "`c(ALPHA)'"
}
if 3 { //Tokenize
forval i = 1/26 {
replace lastname = "``i''" if lastname == "`i'"
}
}
putexcel
#
Output a varlist
, numlist
, and some other list into .xlsx
clear
putexcel set lab6, replace
use $repo$data1, clear
qui ds
//nested loops
tokenize "`c(ALPHA)'"
forvalues i = 1/2 {
local row=2
foreach v of varlist `r(varlist)' {
if `i' == 1 {
qui putexcel ``i''`row' = "`v'"
local row=`row'+ 1
}
if `i' == 2 {
qui sum `v'
local mean: di %3.2f r(mean)
qui putexcel ``i''`row' = "`mean'"
local row=`row' + 1
}
}
}
ls
Review the .xlsx
file you’ve just created
Part II (30 min)#
We discussed how you can define your own “program”. It’s an awesome tool that allows us to automate a specific task. If you think a specific part of your code will be used multiple times, you might as well put that into a program. In this second half of lab, we will practice customizing our programs.
use "${repo}${data1}", clear
use "${repo}${data2}", clear
use "${repo}${data3}", clear
Start Stata, open your do-file editor and consider using conditional code-blocks (
if 2
for instance) as you answer each question in this lab. That way each block has some autonomy and you can “silence” it (if 0
) while you run other code-blocks. Review Part I for instances where macros replace the0
and2
in theif 0
blocks.Write a program called
mymean
. This program will takevarlist
as a user input, and calculate the mean value of each variable, and display the values.Modify your program
mymean
so that when anif
argument is supplied,mymean
would only include the observations that meet the condition specified by theif
argument. In other words, if the user typesmymean height if age>65
, the programmymean
will calculate the mean only among patients older than 65.Further modify your program
mymean
to include the optionsd
. When the optionsd
is supplied,mymean
will display the standard deviation along with the mean. This version ofmymean
should still be able to accommodate theif
argument.Further modify your program
mymean
to include the optiondigits()
, with a number in the parenthesis. When the optiondigits()
is supplied,mymean
will round up the mean (and the standard deviation, if applicable) in units ofdigits()
. Ifdigits()
is NOT supplied, round in units of 0.001. (Hint: use the Stata functionround()
)Did you make
if
,sd
, anddigits()
optional arguments? That is, your program should run whether or not these arguments are supplied. To do so, simply surround each argument with brackets. For example,[sd]
I’d like to draw your attention to the merge command. It’s hard to write a question around
merge
, but it’s a really important command in practice. For instance, we used it in the week 4. Reference these notes when necessary in the future. You should also ask your TA and colleagues about theappend
command. Task 4 (see above) requires you to use it.We want to study if death (
died==1
) is associated with several predictor variables:bmi
,prev_ki
,age
,peak_pra
, orgender
. Run logistic regression betweendied
and each of the predictor variables usingforeach
loop. At each run, save the name and the regression coefficient of the predictor variable into an external Stata dataset file namedoutput.dta
.You have all your commands in your do file, right? Run your do file from the beginning and make sure your do file does exactly the same thing.