

You’ll find that the content of this chapter & lab 6 complement each other.

This is a twoway plot:



  • import .csv files -> save .dta files

  • logfile to capture salient bits of process

  • dofile that appends all the saved .dta files from step #1

  • logfile that documents the above process

  • twowayplot script that produces the above figure


  • open - the entire process, not merely the finessed product, is published

  • public - in the 21st century this means online

  • github - gh-pages can freely host your content

  • reproducible - the entire world has access to your .do files on github

  • classbook - you guys have access to all the stuff that this classbook is made of

  • et tu? on a scale of 0-10 how do you score on openness? who has access to your dofiles?


Lets explore some syntax for twoway plots. Run one line of code at a time. Alternatively, you may type help twoway to explore the options outlined below

use transplants, clear
gen int yr = year(transplant_date) 
gen byte n=1
rename gender female
rename don_ecd ecd
collapse (sum) n ecd female, by(yr) 
gen int scd = n-ecd
gen int male=n-female
save tx_yr, replace
graph twoway line n yr
graph twoway connected n yr
graph twoway area n yr
graph twoway bar n yr
graph twoway scatter ecd scd
graph twoway function y=x^2+2
twoway function y=x^2+2, range(1 10)
twoway function y=x^2+2, range(yr)
graph twoway line ecd scd yr
graph twoway line n ecd scd yr
graph twoway area ecd scd yr
graph twoway area ecd scd yr //where is the ecd area?
graph twoway area scd ecd yr //order matters!
graph twoway bar scd ecd yr
graph twoway bar scd ecd yr
twoway line n yr || connected male female yr
twoway line n yr ///
    || connected male female yr
regress n yr
twoway line n yr ///
    || function y=_b[_cons]+_b[yr]*x, range(yr)
twoway line female yr /// 
    || line male yr ///
    || line scd yr ///
    || line ecd yr ///
    || line n yr
twoway line n yr, yscale(range(0))
tw li n yr, yscale(range(0 700))
tw li n yr, xscale(range(2050))
tw li female yr, yscale(log)
tw li female yr, xscale(reverse) //rarely a good idea
tw li ecd yr, xscale(off) yscale(off)
tw li ecd yr, xscale(off) yscale(off)
tw li ecd yr, xscale(off) ///
    yscale(log range(1) reverse)
tw li n yr, yscale(range(0)) ylabel(#4)
tw li n yr, ///
    yscale(range(0)) ylabel(minmax)
tw li n yr, ylabel(0(100)600) ///
    xlabel(2005 2007 "policy change" 2010(5)2020)
tw li n yr, xtick(2005(1)2020) ///
    yscale(range(0)) ylabel(0(100)600)
tw li n yr, xtitle("Calendar year") ///
    ytitle("DDKT") ylabel(0(100)600)
tw li n yr, ///
    yscale(range(0)) ylabel(0(100)600)
tw li n yr, ///
    title("Transplants per year") ///
twoway line n yr, xline(2007) ///
    text(450 2007 "Policy change")
twoway line n yr, yline(350)
twoway line n yr, ylabel(0(100)600) ///
    text(600 2017 "Local peak in 2017")
graph twoway scatter peak_pra age
tw sc peak_pra age, jitter(2)
tw sc bmi age if gend==0, mcolor(orange) /// 
    || sc bmi age if gend==1, mcolor(black) //orioles colors
tw sc bmi age if gender==0, msymbol(D) ///
    || sc bmi age if gender==1, msymbol(+)
tw sc bmi age if gender==0, msize(small) ///
    || sc bmi age if gender==1, msize(large)

Which of these is not a twoway graph? Does the area under the curve represent anything meaningful?

Crudely, the AUC might be viewed as rectangular: height is 100 individuals x width is 100 years (i.e., ages) = 10,000

Does 10,000 correspond to any of the output? Perhaps to c(N)?

Below’s the script that produced them but you have to do some debugging before it works. There’s no free lunch today!

I’d like to invoke the metaphor of gene activation, which is analogous to if macro {, else if macro {, and else {. Although below we have Stata code-blocks rather than genetic code, the metaphor is apt. What happens upstream in one code-block may affect the expression of another downstream code-block for a given process, but in a dofile as compared to a given biological process.

You ought to emerge from this class thinking of Stata programming as a series of if macro { conditional statements. And your teaching team will lookout for these in your .do files!

qui {
 if c(N) { //clear data before running script
        1. adopted from wk1 of this class
  2. import demographics data from nhanes
 if c(N)<1 { //settings,logfile,macros
  capture log close 
  log using session0.log, replace 
  global url
  global datafile DEMO.XPT 
 if c(N)<2 { //import datafile
  import sasxport5 "${url}${datafile}", clear
  replace ridageyr=.
  noi di "N=`c(N)'"
 if c(N)>3 {
     g number=1
      sum ridageyr
   assert c(type) == "float"
      collapse (sum) number,by(ridageyr)
 local N=c(N)-1
 if `N' { //no ouput if c(N)=0
  noi di "N=`c(N)'"
  local ages=c(N)
  line number ridageyr, connect(stairstep) /*
      */text(500 40 "Vars: `c(k)', Obs: `c(N)'") /*
   */yti("") /*
  graph save agedist1.gph,replace 
  twoway area number ridageyr, connect(none) /*
      */text(500 40 "Vars: `c(k)', Obs: `c(N)'") /*
   */yti("") /*
  graph save agedist2.gph,replace 
 if `N' {
  noi di "N=`c(N)'"
  hist ridageyr, freq bins(`ages') /*
      */text(500 40 "Vars: `c(k)', Obs: `c(N)'") /*
   */yti("") /*
  graph save agedist3.gph,replace 
  graph combine agedist1.gph /*
              */agedist2.gph /*
     */agedist3.gph /*
         */, row(1) /*
      */  l1ti("N",orientation(horizontal)) /*
      */  b1ti("Age, y")
  graph export agedist.png,replace 
  noi di c(scheme)
  noi di c(version)
 else {
  noi di "N=`N' (i.e., code-block is not expressed)"
  • Sampling large datasets

  • Approach to workflow

  • May cut hours off efforts

here’s an example

  • exploratory analyses on sample

  • build .do file on sample, iterate

  • submit final job to full dataset

Let’s recall an extra credit challenge from the first day of class:

See also

Bonus points: Use the tokenize command to append the DEMO.XPT files for all continuous NHANES: 1999-2018 into one file. Your .do file should include only one import sasxport5 statement. Search this book for the import sasxport5 command. Up to 1.5 bonus points

We now wish to link the dataset created above to mortality outcomes to perform survival analysis. See chapter 2: r(mean) and specifically the if 6 { code-block, which was exclusively dedicated to survival analysis and used the stset, sts graph, and stcox commands! How may we go about this using the online resources available to us?

if 0 {
    this is not a .do file for you to copy & paste
    instead, run the commands sequentially
    one-by-one, except, of course, the twoway command
    as well as the sts graph command. you'll need to 
    copy & paste that long line of code into a dofile and do   

nhanes_mortality //install programs: click on "?" that ends the last paragraph
merge 1:1 seq using nhanes_mortality, keep(matched)
tab survey
lookfor age
lookfor follow
egen surveytag=tag(survey)
codebook surveytag
desc survey 
split survey, parse("-")
destring survey1, replace 
g years=permth_exm/12
lookfor mort 
g age_at_death=ridageyr + years if mortstat==1
bys survey: egen av_age_at_death=mean(age_at_death)
#delimit ;
twoway scatter av_age_at_death survey1 if surveytag,
     ti("Age at Death by Survey Year", pos(11))
  xti("Survey Year")
  text(76 2010 "obs: `c(N)', vars: `c(k)'");
graph export twoway_ageatdeath.png, replace ;
stset years, fail(mortstat);
sts graph, 
    by(ridreth1 )
graph export km_race.png, replace ;
stcox i.ridreth1 ;
stcox i.ridreth1  ridageyr riagendr ;
#delimit cr

Let’s study this output and discuss a few issues:

  • egen command

  • by command

  • c(N), c(k) macros embedded in graph

  • improving the aesthetics

Which of these is a twoway graph?

Then, in the second-half of the class we’ll recap .dofile structure in context of the solution we’ll share with you. Let’s first briefly study an .ado file that you can find on your computers here:


This is the native Stata program for the stcox command used in Cox proportional hazards regression. We are not presently interested in the content of the .do file but merely wish to use it as an exemplar for our scripts and programs, including our ideal hw1 solution. We are presently interested in .do file or .ado file structure. Don’t be intimidated by the length of the script. Just look out for salient features:

  1. lines of code rarely cross the line (Stata’s suggested right margin)

  2. coder uses more than one method for line continuation including

//entirely new to me as of this week
sts graph, /*
    */ by(race)

//by far the most popular approach
sts graph, ///

//more efficient the longer the line of code
#delimit ;
sts graph, 
#delimit cr
  1. never uses #delimit ; (this is my personal fave, especially for a very long line of code)

  2. otherwise, the entire script is a bunch of if, elseif, else code-blocks

  3. up to this point we’ve used integers like if 1 { to define a code-block

  4. hence-forth we’ll get a litte fancier and replace the integers with system-define macros: c(), e(), r(); watch today’s video on if c(os)=="Windows" {

  5. and maybe occassionally with programmer-defined macros: N in the above script

  6. the limit is your imagination

  7. but i hope you appreciate the flexibility conditional code-blocks bring to programming!

  8. we have been hard-coding the values of if 0 {, if 1 {, etc as we build .do file structure. code-blocks have thus far been placeholders, elements required by syntactic constraints imposed on you by your instructor but that carry little or no semantic information.

But the hw1 script of one of your classmates has serendipitously segued us to informative, functional conditional if statements:

de-identified hw1 script

edited hw script

Copy & paste first the original and then the edited versions into your .do file editor and run. Of course you’ll need to have hw1.txt in the appropriate pwd.


//c() class system-defined macros
h creturn
di c(os)
assert c(os)=="MacOSX"
assert c(os)=="Windows"
assert c(os)=="Unix"

This brings us to our first substantive discussion of conditional statements about code-blocks:

if c(os) == "Windows" {
else {
