by#

video 04/13/2023 Includes introduction to program define

video 04/20/2023 Includes exercise in debugging & discussion of HW1

The figure below outlines a schema of what we covered last week and attempts to generalize it to several scenarios beyond what we might have time to cover in 5 weeks. For instance, you may replace sum with count, regression with kappa statistic estimation (as one of your classmates did), or yet still lincom with a matrix define m=r(table) step.

For the latter case, each matrix value will contain r() class macros, which you may capture and format in a local macro like so:

local b_age: di %3.1f m[1,4]

Where m is the matrix you defined and m[1,4] is the value in row1, column4. And here is a script that outputs r(table) following a Cox regression into an excel file:

rtable.do

Hide code cell source
import networkx as nx
import matplotlib.pyplot as plt
#import numpy as np
#import sklearn as skl
# 

#plt.figure(figsize=[2, 2])
G = nx.DiGraph()
#G.add_node(" ",  pos = (0, 3) )
G.add_node("regress",  pos = (1, 9.2) )
G.add_node("sum", pos = (-3, 4.5) )
G.add_node("lincom", pos = (1, 4.5) )
G.add_node("r()", pos = (-3, 0) )
G.add_node("r()*", pos = (1, 0) )
G.add_node("local", pos = (-1, -4.5))
G.add_node("embed", pos = (-1, -9))
G.add_edges_from([      ("regress", "lincom")])
G.add_edges_from([("sum", "r()"),     ("lincom", "r()*") ])
G.add_edges_from([("r()", "local"), ("r()*", "local")])
G.add_edges_from([("local","embed")])
nx.draw(G, 
        nx.get_node_attributes(G, 'pos'), 
        with_labels=True, 
        font_weight='bold', 
        node_size = 3500,
        node_color = "lightblue",
        linewidths = 3)
ax= plt.gca()
ax.collections[0].set_edgecolor("#000000")
ax.set_xlim([-10, 10])
ax.set_ylim([-10, 10])
plt.show()
_images/05abf78f13dfdb0eb7fde2d7f9eb80f1c0eedaa15fa26611dca58528f4f5e1e5.png

I’ve not illustrated where e() and c() fit in this scheme. But once you appreciate that r() values in memory can be revealed using return list, e() values using ereturn list, and c() values using creturn list then you will be set for life!!

The local macros you define will ultimately be embedded in programs you write or the text, tables, and figures of your final manuscripts.

We’ve seen how to embed a local macro in text after appropriately formatting the r() or e() content:

di "We have seen how to embed a `local macro' in text, but such macros may be embeded in tables and graphs as well!"

Remember:

  1. names: r(), e(), c()

  2. content: return, estimates, constants

  3. these system-defined local macros may then be formatted to suit your needs in a user-defined local macro

In context of the first three weeks here is where things stand:

  1. pwd —> dofile structure, working directory, location of output

  2. r(mean) —> macros, local system-defined, user-defined

  3. by —> prelude to program define

For today’s session, by, you’ll use transplants.dta to demonstrate a few useful commands that we will put to good use over the coming weeks. You should be able to do most of this session on your own and at your convenience since it merely exposes you to various commands but not in the context of any specfic task.

use transplants, clear

What if we want a variable of the total number of records in each ABO blood type?

bys abo: gen cat_n = _N

So what did this code achieve?

tab cat_n

Let’s add labels to the groups:

#delimit ;
lab define abo_lab
   1 "A"
   2 "B"
   3 "AB"
   4 "O"
;
#delimit cr

lab values abo abo_lab

Now lets tabulate once again:

tab abo

Nice!

Ok. Do you remember these commands?

di c(N)

di c(k)

Lets create a disturbance to our setup to learn a few additional commands and their value to us:

qui do https://raw.githubusercontent.com/jhustata/book/main/sample-keep.ado

Is that a new command? Or are you already familiar with it? How about this:

samplekeep

What just happened?

di c(N)

di c(k)

Let’s rerun earlier commands to restore our sanity!

tab abo

Ok, then. It was a temporary disturbance and peace was restored!! We will talk more about the preserve and restore commands, often used together.

Back to the by command:

bys abo: egen age_byabo = mean(age)

Any idea what this command just achieved?

codebook age_byabo

What do you notice? Only four unique values! The egen command, like the gen command is used to define new variables in the dataset. However, the gen command applies values at the level of the individual. The egen command does so by group and yields summary statistics. To learn more type h egen.

qui regress peak_pra prev_ki
local coef_of_det: di %3.2f e(r2_a)*100

di "History of kidney transplant explains `coef_of_det'% of the variance in peak_pra."

From whence does e(r2_a) come? Where does it fit into our schema? (Hint: return list, ereturn list, creturn list)

//lets revisit this 
regress age i.abo
lincom _b[3.abo]

What blood group has value 3?

return list
di %3.2f r(p)
di %3.2f r(estimate)
//does this output look familiar?
 lincom _cons + _b[3.abo]
 return list 

//regression is a fancy way of estimating means for specified groups
 codebook age_byabo

Remember to use lincom appropriately, depending on what you wish to describe!

After a regression we may type ereturn list for all sorts of estimated values:

matrix define b = e(b)
matrix list b
di b[1,3]

You should explore the following commands, one-by-one, on your own:

tab rec_education, sum(age)
regress wait_yrs age, level(90)
count if age > 50
count if inrange(age,50,90)
list age dx rec_education in 1/5

//di c(N) " & " c(k)
use age dx using transplants, clear
use transplants in 1/50, clear
use transplants in 44
use transplants if rec_work==1
use a* p* using transplants in 1/100
//describe
describe, simple
describe, short
describe a*
describe using transplants.dta 
describe r*, fullnames

//codebook
codebook
codebook rec_education rec_work abo
codebook a* e*, compact
codebook, problems

//list 
list 
list prev_ki wait_yrs if race==9
list *date in 1/10
list dx in 1/10, clean noobs
list age, fast

//count
count
count if rec_work==1
count if rec_work!=1
count if bmi<20 | bmi > 35
count if bmi>25 & bmi <30
count if !(age>18)
count if inrange(wait_yrs,4,6)
count if !inlist(gender,0,1)

//tabulate
tab abo
tab abo rec_hcv
tab gender, sum(age)
tab dx gender if age<40
tab dx gender,row
tab dx gender,col nofreq

tab dx gender, col nofreq chi2
return list 

//summarize
sum
return list //not very useful; see schema above

sum age wait_yrs
return list //much more useful!

sum age wait_yrs, detail

//missingness
count if bmi > 30
count if bmi > 1000 //wha? 
sum bmi //whats going on?
assert c(N) == r(N) //when do these diverge?
tab bmi if bmi > 1000
tab bmi if bmi > 1000, missing

count if bmi==.
count if bmi==.a
count if missing(bmi)

Let’s discuss missingness as its relevance emerges.

//manipulating data
gen age_lastyear=age-1
gen any_college=(rec_educ>=3)
g youngman=(age<40>)&(gender==0)

Variable types

byte: integer range -127 to 100

int: integer range -32767 to 32740

long: integer range \(\pm\) 2 billion

float: decimal, range \(\pm\) \(10^{38}\)

double: decimal, range \(\pm\) \(10^{307}\)

strings: e.g. str5 = string of length 5

You’ll need to specify the variable types as you start to right flexible programs with the program define command and syntax command with options such as, time(varname) event(varname) tmax(float)

If you wish to better understand these variables types, please explore those available to you in transplants.dta. Open the dataset and type desc.

use transplants, clear
describe
//generate
g byte young=(age<30>)
g age_f=age if gender==1
g age_spline=(age>40)*(age-40)

//_n and _N
g new_id=_n
gen total_records=_N
g percentile=100*_n/_N

//drop/keep
keep age gender bmi fake_id ra*
drop end_date

//remember: to load the original dataset again
use transplants, clear
keep in 1/100
drop if wait_yrs<1
keep if prev_ki==1
drop if age<18 & abo==2

//load data again
use transplants, clear
replace prev=0
replace wait_yrs=5 if wait_yrs>5
replace gender=gender+1
replace bmi=. if !inrange(bmi,17,50)

//rename
rename age age_at_transplant
rename gender female

use transplants, clear
rename (age gender)(age_at_transplant female)

//sort, gsort
list in 1/10 
sort age
list in 1/10
sort age dx
list in 1/10
gsort -age dx -race 
list in 1/10

//recode
recode rec_education (3 4 5 =9)
recode dx (1=1)(2=2)(*=9)
recode gender (0=1)(1=0),gen(male)

//display
sum wait_yrs
di "variable" _col(20) "mean"
di "wait time" _col(20) r(mean)

//capture
capture sum nonexistent_variable
capture di 4+3
capture di 4+

//assert 
assert age<100
assert end_date>=transplant_date
assert inlist(gender,0,1)
assert 2<1
assert 0
assert c(edition_real)=="BE"
assert c(edition_real)=="MP"

//recode+label
#delimit ;
recode race 
    (1=0 "Cauc")
    (2=1 "AA)
    (4=2 "Hisp/Latino")
    (5/9=3 "Other"),
        gen(race_cat)
;
#delimit cr

//preserve & restore
sum age

//indentation!
preserve
    drop if age<r(mean)
    sum age
restore 

sum age
//comments

sum age //this displays mean, range, etc.

/*
Here's a long comment. It takes up several
lines, so I use the slash/asterisk.
*/

* Asterisk as first characater=comment

tab race * However, this doesn't work

if 0 { //background: ph.340.600

    1. we prefer `if 0 {' for comments
    2. and brief `//' annotations per `if int {' code-block
    3. do not use comments to explain a command
    4. that is what `help' is for
    5. our approach is avant-garde

}

Notebooks such as Jupyter notebook enable analysts to merge richer documentation together with analysis code, that isn’t quite possible in Stata. This class book is such an example of notebook that I have “hacked” from its Python roots for our Stata class.

For the curious few who looked closer at the schema outlined at the beginning of this class, you might have found some hidden but strange-looking code that produced the figure. That is python code.

I’m unable to input Stata code and display output in this Jupyter book. But the if integer { blocks that I’ve recommended for this class are actually inspired by Jupyter notebooks, where if 0 { represent documentation blocks and if int { represent code-blocks.

To me, if 0 { is infinitely more aesthetic than blocked-off green chuncks of annotated text, an approach familiar to most Stata users.

Jupyter books such as this are a major step forward in efforts toward reproducible research.

//line continuation

//Use `///` to make a Stata command span several lines in a .do file

recode race ///
    (1=1) ///
    (2=2) ///
    (4 5 6 7 9 = 9)

//That is the main-stream approach to line continuation. Ours?

#delimit ;
recode race
    (1=1)
    (2=2)
    (4 5 6 7 9)
;
#delimit cr
//labels
describe dx
label var dx "kidney disease diagnosis"

tab gender //is 1 male or female?
#delimit ;
label define g_label 
    0 "0=Male"
    1 "1-Female"
;
#delimit cr
label values gender g_label
tab gender //much better
tab gender, nolabel //don't show label

//or?
g female=gender==1 //intuitive meaning of 1 & 0
//r-class examples
count if age < 45
di r(N) " young patients"

tab race abo, chi2
di "p-value: " r(p)
sum age
assert inrange(r(mean),45,55)

tab race abo, chi2
asset r(p) < 0.05
//display formats
sum bmi //command
local bmi_mean: di %4.1f r(mean) //r() -> local
di "Mean BMI: `bmi_mean' " //embed in text or other final output
//_b coefficient matrix

regress bmi gender age

#delimit ;
di "estimated BMI: " %3.2f _b[_cons] " + " 
                     %3.2f _b[age] "*age (+ " 
                           _b[gender] " if male)"
;
#delimit cr
//local macros
qui sum age if gender==0
local age_m: %3.1f r(mean)

qui sum age if gender==1
local age_f: %3.1f r(mean)

di "Mean age: `age_m' (males) `age_f' (females) " 
//more local macros
local confounders age gender peak_pra
regress bmi wait_yrs `confoundres'
//program define
capture program drop table1
program define table1
    #delimit ;
    disp "Variable" _col(20) "mean (SD)" 
                    _col(40) "range"
                    ;
    #delimit cr

    quietly sum age
    #delimit ;
    disp "age" _col(20) %3.2f r(mean) 
              " (" %3.2f r(sd) ")"
               _col(40)  %3.2f r(min) "-" %3.2f r(max)
               ;
    #delimit cr
     //add more variables!
    end

    table1
//tagging
list abo gender spread
egen = tag
egen grouptag = tag(abo gender)
list abo gender spread if grouptag
use transplants, clear
egen ethtag = tag(ethcat)
bys ethcat: egen mean_bmi=mean(bmi) bys ethcat: egen mean_age=mean(age)
twoway scatter m~_bmi m~_age if ethtag
//conditionals
count if abo==1

if _N > 100 {
   
    di "Thats a big dataset

}

if _N > 10000 {

    di "That's a huge dataset"

}
//if and else
if _N > 10000 {

    di "That's a huge dataset"

}

else {

    di "Bah! Not so big."

}
//some `if` examples
quietly tab dx gender, chi2

if r(p) < 0.01 {

    di "p<0.01

}

else {

    di %3.2f r(p)
}
count if age<18
if r(N) >=200 {

    tab gender prev_ki, chi2

}

else {

    tab gender prev_ki, exact
    
}
count if !inlist(rec_hcv_antibody,0,1,.)

if r(N) ==0 {

    logistic rec_hcv_antibody age

}

else {

    di "Non-binary outcome."
    di "Can't do logistic regression"
}
//loops

forvalues i = 1/5 {

    di `i' //a local macro, given the notation!!

}
forvalues b = 35/45 {
    
    quietly count if age>=`b' & age<`b'+1 
    disp "Age of `b': " r(N) " patients"
    
}