Code block followed with documentation!

capture log close
log using lecture4.log, replace text

//This log file contains most of the examples used in Lecture 4 of
//summer Stata Programming and Data Management, 
///along with additional explanations and examples.

version 12 //We're teaching new-style merging
clear all //clear all data from memory
macro drop _all //clear all macros in memory
set more off   //give output all at once (not one screenful at a time)
set linesize 80 //maximum allowed width for output

In the code above, the capture log close command is used to close any previously open log files. The log using lecture4.log, replace text command opens a new log file named “lecture4.log” and replaces its contents if it already exists. The text option specifies that both the executed code and the output should be logged.

The subsequent comments provide a brief description of the log file’s purpose, stating that it contains examples used in Lecture 4 of a Stata Programming and Data Management course, along with additional explanations and examples.

The version 12 command sets the Stata version to 12, indicating that the code is compatible with that version. The clear all command clears all data from memory, removing any previously loaded datasets. The macro drop _all command clears all macros in memory. The set more off command disables the feature that displays output one screenful at a time, showing all output at once. Finally, the set linesize 80 command sets the maximum allowed width for output to 80 characters, ensuring that output is displayed within the specified limit.

These initial steps help prepare the environment for executing the subsequent code and logging the results in the “lecture4.log” file.

use transplants, clear

//examples of syntax for merging
//NOTE: after each example we reload transplants.dta

//simplest merge: link data in memory (transplants.dta) to donors_recipients.dta
//by matching on the variable fake_id which appears in both datasets
merge 1:1 fake_id using donors_recipients

//the merge command creates a new variable, _merge
//check how many records matched
tab _merge

//"master" = dataset that was in memory before we merged
//"using" = dataset that we are bringing in with the merge command
//"match" = appears in both master and using datasets

In the code above, we start by using the dataset transplants and clearing any existing data in memory. Then, we demonstrate examples of syntax for merging datasets.

The first example shows the simplest merge where we link the data in memory (transplants.dta) to donors_recipients.dta using a one-to-one match on the variable fake_id which appears in both datasets. The merge command is used for merging.

After the merge, a new variable called _merge is created. To check how many records matched during the merge, we use the tab command on _merge.

In the context of merging, the terms “master” and “using” refer to the datasets involved in the merge operation. “Master” refers to the dataset that was in memory before the merge, while “using” refers to the dataset that we are bringing in with the merge command. The term “match” indicates the records that appear in both the master and using datasets.

Please let me know if you need further assistance or if there’s anything else I can help you with!

use transplants, clear
merge 1:1 fake_id using donors_recipients, keep(match)
//only keep records that appear both in the original dataset and in the new dataset

use transplants, clear
merge 1:1 fake_id using donors_recipients, keep(master match)
//only keep records that appear in the master dataset only, or both datasets

use transplants, clear
merge 1:1 fake_id using donors_recipients, keep(master using)
//only keep records that appear in the master dataset only, or the using dataset only, but not both

use transplants, clear
merge 1:1 fake_id using donors_recipients, keep(master match) gen(mergevar)
//call the newly created variable "mergevar" instead of "_merge"

use transplants, clear
merge 1:1 fake_id using donors_recipients, keep(match) nogen
//don't create a new variable

merge m:1 fake_don_id using donors, keep(match) nogen keepusing(age_don)
//don't load all the new variables from the donors dataset, just load age_don.

corr age*
//are donor age and recipient age correlated in our fake data? No.

//merging: using assert to make sure that all your records match
use transplants, clear
merge 1:1 fake_id using donors_recipients, keep(master match)
assert _merge==3
drop _merge

//merging: using assert to make sure that *most* of your records match
use transplants, clear
merge 1:1 fake_id using donors_recipients, keep(master match)
assert inlist(_merge, 1, 3) //master only or matched
quietly count if _merge == 3
assert r(N)/_N > 0.99  //99% of records have _merge==3
drop _merge

In the code above, we continue with examples of merging datasets using different options:

The first example merges the datasets transplants.dta and donors_recipients.dta based on a one-to-one match on the variable fake_id. It keeps only the records that appear in both datasets using the keep(match) option.

The second example keeps records that appear in the master dataset only or in both datasets using the keep(master match) option. The third example keeps records that appear in the master dataset only or the using dataset only, but not in both, using the keep(master using) option.

The fourth example creates a new variable called “mergevar” instead of the default _merge using the gen(mergevar) option.
The fifth example performs the merge but does not create a new variable using the nogen option.
The sixth example demonstrates merging using a one-to-many match (m:1) and selects specific variables to load from the donors dataset using the keepusing() option.
The code then computes the correlation between the variables age_don and age_rec using the corr command to check if donor age and recipient age are correlated in the fake data.
The next set of examples introduces the use of assert to check the validity of the merge. The first example uses assert to ensure that all records have a _merge value of 3 (indicating a match) and then drops the _merge variable.
The second example uses assert to check that most of the records have a _merge value of 1 (indicating they are in the master dataset) or 3 (indicating a match), with at least 99% of the records meeting this condition. The _merge variable is dropped afterwards.

//by

use transplants, clear
sort abo
by abo: gen cat_n = _N //cat_n = # records in each abo category

sort abo age
by abo: gen cat_id = _n //1 for youngest patient, 2 for next youngest, etc.


//egen

//obtaining mean age and storing it in a variable
use transplants, clear
egen mean_age = mean(age)

//more egen examples
egen median_age = median(age)
egen max_age = max(age)
egen min_age = min(age)
egen age_q1 = pctile(age), p(25) //25th percentile
egen age_sd = sd(age)  //standard deviation
egen total_prev = sum(prev) //add all the values (all are 1 or 0 in this case)

//using egen with by
use transplants, clear
sort dx
by dx: egen mean_age = mean(age)
by dx: egen min_bmi = min(bmi)

//combining sort and by to get bysort / bys
use transplants, clear
bys abo: egen m_bmi=mean(bmi)
bys abo gender: egen max_bmi = max(bmi)
bys abo gender: egen min_bmi = min(bmi)
gen spread = max_bmi - min_bmi

//egen = tag
egen grouptag = tag(abo gender)
//now, for all groups of abo/gender, (1/M, 3/F, etc), there's exactly one
//record that has grouptag = "1". All others have grouptag = "0".

list abo gend spread if grouptag

//more examples of bys
use donors_recipients, clear
bys fake_don: gen n_tx = _N
bys fake_don: gen tx_id = _n
egen don_tag = tag(fake_don)
tab n_tx if don_tag


//graphing example with egen and tag
use transplants, clear
egen tag_race = tag(race)
bys race: egen mean_bmi=mean(bmi)
bys race: egen mean_age=mean(age)

label define race_label 1 "White" ///
    2 "Black/AA" 4 "Hispanic/Latino" ///
    5 "East Asian" 6 "Native American" ///
    7 "Asian Indian" 9 "Other"  //create a label race_label

//I personally find this more legible than cond() with seven cases,
//although cond() has its champions
gen racedesc = "White/Caucasian" if race==1
replace racedesc = "Black/AA" if race==2
replace racedesc = "Hispanic/Latino" if race==4
replace racedesc = "Native American" if race==5
replace racedesc = "East Asian" if race==6
replace racedesc = "Asian Indian" if race==7
replace racedesc = "Other" if race==9

twoway scatter m~_bmi m~_age if tag_race, ///
    title("Mean age and BMI by race/ethnicity") ///
    xtitle("Mean age") ytitle("Mean BMI") ///
    xlabel(40(2)60) ylabel(20(2)30) ///
    mlabel(racedesc)
graph export graph_lecture4.png, replace width(2400)

In the code above, we cover examples related to the by

//math functions

//setup for math functions
clear
set obs 1000
gen n = _n/100 - 5

//rounding method 1: floor
disp floor(0.3)
disp floor(4.5)
disp floor(8.9)

//rounding method 2: ceil
disp ceil(0.3)
disp ceil(4.5)
disp ceil(8.9)

//rounding method 3: round
disp round(0.3)
disp round(4.5)
disp round(8.9)

//using round to round to something other than nearest integer
disp round(-0.32, 0.1)  //round to nearest 0.1
disp round(4.5, 2)   //round to nearest 2
disp round(8.9, 10)  //round to nearest 10

//min and max
disp min(8,6,7,5,3,0,9) //display minimum value
disp max(8,6,7,5,3,0,9) //display maximum value

//other math functions
disp exp(1)  //exponent. e^1 = e.
disp ln(20)  //log of 20 ~ 3
disp sqrt(729)  //27

disp abs(-6)  //absolute value
disp mod(529, 10) //modulus (remainder)
disp c(pi)  //constant pi (for illustration of sine)
disp sin(c(pi)/2) //sine function

//string functions
use transplants, clear
list extended_dgn in 1/5, clean

disp word("Hello, is there anybody in there?", 4)
list extended_dgn if word(ext, 5) != "", clean noobs
disp strlen("Same as it ever was")
list extended_dgn if strlen(ext)< 6, clean
assert regexm("Earth", "art")
assert !regexm("team", "I")

tab ext if regexm(ext, "HTN")
list ext if regexm(ext, "^A") 
//starts with A

list ext if regexm(ext, "X$") 
//ends with X

tab ext if regexm(ext, "HIV.*Y") 
//contains "HIV", then some other stuff, then Y

That’s it with the math functions

//date and time functions
//review of number formats
disp %3.2f exp(1)
disp %4.1f 3.14159

//examples of the %td format for dates (days since 1/1/1960)
disp %td 19400
disp %td 366
disp %td -5

//doing math on Stata dates
use transplants, clear
gen oneweek = transplant_date +7
format %td oneweek
list transplant_date  oneweek in 1/3

//date functions
//td() function to give the integer date for a given calendar date
disp td(04jul1976)

//"Back to the future" start date
disp td(26oct1985) 

//mdy() function to give the integer date for a given month/day/year
disp mdy(7,4,1976)

//"Back to the future" start date
disp mdy(10,26,1985) 



//date() function to convert strings to dates
//Woodstock festival starts
disp date("August 15, 1969", "MDY")
//Next perihelion of Halley's comet
disp date("2061 28 July", "YDM")

use donors, clear
gen donor_dob = date(fake_don_dob, "DMY")

format %td donor_dob
list donor_dob fake_don_dob in 1/5

use transplants, clear
//survival analysis overview

//prep work
gen f_time = end_date-transplant_date

//stset (1)
stset f_time, failure(died)

//stset (2)
stset end_date, origin(transplant_date) failure(died)

//set a scale to measure time in years instead of days
stset end_date, origin(transplant_date) failure(died) scale(365.25)

//Kaplan-Meier curve
sts graph

//stratified K-M curve
sts graph, by(don_ecd)

//calculate incidence rate per person-year
stsum
stsum, by(don_ecd)

//rank-sum tests
sts test don_ecd
sts test gender
log close

In the code above, we cover examples related to various math functions, string functions, date and time functions, and survival analysis functions in Stata. Each section is clearly commented to explain the purpose and usage of the specific functions.