Stata Advanced File Management - 3 Hour Intensive#

Hour 1: Data Manipulation Essentials#

1.1 The by Command (15 min)#

Basic Syntax:

  • _n = row number within group

  • _N = total rows in group/dataset

// Create group totals and IDs
sort abo
by abo: gen cat_n = _N        // Total in each ABO group
by abo: gen cat_id = _n       // Sequential ID within group

// Shorthand: bysort (or bys)
bysort abo age: gen age_rank = _n

1.2 The egen Command (20 min)#

Key Functions:

// Basic statistics
egen mean_age = mean(age)
egen median_age = median(age)
egen max_age = max(age)
egen age_sd = sd(age)

// By groups
bysort dx: egen mean_age_by_dx = mean(age)
bysort abo gender: egen max_bmi = max(bmi)

1.3 Tagging with egen (10 min)#

// Tag one record per group
egen grouptag = tag(abo gender)
list abo gender spread if grouptag

// Useful for graphing large datasets
egen ethtag = tag(ethcat)
bysort ethcat: egen mean_bmi = mean(bmi)

1.4 Essential Functions (15 min)#

Numeric:

// Rounding
floor(8.9)    // Round down: 8
ceil(8.9)     // Round up: 9
round(8.9)    // Round nearest: 9

// Math
min(8,6,7,5)  // Minimum value
max(8,6,7,5)  // Maximum value
exp(1)        // e^1
ln(20)        // Natural log
sqrt(729)     // Square root
abs(-6)       // Absolute value

Hour 2: String Manipulation & Data Types#

2.1 String Functions (25 min)#

Core String Functions:

// Extract words
word("Hello world", 2)     // Returns "world"

// String length
strlen("Same as it ever was")  // Returns 19

// Find position
strpos("history", "story")     // Returns 4
strpos("history", "xyz")       // Returns 0 (not found)

// Pattern matching
regexm("Earth", "art")         // Returns 1 (true)
regexm("team", "I")            // Returns 0 (false)

// Advanced regex
regexm(text, "^A")             // Starts with A
regexm(text, "X$")             // Ends with X

Practical Example: Drug Classification

// Check for antihistamine exposure
gen has_benadryl = (strpos(drug_name, "DIPHENHYDRAMINE") != 0)
gen has_hydroxyzine = (strpos(drug_name, "HYDROXYZINE") != 0)
gen any_antihistamine = (has_benadryl | has_hydroxyzine)

2.2 Date Handling (20 min)#

Date Storage & Conversion:

// Stata stores dates as days since Jan 1, 1960
disp %td 19400    // Shows as date format

// Convert string to date
gen date_var = date("August 15, 1969", "MDY")
gen date_var2 = date("2061 28 July", "YDM")

// Date arithmetic
gen oneweek_later = transplant_date + 7
format %td oneweek_later

2.3 Code Quality Practices (15 min)#

Essential Habits:

  1. Indent your code - use consistent spacing

  2. Test iteratively - run small sections

  3. Comment extensively - explain your logic

  4. Use descriptive variable names

// Bad
program define myprogram
quietly {
forvalues i=1/10 {
if `i' < 5 {
noisily disp "`i' is less than five" }
else {
noisily disp "`i' is at least five" }
}
} 
end

// Good
program define display_numbers
    quietly {
        forvalues i = 1/10 {
            if `i' < 5 {
                noisily display "`i' is less than five"
            }
            else {
                noisily display "`i' is at least five"
            }
        }
    } 
end

Hour 3: Survival Analysis & Data Merging#

3.1 Survival Analysis Setup (25 min)#

Basic Setup:

// Method 1: Specify follow-up time
gen followup_time = end_date - transplant_date
stset followup_time, failure(died)

// Method 2: Specify start and end dates
stset end_date, origin(transplant_date) failure(died)

// Method 3: Scale to years
stset end_date, origin(transplant_date) failure(died) scale(365.25)

Analysis Commands:

// Kaplan-Meier curves
sts graph
sts graph, by(treatment_group)

// Summary statistics
stsum
stsum, by(treatment_group)

// Test differences
sts test treatment_group

// Cox regression
stcox age gender treatment_group

3.2 Merging Datasets (35 min)#

Merge Types:

  • 1:1 - One-to-one (each key appears once in each dataset)

  • 1:m - One-to-many (master has unique keys, using may have duplicates)

  • m:1 - Many-to-one (master may have duplicates, using has unique keys)

Basic Syntax:

// Simple merge
use master_data, clear
merge 1:1 patient_id using other_data

// Merge options
merge 1:1 patient_id using other_data, keep(match)      // Keep only matched
merge 1:1 patient_id using other_data, nogen            // Don't create _merge
merge 1:1 patient_id using other_data, gen(merge_flag)  // Custom merge variable

Sequential Merges:

// Link recipients to donors via crosswalk table
use transplants, clear
merge 1:1 fake_id using donors_recipients, keep(match) nogen
merge m:1 fake_don_id using donors, keep(match) nogen keepusing(age_don)

// Check merge success
quietly count if _merge == 3
assert r(N)/_N > 0.99  // Ensure 99%+ match rate

Merge Quality Checks:

// Verify expected matches
tab _merge

// Assert perfect match if expected
assert _merge == 3

// Check match rate
display "Match rate: " r(N)/_N * 100 "%"

Quick Reference Card#

Essential Commands#

// Data manipulation
bysort group: egen mean_var = mean(variable)
egen tag = tag(group1 group2)
gen new_var = function(old_var)

// String functions
strlen(string)
strpos(string, substring)
word(string, n)
regexm(string, pattern)

// Dates
date(string, format)
format %td date_variable

// Survival analysis
stset time, failure(event)
sts graph, by(group)
stcox var1 var2

// Merging
merge 1:1 key using dataset, keep(match)
merge m:1 key using dataset, nogen

Common Pitfalls#

  1. Forgot to sort before by - Always sort first

  2. Missing egen vs gen - Use egen for group functions

  3. Date format confusion - Remember MDY, DMY, YMD order

  4. Merge without checking - Always examine _merge results

  5. String case sensitivity - Use upper() or lower() for consistency

Pro Tips#

  • Use keep(match) in merges to avoid unexpected data loss

  • Tag variables are essential for efficient graphing

  • Test string functions with simple examples first

  • Always validate survival analysis setup with stsum

  • Comment your merge logic extensively