Stata Advanced File Management - 3 Hour Intensive

Stata Advanced File Management - 3 Hour Intensive#

Hour 1: Data Manipulation Essentials#

1.1 The `by` Command (15 min)#

Basic Syntax:

_n = row number within group
_N = total rows in group/dataset

// Create group totals and IDs
sort abo
by abo: gen cat_n = _N        // Total in each ABO group
by abo: gen cat_id = _n       // Sequential ID within group

// Shorthand: bysort (or bys)
bysort abo age: gen age_rank = _n

1.2 The `egen` Command (20 min)#

Key Functions:

// Basic statistics
egen mean_age = mean(age)
egen median_age = median(age)
egen max_age = max(age)
egen age_sd = sd(age)

// By groups
bysort dx: egen mean_age_by_dx = mean(age)
bysort abo gender: egen max_bmi = max(bmi)

1.3 Tagging with `egen` (10 min)#

// Tag one record per group
egen grouptag = tag(abo gender)
list abo gender spread if grouptag

// Useful for graphing large datasets
egen ethtag = tag(ethcat)
bysort ethcat: egen mean_bmi = mean(bmi)

1.4 Essential Functions (15 min)#

Numeric:

// Rounding
floor(8.9)    // Round down: 8
ceil(8.9)     // Round up: 9
round(8.9)    // Round nearest: 9

// Math
min(8,6,7,5)  // Minimum value
max(8,6,7,5)  // Maximum value
exp(1)        // e^1
ln(20)        // Natural log
sqrt(729)     // Square root
abs(-6)       // Absolute value

Hour 2: String Manipulation & Data Types#

2.1 String Functions (25 min)#

Core String Functions:

// Extract words
word("Hello world", 2)     // Returns "world"

// String length
strlen("Same as it ever was")  // Returns 19

// Find position
strpos("history", "story")     // Returns 4
strpos("history", "xyz")       // Returns 0 (not found)

// Pattern matching
regexm("Earth", "art")         // Returns 1 (true)
regexm("team", "I")            // Returns 0 (false)

// Advanced regex
regexm(text, "^A")             // Starts with A
regexm(text, "X$")             // Ends with X

Practical Example: Drug Classification

// Check for antihistamine exposure
gen has_benadryl = (strpos(drug_name, "DIPHENHYDRAMINE") != 0)
gen has_hydroxyzine = (strpos(drug_name, "HYDROXYZINE") != 0)
gen any_antihistamine = (has_benadryl | has_hydroxyzine)

2.2 Date Handling (20 min)#

Date Storage & Conversion:

// Stata stores dates as days since Jan 1, 1960
disp %td 19400    // Shows as date format

// Convert string to date
gen date_var = date("August 15, 1969", "MDY")
gen date_var2 = date("2061 28 July", "YDM")

// Date arithmetic
gen oneweek_later = transplant_date + 7
format %td oneweek_later

2.3 Code Quality Practices (15 min)#

Essential Habits:

Indent your code - use consistent spacing
Test iteratively - run small sections
Comment extensively - explain your logic
Use descriptive variable names

// Bad
program define myprogram
quietly {
forvalues i=1/10 {
if `i' < 5 {
noisily disp "`i' is less than five" }
else {
noisily disp "`i' is at least five" }
}
} 
end

// Good
program define display_numbers
    quietly {
        forvalues i = 1/10 {
            if `i' < 5 {
                noisily display "`i' is less than five"
            }
            else {
                noisily display "`i' is at least five"
            }
        }
    } 
end

Hour 3: Survival Analysis & Data Merging#

3.1 Survival Analysis Setup (25 min)#

Basic Setup:

// Method 1: Specify follow-up time
gen followup_time = end_date - transplant_date
stset followup_time, failure(died)

// Method 2: Specify start and end dates
stset end_date, origin(transplant_date) failure(died)

// Method 3: Scale to years
stset end_date, origin(transplant_date) failure(died) scale(365.25)

Analysis Commands:

// Kaplan-Meier curves
sts graph
sts graph, by(treatment_group)

// Summary statistics
stsum
stsum, by(treatment_group)

// Test differences
sts test treatment_group

// Cox regression
stcox age gender treatment_group

3.2 Merging Datasets (35 min)#

Merge Types:

1:1 - One-to-one (each key appears once in each dataset)
1:m - One-to-many (master has unique keys, using may have duplicates)
m:1 - Many-to-one (master may have duplicates, using has unique keys)

Basic Syntax:

// Simple merge
use master_data, clear
merge 1:1 patient_id using other_data

// Merge options
merge 1:1 patient_id using other_data, keep(match)      // Keep only matched
merge 1:1 patient_id using other_data, nogen            // Don't create _merge
merge 1:1 patient_id using other_data, gen(merge_flag)  // Custom merge variable

Sequential Merges:

// Link recipients to donors via crosswalk table
use transplants, clear
merge 1:1 fake_id using donors_recipients, keep(match) nogen
merge m:1 fake_don_id using donors, keep(match) nogen keepusing(age_don)

// Check merge success
quietly count if _merge == 3
assert r(N)/_N > 0.99  // Ensure 99%+ match rate

Merge Quality Checks:

// Verify expected matches
tab _merge

// Assert perfect match if expected
assert _merge == 3

// Check match rate
display "Match rate: " r(N)/_N * 100 "%"

Quick Reference Card#

Essential Commands#

// Data manipulation
bysort group: egen mean_var = mean(variable)
egen tag = tag(group1 group2)
gen new_var = function(old_var)

// String functions
strlen(string)
strpos(string, substring)
word(string, n)
regexm(string, pattern)

// Dates
date(string, format)
format %td date_variable

// Survival analysis
stset time, failure(event)
sts graph, by(group)
stcox var1 var2

// Merging
merge 1:1 key using dataset, keep(match)
merge m:1 key using dataset, nogen

Common Pitfalls#

Forgot to sort before by - Always sort first
Missing egen vs gen - Use egen for group functions
Date format confusion - Remember MDY, DMY, YMD order
Merge without checking - Always examine _merge results
String case sensitivity - Use upper() or lower() for consistency

Pro Tips#

Use keep(match) in merges to avoid unexpected data loss
Tag variables are essential for efficient graphing
Test string functions with simple examples first
Always validate survival analysis setup with stsum
Comment your merge logic extensively