6 Nested, Hierarchical, Multilevel, Longitudinal Data#

global repo https://github.com/jhustata/basic/raw/main/

Introduction to Debugging#

Click to expand!

Before diving into nested data, let’s revisit the concept of debugging with a simple example from Lab 5 last week.

There’s no perfect time to introduce this topic, but many of you have likely experienced the frustration of encountering an error when running a script—whether it’s your own or one shared with you—and not being able to pinpoint the issue. Understanding debugging early on will aid in handling complex data structures more efficiently.

Refer to the detailed debugging steps here:Lab 5 Part I Section 1: Settings - Click on the token!

Revisiting Hardcoding#

Click to expand!

Last week, we reviewed the basics from Weeks 1 to 4. Let’s look at the script we used and discuss how it can be adapted for different outputs, moving from .log files to .xlsx files, something you’ve all grappled with over the last few days in answering your HW5.

Review the script here: Week 5 Review

cat ${repo}review.do
do ${repo}review.do

Hardcoding can be a useful starting point, but it’s essential to evolve your scripts to be dynamic and adaptable to different datasets and requirements. This involves using macros and avoiding fixed variable names or parameters wherever possible.

Example adaptations for homework and further discussion are included in the HW5 solution click here after 11:59PM of May 03, 2024), which emphasizes replacing hardcoded components with flexible code suited for publication-quality outputs.

6.1 Understanding isid#

Moving away from levelsof, we focus on understanding how data is structured within datasets. This week’s themes revolve around different levels of data:

  • Visits

  • Patient

  • Hospital

  • Region

The isid command is crucial for ensuring data integrity and understanding the unique identifiers across these levels.

Independently and Identically Distributed Data (i.i.d.)#

Classical statistics often assumes each observation is independently and identically distributed (i.i.d.). However, nested and multilevel data often violate this assumption, necessitating different analytical approaches.

Explore the i.i.d. assumption in detail here: i.i.d. in Statistics

6.2 The Concept of collapse#

When you describe aggregates like the average age or median blood pressure, you’re reducing the dimensionality of your data. This simplification, while useful, can sometimes obscure the nuances of individual data points or subgroups.

The collapse command in Stata helps in reducing data dimensions to essential summaries, but it’s crucial to be aware of the biases this might introduce, especially in hierarchical data settings. (Note: the expand command does something that is almost the opposite, but its really a case of “resampling” of the same rather than an expansion of horizons to appreciate nuances and variances in real-world data). See week 3 Section 3.3 for an example:

use "${repo}transplants", clear
describe ctr_id
tab ctr_id, sum(age)
//
collapse (mean) age, by(ctr_id)
list, clean
//
use transplants, clear
collapse (mean) age wait bmi, by(ctr_id)
list, clean noobs

6.3 Use of egen#

The egen command is incredibly powerful for creating complex summaries and transformations across grouped data. We’ll explore several practical examples of egen to handle multilevel data efficiently. See week 3 Section 3.3 for an example:

bys abo: egen age_byabo = mean(age)
codebook age_byabo

6.4 Data Integrity with preserve and restore#

Understanding how to safely manipulate datasets without losing original data is critical. The preserve and restore commands allow for temporary changes to the data, ensuring that the original structure and content are not permanently altered. See week 3 Section 3.3 for an example:

use ${repo}transplants, replace 
//preserve & restore
sum age
//indentation!
preserve
    drop if age<r(mean)
    sum age
restore 
sum age
di c(N)
di c(k)

6.5 Combining Data with merge#

Merging datasets is a common task in data analysis (see week 4 Section 4.2), especially when dealing with longitudinal or multilevel data. We will cover best practices and common pitfalls in merging datasets from different sources or time points.

use "${repo}transplants", clear 
merge 1:1 fake_id ///
    using "${repo}donors_recipients", ///
    keep(match)

6.6 Flexible Data Structures with reshape#

The reshape command allows for switching between wide and long formats, which is particularly useful in longitudinal studies where time-point data may need to be restructured for analysis. See week 3 Section 3.3 for an example

//help reshape
use ${repo}ctr_yr, clear 

//example of reshape wide
reshape wide n, i(ctr_id) j(yr)
list ctr_id n2007-n2010, clean noobs

//go back to long
reshape long

//and again
reshape wide
reshape long
reshape wide

//change missingness of n2006-n2015 variables to 0
foreach v of varlist n20* {
    replace `v' = 0 if missing(`v')
}

reshape long

//syntax for reshaping wide to long
//setup
reshape wide
reshape clear

reshape long n, i(ctr_id) j(yr)
list, clean noobs
 

6.7 Other#

How would you define “exposure” to a drug in the simulation dataset below? (Note: Simulation allows us to learn about real-world data without discolure risks)

use ${repo}strpos.dta, clear 

“Exposure” to a drug typically refers to whether and how subjects in the dataset have been administered or have taken the drug. This can include various aspects:

  1. Binary Exposure: Whether the subjects were exposed or not to the drug at any point during the study period. This is often coded as a binary variable (e.g., 1 for exposed, 0 for not exposed). Randomized trials are of this nature. Is this simplistic or true to reality?

  2. Dosage: The amount of the drug given to the subjects. This could be measured in units like mg, ml, etc., depending on the drug’s administration method. Is it cumulative dosage? Or conditional (e.g. cumulative dosage in the last month?)

  3. Duration: How long the subjects have been exposed to the drug. This could be in days, months, or years. And has the compliance been \(100%\) throughout?

  4. Frequency: How often the drug was administered during the period of study (e.g., daily, weekly, monthly).

  5. Mode of Administration: How the drug was administered (e.g., oral, intravenous, topical).

Person-time might become the unit of analysis in such matters if you wish to capture reality as robustly as possible.

In this simulation of large national registry data, these variables described were generated based on real-world distributions from the United States Renal Data System to mimic actual patient data without the risk of disclosing personal health information. The strpos.dta dataset defines aspects of drug exposure, a very complicated kind of nested data.

How many large national databases have we had some “exposure” to in this class?

  1. SRTR

  2. NHANES

  3. USRDS

  4. Linkage

Those datasets have hierarchical structures in these dimensions:

  1. Person: demographic & clinical characteristics

  2. Place: geography, hospital, center

  3. Time: followup visit, date

Consider enrolling in the following courses to master the appropriate methods to analyze nested data:

  1. 140.655.01 Analysis of Multilevel and Longitudinal Data

  2. 140.654.01 Methods in Biostatistics IV

6.8 Lab#

Next week’s lab will involve hands-on exercises using datasets to apply some of the concepts learned about nested and hierarchical data (to be posted before Monday).

6.9 Homework#

Homework will focus on practical applications of merge, collapse, and reshape, challenging you to manipulate and analyze a provided dataset with hierarchical structures.