Randomized clinical trials are generally considered the 'gold standard' in establishing causal relationship due to its ability to balance distributions of subject characteristics across treatment groups. Since the treatment assignment is not confound...
Randomized clinical trials are generally considered the 'gold standard' in establishing causal relationship due to its ability to balance distributions of subject characteristics across treatment groups. Since the treatment assignment is not confounded with the patient's baseline characteristics, treatment effect can be estimated simply by comparing outcomes between treated and untreated groups. Due to ethical and other concerns, randomized trials are not always an option. Researchers sometimes rely on observational study designs to investigate the relationship between outcome and exposure and other covariates. In this dissertation, we investigate statistical methods for analyzing correlated data from observational studies.
First, we consider case-cohort studies with multiple disease outcomes. The case-cohort design is widely used in large cohort studies when it is prohibitively costly to measure some exposures for all subjects in the full cohort, especially in studies where the disease rate is low. To investigate the effect of a risk factor on different diseases, multiple case-cohort studies using the same subcohort are usually conducted. To compare the effect of a risk factor on different types of diseases, times to different disease events need to be modeled simultaneously. Existing case-cohort estimators for multiple disease outcomes utilize only the relevant covariate information in cases and subcohort controls, though many covariates are measured for everyone in the full cohort. Intuitively, making full use of the relevant covariate information can improve efficiency. To this end, we consider a class of doubly-weighted estimators for both regular and generalized case-cohort studies with multiple disease outcomes. The asymptotic properties of the proposed estimators are derived and our simulation studies show that a gain in efficiency can be achieved with a properly chosen weight function. We illustrate the proposed method with a data set from Atherosclerosis Risk in Communities (ARIC) study.
Second, we investigate marginal structural Cox model for clusters of correlated failure time observations. In causal inference, marginal structural Cox model has been widely used to analyze time-to-event data arising from observational studies, where observations are independent. In many studies, subjects in the same community or clinic form natural clusters and are thus correlated. For example, in INSPIRIS Inc. home visiting provider program, participants from the same region are considered in the same cluster. We formulate marginal structural Cox model for this type of data and prove the consistency and asymptotic normality of the estimator. Simulation studies show that marginal structural Cox model perform properly by yielding unbiased estimate and satisfactory confidence interval coverage. The proposed method is implemented using a claim data assessing the effectiveness of INSPIRIS home visiting health care program.
Third, we study cluster-based probability-dependent sampling (PDS). As all studies are conducted with a limited budget, the maximum study sizes are often restricted by the cost of the exposure ascertainment. When the outcome is continuous, the two-stage PDS is an appealing sampling scheme that allows investigators to over-sample the two distributional tails of the continuous exposure and to obtain a more informative sample than simple random sample (SRS), without knowing the functional form of the underlying relationship between exposure and outcome. In the Collaborative Perinatal Project (CPP), subjects are clustered within each participating clinic. Statistical method needs to properly account for cluster-level random effects under PDS scheme. We propose estimation and inference procedures based on a semiparametric profile likelihood function. We show that our estimator is consistent and asymptotically normal. In simulation studies, our cluster-based PDS method provides more efficient estimators compared to linear mixed effect models on an SRS of the same size. We also apply the method to a data set from the CPP.