Many epidemiologic observational studies seek to relate a continuous outcome variable to an environmental exposure and other covariates through a specified regression function indexed by a set of unknown parameters. While observations on the outcome ...
Many epidemiologic observational studies seek to relate a continuous outcome variable to an environmental exposure and other covariates through a specified regression function indexed by a set of unknown parameters. While observations on the outcome are often easy or cheap to obtain, measuring the exposure can prove to be more difficult or expensive; in these situations, the outcome will often be observed for each member of a finite study population, whereas exposure measurements will only be obtained for a subsample from this population. When selection into the subsample depends upon the observed outcome values, this situation is referred to as outcome-dependent sampling (ODS). Under ODS, the marginal distribution function of the covariates, <italic> G<sub>X</sub></italic>, acts as an infinite-dimensional nuisance parameter and is indelibly tied into the likelihood function.
In this dissertation, we consider two semiparametric methods for estimating the regression parameters that do not require specifying a parametric form for <italic>G<sub>X</sub></italic>. The first method is a semiparametric maximum likelihood estimator which is a direct extension of recent work by Lawless et al. (1999, <italic>J. Roy. Statis. Soc. B</italic>); the second method extends the estimated likelihood methods developed by Pepe and Fleeting (1991, <italic> J. Amer. Statis. Assoc</italic>.). Both of these methods incorporate data observed for members of the study population that were not selected into the subsample and for which measurements on the exposure are missing. It is the primary goal of this research to show that this additional data can be utilized to obtain more efficient parameter estimates.
We show that both estimators are consistent and have asymptotic normal distributions, and we develop consistent estimators for the corresponding asymptotic variance matrices. Through the use of simulated data, we study the small sample properties of both estimators and we compare the proposed methods to several other estimators which could be applied to the ODS problem. We also apply the proposed methods to data from a large environmental epidemiologic study. The results of these applications support the claim that significant efficiency gains can be achieved by incorporating all available data into the parameter estimates.