CASE WESTERN RESERVE UNIVERSITY
STATISTICS COLLOQUIUM
Abstract
Prediction can be based on a model which relates a response variable to a set of covariates. Often, for reasons of cost and practicality, a larger set of covariates may be available at the model development stage than in a later application of the model for prediction. There are two different approaches to dealing with variables which will not be available later. Clearly, one can, even at the model development stage, simply ignore the variables which will not be collected later. Alternatively, one may apply a method similar to the simultaneous equation approach, where the variables which will be unavailable are estimated from the others. A prediction rule can then be constructed between these equations, resulting in a "reduced" model that relates the response to available predictors only. Both methods provide almost unbiased predictions. The purpose of this study is to compare their efficiencies. We find that when the unavailable covariates are in fact not related to the response variable, the asymptotic efficiencies of both methods are the same; otherwise neither method is uniformly better than the other in terms of asymptotic efficiency. In specific situations, asymptotic relative efficiency between the two methods can be estimated so that the better method can be selected.