A relevant s-matching classifier for the covariate shift machine learning problem
Date Issued
June 23, 2014
Author(s)
Abstract
In Machine Learning (ML), the x-covariate and its y-label may have di erent
joint probability distributions in the learning (called also training) and the
test populations. This situation occurs often in practice and is studied, among
others, in the covariate shift problem (Bickel et al., 2007), sample selection bias
(Zadrozny, 2004), domain adaptation (Daum e and Marcu, 2006) and distance
ML (Cao et al., 2009). For any loss l; the classi er minimizing, over a collection
of classi ers of interest d ∈ D the (statistical) risk El(d(x); y) in the training
population may not be the risk's minimizer over D in the test population;
see, for example, Bickel et al. (2007, section 2). An additional problem not
addressed is whether the whole learning population or its available subset
(called learning \data" or sample) are relevant when obtaining a classi er for
the test population or the test data.
Both problems are solved herein using tools from causal inference, the minimal
su cient statistic S or equivalently ratios of generalized propensity scores
(Yatracos, 2011), to identify relevant \matching" groups of x-covariates; x1
and x2 belong to the same matching group when S(x1) ≈ S(x2): Due to su -
ciency, the conditional risks on each S-matching group coincide for the learning
and the test distributions of (x; y) and are minimized by the same classi er
that is used to predict the y-label in the test population. When D consists, for
example, of linear classi ers in x; the classi er obtained herein via matching
will consist of piecewise linear classi ers, one for each matching group. This
approach solves directly the problem of obtaining the same piecewise classi er
both for the training and the test populations and reduces the mean square
error.
The above description is now supplemented with the comments of a reader
trained in ML: \the basic idea of the paper seems sensible: learn localized
models for di erent regions of the input space, as de ned by similarity to the
test distribution. Then, pick the appropriate model for a given test example,
and use this to make a prediction."
Previously, S(x) was used as weight to adjust the log-likelihood function
for covariate shift and improve predictive inference (Shimodaira, 2000, p. 231);
x-covariates with the same S-value have equal \importance" (see, for example,
Shimodaira, 2000 or Zadrozny, 2004). More recently, S has been used to adjust
loss function l to randomized l = Sl and obtain the same optimal classi er
in the l -risk and l-risk minimization problems (Bickel et al, 2007).
In applications with learning and test data, conditional risk minimization
via S allows for the use of learning data relevant to the test data and reduces
potential sampling bias as well as the intensity of the optimization problem
when the sizes of the training data and the covariates' dimension are large.
With k (> 1) learning (x; y)-populations and the test population, the use
of Shimodaira's S factor is not possible unless the mixture distribution of the
learning populations is available. In this case, for several related \tasks", i.e.
parameters in the densities of the learning populations, Bickel et al. (2009)
provided for task t the Shimodaira-type weight rt(x; y) and its estimate, in
order to \train a hypothesis for task t by minimizing the expected loss over
the distributions of all tasks", i.e. for the learning mixture distribution. It is
seen herein that rt is the minimal su cient statistic for the test distribution
of task t and the learning mixture distribution.
When the learning mixture distribution is unknown but the covariate shift
distributional assumption holds for the (k+1) populations, the k-dimensional
minimal su cient statistic S is used to obtain matching groups of x-covariates
and the corresponding classi ers. With learning samples, conditional risk minimization
on S-matched groups pooled together from all learning populations
is used to obtain the corresponding classi ers as in the case k = 1: This matching
approach has been recently used with multiple treatments (\tasks" in ML),
when the data is obtained from an observational study (Yatracos, 2011).
For the interested reader, a recent review on matching, propensity scores
and causal inference is presented in Stuart (2010). In sections 2-4, results are
presented for k = 1; a brief description of the results for k > 1 is in section 5.
joint probability distributions in the learning (called also training) and the
test populations. This situation occurs often in practice and is studied, among
others, in the covariate shift problem (Bickel et al., 2007), sample selection bias
(Zadrozny, 2004), domain adaptation (Daum e and Marcu, 2006) and distance
ML (Cao et al., 2009). For any loss l; the classi er minimizing, over a collection
of classi ers of interest d ∈ D the (statistical) risk El(d(x); y) in the training
population may not be the risk's minimizer over D in the test population;
see, for example, Bickel et al. (2007, section 2). An additional problem not
addressed is whether the whole learning population or its available subset
(called learning \data" or sample) are relevant when obtaining a classi er for
the test population or the test data.
Both problems are solved herein using tools from causal inference, the minimal
su cient statistic S or equivalently ratios of generalized propensity scores
(Yatracos, 2011), to identify relevant \matching" groups of x-covariates; x1
and x2 belong to the same matching group when S(x1) ≈ S(x2): Due to su -
ciency, the conditional risks on each S-matching group coincide for the learning
and the test distributions of (x; y) and are minimized by the same classi er
that is used to predict the y-label in the test population. When D consists, for
example, of linear classi ers in x; the classi er obtained herein via matching
will consist of piecewise linear classi ers, one for each matching group. This
approach solves directly the problem of obtaining the same piecewise classi er
both for the training and the test populations and reduces the mean square
error.
The above description is now supplemented with the comments of a reader
trained in ML: \the basic idea of the paper seems sensible: learn localized
models for di erent regions of the input space, as de ned by similarity to the
test distribution. Then, pick the appropriate model for a given test example,
and use this to make a prediction."
Previously, S(x) was used as weight to adjust the log-likelihood function
for covariate shift and improve predictive inference (Shimodaira, 2000, p. 231);
x-covariates with the same S-value have equal \importance" (see, for example,
Shimodaira, 2000 or Zadrozny, 2004). More recently, S has been used to adjust
loss function l to randomized l = Sl and obtain the same optimal classi er
in the l -risk and l-risk minimization problems (Bickel et al, 2007).
In applications with learning and test data, conditional risk minimization
via S allows for the use of learning data relevant to the test data and reduces
potential sampling bias as well as the intensity of the optimization problem
when the sizes of the training data and the covariates' dimension are large.
With k (> 1) learning (x; y)-populations and the test population, the use
of Shimodaira's S factor is not possible unless the mixture distribution of the
learning populations is available. In this case, for several related \tasks", i.e.
parameters in the densities of the learning populations, Bickel et al. (2009)
provided for task t the Shimodaira-type weight rt(x; y) and its estimate, in
order to \train a hypothesis for task t by minimizing the expected loss over
the distributions of all tasks", i.e. for the learning mixture distribution. It is
seen herein that rt is the minimal su cient statistic for the test distribution
of task t and the learning mixture distribution.
When the learning mixture distribution is unknown but the covariate shift
distributional assumption holds for the (k+1) populations, the k-dimensional
minimal su cient statistic S is used to obtain matching groups of x-covariates
and the corresponding classi ers. With learning samples, conditional risk minimization
on S-matched groups pooled together from all learning populations
is used to obtain the corresponding classi ers as in the case k = 1: This matching
approach has been recently used with multiple treatments (\tasks" in ML),
when the data is obtained from an observational study (Yatracos, 2011).
For the interested reader, a recent review on matching, propensity scores
and causal inference is presented in Stuart (2010). In sections 2-4, results are
presented for k = 1; a brief description of the results for k > 1 is in section 5.
File(s)![Thumbnail Image]()
Name
matchingandml14Aweb.pdf
Size
91.84 KB
Format
Adobe PDF
Checksum (MD5)
5673b26b5bcaf528868fe041d81336ce

