Logistic regression with an auxiliary data source

Logistic Regression with an Auxiliary Data Source
Xuejun Liao xjliao@ee.duke.edu Ya Xue yx10@ee.duke.edu Lawrence Carin lcarin@ee.duke.edu Department of Electrical and Computer Engineering,Duke University,Durham,NC27708
Abstract
To achieve good generalization in supervised
learning,the training and testing examples
are usually required to be drawn from the
same source distribution.In this paper we
propose a method to relax this requirement
in the context of logistic regression.Assum-
沥青电加热器ing D p and D a are two sets of examples
drawn from two mismatched distributions,
where D a are fully labeled and D p partially
labeled,our objective is to complete the la-
bels of D p.We introduce an auxiliary vari-
ableμfor each example in D a to reflect its
mismatch with D p.Under an appropriate
constraint theμ’s are estimated as a byprod-
uct,along with the classifier.We also present
an active learning approach for selecting the
labeled examples in D p.The proposed algo-
rithm,called“Migratory-Logit”or M-Logit,
is demonstrated successfully on simulated as
well as real data sets.
1.Introduction
In supervised learning problems,the goal is to design
a classifier using the training examples(labeled data)
D tr={(x tr i,y tr i)}N tr
i=1such that the classifier predicts
the label y p
i correctly for unlabeled primary test data
D p={(x p i,y p i):y p i missing}N p i=1.The accuracy of the predictions is significantly affected by the quality of D tr,which is assumed to contain essential informa-tion about D p.A common assumption utilized by learning algorithms is that D tr are a sufficient sam-ple of the same source distribution from which D p are drawn.Under this assumption,a classifier designed based on D tr will generalize
well when it is tested on D p.This assumption,however,is often violated in Appearing in Proceedings of the22nd International Confer-ence on Machine Learning,Bonn,Germany,2005.Copy-right2005by the author(s)/owner(s).practice.First,in many applications labeling an obser-
vation is an expensive process,resulting in insufficient
labeled data in D tr that are not able to characterize
the statistics of the primary data.Second,D tr and D p are typically collected under different experimen-tal conditions and therefore often exhibit differences
in their statistics.
Methods to overcome the insufficiency of labeled data
have been investigated in the past few years under
the names“active learning”[Cohn et al.,1995,Krogh
&Vedelsby,1995]and“semi-supervised learning”
[Nigam&et al.,2000],which we do not discuss here,
though we will revisit active learning in Section5. The problem of data mismatch has been studied in econometrics,where the available D tr are often a non-randomly selected sample of the true distribution of in-terest.Heckman(1979)developed a method to correct the sample-selection bias for linear regression models. The basic idea of Heckman’s method is that if one can estimate the probability of an observation being selected into the sample,one can use this probability estimate to correct the selection bias.
Heckman’s model has recently been extended to clas-
sification problems[Zadrozny,2004],where it is as-
sumed that the primary test data D p∼Pr(x,y)while the training examples D tr=D a∼Pr(x,y|s=1), where the variable s controls the selection of D a:if s=1,(x,y)is selected into D a;if s=0,(x,y)is not selected into D a.Evidently,unless s is independent of(x,y),Pr(x,y|s=1)=Pr(x,y)and hence D a are mismatched with D p.By Bayes rule,
Pr(x,y)=
Pr(s=1)
Pr(x,y|s=1)(1)
which implies that if one has access to Pr(s=1)
Pr(s=1|x,y)
one
can correct the mismatch by weighting and resam-pling[Zadrozny et al.,2003,Zadrozny,2004].In the special case when Pr(s=1|x,y)=Pr(s=1|x),one may estimate Pr(s=1|x)from a sufficient sample of Pr(x,s)if such a sample is available[Zadrozny,2004].
In the general case,however,it is difficult to estimate
Pr(s=1) Pr(s=1|x,y),as we do not have a sufficient sample of
Pr(x,y,s)(if we do,we already have a sufficent sam-ple of Pr(x,y),which contradicts the assumption of the problem).
In this paper we consider the case in which we have a fully labeled auxiliary data set D a and a partially
labeled primary data set D p=D p
l ∪D p u,where D p
l
are labeled and D p u unlabeled.We assume D p and D a are drawn from two distributions that are mis-matched.Our objective is to use a mixed training set
D tr=D p
l ∪D a to train a classifier that predicts the la-
bels of D p u accurately.Assume D p∼Pr(x,y).In light of equation(1),we can write D a∼Pr(x,y|s=1)as long as the source distributions of D p and D a have the same domain of nonzero probability1.As explained in the previous paragraph,it is difficult to correct the
mismatch by directly estimating Pr(s=1)
Pr(s=1|x,y).Therefore
we take an alternative approach.We introduce an aux-iliary variableμi for each(x a i,y a i)∈D a to reflect its mismatch with D p and to control its participation in the learning process.Theμ’s play a similar role as
the weighting factors Pr(s=1)
Pr(s=1|x,y)in(1).However,un-
like the weighting factors,the auxiliary variables are estimated along with the classifier in the learning.We employ logistic regression as a specific classifier and develop our method in this context.
A related problem has been studied in[Wu&Diet-terich,2004],where the classifier is trained on twofixed and labeled data sets D p and D a,where D a is of lower quality and provides weaker evidence for the classifier design.The problem is approached by minimizing a weighted sum of two separate loss functions,with one defined for the primary data and the other for the aux-iliary data.Our method is distinct from that in[Wu& Dietterich,2004]in two respects.First,we introduce an auxiliary va
riableμi for each(x a i,y a i)∈D a and the auxiliary variables are estimated along with the classi-fier.A largeμi implies large mismatch of(x a i,y a i)with
D p and accordingly less participation of x a i in learning the classifier.Second,we present an active learning
strategy to define D p
l ⊂D p when D p is initially fully
unlabeled.
The remainder of the paper is organized as follows.
A detailed description of the proposed method is pro-vided in Section2,followed by description of a fast learning algorithm in Section3and a theoretical dis-1For any Pr(x,y|s=1)=0and Pr(x,y)=0,there
exists Pr(s=1)
Pr(s=1|x,y)=Pr(x,y)
Pr(x,y|s=1)
∈(0,∞)such that equation
(1)is satisfied.For Pr(x,y|s=1)=Pr(x,y)=0,any
Pr(s=1) Pr(s=1|x,y)=0makes equation(1)satisfied.
cussion in4.In Section5we present a method to
煅烧石油焦actively define D p
l
when D p
l
is initially empty.We
demonstrate example results in Section6.Finally,Sec-
tion7contains the conclusions.
2.Migratory-Logit:Learning Jointly
on the Primary and Auxiliary Data
We assume D p
l
arefixed and nonempty,and with-
out loss of generality,we assume D p
l
are always in-
dexed prior to D p ,D p
l
={(x p
i
,y p i)}N p l i=1and D p u=
{(x p i,y p i):y p i missing}N p
i=N p l+1
.We use N a,N p,and
N p l to denote the size(number of data points)of D a,
D p,and D p
l
,respectively.In Section5we discuss how
to actively determine D p
l
when D p
l
is initially empty.
We consider the binary classification problem and the
labels y a,y p∈{−1,1}.For notational simplicity,we
let x always include a1as itsfirst element to accom-
modate a bias(intercept)term,thus x p,x a∈R d+1
where d is the number of features.For a primary data
point(x p
i
,
y p i)∈D p l,we follow standard logistic regres-
sion to write
Pr(y p
i
|x p i;w)=σ(y p i w T x p i)(2)
where w∈R d+1is a column vector of classifier param-
eters andσ(μ)=1
1+exp(−μ)
is the sigmoid function.
For a auxiliary data point(x a i,y a i)∈D a,we define跑步机控制器
Pr(y a i|x a i;w,μi)=σ(y a i w T x a i+y a iμi)(3)
whereμi is an auxiliary variable.Assuming the ex-
amples in D p
l
and D a are drawn i.i.d.,we have the
log-likelihood function
(w,µ;D p l∪D a)
=
N p
l
i=1
lnσ(y p
i
w T x p i)+
N a
i=1
lnσ(y a i w T x a i+y a iμi)(4)
whereµ=[μ1,···,μN a]T is a column vector of all
auxiliary variables.火筒式加热炉
The auxiliary variableμi is introduced to reflect the
mismatch of(x a i,y a i)with D p and to control its par-
ticipation in the learning of w.A larger y a iμi makes
Pr(y a i|x a i;w,μi)less sensitive to w.When y a iμi=∞,
Pr(y a i|x a i;w,μi)=1becomes completely indepen-
dent of w.Geometrically,theμi is an extra inter-
cept term that is uniquely associated with x a i and
causes it to migrate towards class y a i.If(x a i,y a i)
is mismatched with the primary data D p,w cannot
make
N p
l
i=1
lnσ(y p
i
w T x p i)and lnσ(y a i w T x a i)large at
the same time.In this case x a i will be given an ap-
propriateμi to allow it to migrate towards class y a i,
so that w is less sensitive to(x a i,y a i)and can focus
more onfitting D p
l
.Evidently,if theμ’s are allowed
to change freely,their influence will override that of w in fitting the auxiliary data D a and then D a will not participate in learning w .To prevent this from hap-pening,we introduce constraints on μi and maximize the log-likelihood subject to the constraints:
max w ,µ (w ,µ;D p
l ∪D a )
(5)
subject to
1
N a  N a i =1y a i μi ≤C,C ≥0
(6)y a
i μi
≥0,i =1,2,···,N
a
(7)
where the inequalities in (7)reflect the fact that in or-der for x a i to fit y a i =1(or y a
i =−1)we need to have μi >0(or μi <0),if we want μi to exert a positive influence in the fitting process.Under the constraints
in (7),a larger value of y a
i
μi represents a larger mis-match between (x a i ,y a
i )and D p and accordingly makes
(x a i ,y a
i )play a less important role in determining w .The classifier resulting from solving the problem in (5)-(7)is referred to as “Migratory-Logit”or “M-Logit”.The C in (6)reflects the average mismatch between D a and D p and controls the average participation of D a in determining w .It can be learned from data if we
have a reasonable amount of D p
l
.However,in practice we usually have no or very scarce D p
l
to begin with.In this case,we must rely on other information to set C.We will come back to a more detailed discussion on C in Section 4.
3.Fast Learning Algorithm
The optimization problem in (5),(6),and (7)is con-cave and any standard technique can be utilized to
find the global maxima.However,there is a unique
高分散白炭黑
μi associated with every (x a i ,y a i )∈D a ,and when D a
is large using a standard method to estimate μ’s can consume most of the computational time.
In this section,we give a fast algorithm for training the M-Logit,by taking a block-coordinate ascent ap-proach [Bertsekas,1999],in which we alternately solve for w and µ,keeping one fixed when solving the other.The algorithm draws its efficiency from the analytic solution of µ,which we establish in the following the-orem.Proof of the theorem is given in the appendix,and Section 4contains a discussion that helps to un-derstand the theorem from an intuitive perspective.Theorem 1:Let f (z )be
a twice continuously differ-entiable function and its second derivative f  (z )<0for any z ∈R .Let b 1≤b 2≤···≤b N ,R ≥0,and n =max {m :mb m −
m
i =1b i
≤R,1≤m ≤N }(8)
Then the problem
max {z i } N
城市排水i =1f (b i +z i )
(9)subject to
N
i =1z i ≤R,
R ≥0
(10)z i ≥0,i =1,2,···,N
(11)
has a unique global solution
z i =
1n  n
j =1b j +1
n R −b i ,1≤i ≤n 0,n <i ≤N
(12)
For a fixed w ,the problem in (5)-(7)is simplified to
maximizing  N a
i =1ln σ(y a i w T x a i +y a
i μi )with respect to µ,subject to 1N a  N a i =1y a i μi ≤C ,C ≥0,and y a
i
μi ≥0for i =1,2,···,N a .Clearly ln σ(z )is a twice continuously differentiable function of z and its
second derivative ∂2
∂z 2ln σ(z )=−σ(z )σ(−z )<0for −∞<z <∞.Thus Theorem 1applies.We first
solve {y a
i
μi }using Theorem 1,then {μi }are triv-ially solved using the fact y a
i
∈{−1,1}.Assume y a k 1
w T x a k 1≤y a k 2w T x a k 2≤···≤y a k N a w T x a k N a ,where k 1,k 2,···,k N a is a permutation of 1,2,···,N a .Then we can write the solution of {μi }analytically,
μk i =⎧⎨⎩1y a k i  n
j =1y a k j w T x a
k j
+N a
n y a k i C −w T x a k i ,1≤i ≤n 0,n <i ≤N a
(13)where
n =max  m :my a k m
w T x a k m − m i =1y a k i w T x a k i ≤N a
C,1≤m ≤N
a
(14)For a fixed µ,we use the standard gradient-based method [Bertsekas,1999]to find w .The main p
ro-cedures of the fast training algorithm for M-Logit are summarized in Table 1,where the gradient  w  and the Hessian matrix  2w  are computed from (4).
4.Auxiliary Variables and Choice of C
Theorem 1and its constructive proof in the appendix offers some insight into the mechanism of how the mis-match between D a and D p is compensated through the auxiliary variables {μi }.To make the descrip-tion easier,we think of each data point x a i ∈D
a
as getting a major “wealth”y a i w T x a
i from w and
an additional wealth y a
i μi from a given budget to-taling N a C (C represents the average budget for a single x a ).From the appendix,N a C is distributed among the auxiliary data {x a i }by a “poorest-first”
rule:the “poorest”x a
k 1(that which has the small-est y a k 1
w T x a k 1),gets a portion y a k 1μk 1from N a C first,
Table 1.Fast Learning Algorithm of M-Logit
Input:D a ∪D p
l and C ;Output:w and {μi }N a
i =1
1.Initialize w and μi =0for i =1,2,···,N a .
2.Compute the gradient  w  and Hessian ma-trix  2w  .
3.Compute the ascent direction d =
−( 2w  )−1
w  .
4.Do a linear search for the step-size α∗=
arg max α (w +αd ).
5.Update w :w ←w +α∗d .
6.Sort {y a i w T x a i }N a
i =1in ascending order.As-sume the result is y a k 1
w T x a k 1≤y a k 2w T x a
k 2≤···≤y a
k N a
w T x a k N a ,where k 1,k 2,···,k N a is a permutation of 1,2,···,N a .7.Find the n using (14).
8.Update the auxiliary variables {μi }N a
i =1using
(13).
9.Check the convergence of  :exit and output
w and {μi }N a
i =1if converged;go back to 2oth-erwise.
and as soon as the total wealth y a k 1
w T x a k 1+y a
k 1μk 1reaches the wealth of the second poorest x a k 2,N a
C
becomes equally distributed to x a k 1and x a
k 2such that their total wealths are always equal.Then,as soon as y a k 1
w T x a k 1+y a k 1μk 1=y a k 2w T x a k 2+y a k 2μk 2reach the wealth of the third poorest,N a C be
comes equally distributed to three of them to make them equally rich.The distribution continues in this way until the budget N a C is used up.The “poorest-first”rule is essentially a result of the concavity of the logarith-mic sigmoid function ln σ(·).The goal is to maximize  N a
i =1ln σ(y a i w T x a i +y a
i μi ).The concavity of ln σ(·)dictates that for any given portion of N a C ,distribut-ing it to the poorest makes the maximum gain in ln σ.
The C is used as a means to compensate for the loss that D a may suffer from w .The classifier w is respon-sible for correctly classifying both D a and D p .Because D a and D p are mismatched,w cannot satisfy both of them:one must suffer if the other is to gain.As D p is the primary data set,we want w to classify D p as accurately as possible.The auxiliary variables are therefore introduced to represent compensations that D a get from C .When x a gets small wealth from w and is poor,it is because x a is mismatched and in conflict with D p (assuming perfect separation of D a ,no conflict exists among themselves).By the “poorest first”rule,the most mismatched x a gets compensation
first.
A high compensation y a
i
μi whittles down the partici-pation of x a
i in learning w .This is easily seen from
the contribution of (x a i ,y a i )to  w  and  2
w  ,which
are obtained from (4)as σ(−y a i w T x a i −y a i μi )y a i x a
i and
−σ(−y a i w T x a i −y a i μi )σ(y a i w T x a i +y a i μi )x a i x a i T
,respec-tively.When y a i
μi is large,σ(−y a i w T x a i −y a i μi )is close to zero and hence the contributions of (x a i ,y a
i )to  w  and  2w  are ignorable.We in fact do not need an in-finitely large y a
i μi to make the contributions of x a i ig-norable,because σ(μ)is almost saturated at μ=±6.
If y a i
w T x a i =−6,σ(−y a i w T x a i )=0.9975,implying a large contribution of (x a i ,y a
i )to  w  ,which happens
when w assigns x a i to the correct class y a
i with prob-ability of σ(y a i
w T x a
i )=σ(−6)=0.0025only.In this nearly worst case,a compensation of y a
i
μi =12can effectively remove the contribution of (x a i ,y a
i )because
σ(−y a i
w T x a i −y a i μi )=σ(6−12)=σ(−6)=0.0025.To effectively remove the contributions of N m auxil-iary data,one needs a total budge 12N m ,resulting in an average budget C =12N m /N a .
To make a right choice of C ,the N m /N a should rep-resent the rate that D a are mismatched with D p .This is so because we want N a C to be distributed only to that part of D a that is mismatched with D p ,thus per-mitting us to use the remaining part in learning w .The quantity N m /N a is usually unknown in practice.However,C =12N m /N a gives one a sense of at least what range C should be in.As 0≤N m ≤N a ,letting 0≤C ≤12is usually a reasonable choice.In our experiences,the performance of M-Logit is relatively robust to C ,and this will be demonstrated in Section 6.2using an example data set.
5.Active Selection of D p
l In Section 2we assumed that D p
l had already been de-termined.In this section we describe how D p
l
can be actively selected from D p
,based on the Fisher infor-mation matrix [Fedorov,1972,MacKay,1992].The approach is known as active learning [Cohn et al.,1995,Krogh &Vedelsby,1995].
Let Q denote the Fisher information matrix of D p
l ∪D a about w .By definition of the Fisher information ma-trix [Cover &Thomas,1991],Q =E {y p i },{y a i }∂ ∂w ∂ ∂w T ,
and substituting (4)into this equation gives (a brief derivation is given in the appendix)
Q = N p l i =1σp i (1−σp i
)x p i x p i T + N a i =1σa i (1−σa i )x a i x a i T
(15)
where σp i =σ(w T x p i )for i =1,2,...,N p l ,and σa
i
=σ(w T x a i +μi )for i =1,2,...,N a
,and w and {μi }represent the true classifier and auxiliary variables.
It is well known the inverse Fisher information Q −1lower bounds the covariance matrix of the estimated w [Cover &Thomas,1991].In particular,[det(Q )]−1lower bounds the product of variances of the elements
in w .The goal in selecting D p
l
is to reduce the vari-ances,or uncertainty,of w .Thus we seek the D p
l
that maximize det(Q ).
The selection proceeds in a sequential manner.Ini-tially D p
u =D p ,D p l is empty,and Q = N a i =1σa i (1−
σa i )x a i x a i T .Then one at a time,a data point x p i ∈D p
u
is selected and moved from D p
u to D p l .This causes Q
to be updated as:Q ←Q +σp i (1−σp i
)x p i (x p i )T
.At each iteration,the selection is based on
max x p i ∈D p u
det  Q +σp i (1−σp i )x p i (x p i )T  =max x p i ∈D p u  1+σp i (1−σp i )(x p i )T Q −1x p i  (16)where we assume the existence of Q −1,which can often
be assured by using sufficient auxiliary data D a .Evaluation of (16)requires the true values of w and {μi },which are not known a priori .We follow Fedorov (1972)and replace them with the w and {μi }that
are
estimated from D a ∪D p l
,where D p l are the primary labeled data selected up to the present.
6.Results
In this section the performance of M-Logit is demon-strated and compared to the standard logistic regres-sion,using test error rate as the performance index.
The M-Logit is trained using D a ∪D p l
,where D p l are either randomly selected from D p
,or actively selected
from D p using the method in Section 5.When D p
l are randomly selected,50independent trials are per-formed and the results are obtained as an average over the trials.Three logistic regression classifiers are
trained using different combinations of D a and D p l
:D a ∪D p l ,D p l alone,and D a
alone,where D p l are identi-cal to the D p used for M-Logit.The four classifiers are
tested on D p
u =D p \D p l ,using the following decision rule:declare y p
=−1if σ(w T x p )≤0.5and y p =1
otherwise,for any x p ∈D p u
.Throughout this section the C for M-Logit is set to
C =6when the comparison is made to logistic regres-sion.In addition,we present a comparison of M-Logit with different C ’s,to examine the sensitivity of M-Logit’s performance to C .6.1.A toy Example
In the first example,the primary data are simulated as two bivariate Gaussian distributions representing class “−1”and class “+1”,respectively.In particu-larly,we have Pr(x p |y p =−1)=N (x p ;µ0,Σ)and Pr(x p |y p =+1)=N (x p ;µ1,Σ),where the Gaus-sian parameters µ0=[0,0]T ,µ1=[2.3,2.3]T ,and Σ=
1.75−0.433
−0.4331.25
.The auxiliary data D a
are then a selected draw from the two Gaussian dis-tributions,as described in [Zadrozny,2004].We take the selection probability Pr(s |x p ,y p =−1)=
σ(w 0+w 1K (x p ,µs 0;Σ))and Pr(s |x p ,y
p
=+1)=σ(w 0+w 1K (x p ,µs
1;Σ)),where σis the sigmoid func-tion,w 0=−1,w 1=exp(1),K (x p ,µs 0;Σ)=
exp {−0.5(x p −µs 0)T Σ−1(x p −µs 0)}with µs
0=[2,1]T ,
and K (x p ,µs 1;Σ)=exp {−0.5(x p −µs 1)T Σ
−1(x p
−µs 1)}with µs 1=[0,3]T
.We obtain 150samples of D p and 150samples of D a ,which are shown in Figure 3.
The M-Logit and logistic regression classifiers are trained and tested as explained at the beginning o
f this section.The test error rates are shown in Figure 1and Figure 2,as a function of number of primary
labeled data used in training.The D p
l
in Figure 1are randomly selected and the D p
l in Figure 2are actively selected as described in Section 5.
Figure 1.Test error rates of M-Logit and logistic regression
on the toy data,as a function of size of D p
l
.The primary labeled data D p
l are randomly selected from D p .The error rates are an average over 50independent trials of random
selection of D p l
.Several observations are made from inspection of Fig-ures 1and 2.
•The M-Logit consistently outperforms the three
standard logistic regression classifiers,by a con-siderable margin.This improvement is a result of
properly fusing D a and D p
l
,with D a determining the classifier under the guidance of few D p l
.
Figure 2.Error rates of M-Logit and logistic regression on
the toy data,as a function of size of D p
l
.The primary labeled data D p
l are actively selected from D p ,using the method in Section 5.
•The performance of the logistic regression trained
on D p
l
alone changes significantly with the size of D p l .This is understandable,considering that D p l are the only examples determining the classifier.The abrupt drop of errors from iteration 11to iteration 12in Figure 2may be because the label found at iteration 12is critical to determining w .•The logistic regression trained on D a alone per-forms significantly worse than M-Logit,reflecting a marked mismatch between D a and D p .
•The logistic regression trained on D a ∪D p
l im-proves,but mildly,as D p
l grows,and it is ulti-mately outperformed by the the logistic regression
trained on D p
l
alone,demonstrating that some data in D a
are mismatched with D p and hence cannot be correctly classified along with D p ,if the mismatch is not compensated.•As D p l grows,the logistic regression trained on D p
l alone finally approaches to M-Logit,showing that
without the interference of D a ,a sufficient D p
l
can define a correct classifier.
•All four classifiers benefit from the actively se-lected D p
l ,this is consistent with the general observation with active learning [Cohn et al.,1995,Krogh &Vedelsby,1995].To better understand the active selection process,we show in Figure 3the first few iterations of active learn-ing.Iteration 0corresponds to the initially empty D p l
,and iterations 1,5,10,13respectively correspond to
1,5,10,13data points selected accumulatively from
D p
u
into D p l .Each time a new data point is selected,the w is re-trained,yielding the different decision boundaries.As can be seen in Figure 3,the decision boundary does not change much after 10data are selected,demonstrating convergence.
In Figure 3,each auxiliary data point x a i ∈D a
is symbolically displayed with a size in proportion to
exp(−y a
i
μi /12),hence a small symbol of auxiliary data corresponds to large y a
i
μi and hence small participa-tion in determining w .The auxiliary data that cannot be correctly classified along with the primary data are de-emphasized by the M-Logit.Usually the auxiliary data near the decision boundary are de-emphasized.
6.2.Results on the Wisconsin Breast Cancer
Databases In the second example we consider the Wisconsin Breast Cancer Databases from the UCI Machine Learning Repository.The data set consist of 569in-stances with feature dimensionality 30.We randomly partition the data set into two subsets,one with 228data points and the other with 341data points.The first is used as D p ,and the second as D a .We arti-ficially make D a mismatched with D p by introducing errors into the labels and adding noise to the features.Specifically,we make changes to 50%randomly chosen
(x a i ,y a i )∈D a :change the signs of y a
i and add 0dB white Gaussian noise to x a i .We then proceed,as in Section 6.1,to training and testing the four classifiers.
We again consider both random D p
l
and actively se-lected D p
l .The test errors are summarized in Figures 4and 5.The results are essentially consistent with those in Figures 1and 2,extending the observations we made there to the real data here.It is particularly noted that the mismatch between D a and D p here is more prominent than in the toy data,as manifested by the error rates of logistic regression trained alone on D a .This makes M-Logit more advantageous in the comparison:not only does it give the best results but
it also converges faster than others with the size of D p l .To examine the effect of C on the performance of M-Logit,we present in Figure 6the test error rates of
M-Logit using five different C :C =2,4,6,8,10.Here
the D p
l
are determined by active learning as described in Section 5.Clearly,the results for the 5different C ’s are almost indistinguishable.This relative insen-sitivity of M-Logit to C may partly be attributed to the adaptivity brought about by active learning.With
different C ,the D p
l
are also selected differently,thus counteracting the effect of C and keeping M-Logit ro-

本文发布于:2024-09-22 13:28:28,感谢您对本站的认可!

本文链接:https://www.17tex.com/tex/2/203774.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:沥青   石油焦   白炭黑
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议