Logistic regression with an auxiliary data source

Logistic Regression with an Auxiliary Data Source

Xuejun Liao xjliao@ee.duke.edu Ya Xue yx10@ee.duke.edu Lawrence Carin lcarin@ee.duke.edu Department of Electrical and Computer Engineering,Duke University,Durham,NC27708

Abstract

To achieve good generalization in supervised

learning,the training and testing examples

are usually required to be drawn from the

same source distribution.In this paper we

propose a method to relax this requirement

in the context of logistic regression.Assum-

沥青电加热器ing D p and D a are two sets of examples

drawn from two mismatched distributions,

where D a are fully labeled and D p partially

labeled,our objective is to complete the la-

bels of D p.We introduce an auxiliary vari-

ableμfor each example in D a to reﬂect its

mismatch with D p.Under an appropriate

constraint theμ’s are estimated as a byprod-

uct,along with the classiﬁer.We also present

an active learning approach for selecting the

labeled examples in D p.The proposed algo-

rithm,called“Migratory-Logit”or M-Logit,

is demonstrated successfully on simulated as

well as real data sets.

1.Introduction

In supervised learning problems,the goal is to design

a classiﬁer using the training examples(labeled data)

D tr={(x tr i,y tr i)}N tr

i=1such that the classiﬁer predicts

the label y p

i correctly for unlabeled primary test data

D p={(x p i,y p i):y p i missing}N p i=1.The accuracy of the predictions is signiﬁcantly aﬀected by the quality of D tr,which is assumed to contain essential informa-tion about D p.A common assumption utilized by learning algorithms is that D tr are a suﬃcient sam-ple of the same source distribution from which D p are drawn.Under this assumption,a classiﬁer designed based on D tr will generalize

well when it is tested on D p.This assumption,however,is often violated in Appearing in Proceedings of the22nd International Confer-ence on Machine Learning,Bonn,Germany,2005.Copy-right2005by the author(s)/owner(s).practice.First,in many applications labeling an obser-

vation is an expensive process,resulting in insuﬃcient

labeled data in D tr that are not able to characterize

the statistics of the primary data.Second,D tr and D p are typically collected under diﬀerent experimen-tal conditions and therefore often exhibit diﬀerences

in their statistics.

Methods to overcome the insuﬃciency of labeled data

have been investigated in the past few years under

the names“active learning”[Cohn et al.,1995,Krogh

&Vedelsby,1995]and“semi-supervised learning”

[Nigam&et al.,2000],which we do not discuss here,

though we will revisit active learning in Section5. The problem of data mismatch has been studied in econometrics,where the available D tr are often a non-randomly selected sample of the true distribution of in-terest.Heckman(1979)developed a method to correct the sample-selection bias for linear regression models. The basic idea of Heckman’s method is that if one can estimate the probability of an observation being selected into the sample,one can use this probability estimate to correct the selection bias.

Heckman’s model has recently been extended to clas-

siﬁcation problems[Zadrozny,2004],where it is as-

sumed that the primary test data D p∼Pr(x,y)while the training examples D tr=D a∼Pr(x,y|s=1), where the variable s controls the selection of D a:if s=1,(x,y)is selected into D a;if s=0,(x,y)is not selected into D a.Evidently,unless s is independent of(x,y),Pr(x,y|s=1)=Pr(x,y)and hence D a are mismatched with D p.By Bayes rule,

Pr(x,y)=

Pr(s=1)

Pr(x,y|s=1)(1)

which implies that if one has access to Pr(s=1)

Pr(s=1|x,y)

one

can correct the mismatch by weighting and resam-pling[Zadrozny et al.,2003,Zadrozny,2004].In the special case when Pr(s=1|x,y)=Pr(s=1|x),one may estimate Pr(s=1|x)from a suﬃcient sample of Pr(x,s)if such a sample is available[Zadrozny,2004].

In the general case,however,it is diﬃcult to estimate

Pr(s=1) Pr(s=1|x,y),as we do not have a suﬃcient sample of

Pr(x,y,s)(if we do,we already have a suﬃcent sam-ple of Pr(x,y),which contradicts the assumption of the problem).

In this paper we consider the case in which we have a fully labeled auxiliary data set D a and a partially

labeled primary data set D p=D p

l ∪D p u,where D p

are labeled and D p u unlabeled.We assume D p and D a are drawn from two distributions that are mis-matched.Our objective is to use a mixed training set

D tr=D p

l ∪D a to train a classiﬁer that predicts the la-

bels of D p u accurately.Assume D p∼Pr(x,y).In light of equation(1),we can write D a∼Pr(x,y|s=1)as long as the source distributions of D p and D a have the same domain of nonzero probability1.As explained in the previous paragraph,it is diﬃcult to correct the

mismatch by directly estimating Pr(s=1)

Pr(s=1|x,y).Therefore

we take an alternative approach.We introduce an aux-iliary variableμi for each(x a i,y a i)∈D a to reﬂect its mismatch with D p and to control its participation in the learning process.Theμ’s play a similar role as

the weighting factors Pr(s=1)

Pr(s=1|x,y)in(1).However,un-

like the weighting factors,the auxiliary variables are estimated along with the classiﬁer in the learning.We employ logistic regression as a speciﬁc classiﬁer and develop our method in this context.

A related problem has been studied in[Wu&Diet-terich,2004],where the classiﬁer is trained on twoﬁxed and labeled data sets D p and D a,where D a is of lower quality and provides weaker evidence for the classiﬁer design.The problem is approached by minimizing a weighted sum of two separate loss functions,with one deﬁned for the primary data and the other for the aux-iliary data.Our method is distinct from that in[Wu& Dietterich,2004]in two respects.First,we introduce an auxiliary va

riableμi for each(x a i,y a i)∈D a and the auxiliary variables are estimated along with the classi-ﬁer.A largeμi implies large mismatch of(x a i,y a i)with

D p and accordingly less participation of x a i in learning the classiﬁer.Second,we present an active learning

strategy to deﬁne D p

l ⊂D p when D p is initially fully

unlabeled.

The remainder of the paper is organized as follows.

A detailed description of the proposed method is pro-vided in Section2,followed by description of a fast learning algorithm in Section3and a theoretical dis-1For any Pr(x,y|s=1)=0and Pr(x,y)=0,there

exists Pr(s=1)

Pr(s=1|x,y)=Pr(x,y)

Pr(x,y|s=1)

∈(0,∞)such that equation

(1)is satisﬁed.For Pr(x,y|s=1)=Pr(x,y)=0,any

Pr(s=1) Pr(s=1|x,y)=0makes equation(1)satisﬁed.

cussion in4.In Section5we present a method to

煅烧石油焦actively deﬁne D p

when D p

is initially empty.We

demonstrate example results in Section6.Finally,Sec-

tion7contains the conclusions.

2.Migratory-Logit:Learning Jointly

on the Primary and Auxiliary Data

We assume D p

areﬁxed and nonempty,and with-

out loss of generality,we assume D p

are always in-

dexed prior to D p ,D p

={(x p

,y p i)}N p l i=1and D p u=

{(x p i,y p i):y p i missing}N p

i=N p l+1

.We use N a,N p,and

N p l to denote the size(number of data points)of D a,

D p,and D p

,respectively.In Section5we discuss how

to actively determine D p

when D p

is initially empty.

We consider the binary classiﬁcation problem and the

labels y a,y p∈{−1,1}.For notational simplicity,we

let x always include a1as itsﬁrst element to accom-

modate a bias(intercept)term,thus x p,x a∈R d+1

where d is the number of features.For a primary data

point(x p

y p i)∈D p l,we follow standard logistic regres-

sion to write

Pr(y p

|x p i;w)=σ(y p i w T x p i)(2)

where w∈R d+1is a column vector of classiﬁer param-

eters andσ(μ)=1

1+exp(−μ)

is the sigmoid function.

For a auxiliary data point(x a i,y a i)∈D a,we deﬁne跑步机控制器

Pr(y a i|x a i;w,μi)=σ(y a i w T x a i+y a iμi)(3)

whereμi is an auxiliary variable.Assuming the ex-

amples in D p

and D a are drawn i.i.d.,we have the

log-likelihood function

(w,µ;D p l∪D a)

N p

i=1

lnσ(y p

w T x p i)+

N a

i=1

lnσ(y a i w T x a i+y a iμi)(4)

whereµ=[μ1,···,μN a]T is a column vector of all

auxiliary variables.火筒式加热炉

The auxiliary variableμi is introduced to reﬂect the

mismatch of(x a i,y a i)with D p and to control its par-

ticipation in the learning of w.A larger y a iμi makes

Pr(y a i|x a i;w,μi)less sensitive to w.When y a iμi=∞,

Pr(y a i|x a i;w,μi)=1becomes completely indepen-

dent of w.Geometrically,theμi is an extra inter-

cept term that is uniquely associated with x a i and

causes it to migrate towards class y a i.If(x a i,y a i)

is mismatched with the primary data D p,w cannot

make

N p

i=1

lnσ(y p

w T x p i)and lnσ(y a i w T x a i)large at

the same time.In this case x a i will be given an ap-

propriateμi to allow it to migrate towards class y a i,

so that w is less sensitive to(x a i,y a i)and can focus

more onﬁtting D p

.Evidently,if theμ’s are allowed

to change freely,their inﬂuence will override that of w in ﬁtting the auxiliary data D a and then D a will not participate in learning w .To prevent this from hap-pening,we introduce constraints on μi and maximize the log-likelihood subject to the constraints:

max w ,µ (w ,µ;D p

l ∪D a )

(5)

subject to

N a N a i =1y a i μi ≤C,C ≥0

(6)y a

i μi

≥0,i =1,2,···,N

(7)

where the inequalities in (7)reﬂect the fact that in or-der for x a i to ﬁt y a i =1(or y a

i =−1)we need to have μi >0(or μi <0),if we want μi to exert a positive inﬂuence in the ﬁtting process.Under the constraints

in (7),a larger value of y a

μi represents a larger mis-match between (x a i ,y a

i )and D p and accordingly makes

(x a i ,y a

i )play a less important role in determining w .The classiﬁer resulting from solving the problem in (5)-(7)is referred to as “Migratory-Logit”or “M-Logit”.The C in (6)reﬂects the average mismatch between D a and D p and controls the average participation of D a in determining w .It can be learned from data if we

have a reasonable amount of D p

.However,in practice we usually have no or very scarce D p

to begin with.In this case,we must rely on other information to set C.We will come back to a more detailed discussion on C in Section 4.

3.Fast Learning Algorithm

The optimization problem in (5),(6),and (7)is con-cave and any standard technique can be utilized to

ﬁnd the global maxima.However,there is a unique

高分散白炭黑

μi associated with every (x a i ,y a i )∈D a ,and when D a

is large using a standard method to estimate μ’s can consume most of the computational time.

In this section,we give a fast algorithm for training the M-Logit,by taking a block-coordinate ascent ap-proach [Bertsekas,1999],in which we alternately solve for w and µ,keeping one ﬁxed when solving the other.The algorithm draws its eﬃciency from the analytic solution of µ,which we establish in the following the-orem.Proof of the theorem is given in the appendix,and Section 4contains a discussion that helps to un-derstand the theorem from an intuitive perspective.Theorem 1:Let f (z )be

a twice continuously diﬀer-entiable function and its second derivative f (z )<0for any z ∈R .Let b 1≤b 2≤···≤b N ,R ≥0,and n =max {m :mb m −

i =1b i

≤R,1≤m ≤N }(8)

Then the problem

max {z i } N

城市排水i =1f (b i +z i )

(9)subject to

i =1z i ≤R,

R ≥0

(10)z i ≥0,i =1,2,···,N

(11)

has a unique global solution

z i =

1n n

j =1b j +1

n R −b i ,1≤i ≤n 0,n <i ≤N

(12)

For a ﬁxed w ,the problem in (5)-(7)is simpliﬁed to

maximizing N a

i =1ln σ(y a i w T x a i +y a

i μi )with respect to µ,subject to 1N a N a i =1y a i μi ≤C ,C ≥0,and y a

μi ≥0for i =1,2,···,N a .Clearly ln σ(z )is a twice continuously diﬀerentiable function of z and its

second derivative ∂2

∂z 2ln σ(z )=−σ(z )σ(−z )<0for −∞<z <∞.Thus Theorem 1applies.We ﬁrst

solve {y a

μi }using Theorem 1,then {μi }are triv-ially solved using the fact y a

∈{−1,1}.Assume y a k 1

w T x a k 1≤y a k 2w T x a k 2≤···≤y a k N a w T x a k N a ,where k 1,k 2,···,k N a is a permutation of 1,2,···,N a .Then we can write the solution of {μi }analytically,

μk i =⎧⎨⎩1y a k i n

j =1y a k j w T x a

k j

+N a

n y a k i C −w T x a k i ,1≤i ≤n 0,n <i ≤N a

(13)where

n =max m :my a k m

w T x a k m − m i =1y a k i w T x a k i ≤N a

C,1≤m ≤N

(14)For a ﬁxed µ,we use the standard gradient-based method [Bertsekas,1999]to ﬁnd w .The main p

ro-cedures of the fast training algorithm for M-Logit are summarized in Table 1,where the gradient w and the Hessian matrix 2w are computed from (4).

4.Auxiliary Variables and Choice of C

Theorem 1and its constructive proof in the appendix oﬀers some insight into the mechanism of how the mis-match between D a and D p is compensated through the auxiliary variables {μi }.To make the descrip-tion easier,we think of each data point x a i ∈D

as getting a major “wealth”y a i w T x a

i from w and

an additional wealth y a

i μi from a given budget to-taling N a C (C represents the average budget for a single x a ).From the appendix,N a C is distributed among the auxiliary data {x a i }by a “poorest-ﬁrst”

rule:the “poorest”x a

k 1(that which has the small-est y a k 1

w T x a k 1),gets a portion y a k 1μk 1from N a C ﬁrst,

Table 1.Fast Learning Algorithm of M-Logit

Input:D a ∪D p

l and C ;Output:w and {μi }N a

i =1

1.Initialize w and μi =0for i =1,2,···,N a .

2.Compute the gradient w and Hessian ma-trix 2w .

3.Compute the ascent direction d =

−( 2w )−1

w .

4.Do a linear search for the step-size α∗=

arg max α (w +αd ).

5.Update w :w ←w +α∗d .

6.Sort {y a i w T x a i }N a

i =1in ascending order.As-sume the result is y a k 1

w T x a k 1≤y a k 2w T x a

k 2≤···≤y a

k N a

w T x a k N a ,where k 1,k 2,···,k N a is a permutation of 1,2,···,N a .7.Find the n using (14).

8.Update the auxiliary variables {μi }N a

i =1using

(13).

9.Check the convergence of :exit and output

w and {μi }N a

i =1if converged;go back to 2oth-erwise.

and as soon as the total wealth y a k 1

w T x a k 1+y a

k 1μk 1reaches the wealth of the second poorest x a k 2,N a

becomes equally distributed to x a k 1and x a

k 2such that their total wealths are always equal.Then,as soon as y a k 1

w T x a k 1+y a k 1μk 1=y a k 2w T x a k 2+y a k 2μk 2reach the wealth of the third poorest,N a C be

comes equally distributed to three of them to make them equally rich.The distribution continues in this way until the budget N a C is used up.The “poorest-ﬁrst”rule is essentially a result of the concavity of the logarith-mic sigmoid function ln σ(·).The goal is to maximize N a

i =1ln σ(y a i w T x a i +y a

i μi ).The concavity of ln σ(·)dictates that for any given portion of N a C ,distribut-ing it to the poorest makes the maximum gain in ln σ.

The C is used as a means to compensate for the loss that D a may suﬀer from w .The classiﬁer w is respon-sible for correctly classifying both D a and D p .Because D a and D p are mismatched,w cannot satisfy both of them:one must suﬀer if the other is to gain.As D p is the primary data set,we want w to classify D p as accurately as possible.The auxiliary variables are therefore introduced to represent compensations that D a get from C .When x a gets small wealth from w and is poor,it is because x a is mismatched and in conﬂict with D p (assuming perfect separation of D a ,no conﬂict exists among themselves).By the “poorest ﬁrst”rule,the most mismatched x a gets compensation

ﬁrst.

A high compensation y a

μi whittles down the partici-pation of x a

i in learning w .This is easily seen from

the contribution of (x a i ,y a i )to w and 2

w ,which

are obtained from (4)as σ(−y a i w T x a i −y a i μi )y a i x a

i and

−σ(−y a i w T x a i −y a i μi )σ(y a i w T x a i +y a i μi )x a i x a i T

,respec-tively.When y a i

μi is large,σ(−y a i w T x a i −y a i μi )is close to zero and hence the contributions of (x a i ,y a

i )to w and 2w are ignorable.We in fact do not need an in-ﬁnitely large y a

i μi to make the contributions of x a i ig-norable,because σ(μ)is almost saturated at μ=±6.

If y a i

w T x a i =−6,σ(−y a i w T x a i )=0.9975,implying a large contribution of (x a i ,y a

i )to w ,which happens

when w assigns x a i to the correct class y a

i with prob-ability of σ(y a i

w T x a

i )=σ(−6)=0.0025only.In this nearly worst case,a compensation of y a

μi =12can eﬀectively remove the contribution of (x a i ,y a

i )because

σ(−y a i

w T x a i −y a i μi )=σ(6−12)=σ(−6)=0.0025.To eﬀectively remove the contributions of N m auxil-iary data,one needs a total budge 12N m ,resulting in an average budget C =12N m /N a .

To make a right choice of C ,the N m /N a should rep-resent the rate that D a are mismatched with D p .This is so because we want N a C to be distributed only to that part of D a that is mismatched with D p ,thus per-mitting us to use the remaining part in learning w .The quantity N m /N a is usually unknown in practice.However,C =12N m /N a gives one a sense of at least what range C should be in.As 0≤N m ≤N a ,letting 0≤C ≤12is usually a reasonable choice.In our experiences,the performance of M-Logit is relatively robust to C ,and this will be demonstrated in Section 6.2using an example data set.

5.Active Selection of D p

l In Section 2we assumed that D p

l had already been de-termined.In this section we describe how D p

can be actively selected from D p

,based on the Fisher infor-mation matrix [Fedorov,1972,MacKay,1992].The approach is known as active learning [Cohn et al.,1995,Krogh &Vedelsby,1995].

Let Q denote the Fisher information matrix of D p

l ∪D a about w .By deﬁnition of the Fisher information ma-trix [Cover &Thomas,1991],Q =E {y p i },{y a i }∂ ∂w ∂ ∂w T ,

and substituting (4)into this equation gives (a brief derivation is given in the appendix)

Q = N p l i =1σp i (1−σp i

)x p i x p i T + N a i =1σa i (1−σa i )x a i x a i T

(15)

where σp i =σ(w T x p i )for i =1,2,...,N p l ,and σa

=σ(w T x a i +μi )for i =1,2,...,N a

,and w and {μi }represent the true classiﬁer and auxiliary variables.

It is well known the inverse Fisher information Q −1lower bounds the covariance matrix of the estimated w [Cover &Thomas,1991].In particular,[det(Q )]−1lower bounds the product of variances of the elements

in w .The goal in selecting D p

is to reduce the vari-ances,or uncertainty,of w .Thus we seek the D p

that maximize det(Q ).

The selection proceeds in a sequential manner.Ini-tially D p

u =D p ,D p l is empty,and Q = N a i =1σa i (1−

σa i )x a i x a i T .Then one at a time,a data point x p i ∈D p

is selected and moved from D p

u to D p l .This causes Q

to be updated as:Q ←Q +σp i (1−σp i

)x p i (x p i )T

.At each iteration,the selection is based on

max x p i ∈D p u

det Q +σp i (1−σp i )x p i (x p i )T =max x p i ∈D p u 1+σp i (1−σp i )(x p i )T Q −1x p i (16)where we assume the existence of Q −1,which can often

be assured by using suﬃcient auxiliary data D a .Evaluation of (16)requires the true values of w and {μi },which are not known a priori .We follow Fedorov (1972)and replace them with the w and {μi }that

are

estimated from D a ∪D p l

,where D p l are the primary labeled data selected up to the present.

6.Results

In this section the performance of M-Logit is demon-strated and compared to the standard logistic regres-sion,using test error rate as the performance index.

The M-Logit is trained using D a ∪D p l

,where D p l are either randomly selected from D p

,or actively selected

from D p using the method in Section 5.When D p

l are randomly selected,50independent trials are per-formed and the results are obtained as an average over the trials.Three logistic regression classiﬁers are

trained using diﬀerent combinations of D a and D p l

:D a ∪D p l ,D p l alone,and D a

alone,where D p l are identi-cal to the D p used for M-Logit.The four classiﬁers are

tested on D p

u =D p \D p l ,using the following decision rule:declare y p

=−1if σ(w T x p )≤0.5and y p =1

otherwise,for any x p ∈D p u

.Throughout this section the C for M-Logit is set to

C =6when the comparison is made to logistic regres-sion.In addition,we present a comparison of M-Logit with diﬀerent C ’s,to examine the sensitivity of M-Logit’s performance to C .6.1.A toy Example

In the ﬁrst example,the primary data are simulated as two bivariate Gaussian distributions representing class “−1”and class “+1”,respectively.In particu-larly,we have Pr(x p |y p =−1)=N (x p ;µ0,Σ)and Pr(x p |y p =+1)=N (x p ;µ1,Σ),where the Gaus-sian parameters µ0=[0,0]T ,µ1=[2.3,2.3]T ,and Σ=

1.75−0.433

−0.4331.25

.The auxiliary data D a

are then a selected draw from the two Gaussian dis-tributions,as described in [Zadrozny,2004].We take the selection probability Pr(s |x p ,y p =−1)=

σ(w 0+w 1K (x p ,µs 0;Σ))and Pr(s |x p ,y

=+1)=σ(w 0+w 1K (x p ,µs

1;Σ)),where σis the sigmoid func-tion,w 0=−1,w 1=exp(1),K (x p ,µs 0;Σ)=

exp {−0.5(x p −µs 0)T Σ−1(x p −µs 0)}with µs

0=[2,1]T ,

and K (x p ,µs 1;Σ)=exp {−0.5(x p −µs 1)T Σ

−1(x p

−µs 1)}with µs 1=[0,3]T

.We obtain 150samples of D p and 150samples of D a ,which are shown in Figure 3.

The M-Logit and logistic regression classiﬁers are trained and tested as explained at the beginning o

f this section.The test error rates are shown in Figure 1and Figure 2,as a function of number of primary

labeled data used in training.The D p

in Figure 1are randomly selected and the D p

l in Figure 2are actively selected as described in Section 5.

Figure 1.Test error rates of M-Logit and logistic regression

on the toy data,as a function of size of D p

.The primary labeled data D p

l are randomly selected from D p .The error rates are an average over 50independent trials of random

selection of D p l

.Several observations are made from inspection of Fig-ures 1and 2.

•The M-Logit consistently outperforms the three

standard logistic regression classiﬁers,by a con-siderable margin.This improvement is a result of

properly fusing D a and D p

,with D a determining the classiﬁer under the guidance of few D p l

Figure 2.Error rates of M-Logit and logistic regression on

the toy data,as a function of size of D p

.The primary labeled data D p

l are actively selected from D p ,using the method in Section 5.

•The performance of the logistic regression trained

on D p

alone changes signiﬁcantly with the size of D p l .This is understandable,considering that D p l are the only examples determining the classiﬁer.The abrupt drop of errors from iteration 11to iteration 12in Figure 2may be because the label found at iteration 12is critical to determining w .•The logistic regression trained on D a alone per-forms signiﬁcantly worse than M-Logit,reﬂecting a marked mismatch between D a and D p .

•The logistic regression trained on D a ∪D p

l im-proves,but mildly,as D p

l grows,and it is ulti-mately outperformed by the the logistic regression

trained on D p

alone,demonstrating that some data in D a

are mismatched with D p and hence cannot be correctly classiﬁed along with D p ,if the mismatch is not compensated.•As D p l grows,the logistic regression trained on D p

l alone ﬁnally approaches to M-Logit,showing that

without the interference of D a ,a suﬃcient D p

can deﬁne a correct classiﬁer.

•All four classiﬁers beneﬁt from the actively se-lected D p

l ,this is consistent with the general observation with active learning [Cohn et al.,1995,Krogh &Vedelsby,1995].To better understand the active selection process,we show in Figure 3the ﬁrst few iterations of active learn-ing.Iteration 0corresponds to the initially empty D p l

,and iterations 1,5,10,13respectively correspond to

1,5,10,13data points selected accumulatively from

D p

into D p l .Each time a new data point is selected,the w is re-trained,yielding the diﬀerent decision boundaries.As can be seen in Figure 3,the decision boundary does not change much after 10data are selected,demonstrating convergence.

In Figure 3,each auxiliary data point x a i ∈D a

is symbolically displayed with a size in proportion to

exp(−y a

μi /12),hence a small symbol of auxiliary data corresponds to large y a

μi and hence small participa-tion in determining w .The auxiliary data that cannot be correctly classiﬁed along with the primary data are de-emphasized by the M-Logit.Usually the auxiliary data near the decision boundary are de-emphasized.

6.2.Results on the Wisconsin Breast Cancer

Databases In the second example we consider the Wisconsin Breast Cancer Databases from the UCI Machine Learning Repository.The data set consist of 569in-stances with feature dimensionality 30.We randomly partition the data set into two subsets,one with 228data points and the other with 341data points.The ﬁrst is used as D p ,and the second as D a .We arti-ﬁcially make D a mismatched with D p by introducing errors into the labels and adding noise to the features.Speciﬁcally,we make changes to 50%randomly chosen

(x a i ,y a i )∈D a :change the signs of y a

i and add 0dB white Gaussian noise to x a i .We then proceed,as in Section 6.1,to training and testing the four classiﬁers.

We again consider both random D p

and actively se-lected D p

l .The test errors are summarized in Figures 4and 5.The results are essentially consistent with those in Figures 1and 2,extending the observations we made there to the real data here.It is particularly noted that the mismatch between D a and D p here is more prominent than in the toy data,as manifested by the error rates of logistic regression trained alone on D a .This makes M-Logit more advantageous in the comparison:not only does it give the best results but

it also converges faster than others with the size of D p l .To examine the eﬀect of C on the performance of M-Logit,we present in Figure 6the test error rates of

M-Logit using ﬁve diﬀerent C :C =2,4,6,8,10.Here

the D p

are determined by active learning as described in Section 5.Clearly,the results for the 5diﬀerent C ’s are almost indistinguishable.This relative insen-sitivity of M-Logit to C may partly be attributed to the adaptivity brought about by active learning.With

diﬀerent C ,the D p

are also selected diﬀerently,thus counteracting the eﬀect of C and keeping M-Logit ro-

本文发布于:2024-09-22 13:28:28，感谢您对本站的认可！

本文链接：https://www.17tex.com/tex/2/203774.html

上一篇：API入门教程

下一篇：Google Earth 卫星地图影像数据获取与应用

标签：沥青石油焦白炭黑

留言与评论（共有 0 条评论）