moderate diversity

Moderate diversity for better cluster ensembles

Stefan T.Hadjitodorov

a,*

,Ludmila I.Kuncheva b ,Ludmila P.Todorova

Center for Biomedical Engineering (CLBME),Bulgarian Academy of Sciences,‘‘Acad G.Bonchev’’Str.,block 105,Soﬁa 1113,Bulgaria

School of Informatics,University of Wales—Bangor,Bangor,Gwynedd LL571UT,United Kingdom

Received 21September 2004;received in revised form 29January 2005;accepted 29January 2005

Available online 3March 2005

Abstract

Adjusted Rand index is used to measure diversity in cluster ensembles and a diversity measure is subsequently proposed.Although the measure was found to be related to the quality of the ensemble,this relationship appeared to be non-monotonic.In some cases,ensembles which exhibited a moderate level of diversity gave a more accurate clustering.Based on this,a procedure for building a cluster ensemble of a chosen type is proposed (assuming that an ensemble relies on one or more random parameters):generate a small random population of cluster ensembles,calculate the diversity of each ensemble and select the ensemble corre-sponding to the median diversity.We demonstrate the advantages of both our measure and procedure on 5data sets and carry out statistical comparisons involving two diversity measures for cluster ensembles from the recent literature.An experiment with 9data sets was also carried out to examine how the diversity-based selection procedure fares on ensembles of various sizes.For these experiments the classiﬁcation accuracy was used as the performance criterion.The results suggest that selection by median diversity is no worse and in some cases is better than building and holding on to one ensemble.Ó2005Elsevier B.V.All rights reserved.

Keywords:Pattern recognition;Machine learning;Multiple classiﬁers;Cluster ensembles;Diversity mea

1.Introduction

Cluster ensembles emerged recently as a coherent stream out of the multiple classiﬁer systems area [12,27,28,8–10,6,23,1,14,11].They are deemed to be bet-ter than single clustering algorithms for discovering complex or noisy structures in the data.The strongest argument in favour of cluster ensembles is as follows.It is known that the current oﬀ-the-shelf clustering methods may suggest very diﬀerent structures in the same data.This is the result of the diﬀerent clustering criteria being optimized.There is no layman guide to choosing a clustering method for a given data set and so an inexperienced user runs the risk of picking an inappropriate clustering method.There is no ground

truth against which the result can be matched,therefore there is no critique to the user Õs choice.Cluster ensem-bles provide a more universal solution in that various structures and shapes of clusters present in data may be discovered by the same ensemble method,and the solution is less dependent upon the chosen ensemble type [27].

Let Z be a data set and let P ={P 1,...,P L }be a set of partitions on Z .Each partition is obtained by apply-ing a clustering algorithm on Z or a subset of it.We as-sume that the partitions are generated b

y varying a random parameter of the clustering algorithm,for example starting the algorithm from L random initial-izations.The clustering algorithm (or run)which pro-duces P i will be called here an ‘‘ensemble member’’or ‘‘clusterer’’.The clusterers may be versions of the same clustering algorithm or diﬀerent clustering algorithms.For simplicity,the same notation,P i ,will be used both

Corresponding author.Tel.:+35929875819;fax:+35929816629.E-mail address:sthadj@argo.bas.bg (S.T.Hadjitodorov).

www.elsevier/locate/inﬀus

Information Fusion 7(2006)

264–275

for the clusterer and for the corresponding partition. The goal is toﬁnd a single(resultant)partition,P*, based on the information contained in the set P.

The‘‘accuracy’’of a clustering algorithm(or a cluster ensemble)is measured by the match between the parti-tion produced and some known ground-truth partition.

A reliable ground-truth partition is seldom available,so most experimental studies employ generated data with pre-speciﬁed cluster structure.From the many matching indices suggested in the literature[4,5,16,26],we chose the adjusted Rand index[16]because of the following properties:(1)it has aﬁxed value of0if the two com-pared partitions are formed independently from one an-other;(2)in our preliminary experiments,this index was found to have a greater sensitivity to pick out good par-titions compared to other indices.

Diversity within an ensemble is of vital importance for its success.An ensemble of identical clusterers or classiﬁers will not outperform the individual ensemble members.However,ﬁnding a sensible quantitative mea-sure of diversity in classiﬁer ensembles has been notori-ously hard[19–21].Diversity in cluster ensembles is considered here.A diversity measure is proposed and its relationship with the accuracy of the ensemble is demonstrated.Based on the results,a procedure is sug-gested for selecti

ng a cluster ensemble from a small pop-ulation of ensembles.The proposed diversity measure as well as the match index for the ensemble accuracy are based on the Adjusted Rand Index.

The rest of the paper is organized as follows.Section 2introduces cluster ensembles.Section3contains the proposed diversity measure together with some results on its relationship with the ensemble accuracy.At the end of this section we list the steps of our proposed methodology for selecting a cluster ensemble from a small population.Section4oﬀers the results from a sta-tistical comparison of the proposed diversity measure with two other measures due to Fern and Brodley[6] and Greene et al.[13].Section5contains an experiment with9data sets looking into the relationship between the performance of the proposed selection method and the ensemble size.Section6concludes the study.

2.Cluster ensembles

There are various ways to build a cluster ensemble:•Use diﬀerent subsets of features(overlapping or dis-joint),called feature-distributed clustering in [13,27,28].

•Use diﬀerent clustering algorithms within the ensem-ble[15].Such ensembles are called heterogeneous or hybrid.Ensembles with the same clustering method obtained by varying a random

parameter will be called homogeneous.

•Vary a random parameter of the clustering algo-rithm.For example,run the k-means clustering method from diﬀerent initializations or generate L random projections the data on a low-dimensional space and run k-means for each projec-tion[6,29].

•Use diﬀerent a data set for each ensemble member,

<-sampling with or without replacement

[3,7,10,22,23],called object-distributed clustering [27,28].

Any combination of the above construction heuristics is also a possible construction method.Once P1,...,P L are constructed,the resultant partition P*has to be found.

The direct approach(re-labeling)seeks correspon-dence between the cluster labels across the partitions and fuses the clusters of the same label[7,27,28,32]. Note that the labels that we assign to the clusters in the individual partitions are arbitrary.Thus two identi-cal partitions might have permuted labels and be per-ceived as diﬀerent.Suppose that the correspondence between the partitions has been solved.Then a voting between the clusterers would be straightforward:just count t

he number of votes for the respective cluster. For c clusters,there are c!permutations of the labels and an exhaustive experiment might not be feasible for large c.

The feature-based approach treats the output of each clusterer as a categorical feature.The collection of L fea-tures can be regarded as an‘‘intermediate feature space’’and another clustering algorithm can be run on it.A mixture model for this case is proposed in[30].

The hypergraph approach[27,28]organizes the L par-titions into a hypergraph and uses methods for hyper-graph partitioning to obtain the ensemble result.

Finally,the pairwise approach(also co-association ap-proach)avoids the correspondence task altogether by using a coincidence matrix between all pairs of objects. The matrices for the partitions are then combined and aﬁnal clustering is derived thereof.

This study is based on the pairwise approach whose generic algorithm is detailed in Fig.1.In the traditional implementation of this algorithm(voting k-means[8], evidence accumulation algorithm[9]),the following choices are made:(i)The clusterers are various runs of the k-means algorithm(see for details[2]).(ii)The same number of overproduced clusters,c,is assigned to each clusterer.(iii)Theﬁnal consensus matrix,M,has an (i,j)th entry as follows:

m ij¼

X L

k¼1

m k

i;j

;

<,it contains the proportion out of L clusterers which have put objects i and j in the same cluster.

S.T.Hadjitodorov et al./Information Fusion7(2006)264–275265

The resultant partition,P*,is traditionally found in one of two ways.In theﬁrst way,M is‘‘cut’’with a pre-s

peciﬁed threshold,h.All entries greater than h are set to1and the remaining entries are set to0.The new matrix is treated as the co-association matrix and the respective partition is derived thereof(a choice of h=0.5will correspond to majority voting between the clusterers as to the joint membership of objects i and j).In the second way,the entries of M are treated as ‘‘similarities’’and another clustering algorithm is run on them.The common choice is the single-link algorithm (see[2]).In fact,cutting M at a certain threshold is equivalent to running the single link algorithm and cut-ting the dendrogram obtained from the hierarchical clustering at similarity h.Viewed in this context,cluster ensemble is a type of stacked clustering,where layers of similarity matrices are generated and clustering algo-rithms are subsequently applied on them.Our pilot experiments showed slightly better results when a new clustering procedure was applied on the consensus ma-trix M used as data.This corresponds to a method re-cently proposed in pattern recognition whereby similarities are treated as new features[24,25].If not sta-ted otherwise,the experiments in this paper are carried out with a single link clustering using the consensus ma-trix,M,as the data.

3.Diversity measures for cluster ensembles

The adjusted Rand index needed for both diversity and accuracy of the ensemble is calculated as follows [16].Let A and B be two partitions on a data set Z with N objects.Let A have c A clusters and B

have c B clusters. Denote by

•N ij the number of objects in cluster i in partition A and in cluster j in partition B.

•N.j the number of objects in cluster j in partition B.•N i.the number of objects in cluster i in partition A.

The adjusted Rand index is

t1¼

X c A

i¼1

N i:

;t2¼

X c B

i¼1

N:j

;t3¼

2t1t2

NðNÀ1Þ

;

arðA;BÞ¼

P c A

i¼1

P c B

j¼1

N ij

Àt3

ðt1þt2ÞÀt3

;

where

is the binomial coeﬃcient.If weﬁx the number of clusters c A and c B,and the number of objects in each cluster,and draw A and B randomly(generalized hypergeometric distribution),the adjuster Rand index ar(A,B)should be zero.Values of ar(A,B)close to zero will indicate that by observing A,nothing can be pre-dicted about B,and vice versa.

The accuracy of a clusterer,say P i,is taken to be ar(P i,P T),where P T is the known true partition of the data.Respectively,the accuracy of the ensemble is calculated as ar(P*,P T).Note thatÔarÕis used as a mea-sure of accuracy.In fact,Adjusted Rand Index measures the degree of departure from the assumption that the two compared clustering results have occurred by chance.A value of zero of this index will mean that the two partitions have been generated completely inde-pendently of one another.Thus a value of zero does not mean that there is no match between the labels!One distinguishing feature of our study is that we do not im-pose the correct(known)number of clusters onto the algorithm either at the stage of building the individual clusterers,or at the aggregation stage.Thus our cluster-ing may end up 2,3,4or5clusters for a prob-lem with2classes.Calculation of a classiﬁcation error will be suitable when the number of clusters is set equal to the number of classes.

Two approaches to measuring the ensemble diversity are considered—pairwise and non-pairwise.In

the pair-wise approach,using the adjusted Rand index,the ensemble diversity is

D P¼

LðLÀ1Þ

X LÀ1

i¼1

X L

j¼iþ1

ð1ÀarðP i;P jÞÞ:

Note thatÔarÕgives similarity between partitions,there-fore1Àar would be the pairwise diversity.Pairwise diversity in cluster ensembles is discussed in[6].Instead of the adjusted Rand index,the Normalized Mutual Information(NMI)is used.We have studied NMI in our previous experime

nts but chose hereÔarÕfor the rea-sons stated above.In any case,the two indices have very similar behaviour.

The non-pairwise approach can be subdivided into two:group diversity and individual diversity

The measure proposed in[13]can be branded as ‘‘group diversity’’.The mutual information of the con-sensus matrix M is calculated by regarding each entry in the consensus matrix as a probability distribution. The random variable in each cell of the matrix has two values:‘‘Yes’’(meaning that objects i and j belong in the same cluster),an estimate of Pr(Yes)being m ij, and‘‘No’’with Pr(No)=1Àm ij.The entropy of this distribution is

H ij¼Àðm ij log2ðm ijÞþð1Àm ijÞlog2ð1Àm ijÞÞ;

and the overall measure of diversity is the averaged en-tropy across the consensus matrix M,

H¼

NðNÀ1Þ

X NÀ1

i¼1

X N

j¼iþ1

H ij:

The larger the entropy,the more diverse the ensemble is.If all clusterers gave the same partition,then M would contain only0s and1s,and the entropy would be0.(Note that by convention0Ælog(0)=0.) For the individual diversity subgroup,the ensemble decision is derived and each clusterer is assigned a diver-sity value measuring its diﬀerence from the ensemble decision.Recalling that P*denotes the resultant cluster-ing(the ensemble decision),the individual diversity of clusterer P i is1Àar(P i,P*).To obtain an overall mea-sure of diversity we may simply take the average of the L individual diversities,

D npÀ1¼1

X L

i¼1

ð1ÀarðP i;PÃÞÞ:

In our previous studies[18]we found that ensembles that exhibit a larger spread of individual diversities are generally better than ensembles with a smaller spread. Therefore,we choose as the second non-pairwise diver-sity measure,D npÀ2the standard deviation of the indi-vidual diversities

D npÀ2¼

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ1

ðLÀ1Þ

X L

i¼1

ð1ÀarðP i;PÃÞÀD npÀ1Þ2 v u

u t

It turned out that the spread alone was not strongly related to the ensemble accuracy either.Therefore a third non-pairwise diversity measure,D npÀ3,is proposed here based on the following intuition.Since it is believed that the ensemble decision is close to the true labeling of the data,the accuracy of the individual clusterers may be estimated based on how close they are to the ensem-ble decision.Thus larger values of1ÀD npÀ1should be preferred.On the other hand,variability within the ensemble can be estimated by the spread of the individ-ual diversities.Large variability will be indicated by lar-ger values of D npÀ2The simplest compromise measure, D npÀ3can be devised as D npÀ3¼1

ð1ÀD npÀ1þD npÀ2Þ:

Another compromise measure can be constructed using the coeﬃcient of variation1

D npÀ4¼

D npÀ2

D npÀ1

The goal is toﬁnd a measure related to the quality of the ensemble so that we can pick from a set of ensembles the one that is most likely to be good.Figs.2and3 show the relationship between the six diversity mea-sures:D p,H,D npÀ1,D npÀ2,D npÀ3and D npÀ4and the ensemble accuracy,calculated through the adjusted Rand index for two data sets used in the experiments re-ported in the following section.Fig.2shows the six plots for an artiﬁcial data set consisting of4Gaussian clus-ters,called‘‘four-gauss’’.Fig.3shows the plots with the UCI wine data(www.ics.uci.edu/~mlearn/ MLRepository.html).Each plot contains the accuracies of100ensembles against their diversity.The ensembles in Fig.2were built on100diﬀerent data sets of4Gauss-ian clusters sampled from the same distribution.The100 ensembles in Fig.3were built by varying the random parameters of the ensemble(explained later in the exper-imental section).The solid line shows the ensemble accu-racy and the line with the dot markers shows the averaged individual accuracy.

The behaviour of the measures except for D npÀ1and D npÀ3is rather erratic.On the other hand,D npÀ1shows an unexpected pattern.The four-gauss data has a clear-cut structure,however the accuracy of the ensemble is inversely related with its diversity D npÀ1.As seen in Fig.2(c),more diverse ensembles are less accurate than less diverse ensembles.We could attribute this phenom-en

on to the intuition that more diversity would be asso-ciated with many clusterers not getting right the clustering structure and therefore having lower individ-ual accuracy.It is interesting to observe that the average accuracy of the individual clusterers(dot marker)shows no substantial increase or decrease.We can say that, although D npÀ1is a valid measure of diversity by design, its behaviour with respect to accuracy is to some extent counterintuitive.

Figs.2and3display the two typical patterns that we found in our experiments.For the sets with a clear-cut cluster structure,such as four-gauss,the proposed indi-ces D npÀ3and D npÀ4have roughly a monotonically increasing relationship with the ensemble accuracy. For other data sets,an example of which is the wine data set,there is a‘‘cup’’pattern,showing that moderate values of the two indices are associated with higher ensemble accuracy.

1Other combinations have been tried as s that included both pairwise and non-pairwise indices(D p,and the standard deviation of pairwise diversities).The results were similar or worse to the ones reported here.

S.T.Hadjitodorov et al./Information Fusion7(2006)264–275267

To ease the interpretation of Figs.2and 3,the corre-lation between the 6diversity measures and the e

nsem-ble accuracy has been calculated and displayed in Table 1.Since the correlation coeﬃcient is a measure of linear dependency,the ‘‘cup’’pattern in Fig.3(c)and (e)will not stand out.

Fig.4shows the two subplots (e)from Figs.2and 3and a ﬁtted polynomial curve of degree 3(the thick line)of the ensemble accuracy as a function of D np À3.The monotonic and the cup pattern can be seen from the interpolation.The problem is that it is not known in ad-vance whether our data will exhibit one or the other.

The two patterns suggest that a compromise can be sought in the medium value of the diversity.Therefore we suggest the following simple procedure for building cluster ensembles:

1.Generate K ensembles varying the random parame-ter(s)of the clustering algorithm.

2.Calculate diversity using a chosen diversity measure (we prefer D np À3or D np À1for reasons stated above and the explanations later in the experimental section).

3.Find the median of the diversity values and pick the corresponding ensemble.

268

S.T.Hadjitodorov et al./Information Fusion 7(2006)264–275

Our hypothesis is that ensembles selected through median diversity will fare better than randomly selected ensembles or ensembles selected through maximum diversity.4.Experiments

Seven types of homogeneous ensembles were con-structed as summarized in Table 2.Two most common types of clusterers were used:the k -means and the mean link method (average link,average linkage).All ensem-ble consisted of L =25clusterers.The parameters that we varied were:

•the number of overproduced clusters,c .The value was either ﬁxed at c =20(ensembles 1and 5)or cho-sen randomly for each ensemble in range from 2to 22;

•the initialization of k -means for ensembles 1,2,3and 4;

•sample submitted for clustering to each ensemble member.In ensemble models 1,2and 4,the whole data set Z was submitted to In the

remaining ensembles,a random sub-sample of Z was submitted to each clusterer with size between N /2and N ;

•the noise injection.For building ensembles 4and 7we altered the data for each ensemble adding a Gaussian noise with mean 0and standard deviation 0.1.

All ensemble types were applied to 5data sets,sum-marized in Table 3.For each ensemble type and each data set,100ensembles were generated in order to mea-sure diversity and its relationship with the ensemble accuracy.For the three artiﬁcial data sets,each ensem-ble was built on a diﬀerent data set generated from the respective distribution.Fig.5shows an example of three such data sets.

All three sets were generated in 2-D (as plotted)and then 10more dimensions of uniform random noise were appended to each data set.A total of 100points were gen-erated from each distribution.The noise is bound to introduce diversity perhaps both helpful and harmful to the clustering.Usually noise is being artiﬁcially injected into data for the purpose of simulating actly for the purpose of creating diversity.We felt that the 2-D

Table 2

Summary of the design of

the 7ensembles types Type of the base clusterer

k -Means k -Means k -Means k -Means Mean link Mean link Mean link Number of overproduced clusters,c 20Random Random Random 20

Random Random Sample size for the base clusterer Whole Whole Random Whole Random Random Random Noise added

Yes

Table 1

Correlation coeﬃcients between the 6diversity measures and the ensemble accuracy for the examples in Figs.2and 3Data set D p H D np À1D np À2D np À3D np À4Individual average Four-gauss À0.1313À0.0329À0.92630.23560.87010.61710.2245Wine

0.4456

0.3378

À0.3796

0.4006

0.5394

0.3953

0.4095

Shown also is the correlation between the individual average and the ensemble accuracy.

S.T.Hadjitodorov et al./Information Fusion 7(2006)264–275

269

本文发布于:2024-09-24 00:20:33，感谢您对本站的认可！

本文链接：https://www.17tex.com/tex/2/367942.html