Mean Shift-A Robust Approach toward Feature Space Analysis

Mean Shift:A Robust Approach

致幻Toward Feature Space Analysis

Dorin Comaniciu,Member,IEEE,and Peter Meer,Senior Member,IEEE

AbstractÐA general nonparametric technique is proposed for the analysis of a complex multimodal feature space and to delineate arbitrarily shaped clusters in it.The basic computational module of the technique is an old pattern recognition procedure,the mean shift.We prove for discrete data the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and,thus,its utility in detectingthe modes of the density.The relation of the mean shift procedure to the Nadaraya-Watson estimator from kernel regression and the robust M-estimators of location is also established.Algorithms for two low-level vision tasks,discontinuity preserving smoothing and image segmentation,are described as applications.In these algorithms,the only user set parameter is the resolution of the analysis and either gray level or color images are accepted as input.Extensive experimental results illustrate their excellent performance.

Index TermsÐMean shift,clustering,image segmentation,image smoothing,feature space,low-level vision.

1I NTRODUCTION

L OW-LEVEL computer vision tasks are misleadingly diffi-cult.Incorrect results can be easily obtained since the employed techniques often rely upon the user correctly guessing the values for the tuning parameters.To improve performance,the execution of low-level tasks should be task ,supported by independent high-level informa-tion.This approach,however,requires that,first,the low-level stage provides a reliable enough representation of the input and that the feature extraction process be controlled only by very few tuning parameters corresponding to intuitive measures in the input domain.

Feature space-based analysis of images is a paradigm which can achieve the above-stated goals.A feature space is a mapping of the input obtained through the processing of the data in small subsets at a time.For each subset,a parametric representation of the feature of interest is obtained and the result is mapped into a point in the multidimensional space of the parameter.After the entire input is processed,significant features correspond to denser regions in the feature ,to clusters,and the goal of the analysis is the delineation of these clusters.

The nature of the feature space is application dependent. The subsets employed in the mapping can range from individual pixels,as in the color space representation of an image,to a set of quasi-randomly chosen data points,as in the probabilistic Hough transform.Both the advantage and the disadvantage of the feature space paradigm arise from the global nature of the derived representation of the input. On one hand,all the evidence for the presence of a significant feature is pooled together,providing excellent tolerance to a noise level which may render local decisions unreliable.On the other hand,features with lesser support in the feature space may not be detected in spite of being salient for the task to be executed.This disadvantage, however,can be largely avoided by either augmenting the feature space with additional(spatial)parameters from the input domain or by robust postprocessing of the input domain guided by the results of the feature space analysis. Analysis of the feature space is application independent. While there are a plethora of published clustering techni-ques,most of them are not adequate to analyze feature spaces derived from real data.Methods which rely upon a priori knowledge of the number of clusters present (including those which use optimization of a global criterion to find this number),as well as methods which implicitly assume the same shape(most often elliptical)for all the clusters in the space,are not able to handle the complexity of a real feature space.For a recent survey of such methods,see[29,Section8].

In Fig.1,a typical example is shown.The color image in Fig.1a is mapped into the three-dimensional L*u*v*color space(to be discussed in Section4).There is a continuous transition between the clusters arising from the dominant colors and a decomposition of the space into elliptical tiles will introduce severe artifacts.Enforcing a Gaussian mixture model over such data is doomed to ,[49], and even the use of a robust approach with contaminated Gaussian densities[67]cannot be satisfactory for such complex cases.Note also that the mixture models require the number of clusters as a parameter,which raises its own challenges.For example,the method described in[45] proposes several different ways to determine this number. Arbitrarily structured feature spaces can be analyzed only by nonparametric methods since these methods do not have embedded assumptions.Numerous nonparametric clustering methods were described in the literature and they can be classified into two large classes:hierarchical clustering and density estimation.Hierarchical clustering techniques either aggregate or divide the data based on

. D.Comaniciu is with the Imaging and Visualization Department,Siemens

Corporate Research,755College Road East,P rinceton,NJ08540.

E-mail:comanici@scr.siemens.

P.Meer is with the Electrical and Computer Engineering Department,

Rutgers University,94Brett Road,P iscataway,NJ08854-8058.

E-mail:meer@caip.rutgers.edu.

Manuscript received17Jan.2001;revised16July2001;accepted21Nov.

2001.

Recommended for acceptance by V.Solo.

For information on obtaining reprints of this article,please send e-mail to:

,and reference IEEECS Log Number113483.

0162-8828/02/$10.00ß2002IEEE

some proximity measure.See[28,Section3.2]for a survey of hierarchical clustering methods.The hierarchical meth-ods tend to be computationally expensive and the definition of a meaningful stopping criterion for the fusion(or division)of the data is not straightforward.

The rationale behind the density estimation-based non-parametric clustering approach is that the feature space can be regarded as the empirical probability density function (p.d.f.)of the represented parameter.Dense regions in the feature space thus correspond to local maxima of the p.d.f., that is,to the modes of the unknown density.Once the location of a mode is determined,the cluster associated with it is delineated based on the local structure of the feature space[25],[60],[63].

Our approach to mode detection and clustering is based on the mean shift procedure,proposed in1975by Fukunaga and Hostetler[21]and largely forgotten until Cheng's paper[7] rekindled interest in it.In spite of its excellent qualities,the mean shift procedure does not seem to be known in statistical literature.While the book[54,Section6.2.2]discusses[21],the advantages of employing a mean shift type procedure in density estimation were only recently rediscovered[8].

As will be proven in the sequel,a computational module based on the mean shift procedure is an extremely versatile tool for feature space analysis and can provide reliable solutions for many vision tasks.In Section2,the mean shift procedure is defined and its properties are analyzed.In Section3,the procedure is used as the computational module for robust feature space analysis and implementa-tional issues are discussed.In Section4,the feature space analysis technique is applied to two low-level vision tasks: discontinuity preserving filtering and image segmentation. Both algorithms can hav

e as input either gray level or color images and the only parameter to be tuned by the user is the resolution of the analysis.The applicability of the mean shift procedure is not restricted to the presented examples. In Section5,other applications are mentioned and the procedure is put into a more general context.

2T HE M EAN S HIFT P ROCEDURE

Kernel density estimation(known as the Parzen window technique in pattern recognition literature[17,Section4.3])is the most popular density estimation method.Given n data points x i,i 1;...;n in the d-dimensional space R d,the multivariate kernel density estimator with kernel K x and a symmetric positive definite dÂd bandwidth matrix H, computed in the point x is given by

^f x 1

i 1

K H xÀx i

; 1 where

K H x j H jÀ1=2K HÀ1=2x : 2 The d-variate kernel K x is a bounded function with compact support satisfying[62,p.95]

R d

K x d x 1lim

k x k3I

k x k d K x 0

R d

x K x d x 0

R d

xx b K x d x c K I;

where c K is a constant.The multivariate kernel can be generated from a symmetric univariate kernel K1 x in two different ways

K P x

i 1

K1 x i K S x a k;d K1 k x k ; 4

where K P x is obtained from the product of the univariate kernels and K S x from rotating K1 x in R ,K S x is radially symmetric.The constant aÀ1k;d

R d

K1 k x k d x assures that K S x integrates to one,though this condition can be relaxed in our context.Either type of multivariate kernel obeys(3),but,for our purposes,the radially symmetric kernels are often more suitable.

We are interested only in a special class of radially symmetric kernels satisfying

K x c k;d k k x k2 ; 5 in which case it suffices to define the function k x called the profile of the kernel,only for x!0.The normalization constant c k;d,which makes K x integrate to one,is assumed strictly positive.

Using a fully parameterized H increases the complexity of the estimation[62,p.106]and,in practice,the bandwidth matrix H is chosen either as diagonal H diag h21;...;h2d ,

应力分析软件Fig.1.Example of a feature space.(a)A400Â276color image.(b)Corresponding L*u*v*color space with

110;400data points.

or proportional to the identity matrix H h 2

I .The clear advantage of the latter case is that only one bandwidth parameter h >0must be provided;however,as can be seen from (2),then the validity of an Euclidean metric for the feature space should be confirmed first.Employing only one bandwidth parameter,the kernel density estimator (1)becomes the well-known expression

^f x 1nh d n i 1K x Àx i h

: 6

The quality of a kernel density estimator is measured by

the mean of the square error between the density and its estimate,integrated over the domain of definition.In practice,however,only an asymptotic approximation of this measure (denoted as AMISE)can be computed.Under the asympto-tics,the number of data points n 3I ,while the bandwidth h 30at a rate slower than n À1.For both types of multivariate kernels,the AMISE measure is minimized by the Epanechni-kov kernel [51,p.139],[62,p.104]having the profile

k E x

1Àx 0 x 1

0x >1;&

7 which yields the radially symmetric kernel

K E x 12c À1

d 2 1Àk x k 2 k x k 1

0otherwise ;

where c d is the volume of the unit d -dimensional sphere.

Note that the Epanechnikov profile is not differentiable at the boundary.The profile

k N x exp À1

儒林外史的讽刺艺术x

x !0 9

yields the multivariate normal kernel

K N x 2 Àd=2exp À1

k x k 2

for both types of composition (4).The normal kernel is often

symmetrically truncated to have a kernel with finite support.While these two kernels will suffice for m

ost applications we are interested in,all the results presented below are valid for arbitrary kernels within the conditions to be stated.Employing the profile notation,the density estimator (6)can be rewritten as

^f h;K x c k;d nh d n i 1k x Àx i h 2 :

11 The first step in the analysis of a feature space with the

underlying density f x is to find the modes of this density.The modes are located among the zeros of the gradient r f x 0and the mean shift procedure is an elegant way to locate these zeros without estimating the density.

2.1Density Gradient Estimation

The density gradient estimator is obtained as the gradient of the density estimator by exploiting the linearity of (11)

^r f h;K x r ^f h;K x 2c k;d nh d 2 n i 1

x Àx i k H x Àx i h 2 : 12

We define the function

g x Àk H x ;

assuming that the derivative of the kernel profile k exists for

all x P 0;I ,except for a finite set of points.Now,using g x for profile,the kernel G x is defined as

G x c g;d g k x k 2

; 14 where c g;d is the corresponding normalization constant.The

kernel K x was called the shadow of G x in [7]in a slightly different context.Note that the Epanechnikov kernel is the shadow of the uniform ,the d -dimensional unit sphere,while the normal kernel and its shadow have the same expression.

Introducing g x into (12)yields,^r

f h;K x 2c k;d nh d 2 n i 1

x i Àx g x Àx i h 2

2c k;d nh d 2 n

i 1

g x Àx i h 2 45 n i 1x i g x Àx i h 2 n i 1

g x Àx i h

2 Àx P R Q S ; 15

where n

i 1g x Àx i h

is assumed to be a positive number.This condition is easy to satisfy for all the profiles met in practice.Both terms of the product in (15)have special significance.From (11),the first term is proportional to the density estimate at x computed with the kernel G

^f h;G x c g;d nh d n i 1g x Àx i h

2 : 16 The second term is the mean shift m h;G x n i 1x i g x Àx i h

2 n i 1g x Àx i h

2 Àx ; 17

<,the difference between the weighted mean,using the kernel G for weights,and x ,the center of the kernel (window).From (16)and (17),(15)becomes

f h;K x ^f h;G x 2c k;d h 2c g;d

m h;G x ; 18

yielding

m h;G x 1h 2c ^r

f h;K x ^f

h;G x :

19 The expression (19)shows that,at location x ,the mean shift vector computed with kernel G is proportional to the normalized density gradient estimate obtained with kernel K .The normalization is by the density estimate in x computed with the kernel G .The mean shift vector thus always points toward the direction of maximum increase in the density.

COMANICIU AND MEER:MEAN SHIFT:A ROBUST APPROACH TOWARD FEATURE SPACE ANALYSIS 3

This is a more general formulation of the property first remarked by Fukunaga and Hostetler [20,p.535],[21],and discussed in [7].

The relation captured in (19)is intuitive,the local mean is shifted toward the region in which the majority of the points reside.Since the mean shift vector is aligned with the local gradient estimate,it can define a path leading to a stationary point of the estimated density.The modes of the density are such stationary points.The mean shift procedure ,obtained by successive

putation of the mean shift vector m h;G x ,建湖实验小学

translation of the kernel (window)G x by m h;G x ,is guaranteed to converge at a nearby point where the estimate (11)has zero gradient,as will be shown in the next section.The presence of the normalization by the density estimate is a desirable feature.The regions of low-density values are of no interest for the feature space analysis and,in such regions,the mean shift steps are large.Similarly,near local maxima the steps are small and the analysis more refined.The mean shift procedure thus is an adaptive gradient ascent method.

2.2Sufficient Condition for Convergence

Denote by f y j g j he sequence of successive locations of the kernel G ,where,from (17),

y j 1 n i 1x i g x Àx i h

2 n i 1g x Àx i h j 1;2;... 20 is the weighted mean at y j computed with kernel G and y 1

is the center of the initial position of the kernel.The corresponding sequence of density estimates computed

with kernel K ,f ^f h;K j g j ,is given by

h;K j ^f h;K y j j :

As stated by the following theorem,a kernel K that obeys some mild conditions suffices for the convergence of the

sequences f y j g j and f ^f h;K j g j Theorem 1.If the kernel K has a convex and monotonically decreasing profile,the sequences y j ÈÉ

j and

f ^f h;K j

g j verge and f ^f h;K j g j is monotoni-cally increasing.The proof is given in the Appendix.The theorem

generalizes the result derived differently in [13],where K was the Epanechnikov kernel and G the uniform kernel.The theorem remains valid when each data point x i is associated with a nonnegative weight w i .An example of nonconver-gence when the kernel K is not convex is shown in [10,p.16].T

he convergence property of the mean shift was also discussed in [7,Section iv].(Note,however,that almost all the discussion there is concerned with the ªblurringºprocess in which the input is recursively modified after each mean shift step.)The convergence of the procedure as defined in this paperwas attributed in [7]to thegradient ascentnature of (19).However,as shown in [4,Section 1.2],moving in the direction of the local gradient guarantees convergence only for infinitesimal steps.The step size of a gradient-based algo-rithm is crucial for the overall performance.If the step size is too large,thealgorithm will diverge,while ifthestep sizeis too small,the rate of convergence may be very slow.A number of

costly procedures have been developed for step size selection [4,p.24].The guaranteed convergence (as shown by Theorem 1)is due to the adaptive magnitude of the mean shift vector,which also eliminates the need for additional procedures to chose the adequate step sizes.This is a major advantage over the traditional gradient-based methods.

For discrete data,the number of steps to convergence depends on the employed kernel.When G is the uniform kernel,convergence is achieved in a finite number of steps since the number of locations generating distinct mean values is finite.However,when the kernel G imposes a weighting on the data points (according to the distance from its center),the mean shift procedure is infinitely converge

nt.The practical way to stop the iterations is to set a lower bound for the magnitude of the mean shift vector.

2.3Mean Shift-Based Mode Detection

Let us denote by y c and ^f c h;K

^f h;K y c the convergence points of the sequences f y j g j and f ^f h;K j g j ,

respectively.The implications of Theorem 1are the following.First,the magnitude of the mean shift vector converges to zero.Indeed,from (17)and (20)the j th mean shift vector is

m h;G y j y j 1Ày j

and,at the limit,m h;G y c y c Ày c 0.In other words,the gradient of the density estimate (11)computed at y c is zero

r ^f

h;K y c 0; 23

due to (19).Hence,y c is a stationary point of ^f

h;K .Second,since f ^f h;K j g j is monotonically increasing,the mean

shift iterations satisfy the conditions required by the Capture Theorem [4,p.45],which states that the trajectories of such gradient methods are attracted by local maxima if they are unique (within a small neighborhood)stationary points.

That is,once y j gets sufficiently close to a mode of ^f

h;K ,it converges to it.The set of all locations that converge to the same mode defines the basin of attraction of that mode.The theoretical observations from above suggest a practical algorithm for mode detection:

Run the mean shift procedure to find the stationary

points of ^f

h;K ,.Prune these points by retaining only the local

maxima.

The local maxima points are defined,according to the Capture Theorem,as unique stationary points within some small open sphere.This property can be tested by perturbing each stationary point by a random vector of small norm and letting the mean shift procedure converge again.Should the point of convergence be unchanged (up to a tolerance),the point is a local maximum.2.4Smooth Trajectory Property

The mean shift procedure employing a normal kernel has an interesting property.Its path toward the mode follows a smooth trajectory,the angle between two consecutive mean shift vectors being always less than 90degrees.

Using the normal kernel (10),the j th mean shift vector is given by

4IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.24,NO.5,MAY 2002

m h;N y j y j 1Ày j n

i 1

x i exp xÀx i h

nvnu

i 1

exp xÀx i h

Ày j: 24

The following theorem holds true for all j 1;2;..., according to the proof given in the Appendix.

Theorem2.The cosine of the angle between two consecutive mean shift vectors is strictly positive when a normal kernel is ,

m h;N y j b m h;N y j 1

k m h;N y j kk m h;N y j 1 k

>0: 25

As a consequence of Theorem2,the normal kernel appears to be the optimal one for the mean shift procedure. The smooth trajectory of the mean shift procedure is in contrast with the standard steepest ascent method[4,p.21] (local gradient evaluation followed by line maximization) whose convergence rate on surfaces with deep narrow valleys is slow due to its zigzagging trajectory.

In practice,the convergence of the mean shift procedure based on the normal kernel requires large number of steps, as was discussed at the end of Section2.2.Therefore,in most of our experiments,we have used the uniform kernel, for which the convergence is finite,and not the normal kernel.Note,however,that the quality of the results almost always improves when the normal kernel is employed. 2.5Relation to Kernel Regression

Important insight can be gained when(19)is obtained approaching the problem differently.Considering the univariate case suffices for this purpose.

Kernel regression is a nonparametric method to estimate complex trends from noisy data.See[62,ch

apter5]for an introduction to the topic,[24]for a more in-depth treatment. Let n measured data points be X i;Z i and assume that the values X i are the outcomes of a random variable x with probability density function f x ,x i X i;i 1;...;n, while the relation between Z i and X i is

Z i m X i i i 1;...;n; 26 where m x is called the regression function and i is an independently distributed,zero-mean error,E i 0.

A natural way to estimate the regression function is by locally fitting a degree p polynomial to the data.For a window centered at x,the polynomial coefficients then can be obtained by weighted least squares,the weights being computed from a symmetric function g x .The size of the window is controlled by the parameter h,g h x hÀ1g x=h .The simplest case is that of fitting a constant to the data in the ,p 0.It can be shown,[24,Section3.1],[62,Section5.2],that the estimated constant is the value of the Nadaraya-

Watson estimator,

^m x;h n

i 1

g h xÀX i Z i

i 1

g h xÀX i

; 27

introduced in the statistical literature35years ago.The asymptotic conditional bias of the estimator has the expression[24,p.109],[62,p.125],

E ^m x;h Àm x j X1;...;X n

%h2

m HH x f x 2m H x f H x

2f x

2 g ;

28 where 2 g

u2g u du.Defining m x x reduces the Nadaraya-Watson estimator to(20)(in the univariate case), while(28)becomes

E ^xÀx j X1;...;X n %h2

f H x

f x 2 g

; 29

which is similar to(19).The mean shift procedure thus exploits to its advantage the inherent bias of the zero-order kernel regression.

The connection to the kernel regression literature opens many interesting issues,however,most of these are more of

a theoretical than practical importance.

2.6Relation to Location M-Estimators

The M-estimators are a family of robust techniques which can handle data in the presence of severe , outliers.See[26],[32]for introductory surveys.In our context only,the problem of location estimation has to be considered. Given the data x i;i 1;...;n;and the scale h,will define^ ,the location estimator as

^ argmin

J argmin

i 1

Àx i

; 30

where, u is a symmetric,nonnegative valued function, with a unique minimum at the origin and nondecreasing for u!0.The estimator is obtained from the normal equations

r J ^ 2hÀ2 ^ Àx i w

^ Àx

冬芹d

e 0; 31 where

w u

d u

Therefore,the iterations to find the location M-estimate are based on

i 1

x i w^ Àx i h

i 1

w^ Àx i h

; 32

which is identical to(20)when w u g u .Taking into account(13),the minimization(30)becomes

^ argmax

i 1

Àx i

; 33 which can also be interpreted as

^ argmax

h;K j x1;...;x n : 34 That is,the location estimator is the mode of the density estimated with the kernel K from the available data.Note that the convexity of the k x profile,the sufficient condition for the convergence of the mean shift procedure(Section2.2)is in

COMANICIU AND MEER:MEAN SHIFT:A ROBUST APPROACH TOWARD FEATURE SPACE ANALYSIS5

本文发布于:2024-09-20 21:44:44，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/543464.html

上一篇：部编版(超详)高中生物知识点归纳汇总

下一篇：基于Mean Shift模型的多粗差探测RAIM算法

标签：软件建湖讽刺应力小学

留言与评论（共有 0 条评论）