Feature selection for high-dimensional data A fast

2023年12月28日发(作者：sumproduct)

FeatureSelectionforHigh-DimensionalData:AFastCorrelation-BasedFilterSolutionLeiYuleiyu@nLiuhliu@artmentofComputerScience&Engineering,ArizonaStateUniversity,Tempe,AZ85287-5406,USAAbstractFeatureselection,asapreprocessingsteptomachinelearning,hasbeeneﬀectiveinreduc-ingdimensionality,removingirrelevantdata,increasinglearningaccuracy,r,therecentin-creaseofdimensionalityofdataposesase-verechallengetomanyexistingfeaturese-lectionmethodswithrespecttoeﬃciencyandeﬀwork,weintro-duceanovelconcept,predominantcorrela-tion,andproposeafastﬁltermethodwhichcanidentifyrelevantfeaturesaswellasre-ﬃciencyandeﬀectivenessofourmethodisdemon-stratedthroughextensivecomuctionFeaturrocessofchoosingasubsetoforiginalfeaturessothatthefeatueselectionhasbeenafertileﬁeldofresearchanddevelopmentsince1970’sandshownveryeﬀectiveinremovingirrelevantandredundantfea-tures,increasingeﬃciencyinlearningtasks,improv-inglearningperformancelikepredictiveaccuracy,andenhancingcomprehensibilityoflearnedresults(Blum&Langley,1997;Dash&Liu,1997;Kohavi&John,1997).Inrecentyears,,numberofinstances),numberoffeatures)inmanyapplica-tionssuchasgenomeprojects(Xingetal.,2001),textcategorization(Yang&Pederson,1997),imagere-trieval(Ruietal.,1999),andcustomerrelationshipmanagement(Ng&Liu,2000).Thisenormitymaycauseseriousproblemstomanymachineleample,,datasetswithhundredsorthousandsoffeatures),cancontainhighdegreeofirrelevantandredundantinfor-ore,featureselectionbe-comesverynecessr,thistrendofenormityonbothsizeanddimetherecentresearcheﬀortsinfeatureselectionhavebeenfocusedonthesechallengesfromhandlingahugenumberofinstances(Liuetal.,2002b)todeal-ingwithhighdimensionaldata(Das,2001;Xingetal.,2001).ollowing,weﬁrstreviewmodelsoffeatureselectionandexplainwhyaﬁltersolutionissuitableforhighdimensionaldata,andthenreviewsomerecenteﬀeselectionalgorithmscanbroadlyfallintotheﬁltermodelorthewrappermodel(Das,2001;Kohavi&John,1997).Theﬁltermodelreliesongeneralchar-acteristicsofthetrainingdatatoselectsomefeatureswithoutinvolvinganylearningalgorithm,ppermodelrequiresonepredeterminedlearningal-gorithminfeatureselectionanduachnewsubsetoffeatures,thewrappermodelneedstolearnahypothesis(oraclassiﬁer).Ittendstogivesuperiorperformanceasitﬁndsfeaturesbettersuitedtothepredeterminedlearningalgorithm,butitalsotendstobemorecomputationallyexpensive(Lan-gley,1994).Whenthenumberoffeaturesbecomesverylarge,theﬁltermodelisusuallyachoiceduetoitscomputationaleﬃdingsoftheTwentiethInternationalConferenceonMachineLearning(ICML-2003),WashingtonDC,2003.

Tocombinetheadvantagesofbothmodels,algorithmsinahybridmodelhaverecentlybeenproposedtodealwithhighdimensionaldata(Das,2001;Ng,1998;Xingetal.,2001).Inthesealgorithms,ﬁrst,agoodnessmeasureoffeaturesubsetsbasedondatacharacteris-ticsisusedtochoosebestsubsetsforagivencardinal-ity,andthen,crossvalidationisexploitedtodecideaﬁnalbestsubsetacrossdiﬀlgorithmsmainlyfocusoncombiningﬁlterandwrap-peralgorithmstoachievebestpossibleperformancewithaparticularlearningalgorithmatthesametimecomplexityofﬁwork,wefocusontheﬁltermodelandaimtodevelopanewfeatureselectionalgorithmwhichcaneﬀectivelyremovebothirrelevantandredundantfeaturion2,wereviewcurrentalgorithmswithintheﬁltion3,wedescribecorre-lationmeasureswhich-tion4,weﬁrstproposeourmethodwhichselectsgoodfeaturesforclassiﬁcationbasedonanovelconcept,predominantcorrelation,ion5,weevaluatetheeﬃciencyandeﬀectivenessofthisalgorithmviaextensiveexperimentsonvariousreal-worlddatasetscomparingwithotherrepresenta-tivefeatureselectionalgorithms,anddiscusstheim-plicationsoftheﬁion6,dWorkWithintheﬁltermodel,diﬀerentfeatureselectional-gorithmscanbefurthercategorizedintotwogroups,namelyfeatureweightingalgorithmsandsubsetsearchalgorithms,basedonwhethertheyeva,wediscusstheeweightingalgorithmsassignweightstofea-turesreanumberofdiﬀerentdeﬁnitionsonfeaturerelevanceinmachinelearningliterature(Blum&Langley,1997;Kohavi&John,1997).Afeatureisgoodandthuswilnownalgorithmthatreliesonrele-vanceevaluationisRelief(Kira&Rendell,1992).ThekeyideaofReliefistoestimatetherelevanceoffea-turesaccordingtohowwelltheirvaluesdistinguishbe-tweentheinstancesofthesameanddiﬀrandomlysamplesanumber(m)ofinstancesfromthetrainingsetandup-datestherelevanceestimationofeachfeaturebasedonthediﬀerencebetweentheselecmplexityofReliefforadatasetwithMinstancesandNfeaturesisO(mMN).Withmbe-ingaconstant,thetimecomplexitybecomesO(MN),whichmakesitveryscalabletodar,asfeaturesaredeemedrelevanttotheclassconcept,theywillallbeselectedeventhoughmanyofthemarehighlycorrelatedtoeachother(Kira&Rendell,1992).nonlycapturetherelevanceoffeaturestothetargetconcept,r,empiricalevidencefromfea-tureselectionliteratureshowsthat,alongwithirrele-vantfeatures,redundantfeaturesalsoaﬀectthespeedandaccuracyoflearningalgorithmsandthusshouldbeeliminatedaswell(Hall,2000;Kohavi&John,1997).Therefore,inthecontextoffeatureselectionforhighdimensionaldatawheretheremayexistmanyredun-dantfeatures,purerelevance-basedfeatusearchalgorithmssearchthroughcandidatefeaturesubsetsguidedbyacertainevaluationmea-sure(Liu&Motoda,1998)mal(ornearoptimal)istingevaluationmeasuresthathavebeenshowneﬀectiveinremovingbothirrelevantandredundantfeaturesin-cludetheconsistencymeasure(Dashetal.,2000)andthecorrelationmeasure(Hall,1999;Hall,2000).Con-sistencymeasureattemptstoﬁndaminimumnum-beroffeansistencyisde-ﬁnedastwoinstanceshavingthesamefeaturevaluesbutdiﬀetal.(2000),dif-ferentsearchstrategies,namely,exhaustive,heuristic,andrandomsearch,arecombinedwiththisevalua-tionmeasuretoformdiﬀecomplexityisexponentialintermsofdatadplexitycanbelineartothenumberofiterationsinarandomsearch,butexperimentsshowthatinordertoﬁndbestfeaturesubset,thenumberofiterationsrequiredismostlyatleastquadratictothenumberoffeatures(Dashetal.,2000).InHall(2000),acorrelationmeasureisappliedtoevaluatethegood-

nessoffeaturesubsetsbasedonthehypothesisthatagoodfeaturesubsetisonethatcontainsfeatureshighlycorrelatedwiththeclass,erlyingalgorithm,namedCFS,ore,withquadraticorhighertimecomplexityintermsofdimensionality,existingsubsetsearchalcometheproblemsofalgorithmsinbothgroupsandmeetthedemandforfeatureselectionforhighdimensionaldata,wedevelopanovelalgorithmwhichcaneﬀectivelyidentifybothirrelevantandreation-BasedMeasuresInthissection,wediscusshowtoevaluatethegood-nessoffeaturesforclassiﬁral,afeatureisgoodifitisrelevanttooptthecorrelationbetweentwovariablesasagood-nessmeasure,theabovedeﬁnitionbecomesthatafea-tureisgoodifitishighlycorrelrwords,ifthecorrelationbetweenafeatureandtheclassishighenoughtomakeitrelevantto(orpredictiveof)theclassandthecorrelationbetweenitandanyotherrelevantfeaturesdoesnotreachalevelsothatitcanbepredictedbyanyoftheotherrelevantfeatures,itwillberegardedasagoodfeaturefortheclassiﬁsense,theproblemoffea-tureselectionboilsdowntoﬁndasuitablemeasureofcorrelationsbetwxistbroadasedonheﬁrstapproach,themostwellknownmeasureislinearcorrelationco-eﬃirofvariables(X,Y),thelinearcorrelationcoeﬃcientrisgivenbytheformula

本文发布于:2024-09-24 16:34:27，感谢您对本站的认可！

本文链接：https://www.17tex.com/fanyi/40184.html

上一篇：for two-class-based compact three-dimensional

下一篇：数字技术与传统二维动画制作的相互关系

标签：

留言与评论（共有 0 条评论）