Feature selection for high-dimensional data A fast


2023年12月28日发(作者:sumproduct)

FeatureSelectionforHigh-DimensionalData:AFastCorrelation-BasedFilterSolutionLeiYuleiyu@nLiuhliu@artmentofComputerScience&Engineering,ArizonaStateUniversity,Tempe,AZ85287-5406,USAAbstractFeatureselection,asapreprocessingsteptomachinelearning,hasbeeneffectiveinreduc-ingdimensionality,removingirrelevantdata,increasinglearningaccuracy,r,therecentin-creaseofdimensionalityofdataposesase-verechallengetomanyexistingfeaturese-lectionmethodswithrespecttoefficiencyandeffwork,weintro-duceanovelconcept,predominantcorrela-tion,andproposeafastfiltermethodwhichcanidentifyrelevantfeaturesaswellasre-fficiencyandeffectivenessofourmethodisdemon-stratedthroughextensivecomuctionFeaturrocessofchoosingasubsetoforiginalfeaturessothatthefeatueselectionhasbeenafertilefieldofresearchanddevelopmentsince1970’sandshownveryeffectiveinremovingirrelevantandredundantfea-tures,increasingefficiencyinlearningtasks,improv-inglearningperformancelikepredictiveaccuracy,andenhancingcomprehensibilityoflearnedresults(Blum&Langley,1997;Dash&Liu,1997;Kohavi&John,1997).Inrecentyears,,numberofinstances),numberoffeatures)inmanyapplica-tionssuchasgenomeprojects(Xingetal.,2001),textcategorization(Yang&Pederson,1997),imagere-trieval(Ruietal.,1999),andcustomerrelationshipmanagement(Ng&Liu,2000).Thisenormitymaycauseseriousproblemstomanymachineleample,,datasetswithhundredsorthousandsoffeatures),cancontainhighdegreeofirrelevantandredundantinfor-ore,featureselectionbe-comesverynecessr,thistrendofenormityonbothsizeanddimetherecentresearcheffortsinfeatureselectionhavebeenfocusedonthesechallengesfromhandlingahugenumberofinstances(Liuetal.,2002b)todeal-ingwithhighdimensionaldata(Das,2001;Xingetal.,2001).ollowing,wefirstreviewmodelsoffeatureselectionandexplainwhyafiltersolutionissuitableforhighdimensionaldata,andthenreviewsomerecenteffeselectionalgorithmscanbroadlyfallintothefiltermodelorthewrappermodel(Das,2001;Kohavi&John,1997).Thefiltermodelreliesongeneralchar-acteristicsofthetrainingdatatoselectsomefeatureswithoutinvolvinganylearningalgorithm,ppermodelrequiresonepredeterminedlearningal-gorithminfeatureselectionanduachnewsubsetoffeatures,thewrappermodelneedstolearnahypothesis(oraclassifier).Ittendstogivesuperiorperformanceasitfindsfeaturesbettersuitedtothepredeterminedlearningalgorithm,butitalsotendstobemorecomputationallyexpensive(Lan-gley,1994).Whenthenumberoffeaturesbecomesverylarge,thefiltermodelisusuallyachoiceduetoitscomputationaleffidingsoftheTwentiethInternationalConferenceonMachineLearning(ICML-2003),WashingtonDC,2003.

Tocombinetheadvantagesofbothmodels,algorithmsinahybridmodelhaverecentlybeenproposedtodealwithhighdimensionaldata(Das,2001;Ng,1998;Xingetal.,2001).Inthesealgorithms,first,agoodnessmeasureoffeaturesubsetsbasedondatacharacteris-ticsisusedtochoosebestsubsetsforagivencardinal-ity,andthen,crossvalidationisexploitedtodecideafinalbestsubsetacrossdifflgorithmsmainlyfocusoncombiningfilterandwrap-peralgorithmstoachievebestpossibleperformancewithaparticularlearningalgorithmatthesametimecomplexityoffiwork,wefocusonthefiltermodelandaimtodevelopanewfeatureselectionalgorithmwhichcaneffectivelyremovebothirrelevantandredundantfeaturion2,wereviewcurrentalgorithmswithinthefiltion3,wedescribecorre-lationmeasureswhich-tion4,wefirstproposeourmethodwhichselectsgoodfeaturesforclassificationbasedonanovelconcept,predominantcorrelation,ion5,weevaluatetheefficiencyandeffectivenessofthisalgorithmviaextensiveexperimentsonvariousreal-worlddatasetscomparingwithotherrepresenta-tivefeatureselectionalgorithms,anddiscusstheim-plicationsofthefiion6,dWorkWithinthefiltermodel,differentfeatureselectional-gorithmscanbefurthercategorizedintotwogroups,namelyfeatureweightingalgorithmsandsubsetsearchalgorithms,basedonwhethertheyeva,wediscusstheeweightingalgorithmsassignweightstofea-turesreanumberofdifferentdefinitionsonfeaturerelevanceinmachinelearningliterature(Blum&Langley,1997;Kohavi&John,1997).Afeatureisgoodandthuswilnownalgorithmthatreliesonrele-vanceevaluationisRelief(Kira&Rendell,1992).ThekeyideaofReliefistoestimatetherelevanceoffea-turesaccordingtohowwelltheirvaluesdistinguishbe-tweentheinstancesofthesameanddiffrandomlysamplesanumber(m)ofinstancesfromthetrainingsetandup-datestherelevanceestimationofeachfeaturebasedonthedifferencebetweentheselecmplexityofReliefforadatasetwithMinstancesandNfeaturesisO(mMN).Withmbe-ingaconstant,thetimecomplexitybecomesO(MN),whichmakesitveryscalabletodar,asfeaturesaredeemedrelevanttotheclassconcept,theywillallbeselectedeventhoughmanyofthemarehighlycorrelatedtoeachother(Kira&Rendell,1992).nonlycapturetherelevanceoffeaturestothetargetconcept,r,empiricalevidencefromfea-tureselectionliteratureshowsthat,alongwithirrele-vantfeatures,redundantfeaturesalsoaffectthespeedandaccuracyoflearningalgorithmsandthusshouldbeeliminatedaswell(Hall,2000;Kohavi&John,1997).Therefore,inthecontextoffeatureselectionforhighdimensionaldatawheretheremayexistmanyredun-dantfeatures,purerelevance-basedfeatusearchalgorithmssearchthroughcandidatefeaturesubsetsguidedbyacertainevaluationmea-sure(Liu&Motoda,1998)mal(ornearoptimal)istingevaluationmeasuresthathavebeenshowneffectiveinremovingbothirrelevantandredundantfeaturesin-cludetheconsistencymeasure(Dashetal.,2000)andthecorrelationmeasure(Hall,1999;Hall,2000).Con-sistencymeasureattemptstofindaminimumnum-beroffeansistencyisde-finedastwoinstanceshavingthesamefeaturevaluesbutdiffetal.(2000),dif-ferentsearchstrategies,namely,exhaustive,heuristic,andrandomsearch,arecombinedwiththisevalua-tionmeasuretoformdiffecomplexityisexponentialintermsofdatadplexitycanbelineartothenumberofiterationsinarandomsearch,butexperimentsshowthatinordertofindbestfeaturesubset,thenumberofiterationsrequiredismostlyatleastquadratictothenumberoffeatures(Dashetal.,2000).InHall(2000),acorrelationmeasureisappliedtoevaluatethegood-

nessoffeaturesubsetsbasedonthehypothesisthatagoodfeaturesubsetisonethatcontainsfeatureshighlycorrelatedwiththeclass,erlyingalgorithm,namedCFS,ore,withquadraticorhighertimecomplexityintermsofdimensionality,existingsubsetsearchalcometheproblemsofalgorithmsinbothgroupsandmeetthedemandforfeatureselectionforhighdimensionaldata,wedevelopanovelalgorithmwhichcaneffectivelyidentifybothirrelevantandreation-BasedMeasuresInthissection,wediscusshowtoevaluatethegood-nessoffeaturesforclassifiral,afeatureisgoodifitisrelevanttooptthecorrelationbetweentwovariablesasagood-nessmeasure,theabovedefinitionbecomesthatafea-tureisgoodifitishighlycorrelrwords,ifthecorrelationbetweenafeatureandtheclassishighenoughtomakeitrelevantto(orpredictiveof)theclassandthecorrelationbetweenitandanyotherrelevantfeaturesdoesnotreachalevelsothatitcanbepredictedbyanyoftheotherrelevantfeatures,itwillberegardedasagoodfeaturefortheclassifisense,theproblemoffea-tureselectionboilsdowntofindasuitablemeasureofcorrelationsbetwxistbroadasedonhefirstapproach,themostwellknownmeasureislinearcorrelationco-effiirofvariables(X,Y),thelinearcorrelationcoefficientrisgivenbytheformula

本文发布于:2024-09-24 16:34:27,感谢您对本站的认可!

本文链接:https://www.17tex.com/fanyi/40184.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议