structure 使用说明

2023年12月14日发(作者：suggestive)

Documentationforstructuresoftware:ardaXiaoquanWenaDanielFalushb123aDepartmentofHumanGeneticsUniversityofChicagobDepartmentofStatisticsUniversityofOxfordSoftwarefrom/ril21,2009colleaguesinthestructureprojectarePeterDonnelly,MatthewStephensandMelissaHubisz.ﬁrstversionofthisprogramwasdevelopedwhiletheauthors(JP,MS,PD)wereintheDepartmentofStatistics,UniversityofOxford.3Discussionandquestionsaboutstructureshouldbeaddressedtotheonlineforumatstructure-software@checkthisdocumentandsearchthepreviousdiscus-sionbeforepostingquestions.2The1Ourother

1.2What’snewinVersion2.3?...............................2Formatforthedataﬁle2.1Componentsofthedataﬁle:................................2...4Missingdata,nullallelesanddominantmarkers4.1Dominantmarkers,5EstimationofK(thenumberofpopulations)5.3InformalpointersforchoosingK;isthestructurereal?................6BackgroundLDandothermiscellania6..7.2Parametersinﬁ.7.3Parametersinﬁ.8.4Conﬁ8.7Exportingparameterﬁ758526262727283030303132

.9.5Printoutofestimatedallelefrequencies(P).......................11Howtocitethisprogram12Bibliography2333434353535363737373737

1IntroductionTheprogramstructureimplementsamodel-basedclusteringmethodforinferringphodwasintroducedinapaperbyPritchard,StephensandDonnelly(2000a)andextendedinsequelsbyFalush,StephensandPritchard(2003a,2007).Applicationsofourmethodincludedemonstratingthepresenceofpopu-lationstructure,identifyingdistinctgeneticpopulations,assigningindividualstopopulations,ﬂy,weassumeamodelinwhichthereareKpopulations(whereKmaybeunknown),dualsinthesampleareassigned(probabilistically)topopulations,orjointlytsumedthatwithinpopulations,thelociareatHardy-Weinbergequilibrium,yspeaking,eldoesnotassumeaparticularmutationprocess,anditcanbeappliedtomostofthecommonlyusedgeneticmarkersincludingmicrosatellites,elassumesthatmarkersarenotinlinkagedisequilibrium(LD)withinsubpopulations,sowecan’ngwithversion2.0,hecomputationalapproachesimplementedherearefairlypowerful,somple,itisnotpossibletodeterminesuitablerun-lengthstheoretically,cumentdescribestheuseandinterpretationofthesoftwareandsupplementsthepublishedpapers,ributesourcecodeaswellasexecutablesforvariousplatforms(currentlyMac,Windows,Linux,Sun).TheCexecutablereadsadataﬁsalsoaJavafrontendthatprovidesvariocumentincludesinformationabouthowtoformatthedataﬁle,howtochooseappropriatemodels,hasdetailsonusingthetwointerfaces(commandlineandfrontend)andasummaryofthevarioususer-deﬁnedparameters.1.2What’snewinVersion2.3?The2.3release(April2009)introducesnewmodelsforimprovingstructureinferencefordatasetswhere(1)thedataarenotinformativeenoughfortheusualstructuremodelstoprovideaccuratein-ference,but(2)situation,bymakingexplicituseofsamplinglocationinformation,wegivestructureaboost,oftenallowingmuchimprovedperformance(Hubiszetal.,2009).Wehopetoreleasefurtherimprovementsinthecomingmonths.3

GeorgeGeorgePaulaPaulaMatthewMatthewBobBobAnjaAnjaPeterPeterCarstenCarsten122localocblocclocd-9145660-9-9646108142641-9142-90112142-91114142661-91456601162loce9294929492-99494-994-9-9-992Table1:SampledataﬁRKERNAMES=1,LABEL=1,POPDATA=1,NUMINDS=7,NUMLOCI=5,andMISSING=-,POPFLAG=0,LOCDATA=0,PHENOTYPE=0,EX-TRACOLS=lsostorethedatawithonerowperindividual(ONEROWPERIND=1),inwhichcasetheﬁrstrowwouldread“George1-9-9145-96664009294”.2FormatforthedataﬁleTheformatforthegenotypedataisshowninTable2(andTable1showsanexample).Essentially,theentiredatasetisarrangedasamatrixinasingleﬁle,inwhichthedataforindividualsareinrows,rcanmakeseveralchoicesaboutformat,andmostofthesedata(apartfromthegenotypes!)ploidorganism,dataforeachindividualcanbestoredeitheras2consecutiverows,whereeachlocusisinonecolumn,orinonerow,youplantousethelinkagemodel(seebelow)-genotypedatacolumns(seebelow)arerecordedtwiceforeachindividual.(Moregenerally,forn-ploidorganisms,dataforeachindividualarestoredinnconsecutiverowsunlesstheONEROWPERINDoptionisused.)2.1Componentsofthedataﬁle:Theelementsoftheinputﬁent,theymustbeinthefollowingorder,howevermostareoptional(asindicated)rspeciﬁeswhichdataarepresent,eitherinthefrontend,or(whenrunningstructurefromthecommandline),inaseparateﬁle,ametime,theuseralsospeciﬁesthenumberofindividualsandthenumberofloci.4

Names(Optional;string)TheﬁrstrowintheﬁlecancontainalistofidentiﬁwcontainsLstringsofintegersorcharacters,iveAlleles(Datawithdominantmarkersonly;integer)DriftheoptionRECESSIVEALLE-LESissetto1,thentheprogramrequiresthisrowtoindicatewhichallele(ifany)ionisuse-MarkerDistances(Optional;real)thenextrowintheﬁleisasetofinter-markerdistances,,centiMorgans),orsomeproxyforthisbased,forexample,ualunitsofdistancedonotmattertoomuch,providedthatthemarkerdistancesare(roughly)ntendestimatesanappropriatescalingfromthedata,butusersofthecommandlineversionmustsetLOG10RMIN,LOG10RMAXandLOG10RSTARTintheﬁnsecutivemarkersarefromdiﬀ,diﬀerentchromosomes),ﬁnformation(Optional;diploiddataonly;realnumberintherange[0,1]).asinglerowoeisknowncompletely,ornophaseinformationisavailable,ybeusefulwhenthereispartialphaseinformationfromfamilydataorwhenhaploidretwoalternativerepresentationsforthephaseinformation:(1)thetworowsofdataforanindividualareassumedtocorrespondtothepaternalandmaternalcontributions,selineindicatestheprobabilitythattheorderingiscorrectatthecurrentmarker(setMARKOVPHASE=0);(2)thephaselineindicatestheprobabilitythatthephaseofoneallelerelativetothepreviousalleleiscorrect(setMARKOVPHASE=1).Theﬁrstentryshouldbeﬁlledinwith0.5toﬁmplethefollowingdatainputwouldrepresenttheinformationfromanmalewith5unphasedautosomalmicrosatellitelocifollowedbythreeXchromosomeloci,usingthematernal/paternalphasemodel:1143-9-9-90.50.50.50.50.51.01.01.0where-9indicates”missingdata”,heremissingduetotheabsenceofasecondXchromo-some,the0.5indicatesthattheautosomallociareunphased,andthe1.0sindicatethattheXchromosomelociarehavebeenmaternallyinheritedwithprobability1.0,casetheinputﬁlewouldread:5

1143-9-9-90.50.50.50.50.50.51.01.0Here,thetwo1.0sindicatethattheﬁrstandsecond,aatthesitebysiteoutputunderthesetwomodelswillbediﬀﬁrstcase,structurewouldouecondcase,itwouldoutputtheprobabilitiesforeachallelelistedintheinputﬁdual/Genotypedata(Required)Dataforeachsampledindividualarearrangedintooneormorerowsasdescribedbelow.2.3Individual/ormcolumnsinthedataﬁ(Optional;string)Astrina(Optional;integer)Anintegerdesignatingauser-deﬁnedpopulationfromwhichtheindividualwasobtained(forinstancethesemightdesignatethegeographicsamplinglocationsofindividuals).Inthedefaultmodels,thisinformationisnotusedbytheclusteringalgorithm,butcanbeusedtohelporganizetheoutput(forexample,plottingindividualsfromthesamepre-deﬁnedpopulationnexttoeachother).g(Optional;0or1)ABooleanﬂagwhichindicateswhethertousethePopDatawhenusinglearningsamples(seeUSEPOPINFO,below).(Note:ABooleanvariable(ﬂag)isavariablewhichtakesthevaluesTRUEorFALSE,whicharedesignatedherebytheintegers1(usePopData)and0(don’tusePopData),respectively.)a(Optional;integer)Anintegerdesignatingauser-deﬁnedsamplinglocation(orothercharacteristic,suchasasharedphenotype)forimplywishtousethePopDatafortheLOCPRIORmodel,thenyoucanomittheLocDatacolumnandsetLOCISPOP=1(thistellstheprogramtousePopDatatosetthelocations).ype(Optional;integer)Anintegerdesignatingthevalueofaphenotypeofinterest,foreachindividual.(φ(i)intable.)(retopermitasmoothinterfacewiththeprogramSTRATwhichisusedforassociationmapping.)olumns(Optional;string)Itmaybeconvenientfortheusertoincludeadditionaldataintheinputﬁohere,peData(Required;integer)Eachalleleatagivenlocusshouldbecodedbyauniqueinteger(egmicrosatelliterepeatscore).6

2.4MissinggenotypedataMissingdatashouldbeindicatedbyanumberthatdoesn’toccurelsewhereinthedata(often-9byconvention).Thisnumbercanalsobeusedwherethereisamixtureofhaploidanddiploiddata(egXandautosomallociinmales).Themissing-datavalueissetalonimplementedreasonablycarefulerrorcheckingtomakesurethatthedatasetisinthecorrectformat,andtheprogramwillattntendrequiresreturnsattheendsofeachrow,anddoesnotallowreturnswithinrows;thecommanblemthatcanariseisthateditingprogramsusedtoassemblethedatapriortoimportingthemintostructurecanintroducehiddenformattingcharacters,oftenattheendsoflines,orattheendoftheﬁntendcanremovemanyoftheseautomatically,butthistypeofproblemmayberesponsibleforerrorswhenthedataﬁreimportingdatatoaUNIXsystem,thedos2unixfunctioncanbehelpfulforcleaningtheseup.33.1ModellingdecisionsfortheuserAncestryModelsTherearefourmainmodelsfortheancestryofindividuals:(1)noadmixturemodel(individualsarediscretelyfromonepopulationoranother);(2)theadmixturemodel(eachindividualdrawssomefractionofhis/hergenomefromeachoftheKpopulations;(3)thelinkagemodel(liketheadmixturemodel,butlinkedlociaremorelikelytocomefromthesamepopulation);(4)modelswithinformativepriors(allowstructuretouseinformationaboutsamplinglocations:eithertoassistclusteringwithweakdata,todetectmigrants,ortopre-deﬁnesomepopulations).SeePritchardetal.(2000a)and(Hubiszetal.,2009)formoreonmodels1,2,and4andFalushetal.(2003a)puorprobabilityforeachpopulationis1/delisappropriateforstudyingfullydiscretepopulationsandisomodelledbysayingthatindividualihasinheritedsomefractionofhis/ionalontheancestryvector,q(i),easonablyﬂureisacommonfeatureofrealdata,andyouprobablywon’tﬁixturemodelcanalsodealwithhybridzonesinanaturalway.7

LabelPopFlagLocationPhenExtraColsM1r1-1Loc1Loc2Loc3M2r2D1,2M3r3D2,3x2(1,2)x2(1)p3x2(2,2)x2(2)p3(i,1)(2,1)(1,1)....MLrLDL−((3(1,3(2,(i,1)(2,1)(1,1)ID(1)ID(1)g(1)g(1)f(1)f(1)l(1)l(1)φ(1)φ(1)y1,...,yn(1)(1)y1,...,yn(1)p1y1,...,yn(2)(2)y1,...,yn(2)p1(i)(i)(2)(1)(1)x1(1,2)x1(1)p2x1(2,2)x1(2)p2(i,1)(2,1)(1,1)xL(1,2)xL(2,1)(1,1)ID(2)ID(2)....ID(i)ID(i)....ID(N)ID(N)g(2)g(2)f(2)f(2)l(2)l(2)φ(2)φ(2)(2)xL(2,2)xLg(i)g(i)f(i)f(i)l(i)l(i)φ(i)φ(i)y1,...,yn(i)(i)y1,...,yn(3)p1(N)(N)x1(i,2)x1(3)p2(N,1)x2(i,2)x2(3)p3(N,1)x3(i,(N,1)........(3)pLxL(i,2)xL(i,1)g(N)g(N)f(N)f(N)l(N)l(N)φ(N)φ(N)y1,...,yn(N)(N)y1,...,yn(L)p1x1(N,2)x1(L)p2x2(N,2)x2(L)p3x3(N,(1)pLxL(N,2)xL(N,1)Table2:Formatofthedataﬁle,thesecomponentsareoptional(seetextfordetails).Mlisanidentiﬁcateswhichallele,ifany,isrecessiveateachmarker(dominantgenotypedataonly).Di,i+1isthedistancebetweenmarkersiandi+(i)isthelabelforindividuali,g(i)isapredeﬁnedpopulationindexforindividuali(PopData);f(i)isaﬂagusedtoincorporatelearningsamples(PopFlag);l(i)isthesamplinglocationofindividuali(i)(i)(LocData);φ(i)canstoreaphenotypeforindividuali;y1,...,ynareforstoringextradata(ignored(l)1i,2bytheprogram);(xi,l,xl)essentiallyageneralizationoftheadmixturemodeltodealwith“ad-mixturelinkagedisequilibrium”–i.e.,thecorreletal.(2003a)describesthemodel,icmodelisthat,tgenerationsinthepast,onsideranindividualchromosome,itiscomposedofaseriesof“chunks”thatureLDarisesbecauselinkedallelesareoftenonthesamechunk,esofthechunksareassumedtobeindependentexponentialrandomvariableswithmeanlength1/t(inMorgans).Inpracticeweestimatea“recombinationrate”rfromthedata8

thatcorrespondstotherateofswitchingfromthepresentchunktoanewchunk.1Eachchunk(i)(i)inindividualiisderivedindependentlyfrompopulationkwithprobabilityqk,whereqkistheproportionofthatindividual’l,thenewmodelretainsthemainelementsoftheadmixturemodel,butalrtstheoverallancestryforeachindividual,takingaccountofthelinkage,andcanalsoreporttheprobabilityoforiginofeachbitofchromosome,wmodelperformsbetterthantheorievesmoreaccurateestimatesoftheancestryvector,elisny,thismodelisabigsimpliﬁr,themajoreﬀectofadmixtureistocreatelong-rangecorrelationamonglinkedmarkers,anputationsareabitslowerthanfortheadmixturemodel,eless,theyarepelcanonlybeusedifthereisinformationabouttherelativepositionsofthemarkers(usuallyageneticmap).aultmodeforstrr,ther,physicalcharacteristicsofsampledindividualsorgeographicsamplinglocations).Atpresent,structurecanusethisinformationinthreeways:•LOCPRIORmodels:usesamplinglocationsaspriorinformationtoassisttheclustering–,signiﬁcantFSTbetweensamplinglocations),oftenthecasefordatasetswithfewmarkers,fewindividuals,oveperformanceinthissituation,Hubiszetal.(2009)developemodelscanoftenprovideaccurateinferenceofpopulationstructureandindividualancestryindatasetswherethesigﬂy,y,structureassumesthahereisanimmensenumberofpossiblepartitions,ittakeshighlyinformativedataforstructuretoBecauseofthewaythatthisisparameterized,themapdistancesintheinputﬁlecanbeinarbitraryunits–e.g.,geneticdistances,orphysicaldistances(undertheassumptionthattheseareroughlyproportionaltogeneticdistances).Thentheestimatedvalueofrrepresentstherateofswitchingfromonechunkstothenext,perunitofwhateverdistancewasassumedintheinputﬁ,ifanadmixtureeventtookplacetengenerationsago,thenrshouldbeestimatedas0.1whenthemapdistancesaremeasuredincM(thisis10∗0.01,where0.01istheprobabilityofrecombinationpercentiMorgan),oras10−4=10∗10−5whenthemapdistancesaremeasuredinKB(assumingaconstantcrossing-overrateof1cM/MB).ntendtriestomakesomeguessesaboutsensibleupperandlowerboundsforr,buttheusershouldadjustthesetomatchthebiologyofthesituation.2Danielreferstothisas“Betterpriorsforworsedata.”19

concludethatanyparticularparast,theLOCPRIORmodelstaketheviewthatinpractice,indivore,theLOCPRIORmodelsaresetatasuggestthatthelocationsareinformative,etal.(2009)developedapairofLOCPRIORmodels:cases,theunderlyingmodel(andthelikelihood)diﬀerenceisthatstr,bymodifyingthepriortopreferclusteringsolutionsthatcorrelatewiththelocations).TheLOCPRIORmodelshavethedesirablepropertiesthat(i)theydonottendtoﬁndstruc-turewhennoneispresent;(ii)theyareabletoignorethesamplinginformationwhentheancestryofindividualsisuncorrelatedwithsamplinglocations;and(iii)theoldandnewmodelsgiveessentia,werecommendusingthenewmodelsinmostsituationswheretheamountofavailabledataisverylimited,especiallywhr,sincethereisnowagreatdealofaccumulatedexperiencewiththestandardstructuremodels,werecommendthatthebasicmodelsremainthedefaultforhighlyinformativedatasets(Hubiszetal.,2009).ToruntheLOCPRIORmodel,theusermustﬁrstspecifya“samplinglocation”foreachindividual,,weassumethesampleswerecollectedatasetofdiscretelocations,andwedonotuseanyspatialinformationaboutthelocations.(Werecognizethatinsomestudies,everyindividualmaybecollectedatadiﬀerentlocation,andsoclumpingindividualsintoasmallersetofdiscretelocationsmaynotbeanidealrepresentationofthedata.)The“locations”couldalsorepresentaphenotype,ecotype,ationsareenteredintotheinputﬁleeitherinthePopDatacolumn(setLOCISPOP=1),orasaseparateLocDatacolumn(seeSection2.3).TousetheLOCPRIORmodelyoumustﬁreusingtheGraphicalUserInterfaceversion,tickthe“usesamplinglocationsasprior”reusingthecommand-lineversion,setLOCPRIOR=1.(NotethatLOCPRIORisincompatiblewiththelinkagemodel.)OurexperiencesofaristhattheLOCPRIORmodelusethesamediagnosticonallyitmaybehelpfultolookatthevalueofr,ofrnear1,or

NotethatthismodelassumesthatthepredeﬁsquitestrongdatatoovercometheprioragainstmisclassiﬁusingtheUSEPOPINFOmodel,youshouldalsoruntheprogramwithoutpopulationinformationtoensurethatthepre-deﬁhismodelsetUSEPOPINFOto1,andchooseavalueofMIGRPRIOR(whichisνinPritchardetal.(2000a)).Youmightchoosesomethingintherange0.001to0.1forν.Thepre-deﬁnedpopulationforeachindividualissetintheinputdataﬁle(seePopData).Inthismode,individualsassignedtopopulationkintheinputﬁore,thepredeﬁnedpopulationsshouldbeintegersbetween1andMAXPOPS(K),ataforanyindividualisoutsidethisrange,theirqwillbeupdatedinthenormalway(iewithoutpriorpopulationinformation,accordingtothemodelthatwouldbeusedifUSEPOPINFOwasturnedoﬀ.3).•USEPOPINFOmodel:pre-specifythepopulationoforiginofsomeinddwaytousetheUSEPOPINFOmodelistodeﬁne“learningsamples”thatarepre-deﬁ:IntheFrontEnd,thisoptionisswitchedonusingtheoption“UpdateallelefrequenciesusingonlyindividualswithPOPFLAG=1”,locatedunderthe“AdvancedTab”.LearningsamplesareimplementedusingthePopFlagcolumninthedataﬁ-deﬁnedpopulationisusedforthoseindividualsforwhomPopFlag=1(K)).ThePopDatavalueisignoredforindividualsforwhomPopFlag=eisnoPopFlagcolumninthedataﬁle,thenwhenUSEPOPINFOisturnedon,ryofindividualswithPopFlag=0,K)areupdatedaccordingtotheadmixtureorno-admixturemodel,asspeciﬁdabove,itmaybehelpfultosetαtoasensiblevalueiftherearefewindividualswithoutpredeﬁmple,theremaybesomeindividualsofknownorigin,mple,K),andthenusestructuretoestimatetheancestryforadditionaldogsofunknown(possiblyhybrid)-settingthepopulationnumbers,wecanensurethatthestructureclusterscorrespondtopre-deﬁnedbreeds,whichmakestheoutputmoreinterpretable,andcanimprovetheaccuracyoftheinference.(Ofcourse,iftwopre-deﬁnedbreedsaregeneticallyidentical,ruseofUSEPOPINFOisforcaseswheretheuserrily,structureanalysrtherearesomesettingswhereyoumightwanttoestimateancestryforsomeindividuals,withoutthoseindividualsaﬀmpleyoumayhaveastandardcollectionoflearningsamples,andthenperiodicallyyouwanttoestimateancestryfornewbatchesofgenotypedIftheadmixturemodelisusedtoestimateqforthoseindividualswithoutpriorpopulationinformation,αeareveryfewsuchindividuals,youmayneedtoﬁxαatasensiblevalue.311

efaultoptions,theancestryestimatesforindividualswoulddepend(some-what)gPFROMPOPFLAGONLYyoucanensurethattheallelefrequencyestimatesdependonlyonsamplesforwhichPopFlag=ﬀerentsetting,Murgiaetal.(2006)oursweresocloselyregPFROMPOPFLAGONLY,ts:Werecommendrunningthebasicversionofstructureﬁrsttoverifythattheprede-ﬁ,whenusinglearningsamples,itmaybesensibletoallowforsomemisclassiﬁcationbysettingMIGRPRIORlargerthan0.3.2Allelassumesthattheallelefrequenciesineachpopulationareindependentdrawsfromadistributionthatisspeciﬁedbyaparametercalledλ.ThatistheoriginalmodelthatusedinPritchardetal.(2000a).Usuallywesetλ=1;etal.(2003a)ysthatfrequenciesinthediﬀerentpopulationsarelikelytobesimilar(probablyduetomigrationorsharedancestry).yspeaking,thispriorsaysthatweexpectallelefrequenciesindiﬀerentpopulationstobereasonablydiﬀtenimprovesclusteringforcloselyrelatedpopulations,butmayincreasetheriskofover-estimatingK(seebelow).Ifonepopulationisquitedivergentfromtheothers,thecorrelatedmtingλ:Fixingλ=1isagoodideaformostdata,butinsomesituations–e.g.,SNPdatawheremostminorallelesarerare–sreason,youcangettheprogramtoestimateλwanttodothisonce,perhapsforK=1,andthenﬁxλattheestimatedvaluethereafter,becausethereseemtobesomeproblemswithnon-identiﬁabilitywhentryingtoestimatetoomanyofthehyperparameters(λ,α,F)atedallelefrequenciesmodel:AsdescribedbyFalushetal.(2003a)thecorrelatedfrequenciesmodelusesa(multidimensional)vector,PA,whichrecordstheallelefrequenciesinahypothetical“ancestral”sumedthattheKpopulationsrepresentedinoursam-plehaveeachundergoneindependentdriftawayfromtheseancestralfrequencies,ataratethatisparameterizedbyF1,F2,F3,...,FK,imatedFkvaluesshouldbenumericallysimilartoFSTvalues,apartfromdiﬀerencesthatstemfromtheslightlydiﬀerentmodel,anddiﬀ,itisdiﬃsumedtohaveaDirichletpriorofthesameformasthatusedaboveforthepopulationfrequencies:pAl·∼D(λ1,λ2,...,λJl),(1)NotethatPritchardetal.(2000a)alsooutlinedamodelofcorrelatedallelefrequencies;thiswaslastavailableinVersion1.x412

epriorforthefrequenciesinpopulationkispkl·∼D(PAl11−Fk1−Fk1−Fk,PAl2,...,PAlJl),FkFkFk(2)model,theFshaveacloserelationshiptothestandardmeasureofgeneticdistance,tandardparametrizationofFST,theexpectedfrequencyineachpopulationisgivenbyoverallmeanfrequency,andthevarianceinfrequencyacrosssubpop-ulationsofanalleleatoverallfrequencypisp(1−p)elhereismuchthesame,exceptthatwegeneralizethemodelslightlybyallowingeachpopulationtodriftawayfromtheancestralpopulationatadiﬀerentrate(Fk),asmightbeexpectedifpopulationshavediﬀtrytoestimate“ancestralfrequencies”,placedindependentpriorsontheFk,proportionaltoagammadistributionwithmeansof0.01andstandarddeviation0.05(butwithPr[Fk≥1]=0).Theparametersofthegammapriorcanbemodiﬁperimentationsuggeststhatthepriormeanof0.01,whichcorrespondstoverylowlevelsofsubdivision,oftenleadstogoodperformancefordatathatarediﬃrproblems,wherethediﬀerencesamongpopulationsaremoremarked,itseemsthatthedatausuallyoverwhelmthisprioronFk.3.3HowlongtoruntheprogramTheprogramisstartedfromarandomconﬁguration,andfromtheretakesaseriesofstepsthroughtheparameterspace,eachofwhichdepends(only)ocedureinducescorrelationsbetweenthestateoftheMarkovchainatdiﬀeisthatbyrunningthesimulationforlongenough,retwoissuestoworryabout:(1)burninlength:howlongtorunthesimulationbeforecollectingdatatominimizetheeﬀectofthestartingconﬁguration,and(2)howlongseanappropriateburninlength,itisreallyhelpfultolookatthevaluesofsummarystatisticsthatareprintedoutbytheprogram(egα,F,thedivergencedistancesamongpopulationsDi,j,andthelikelihood)llyaburninof10,000—100,seanappropriaterunlength,youwillneedtodoseveralrunsateachK,possiblyofdiﬀerentlengths,lly,youcangetgoodestimatesoftheparametervalues(PandQ)withrunsof10,000–100,000steps,butaccurateestimationofPr(X|K)ticeyourrunlengthmaredealingwithextremelylargedatasetsandarefrustratedwiththeruntimes,youmighttrytrimmingboththelengthoftheruns,andthenumberofmarkers/individuals,uldlooktoseewaluesarestillincreasingordecreasingattheendoftheburninphase,stimateofα,notjustduringtheburnin),youmaygetmoreaccurateestimatesofPr(X|K)byincreasingALPHAPROPSD,whichimprovesmixinginthatsituation.(Seearelatedissueinsection5).13

4Missingdata,nullallelesanddominproachiscorrectwhentheprobabilityofhavingmissingdataastimatesofQforindividualswithmissingdataarelessaccurate,thereisnoparticularreasontoexcludesuchindividualsfromtheanalysis,usproblemariseswhendataaremissinginasystematicmanner,onotﬁttheassumedmodel,andcanleadtoapprthedominantmarkersmodel(below)nsometimesleadtooverestimationofK,especiallyforthecorrelatedfrequenciesmodel(Falushetal.,2003a),butthereislittleeﬀectontheassignmentofindividualstopopulationsforﬁxedK.4.1Dominantmarkers,nullalleles,andpolyploidgenotypesForsometypesofgeneticmarkers,suchasAFLPs,ypesofmarkersmayresultinambigousgenotypesifsomefractionoftheallelesare’null’ngwithVer-sion2.2weimplementamodelf,weassumethatatanyparticularlocustheremaybeasingleallelethatisrecessivetoallotheralleles(egA),BandBBwouldappearintherawgenotypedataasa“phenotype”B,ACandCCwouldberecordedasC,ereisambiguity,tailsaregiveninFalushetal.(2007).Inordertoperformthesecomputationsthealgorithmmustbetoldwhichallele(ifany)donebysettingRECESSIVEALLELES=1,andincludingasinglerowofLintegersatthetopoftheinputﬁle,betweenthe(optional)linesformarkernamesandmapdistances,ivenlocusallthemarkersarecodominanselyiftherecessivealleleisneverobservedinhomozygousstatebutyouthinkitmightbepresent(ightbenullalleles)thensettherecessivevaluetoanallelethatisnotobservedatthatlocus(butnotMISSING!).Codingthegenotypedata:Ifthephenotypeisunambiguous,thenitiscodedinthestructureinputﬁambiguousthenitiscodedashomozygousforthedominantallele(s).Forexample,phenotypeAiscodedAA,BiscodedBB,BCiscodedBC,arkerishaploidinanotherwisediploidindividual(egfortheXchromosomeinamale),otypesAB,AC,etcareillegalintheinputﬁCESSIVEALLELESisusedtodealwithnullalleles,genotypesthatappeartobehomozygotenullshoulticeitmaybeunureshouldberobusttothesebeingcodedasmissingunlessnullallelesareathighfrequencyatalocus.14

Inpolyploids(PLOIDY>2)thesituationismorecomplictendiﬃctureisrunwithRECESSIVEALLELES=yploids,whenRECESSIVEALLELES=1,structureallowsthedatatoconsistofamixtureoflociforwhichthereis,andisn’lociarenotambiguousthensetthecodeNOTAMBIGUOUStoanintegerthatdoesnotmatchanyoftheallelesinthedata,therecessivealleleslineatthetopoftheinputﬁead,ataparticularlocustheallelesareallcodominant,butthereisambiguityaboutthenumberofeach(egformicrosatellitesinatetraploid)y,ifthereisarecessiveallele,andthereisalsoambiguityaboutthenumberofeachallele,ofalleleswherethereiscxampleinatetraploidwherethreecodominantlociB,CandDobserved,thisshouldbecodedasBCDDorequildnotbecodedasBCD(MISSING),asthisindtionofPr(K):WhenRECESSIVEALLELESisusedfordiploids,thelikelihoodateachsteofcoding,wheneitherPLOIDY>2orthelinkagemodelisused,creasesthelikelihoodandseemstogreatlyinﬂdexperienceindicatesthatthisleadstopoorperformanceforestimatingKinthelattercases,andyoushouldconsidersuchestimatesofKtobeunreliable.5EstimationofK(thenumberofpopulations)Inourpaperdescribingthisprogram,wepointedoutthatthisissueshouldbetreatedwithcarefortworeasons:(1)itiscomputationallydiﬃculttoobtainaccurateestimatesofPr(X|K),andourmethodmerelyprovidesanadhocapproximation,and(2)xperienceweﬁndthattherealdiﬃcedureforestimatingKger,,duetoisolationbydistanceorinbreeding).Inthosecasestheremaynotbeanaturalanswertowhatisthe“correct”sforthiskindofreason,itisnotinfrequentthatinrealdatathevusuallymakessensetofocusonvaluesofKthatcapturemostofthestructureinthedataandthatseembiologicallysensible.5.1StepsinestimatingK1.(Command-lineversion)SetCOMPUTEPROBSandINFERALPHAto1intheﬁleextra-params.(FrontEndversion)MakesurethatαMCMCschemefordiﬀerentvaluesofMAXPOPS(K).Attheenditwilloutputaline”EstimatedLnProbofData”.ThisistheestimateoflnPr(X|K).Youshouldrun15

severalindependentrunsforeachK,ariabilityacrossrunsforagivenKissubstantialcomparedtothevariabilityofestimatesobtainedfordiﬀerentK,(X|K)appearstobebimodalormultimodal,theMCMCschememaybeﬁndingdiﬀcheckforthisbycomparingtheQfordiﬀerentrunsatasingleK.(cfDataSet2AinPritchardetal.(2000a),andseethesectiononMultimodality,below).mple,forDataSet2Ainthepaper(whereKwas2),wegotK12345lnPr(X|K)-4356-3983-3982-3983-4006WecanstartbyassumingauniformprioronK={1,...,5}.ThenfromBayes’Rule,Pr(K=2)isgivenbye−3983(3)−4356−3983−3982−3983−4006e+e+e+e+eIt’seasiertocomputethisifwesimplifytheexpressiontoe−1=0.21e−374+e−1+e0+e−1+e−24(4)5.2MilddeparturesfromthemodelcanleadtooverestimatingKWhenthereisrealpopulationstructure,thisleadsyspeaking,edeparturesncludeinbreeding,andgenotypingerrorssuchasoccasional,undetected,theabsenceofpopulationstructure,thesetypesoffactorscanleadtoaweakstatisticalsignalforK>inginVersion2,wehavesuggestedthatthecorrelatedallelefrequencymodelshouldbeusedasadefaultbecauseitoftenachievesbetterperformanceondiﬃcultproblems,buttheusershouldbeawarethatthismaymakeiteasiertooverestimateKinsuchsettingsthanundertheindependentfrequenciesmodelFalushetal.(2003a).Thenextsubsectiondiscusseshowtodecidewhetherinferredstructureisreal.5.3InformalpointersforchoosingK;isthestructurereal?Tﬁrstisthatit’softenthesituationthatPr(K)isverysmallforKlessthantheappropiatevalue(eﬀectivelyzero),andthenmore-or-lessplateausforlargerK,sortofsituationwhereseveralvaluesofKgivesimilarestimatesoflogPr(X|K),itseemsthatthesmallestoftheseisoften“correct”.16

Itisabitdiﬃculttoprovideaﬁrmruleforwhatwemeanbya“more-or-lessplateaus”.Forsmalldatasets,thismightmeanthatthevaluesoflogPr(X|K)arewithin5-10,butDanielFalushwritesthat“inverybigdatasets,thediﬀerencebetweenK=3andK=4maybe50,butifthediﬀerencebetweenK=3andK=2is5,000,thenIwoulddeﬁnitelychooseK=3.”ReaderswhowanttouseamoreformalcriterionthattakesthisintoaccountmaybeinterestedinthemethodofEvannoetal.(2005).,wemaynotalwaysbeabletoknowtheTRUEvalueofK,butweshouldondpointeristhatiftherereallyareseparatepopulations,thereistypicallyalotofinformationaboutthevalueofα,andoncetheMarkovchainconverges,αwillnormallysettledowntoberelativelyconstant(oftenwitharangeofperhaps0.2orless).However,ifthereisn’tanyrealstructure,αllaryofthisisthatwhenthereisnopopulationstructure,youwilltypicallyseethattheproportionofthesampleassignedtoeachpopulationisroughlysymmetric(∼1/Kineachpopulation),individualsarestronglyassignedtoonepopulationoranother,andiftheproportionsassignedtoeachgroupareasymmetric,ethatyouhaveasituationwithtwoclearpopulations,butyouaretryingtodecidewhetheroneoftheseisfurthersubdivided(ie,thevalueofPr(X|K=3)issimilarto,orperhapsalittlelargerthanP(X|K=2)).Thenonethingyoucouldtryistorunstructureusingonlytheindividualsinthepopulationthatyoususpectmightbesubdivided,ary,youshouldbeskepticalaboutpopulationstructureinferredonthebasisofsmalldiﬀerencesinPr(K)if(1)thereisnoclearbiologicalinterpretationfortheassignments,and(2)theassignmentsareroughlysymmetrictoallpopulationsandnoindividualsarestronglyassigned.5.4IsolationbydistancedataIsolationbydistancereferstotheideathatindividualsmaybespatiallydistributedacrosssomeregion,situation,eisoccurs,theinferredvalueofK,andtingonthesamplingscheme,,thealgorithmwillattempttomodeltheallelefrequsituations,interpretingtheresultsmaybechallenging.66.1BackgroundLDandothermiscellaniaSequencedata,tightlylinkedSNPsandhaplotypedata,notinLDwithinpopulations).Thisassumptionislikelytobeviolatedforsequencedata,avesequencedataordenseSNPdatafrommultipleindependentregions,thenstructuremayactuallyperformreasonablywelldespitethedatanotcompletelyﬁy17

speaking,thiswillhappenprovidedthatthereisenoughindereareenoughindependentregions,themaincostofthedependencewithinregionswillbethatstructureundmple,Falushetal.(2003b)appliedstructuretoMLST(multilocussequence)case,thereisenoughrecombinationwithinregionsthatthesignalofpopulationstructuredominatesbackgroundLD.(FormoreonMLSTdata,seealsoSection10.)Inanapplicationtohumans,Conradetal.(2006)foundthat3000SNPsfrom36linkedregionsproducedsensible(butnoisy)answersinaworldwidesampleofhumansthatlargelyagreedwithpreviousresultsbasedonmicrosatellites[seetheirSupplementaryMethodsFigureSM2].However,ifthedataaredominatedbyoneorafewnon-orlow-recombiningregions,mple,ifthedataconsistedofYchromosomedataonly,thentheestimatedstructurewouldpresumablyreﬂectsomethingabouttheYchromosometree,actofusingsuchdataislikelytobethat(1)thealgorithmunderestimatesthedegreeofuncertaintyinancestryestimates,andintheworstcase,maybebiasedorinaccurate;(2)aveYormtDNAdataplusanumberofnuclearmarkers,asafeandvalidsolutionistorecodethehaplotypesfromeachlinkeeareverymanyhaplotypes,atthelinkagemodelisnotnecessarilyanybetterthanthe(no)-kagemodelisnotdesignedtodealwithbackgroundLDwithinpopulations,andislikelytobesimilarlyconfused.6.2MultimodalityThestructurealgorithmstartsatarandomplaceinparameterspace,andthenconvergestowardsamodeoftheparameterspace.(Inthiscontext,amodecanbethoughtof,looselyspeaking,asaclusteringsolutionthathashighposteriorprobability.)Whenpriorlabelsarenotused,thereisnoinherentmeaninginthenumberingoftheKclusters,andsothereareK!ry,structuremightswitchamongthesemodes,butthisdoesnotnormallyoccurforrealdatasets(Pritchardetal.,2000a).Forpreparingplotsforpublication,NoahRosenberg’slabhasahelpfulprogram,CLUMPP,thatlinesuptheclusterlabelsacrossdiﬀerentrunspriortodataplotting(Section10).Inadditiontothesesymmetricmodes,rentimplementationofsansthatdiﬀerentrunscanproducesubstantiallydiﬀerentanswers,andlongerrunswillprobablynotﬁmainlyanissueforverycomplexdatasets,withlargevaluesofK,K>5orK>10,say(butseetheexampleofDataSet2AinPritchardetal.(2000a)).YoucanulanalysisofthistypeofsituationwaspresentedbyRosenbergetal.(2001),foradatasetwheretheestimatedKwasaround19.18

tingadmixtureproportionscanbeparticularlychallsanexampleofthisforsimulateddatainPritchardetal.(2000b).ThedataweresupposedtoapproximateasamplefromanAfricanAmericansedata,theestimatedancestryproportionswerehighlycorrelatedwiththetrue(simulated)values,cursbecauseintheabsenceofanynon-admixedindividuals,theremaybesomenon-identiﬁabilitywhereitispossibletopushtheallelefrequenciesfurtherapart,andsqueezetheadmixtureproportionstogether(orvice-versa),andobtainmuchthesamedegreeofmodelﬁOPALPHAS=1(separateαforeachpopulation)canhelpabitwhentore,theadmixtureestimatesinthesesituationsshouldbetreatedwithcaution.7Runningstructurefromthecreintwoﬁles(main-paramsandextraparams),ramsspeciﬁestheinputformatforthedataﬁaramsspeciﬁlneedtosetallthevaluesinmainparams,atthedefaultmodelassumesadmixture,anddoesnotmakeuseoftheuser-deﬁrameterisprintedinall-capsinoneofthesetwoﬁles,precededbytheword“#deﬁne”.(Theyarealsoprintedinall-capsthroughoutthisdocument.)Thevalueissetimmediatelyfollowingthenameoftheparameter(eg“#deﬁneNUMREPS1000”setsthenumberofMCMCrepetitionsto1000).Followingeachparameterdeﬁnition,thereisabriefcomment(marked“//”),nclude:“(str)”,forstring(usedforthenamesoftheinputandoutputﬁles);“(int)”,forinteger;“(d)”,,arealnumbersuchas3.14);and“(B)”,,theparametertakesvaluesTRUEorFALSEbysettingthisto1or0,respectively).Theprogramisinsensitivetotheorderoftheparameters,soyoucanre-arrangethemoraddcomments,uesofallparametersusedforagivenrunareprintedattheendoftheoutputﬁle.7.1Programparamreorderedaccordingtotheparameterﬁlesthatareusedinthecommand-lineversionofstructure.7.2Parametersinﬁloftheseparameters(LABEL,POPDATA,POPFLAG,PHENOTYPE,EXTRACOLS)indicatewhether19

particulartypesofdataarepresentintheinputﬁle;S(int)ardetal.(2000a)mes(dependingonthenatureofthedata)thereisanaturalvalueofKthatcanbeused,otherwiseKcanbeestimatedbycheckingtheﬁtofthemodelatdiﬀerentvaluesofK(seeSection5).BURNIN(int)Lengthofburninperiodbeforethestartofdatacollection.(SeeSection3.3.)NUMREPS(int)NumberofMCMCrepsafterburnin.(SeeSection3.3.)Input/Outputﬁ(string)Nameofinputdataﬁgth30characters(orpossiblylessdependingonoperatingsystem).OUTFILE(string)Nameforprogramoutputﬁles(thesuﬃxes“1”,“2”,...,“m”(forinter-mediateresults)and“f”(ﬁnalresults)areaddedtothisname).Existingﬁgthofname30characters(orpossiblylessdependingonoperatingsystem).DataﬁS(int)NumberofindividualsindataﬁI(int)Numberoflociindataﬁ(int)tis2(diploid).MISSING(int)aninteger,PERIND(Boolean),fordiploiddata,thiswouldmeanthatthetwoallelesforeachlocusareinconsecutiveorderinthesamerow,ratherthanbeingarrangedinthesamecolumn,(Boolean)Inputﬁlecontainslabels(names)foreachindividual.1=Yes;0=A(Boolean)Inputﬁlecontainsauser-deﬁnedpopulation-of-originforeachindividual.1=Yes;0=G(Boolean)InputﬁlecontainsanindicatorvariablewhichsayswhethertousepopinfowhenUSEPOPINFO==1(seebelow).1=Yes;0=A(Boolean)Inputﬁlecontainsauser-deﬁnedsamplinglocationforeachindividual.1=Yes;0=LOCISPOP=YPE(Boolean)Inputﬁlecontainsacolumnofphenotypeinformation.1=Yes;0=No.20

EXTRACOLS(int)Numberofaddireignoredbytheprogram.0=NAMES(Boolean)ThetoprowofthedataﬁIVEALLELES(Boolean)NextrowofdataﬁlecontaiTANCES(Boolean)Thenextrowofthedataﬁle(ortheﬁrstrowifMARKER-NAMES==0)eddataﬁ(Boolean)(LINKAGE=1,PHASED=0),thenPHASEINFOcanbeused–thisisanextralineintheinputﬁASEINFO=0eachvalueissetto0.5,elinkagemodelisusedwithpolyploids,PHASED=NFO(Boolean)Therow(s)ofgenotypedataforPHASE(Boolean)IGUOUS(int)ForusewithpolyploidswhenRECESSIVEALLELES=ﬁtmatchMISSINGoranyallelevalueinthedata.7.3Parametersinﬁptionsallowtheusertoreﬁnethemodelinvariousways,aultvaluesareprobablyﬁleanoptions,type1for“Yes”,or“Usethisoption”;0for“No”or“Don’tusethisoption”.X(Boolean)Assumethemodelwithoutadmixture(Pritchardetal.,2000a).(EachindividualisassumedtobecompletelyfromoneoftheKpopulations.)Intheoutput,insteadofprintingtheaveragevalueofQasintheadmixturecase,theprogramprintstheposteriorprobabilitythateachindividualisfromeachpopulation.1=noadmixture;0=E(Boolean)ntendmakessomeguessesaboutthese,butsomecareonthepartoftheuserinrequiredtobesurethatthevaluesaresensiblefortheparticularapplication.21

USEPOPINFO(Boolean)vePOPDATA=OR(Boolean)UselocationinformationtoimORR(double)Usethe“Fmodel”,inwhichtheallelefrequenciesarecorrelatedacrosspopulations(Falushetal.,2003a).Morespeciﬁcally,ratherthanassumingapriorinwhichtheallelefrequenciesineachpopulationareindependentdrawsfromauniformDirichletdis-tribution,westartwithadistrdelismorerealisticforverycloselyrelatedpopulations(whereweexpecttheallelefrequenciestobesimilaracrosspopulations),andcanproducebetterclustering(section3.2).ThepriorofFkissetusingFPRIORMEAN,(Boolean)AssumethesamevalueofFkforallpopulations(analogoustoWright’straditionalFST).Thisisnotrecommendedformostdata,becauseinpracticeyouprobablyexpectdiﬀ=2itmaysometimesbediﬃculttoestimatetwovaluesofFSTseparately(butseeHarteretal.(2004)).Whenyou’retryingtoestimateK,youshouldusethesamemodelforallK(wesuggestONEFST=0).INFERALPHA(Boolean)Inferthevalueofthemodelparameterαfromthedata;otherwiseαisﬁtionisignoredundertheNOADMIXmodel.(ThepriorfortheancestryvectorQisDirichletwithparameters(α,α,...,α).Smallαimpliesthatmostindividualsareessentiallyfromonepopulationoranother,whilealpha>1impliesthatmostindividualsareadmixed.)POPALPHAS(Boolean)Inferaseparateαommend(double)Dirichletparameter(α)fordegreeofadmixture(thisistheinitialvalueifINFERALPHA==1).INFERLAMBDA(Boolean)Inferasuitablevalueforλ.CIFICLAMBDA(Boolean)Inferaseparateλ(double)parameterizestheallelefrequencyprior,requenciesatmostmarkersareveryskewedtowardslow/highfrequencies,asmallervalueofλn’tseemtoworkverywelltoestimateλatthesametimeastheotherhyperparameters,αcasesthedefaultMEAN,FPRIORSD(double)orforFkistakentobeGammawithmeanFPRIORMEAN,ﬁndthatthismakesthealgorithmsensitivetosubtlestructure,butatsomeincreasedriskofoverestimatingK(Falushetal.,2003a).22

UNIFPRIORALPHA(Boolean),ALPHAMAX(double)Assumeauniformpriorforαdelseemstoworkﬁne;thealternativemodel(whenUNIFPRIORALPHA=0)istotakeαashavingaGammaprior,withmeanALPHAPRI-ORA×ALPHAPRIORB,andvarianceALPHAPRIORA×10RMIN,LOG10RMAX,LOG10PROPSD,LOG10RSTART(double)Whenthelinkagemodelisused,theswitchrateristakentohaveauniformprioronalogscale,aluesneedriorpopulationinformation(USEPOPINFO).GENSBACK(int)ThiscorrespondstoG(Pritchardetal.,2000a).Whenusingpriorpopulationinformationforindividuals(USEPOPINFO=1),theprogramtestswhethereachindividualhasanimmigrantancestorinthelastGgenerations,whereG=rtohavedecentpower,Gshouldbesetfairlysmall(2,say)IOR(double)Mustbein[0,1].ThisisνinPritchardetal.(2000a).Sensiblevaluesmightbeintherange0.001—OPFLAGONLY(Boolean)Thisoption,newwithversion2.0,makesitpossibletoupdatetheallelefrequencies,P,usingonlyaprespeciﬁhis,includeaPOPFLAGcolumn,andsetPOPFLAG=1forindividualswhoshouldbeusedtoupdateP,andPOPFLAG=nbeusedbothwith,tionwillbeuseful,forexample,ifyouhaveastandardreferencesetofindividualsfromknownpopulations,hisoption,theqestimateforeachunknownindividualdependsonlyonthereferenceset,OP(Boolean)ThisoptioninstructstheprogramtousethePopDatacolumnintheinputﬁCISPOP=0,ORINIT(double)InitialvaluefortheLOCPRIORparameterr,thatparameterizeshowinformativethepopulationsare(citepHubiszEtAl09).WefoundthatLOCPRIORINIT=PRIOR(double)Rangeofrisfrom(0,MAXLOCPRIOR).WesuggestMAXLOCPRIOR=optionsPRINTNET(Boolean)Printthe“netnucleotidedistance”stancebetweenpopulationsAandB,DAB,iscalculatedas

本文发布于:2024-09-22 11:38:09，感谢您对本站的认可！

本文链接：https://www.17tex.com/fanyi/285.html

上一篇：structure-2.3---中文使用手册

下一篇：StructureTalks形的意义

标签：

留言与评论（共有 0 条评论）