structure 使用说明


2023年12月14日发(作者:suggestive)

Documentationforstructuresoftware:ardaXiaoquanWenaDanielFalushb123aDepartmentofHumanGeneticsUniversityofChicagobDepartmentofStatisticsUniversityofOxfordSoftwarefrom/ril21,2009colleaguesinthestructureprojectarePeterDonnelly,MatthewStephensandMelissaHubisz.firstversionofthisprogramwasdevelopedwhiletheauthors(JP,MS,PD)wereintheDepartmentofStatistics,UniversityofOxford.3Discussionandquestionsaboutstructureshouldbeaddressedtotheonlineforumatstructure-software@checkthisdocumentandsearchthepreviousdiscus-sionbeforepostingquestions.2The1Ourother

1.2What’snewinVersion2.3?...............................2Formatforthedatafile2.1Componentsofthedatafile:................................2...4Missingdata,nullallelesanddominantmarkers4.1Dominantmarkers,5EstimationofK(thenumberofpopulations)5.3InformalpointersforchoosingK;isthestructurereal?................6BackgroundLDandothermiscellania6..7.2Parametersinfi.7.3Parametersinfi.8.4Confi8.7Exportingparameterfi758526262727283030303132

.9.5Printoutofestimatedallelefrequencies(P).......................11Howtocitethisprogram12Bibliography2333434353535363737373737

1IntroductionTheprogramstructureimplementsamodel-basedclusteringmethodforinferringphodwasintroducedinapaperbyPritchard,StephensandDonnelly(2000a)andextendedinsequelsbyFalush,StephensandPritchard(2003a,2007).Applicationsofourmethodincludedemonstratingthepresenceofpopu-lationstructure,identifyingdistinctgeneticpopulations,assigningindividualstopopulations,fly,weassumeamodelinwhichthereareKpopulations(whereKmaybeunknown),dualsinthesampleareassigned(probabilistically)topopulations,orjointlytsumedthatwithinpopulations,thelociareatHardy-Weinbergequilibrium,yspeaking,eldoesnotassumeaparticularmutationprocess,anditcanbeappliedtomostofthecommonlyusedgeneticmarkersincludingmicrosatellites,elassumesthatmarkersarenotinlinkagedisequilibrium(LD)withinsubpopulations,sowecan’ngwithversion2.0,hecomputationalapproachesimplementedherearefairlypowerful,somple,itisnotpossibletodeterminesuitablerun-lengthstheoretically,cumentdescribestheuseandinterpretationofthesoftwareandsupplementsthepublishedpapers,ributesourcecodeaswellasexecutablesforvariousplatforms(currentlyMac,Windows,Linux,Sun).TheCexecutablereadsadatafisalsoaJavafrontendthatprovidesvariocumentincludesinformationabouthowtoformatthedatafile,howtochooseappropriatemodels,hasdetailsonusingthetwointerfaces(commandlineandfrontend)andasummaryofthevarioususer-definedparameters.1.2What’snewinVersion2.3?The2.3release(April2009)introducesnewmodelsforimprovingstructureinferencefordatasetswhere(1)thedataarenotinformativeenoughfortheusualstructuremodelstoprovideaccuratein-ference,but(2)situation,bymakingexplicituseofsamplinglocationinformation,wegivestructureaboost,oftenallowingmuchimprovedperformance(Hubiszetal.,2009).Wehopetoreleasefurtherimprovementsinthecomingmonths.3

GeorgeGeorgePaulaPaulaMatthewMatthewBobBobAnjaAnjaPeterPeterCarstenCarsten122localocblocclocd-9145660-9-9646108142641-9142-90112142-91114142661-91456601162loce9294929492-99494-994-9-9-992Table1:SampledatafiRKERNAMES=1,LABEL=1,POPDATA=1,NUMINDS=7,NUMLOCI=5,andMISSING=-,POPFLAG=0,LOCDATA=0,PHENOTYPE=0,EX-TRACOLS=lsostorethedatawithonerowperindividual(ONEROWPERIND=1),inwhichcasethefirstrowwouldread“George1-9-9145-96664009294”.2FormatforthedatafileTheformatforthegenotypedataisshowninTable2(andTable1showsanexample).Essentially,theentiredatasetisarrangedasamatrixinasinglefile,inwhichthedataforindividualsareinrows,rcanmakeseveralchoicesaboutformat,andmostofthesedata(apartfromthegenotypes!)ploidorganism,dataforeachindividualcanbestoredeitheras2consecutiverows,whereeachlocusisinonecolumn,orinonerow,youplantousethelinkagemodel(seebelow)-genotypedatacolumns(seebelow)arerecordedtwiceforeachindividual.(Moregenerally,forn-ploidorganisms,dataforeachindividualarestoredinnconsecutiverowsunlesstheONEROWPERINDoptionisused.)2.1Componentsofthedatafile:Theelementsoftheinputfient,theymustbeinthefollowingorder,howevermostareoptional(asindicated)rspecifieswhichdataarepresent,eitherinthefrontend,or(whenrunningstructurefromthecommandline),inaseparatefile,ametime,theuseralsospecifiesthenumberofindividualsandthenumberofloci.4

Names(Optional;string)ThefirstrowinthefilecancontainalistofidentifiwcontainsLstringsofintegersorcharacters,iveAlleles(Datawithdominantmarkersonly;integer)DriftheoptionRECESSIVEALLE-LESissetto1,thentheprogramrequiresthisrowtoindicatewhichallele(ifany)ionisuse-MarkerDistances(Optional;real)thenextrowinthefileisasetofinter-markerdistances,,centiMorgans),orsomeproxyforthisbased,forexample,ualunitsofdistancedonotmattertoomuch,providedthatthemarkerdistancesare(roughly)ntendestimatesanappropriatescalingfromthedata,butusersofthecommandlineversionmustsetLOG10RMIN,LOG10RMAXandLOG10RSTARTinthefinsecutivemarkersarefromdiff,differentchromosomes),finformation(Optional;diploiddataonly;realnumberintherange[0,1]).asinglerowoeisknowncompletely,ornophaseinformationisavailable,ybeusefulwhenthereispartialphaseinformationfromfamilydataorwhenhaploidretwoalternativerepresentationsforthephaseinformation:(1)thetworowsofdataforanindividualareassumedtocorrespondtothepaternalandmaternalcontributions,selineindicatestheprobabilitythattheorderingiscorrectatthecurrentmarker(setMARKOVPHASE=0);(2)thephaselineindicatestheprobabilitythatthephaseofoneallelerelativetothepreviousalleleiscorrect(setMARKOVPHASE=1).Thefirstentryshouldbefilledinwith0.5tofimplethefollowingdatainputwouldrepresenttheinformationfromanmalewith5unphasedautosomalmicrosatellitelocifollowedbythreeXchromosomeloci,usingthematernal/paternalphasemodel:1143-9-9-90.50.50.50.50.51.01.01.0where-9indicates”missingdata”,heremissingduetotheabsenceofasecondXchromo-some,the0.5indicatesthattheautosomallociareunphased,andthe1.0sindicatethattheXchromosomelociarehavebeenmaternallyinheritedwithprobability1.0,casetheinputfilewouldread:5

1143-9-9-90.50.50.50.50.50.51.01.0Here,thetwo1.0sindicatethatthefirstandsecond,aatthesitebysiteoutputunderthesetwomodelswillbedifffirstcase,structurewouldouecondcase,itwouldoutputtheprobabilitiesforeachallelelistedintheinputfidual/Genotypedata(Required)Dataforeachsampledindividualarearrangedintooneormorerowsasdescribedbelow.2.3Individual/ormcolumnsinthedatafi(Optional;string)Astrina(Optional;integer)Anintegerdesignatingauser-definedpopulationfromwhichtheindividualwasobtained(forinstancethesemightdesignatethegeographicsamplinglocationsofindividuals).Inthedefaultmodels,thisinformationisnotusedbytheclusteringalgorithm,butcanbeusedtohelporganizetheoutput(forexample,plottingindividualsfromthesamepre-definedpopulationnexttoeachother).g(Optional;0or1)ABooleanflagwhichindicateswhethertousethePopDatawhenusinglearningsamples(seeUSEPOPINFO,below).(Note:ABooleanvariable(flag)isavariablewhichtakesthevaluesTRUEorFALSE,whicharedesignatedherebytheintegers1(usePopData)and0(don’tusePopData),respectively.)a(Optional;integer)Anintegerdesignatingauser-definedsamplinglocation(orothercharacteristic,suchasasharedphenotype)forimplywishtousethePopDatafortheLOCPRIORmodel,thenyoucanomittheLocDatacolumnandsetLOCISPOP=1(thistellstheprogramtousePopDatatosetthelocations).ype(Optional;integer)Anintegerdesignatingthevalueofaphenotypeofinterest,foreachindividual.(φ(i)intable.)(retopermitasmoothinterfacewiththeprogramSTRATwhichisusedforassociationmapping.)olumns(Optional;string)Itmaybeconvenientfortheusertoincludeadditionaldataintheinputfiohere,peData(Required;integer)Eachalleleatagivenlocusshouldbecodedbyauniqueinteger(egmicrosatelliterepeatscore).6

2.4MissinggenotypedataMissingdatashouldbeindicatedbyanumberthatdoesn’toccurelsewhereinthedata(often-9byconvention).Thisnumbercanalsobeusedwherethereisamixtureofhaploidanddiploiddata(egXandautosomallociinmales).Themissing-datavalueissetalonimplementedreasonablycarefulerrorcheckingtomakesurethatthedatasetisinthecorrectformat,andtheprogramwillattntendrequiresreturnsattheendsofeachrow,anddoesnotallowreturnswithinrows;thecommanblemthatcanariseisthateditingprogramsusedtoassemblethedatapriortoimportingthemintostructurecanintroducehiddenformattingcharacters,oftenattheendsoflines,orattheendofthefintendcanremovemanyoftheseautomatically,butthistypeofproblemmayberesponsibleforerrorswhenthedatafireimportingdatatoaUNIXsystem,thedos2unixfunctioncanbehelpfulforcleaningtheseup.33.1ModellingdecisionsfortheuserAncestryModelsTherearefourmainmodelsfortheancestryofindividuals:(1)noadmixturemodel(individualsarediscretelyfromonepopulationoranother);(2)theadmixturemodel(eachindividualdrawssomefractionofhis/hergenomefromeachoftheKpopulations;(3)thelinkagemodel(liketheadmixturemodel,butlinkedlociaremorelikelytocomefromthesamepopulation);(4)modelswithinformativepriors(allowstructuretouseinformationaboutsamplinglocations:eithertoassistclusteringwithweakdata,todetectmigrants,ortopre-definesomepopulations).SeePritchardetal.(2000a)and(Hubiszetal.,2009)formoreonmodels1,2,and4andFalushetal.(2003a)puorprobabilityforeachpopulationis1/delisappropriateforstudyingfullydiscretepopulationsandisomodelledbysayingthatindividualihasinheritedsomefractionofhis/ionalontheancestryvector,q(i),easonablyflureisacommonfeatureofrealdata,andyouprobablywon’tfiixturemodelcanalsodealwithhybridzonesinanaturalway.7

LabelPopFlagLocationPhenExtraColsM1r1-1Loc1Loc2Loc3M2r2D1,2M3r3D2,3x2(1,2)x2(1)p3x2(2,2)x2(2)p3(i,1)(2,1)(1,1)....MLrLDL−((3(1,3(2,(i,1)(2,1)(1,1)ID(1)ID(1)g(1)g(1)f(1)f(1)l(1)l(1)φ(1)φ(1)y1,...,yn(1)(1)y1,...,yn(1)p1y1,...,yn(2)(2)y1,...,yn(2)p1(i)(i)(2)(1)(1)x1(1,2)x1(1)p2x1(2,2)x1(2)p2(i,1)(2,1)(1,1)xL(1,2)xL(2,1)(1,1)ID(2)ID(2)....ID(i)ID(i)....ID(N)ID(N)g(2)g(2)f(2)f(2)l(2)l(2)φ(2)φ(2)(2)xL(2,2)xLg(i)g(i)f(i)f(i)l(i)l(i)φ(i)φ(i)y1,...,yn(i)(i)y1,...,yn(3)p1(N)(N)x1(i,2)x1(3)p2(N,1)x2(i,2)x2(3)p3(N,1)x3(i,(N,1)........(3)pLxL(i,2)xL(i,1)g(N)g(N)f(N)f(N)l(N)l(N)φ(N)φ(N)y1,...,yn(N)(N)y1,...,yn(L)p1x1(N,2)x1(L)p2x2(N,2)x2(L)p3x3(N,(1)pLxL(N,2)xL(N,1)Table2:Formatofthedatafile,thesecomponentsareoptional(seetextfordetails).Mlisanidentificateswhichallele,ifany,isrecessiveateachmarker(dominantgenotypedataonly).Di,i+1isthedistancebetweenmarkersiandi+(i)isthelabelforindividuali,g(i)isapredefinedpopulationindexforindividuali(PopData);f(i)isaflagusedtoincorporatelearningsamples(PopFlag);l(i)isthesamplinglocationofindividuali(i)(i)(LocData);φ(i)canstoreaphenotypeforindividuali;y1,...,ynareforstoringextradata(ignored(l)1i,2bytheprogram);(xi,l,xl)essentiallyageneralizationoftheadmixturemodeltodealwith“ad-mixturelinkagedisequilibrium”–i.e.,thecorreletal.(2003a)describesthemodel,icmodelisthat,tgenerationsinthepast,onsideranindividualchromosome,itiscomposedofaseriesof“chunks”thatureLDarisesbecauselinkedallelesareoftenonthesamechunk,esofthechunksareassumedtobeindependentexponentialrandomvariableswithmeanlength1/t(inMorgans).Inpracticeweestimatea“recombinationrate”rfromthedata8

thatcorrespondstotherateofswitchingfromthepresentchunktoanewchunk.1Eachchunk(i)(i)inindividualiisderivedindependentlyfrompopulationkwithprobabilityqk,whereqkistheproportionofthatindividual’l,thenewmodelretainsthemainelementsoftheadmixturemodel,butalrtstheoverallancestryforeachindividual,takingaccountofthelinkage,andcanalsoreporttheprobabilityoforiginofeachbitofchromosome,wmodelperformsbetterthantheorievesmoreaccurateestimatesoftheancestryvector,elisny,thismodelisabigsimplifir,themajoreffectofadmixtureistocreatelong-rangecorrelationamonglinkedmarkers,anputationsareabitslowerthanfortheadmixturemodel,eless,theyarepelcanonlybeusedifthereisinformationabouttherelativepositionsofthemarkers(usuallyageneticmap).aultmodeforstrr,ther,physicalcharacteristicsofsampledindividualsorgeographicsamplinglocations).Atpresent,structurecanusethisinformationinthreeways:•LOCPRIORmodels:usesamplinglocationsaspriorinformationtoassisttheclustering–,significantFSTbetweensamplinglocations),oftenthecasefordatasetswithfewmarkers,fewindividuals,oveperformanceinthissituation,Hubiszetal.(2009)developemodelscanoftenprovideaccurateinferenceofpopulationstructureandindividualancestryindatasetswherethesigfly,y,structureassumesthahereisanimmensenumberofpossiblepartitions,ittakeshighlyinformativedataforstructuretoBecauseofthewaythatthisisparameterized,themapdistancesintheinputfilecanbeinarbitraryunits–e.g.,geneticdistances,orphysicaldistances(undertheassumptionthattheseareroughlyproportionaltogeneticdistances).Thentheestimatedvalueofrrepresentstherateofswitchingfromonechunkstothenext,perunitofwhateverdistancewasassumedintheinputfi,ifanadmixtureeventtookplacetengenerationsago,thenrshouldbeestimatedas0.1whenthemapdistancesaremeasuredincM(thisis10∗0.01,where0.01istheprobabilityofrecombinationpercentiMorgan),oras10−4=10∗10−5whenthemapdistancesaremeasuredinKB(assumingaconstantcrossing-overrateof1cM/MB).ntendtriestomakesomeguessesaboutsensibleupperandlowerboundsforr,buttheusershouldadjustthesetomatchthebiologyofthesituation.2Danielreferstothisas“Betterpriorsforworsedata.”19

concludethatanyparticularparast,theLOCPRIORmodelstaketheviewthatinpractice,indivore,theLOCPRIORmodelsaresetatasuggestthatthelocationsareinformative,etal.(2009)developedapairofLOCPRIORmodels:cases,theunderlyingmodel(andthelikelihood)differenceisthatstr,bymodifyingthepriortopreferclusteringsolutionsthatcorrelatewiththelocations).TheLOCPRIORmodelshavethedesirablepropertiesthat(i)theydonottendtofindstruc-turewhennoneispresent;(ii)theyareabletoignorethesamplinginformationwhentheancestryofindividualsisuncorrelatedwithsamplinglocations;and(iii)theoldandnewmodelsgiveessentia,werecommendusingthenewmodelsinmostsituationswheretheamountofavailabledataisverylimited,especiallywhr,sincethereisnowagreatdealofaccumulatedexperiencewiththestandardstructuremodels,werecommendthatthebasicmodelsremainthedefaultforhighlyinformativedatasets(Hubiszetal.,2009).ToruntheLOCPRIORmodel,theusermustfirstspecifya“samplinglocation”foreachindividual,,weassumethesampleswerecollectedatasetofdiscretelocations,andwedonotuseanyspatialinformationaboutthelocations.(Werecognizethatinsomestudies,everyindividualmaybecollectedatadifferentlocation,andsoclumpingindividualsintoasmallersetofdiscretelocationsmaynotbeanidealrepresentationofthedata.)The“locations”couldalsorepresentaphenotype,ecotype,ationsareenteredintotheinputfileeitherinthePopDatacolumn(setLOCISPOP=1),orasaseparateLocDatacolumn(seeSection2.3).TousetheLOCPRIORmodelyoumustfireusingtheGraphicalUserInterfaceversion,tickthe“usesamplinglocationsasprior”reusingthecommand-lineversion,setLOCPRIOR=1.(NotethatLOCPRIORisincompatiblewiththelinkagemodel.)OurexperiencesofaristhattheLOCPRIORmodelusethesamediagnosticonallyitmaybehelpfultolookatthevalueofr,ofrnear1,or

NotethatthismodelassumesthatthepredefisquitestrongdatatoovercometheprioragainstmisclassifiusingtheUSEPOPINFOmodel,youshouldalsoruntheprogramwithoutpopulationinformationtoensurethatthepre-defihismodelsetUSEPOPINFOto1,andchooseavalueofMIGRPRIOR(whichisνinPritchardetal.(2000a)).Youmightchoosesomethingintherange0.001to0.1forν.Thepre-definedpopulationforeachindividualissetintheinputdatafile(seePopData).Inthismode,individualsassignedtopopulationkintheinputfiore,thepredefinedpopulationsshouldbeintegersbetween1andMAXPOPS(K),ataforanyindividualisoutsidethisrange,theirqwillbeupdatedinthenormalway(iewithoutpriorpopulationinformation,accordingtothemodelthatwouldbeusedifUSEPOPINFOwasturnedoff.3).•USEPOPINFOmodel:pre-specifythepopulationoforiginofsomeinddwaytousetheUSEPOPINFOmodelistodefine“learningsamples”thatarepre-defi:IntheFrontEnd,thisoptionisswitchedonusingtheoption“UpdateallelefrequenciesusingonlyindividualswithPOPFLAG=1”,locatedunderthe“AdvancedTab”.LearningsamplesareimplementedusingthePopFlagcolumninthedatafi-definedpopulationisusedforthoseindividualsforwhomPopFlag=1(K)).ThePopDatavalueisignoredforindividualsforwhomPopFlag=eisnoPopFlagcolumninthedatafile,thenwhenUSEPOPINFOisturnedon,ryofindividualswithPopFlag=0,K)areupdatedaccordingtotheadmixtureorno-admixturemodel,asspecifidabove,itmaybehelpfultosetαtoasensiblevalueiftherearefewindividualswithoutpredefimple,theremaybesomeindividualsofknownorigin,mple,K),andthenusestructuretoestimatetheancestryforadditionaldogsofunknown(possiblyhybrid)-settingthepopulationnumbers,wecanensurethatthestructureclusterscorrespondtopre-definedbreeds,whichmakestheoutputmoreinterpretable,andcanimprovetheaccuracyoftheinference.(Ofcourse,iftwopre-definedbreedsaregeneticallyidentical,ruseofUSEPOPINFOisforcaseswheretheuserrily,structureanalysrtherearesomesettingswhereyoumightwanttoestimateancestryforsomeindividuals,withoutthoseindividualsaffmpleyoumayhaveastandardcollectionoflearningsamples,andthenperiodicallyyouwanttoestimateancestryfornewbatchesofgenotypedIftheadmixturemodelisusedtoestimateqforthoseindividualswithoutpriorpopulationinformation,αeareveryfewsuchindividuals,youmayneedtofixαatasensiblevalue.311

efaultoptions,theancestryestimatesforindividualswoulddepend(some-what)gPFROMPOPFLAGONLYyoucanensurethattheallelefrequencyestimatesdependonlyonsamplesforwhichPopFlag=fferentsetting,Murgiaetal.(2006)oursweresocloselyregPFROMPOPFLAGONLY,ts:Werecommendrunningthebasicversionofstructurefirsttoverifythattheprede-fi,whenusinglearningsamples,itmaybesensibletoallowforsomemisclassificationbysettingMIGRPRIORlargerthan0.3.2Allelassumesthattheallelefrequenciesineachpopulationareindependentdrawsfromadistributionthatisspecifiedbyaparametercalledλ.ThatistheoriginalmodelthatusedinPritchardetal.(2000a).Usuallywesetλ=1;etal.(2003a)ysthatfrequenciesinthedifferentpopulationsarelikelytobesimilar(probablyduetomigrationorsharedancestry).yspeaking,thispriorsaysthatweexpectallelefrequenciesindifferentpopulationstobereasonablydifftenimprovesclusteringforcloselyrelatedpopulations,butmayincreasetheriskofover-estimatingK(seebelow).Ifonepopulationisquitedivergentfromtheothers,thecorrelatedmtingλ:Fixingλ=1isagoodideaformostdata,butinsomesituations–e.g.,SNPdatawheremostminorallelesarerare–sreason,youcangettheprogramtoestimateλwanttodothisonce,perhapsforK=1,andthenfixλattheestimatedvaluethereafter,becausethereseemtobesomeproblemswithnon-identifiabilitywhentryingtoestimatetoomanyofthehyperparameters(λ,α,F)atedallelefrequenciesmodel:AsdescribedbyFalushetal.(2003a)thecorrelatedfrequenciesmodelusesa(multidimensional)vector,PA,whichrecordstheallelefrequenciesinahypothetical“ancestral”sumedthattheKpopulationsrepresentedinoursam-plehaveeachundergoneindependentdriftawayfromtheseancestralfrequencies,ataratethatisparameterizedbyF1,F2,F3,...,FK,imatedFkvaluesshouldbenumericallysimilartoFSTvalues,apartfromdifferencesthatstemfromtheslightlydifferentmodel,anddiff,itisdiffisumedtohaveaDirichletpriorofthesameformasthatusedaboveforthepopulationfrequencies:pAl·∼D(λ1,λ2,...,λJl),(1)NotethatPritchardetal.(2000a)alsooutlinedamodelofcorrelatedallelefrequencies;thiswaslastavailableinVersion1.x412

epriorforthefrequenciesinpopulationkispkl·∼D(PAl11−Fk1−Fk1−Fk,PAl2,...,PAlJl),FkFkFk(2)model,theFshaveacloserelationshiptothestandardmeasureofgeneticdistance,tandardparametrizationofFST,theexpectedfrequencyineachpopulationisgivenbyoverallmeanfrequency,andthevarianceinfrequencyacrosssubpop-ulationsofanalleleatoverallfrequencypisp(1−p)elhereismuchthesame,exceptthatwegeneralizethemodelslightlybyallowingeachpopulationtodriftawayfromtheancestralpopulationatadifferentrate(Fk),asmightbeexpectedifpopulationshavedifftrytoestimate“ancestralfrequencies”,placedindependentpriorsontheFk,proportionaltoagammadistributionwithmeansof0.01andstandarddeviation0.05(butwithPr[Fk≥1]=0).Theparametersofthegammapriorcanbemodifiperimentationsuggeststhatthepriormeanof0.01,whichcorrespondstoverylowlevelsofsubdivision,oftenleadstogoodperformancefordatathatarediffirproblems,wherethedifferencesamongpopulationsaremoremarked,itseemsthatthedatausuallyoverwhelmthisprioronFk.3.3HowlongtoruntheprogramTheprogramisstartedfromarandomconfiguration,andfromtheretakesaseriesofstepsthroughtheparameterspace,eachofwhichdepends(only)ocedureinducescorrelationsbetweenthestateoftheMarkovchainatdiffeisthatbyrunningthesimulationforlongenough,retwoissuestoworryabout:(1)burninlength:howlongtorunthesimulationbeforecollectingdatatominimizetheeffectofthestartingconfiguration,and(2)howlongseanappropriateburninlength,itisreallyhelpfultolookatthevaluesofsummarystatisticsthatareprintedoutbytheprogram(egα,F,thedivergencedistancesamongpopulationsDi,j,andthelikelihood)llyaburninof10,000—100,seanappropriaterunlength,youwillneedtodoseveralrunsateachK,possiblyofdifferentlengths,lly,youcangetgoodestimatesoftheparametervalues(PandQ)withrunsof10,000–100,000steps,butaccurateestimationofPr(X|K)ticeyourrunlengthmaredealingwithextremelylargedatasetsandarefrustratedwiththeruntimes,youmighttrytrimmingboththelengthoftheruns,andthenumberofmarkers/individuals,uldlooktoseewaluesarestillincreasingordecreasingattheendoftheburninphase,stimateofα,notjustduringtheburnin),youmaygetmoreaccurateestimatesofPr(X|K)byincreasingALPHAPROPSD,whichimprovesmixinginthatsituation.(Seearelatedissueinsection5).13

4Missingdata,nullallelesanddominproachiscorrectwhentheprobabilityofhavingmissingdataastimatesofQforindividualswithmissingdataarelessaccurate,thereisnoparticularreasontoexcludesuchindividualsfromtheanalysis,usproblemariseswhendataaremissinginasystematicmanner,onotfittheassumedmodel,andcanleadtoapprthedominantmarkersmodel(below)nsometimesleadtooverestimationofK,especiallyforthecorrelatedfrequenciesmodel(Falushetal.,2003a),butthereislittleeffectontheassignmentofindividualstopopulationsforfixedK.4.1Dominantmarkers,nullalleles,andpolyploidgenotypesForsometypesofgeneticmarkers,suchasAFLPs,ypesofmarkersmayresultinambigousgenotypesifsomefractionoftheallelesare’null’ngwithVer-sion2.2weimplementamodelf,weassumethatatanyparticularlocustheremaybeasingleallelethatisrecessivetoallotheralleles(egA),BandBBwouldappearintherawgenotypedataasa“phenotype”B,ACandCCwouldberecordedasC,ereisambiguity,tailsaregiveninFalushetal.(2007).Inordertoperformthesecomputationsthealgorithmmustbetoldwhichallele(ifany)donebysettingRECESSIVEALLELES=1,andincludingasinglerowofLintegersatthetopoftheinputfile,betweenthe(optional)linesformarkernamesandmapdistances,ivenlocusallthemarkersarecodominanselyiftherecessivealleleisneverobservedinhomozygousstatebutyouthinkitmightbepresent(ightbenullalleles)thensettherecessivevaluetoanallelethatisnotobservedatthatlocus(butnotMISSING!).Codingthegenotypedata:Ifthephenotypeisunambiguous,thenitiscodedinthestructureinputfiambiguousthenitiscodedashomozygousforthedominantallele(s).Forexample,phenotypeAiscodedAA,BiscodedBB,BCiscodedBC,arkerishaploidinanotherwisediploidindividual(egfortheXchromosomeinamale),otypesAB,AC,etcareillegalintheinputfiCESSIVEALLELESisusedtodealwithnullalleles,genotypesthatappeartobehomozygotenullshoulticeitmaybeunureshouldberobusttothesebeingcodedasmissingunlessnullallelesareathighfrequencyatalocus.14

Inpolyploids(PLOIDY>2)thesituationismorecomplictendiffictureisrunwithRECESSIVEALLELES=yploids,whenRECESSIVEALLELES=1,structureallowsthedatatoconsistofamixtureoflociforwhichthereis,andisn’lociarenotambiguousthensetthecodeNOTAMBIGUOUStoanintegerthatdoesnotmatchanyoftheallelesinthedata,therecessivealleleslineatthetopoftheinputfiead,ataparticularlocustheallelesareallcodominant,butthereisambiguityaboutthenumberofeach(egformicrosatellitesinatetraploid)y,ifthereisarecessiveallele,andthereisalsoambiguityaboutthenumberofeachallele,ofalleleswherethereiscxampleinatetraploidwherethreecodominantlociB,CandDobserved,thisshouldbecodedasBCDDorequildnotbecodedasBCD(MISSING),asthisindtionofPr(K):WhenRECESSIVEALLELESisusedfordiploids,thelikelihoodateachsteofcoding,wheneitherPLOIDY>2orthelinkagemodelisused,creasesthelikelihoodandseemstogreatlyinfldexperienceindicatesthatthisleadstopoorperformanceforestimatingKinthelattercases,andyoushouldconsidersuchestimatesofKtobeunreliable.5EstimationofK(thenumberofpopulations)Inourpaperdescribingthisprogram,wepointedoutthatthisissueshouldbetreatedwithcarefortworeasons:(1)itiscomputationallydifficulttoobtainaccurateestimatesofPr(X|K),andourmethodmerelyprovidesanadhocapproximation,and(2)xperiencewefindthattherealdifficedureforestimatingKger,,duetoisolationbydistanceorinbreeding).Inthosecasestheremaynotbeanaturalanswertowhatisthe“correct”sforthiskindofreason,itisnotinfrequentthatinrealdatathevusuallymakessensetofocusonvaluesofKthatcapturemostofthestructureinthedataandthatseembiologicallysensible.5.1StepsinestimatingK1.(Command-lineversion)SetCOMPUTEPROBSandINFERALPHAto1inthefileextra-params.(FrontEndversion)MakesurethatαMCMCschemefordifferentvaluesofMAXPOPS(K).Attheenditwilloutputaline”EstimatedLnProbofData”.ThisistheestimateoflnPr(X|K).Youshouldrun15

severalindependentrunsforeachK,ariabilityacrossrunsforagivenKissubstantialcomparedtothevariabilityofestimatesobtainedfordifferentK,(X|K)appearstobebimodalormultimodal,theMCMCschememaybefindingdiffcheckforthisbycomparingtheQfordifferentrunsatasingleK.(cfDataSet2AinPritchardetal.(2000a),andseethesectiononMultimodality,below).mple,forDataSet2Ainthepaper(whereKwas2),wegotK12345lnPr(X|K)-4356-3983-3982-3983-4006WecanstartbyassumingauniformprioronK={1,...,5}.ThenfromBayes’Rule,Pr(K=2)isgivenbye−3983(3)−4356−3983−3982−3983−4006e+e+e+e+eIt’seasiertocomputethisifwesimplifytheexpressiontoe−1=0.21e−374+e−1+e0+e−1+e−24(4)5.2MilddeparturesfromthemodelcanleadtooverestimatingKWhenthereisrealpopulationstructure,thisleadsyspeaking,edeparturesncludeinbreeding,andgenotypingerrorssuchasoccasional,undetected,theabsenceofpopulationstructure,thesetypesoffactorscanleadtoaweakstatisticalsignalforK>inginVersion2,wehavesuggestedthatthecorrelatedallelefrequencymodelshouldbeusedasadefaultbecauseitoftenachievesbetterperformanceondifficultproblems,buttheusershouldbeawarethatthismaymakeiteasiertooverestimateKinsuchsettingsthanundertheindependentfrequenciesmodelFalushetal.(2003a).Thenextsubsectiondiscusseshowtodecidewhetherinferredstructureisreal.5.3InformalpointersforchoosingK;isthestructurereal?Tfirstisthatit’softenthesituationthatPr(K)isverysmallforKlessthantheappropiatevalue(effectivelyzero),andthenmore-or-lessplateausforlargerK,sortofsituationwhereseveralvaluesofKgivesimilarestimatesoflogPr(X|K),itseemsthatthesmallestoftheseisoften“correct”.16

Itisabitdifficulttoprovideafirmruleforwhatwemeanbya“more-or-lessplateaus”.Forsmalldatasets,thismightmeanthatthevaluesoflogPr(X|K)arewithin5-10,butDanielFalushwritesthat“inverybigdatasets,thedifferencebetweenK=3andK=4maybe50,butifthedifferencebetweenK=3andK=2is5,000,thenIwoulddefinitelychooseK=3.”ReaderswhowanttouseamoreformalcriterionthattakesthisintoaccountmaybeinterestedinthemethodofEvannoetal.(2005).,wemaynotalwaysbeabletoknowtheTRUEvalueofK,butweshouldondpointeristhatiftherereallyareseparatepopulations,thereistypicallyalotofinformationaboutthevalueofα,andoncetheMarkovchainconverges,αwillnormallysettledowntoberelativelyconstant(oftenwitharangeofperhaps0.2orless).However,ifthereisn’tanyrealstructure,αllaryofthisisthatwhenthereisnopopulationstructure,youwilltypicallyseethattheproportionofthesampleassignedtoeachpopulationisroughlysymmetric(∼1/Kineachpopulation),individualsarestronglyassignedtoonepopulationoranother,andiftheproportionsassignedtoeachgroupareasymmetric,ethatyouhaveasituationwithtwoclearpopulations,butyouaretryingtodecidewhetheroneoftheseisfurthersubdivided(ie,thevalueofPr(X|K=3)issimilarto,orperhapsalittlelargerthanP(X|K=2)).Thenonethingyoucouldtryistorunstructureusingonlytheindividualsinthepopulationthatyoususpectmightbesubdivided,ary,youshouldbeskepticalaboutpopulationstructureinferredonthebasisofsmalldifferencesinPr(K)if(1)thereisnoclearbiologicalinterpretationfortheassignments,and(2)theassignmentsareroughlysymmetrictoallpopulationsandnoindividualsarestronglyassigned.5.4IsolationbydistancedataIsolationbydistancereferstotheideathatindividualsmaybespatiallydistributedacrosssomeregion,situation,eisoccurs,theinferredvalueofK,andtingonthesamplingscheme,,thealgorithmwillattempttomodeltheallelefrequsituations,interpretingtheresultsmaybechallenging.66.1BackgroundLDandothermiscellaniaSequencedata,tightlylinkedSNPsandhaplotypedata,notinLDwithinpopulations).Thisassumptionislikelytobeviolatedforsequencedata,avesequencedataordenseSNPdatafrommultipleindependentregions,thenstructuremayactuallyperformreasonablywelldespitethedatanotcompletelyfiy17

speaking,thiswillhappenprovidedthatthereisenoughindereareenoughindependentregions,themaincostofthedependencewithinregionswillbethatstructureundmple,Falushetal.(2003b)appliedstructuretoMLST(multilocussequence)case,thereisenoughrecombinationwithinregionsthatthesignalofpopulationstructuredominatesbackgroundLD.(FormoreonMLSTdata,seealsoSection10.)Inanapplicationtohumans,Conradetal.(2006)foundthat3000SNPsfrom36linkedregionsproducedsensible(butnoisy)answersinaworldwidesampleofhumansthatlargelyagreedwithpreviousresultsbasedonmicrosatellites[seetheirSupplementaryMethodsFigureSM2].However,ifthedataaredominatedbyoneorafewnon-orlow-recombiningregions,mple,ifthedataconsistedofYchromosomedataonly,thentheestimatedstructurewouldpresumablyreflectsomethingabouttheYchromosometree,actofusingsuchdataislikelytobethat(1)thealgorithmunderestimatesthedegreeofuncertaintyinancestryestimates,andintheworstcase,maybebiasedorinaccurate;(2)aveYormtDNAdataplusanumberofnuclearmarkers,asafeandvalidsolutionistorecodethehaplotypesfromeachlinkeeareverymanyhaplotypes,atthelinkagemodelisnotnecessarilyanybetterthanthe(no)-kagemodelisnotdesignedtodealwithbackgroundLDwithinpopulations,andislikelytobesimilarlyconfused.6.2MultimodalityThestructurealgorithmstartsatarandomplaceinparameterspace,andthenconvergestowardsamodeoftheparameterspace.(Inthiscontext,amodecanbethoughtof,looselyspeaking,asaclusteringsolutionthathashighposteriorprobability.)Whenpriorlabelsarenotused,thereisnoinherentmeaninginthenumberingoftheKclusters,andsothereareK!ry,structuremightswitchamongthesemodes,butthisdoesnotnormallyoccurforrealdatasets(Pritchardetal.,2000a).Forpreparingplotsforpublication,NoahRosenberg’slabhasahelpfulprogram,CLUMPP,thatlinesuptheclusterlabelsacrossdifferentrunspriortodataplotting(Section10).Inadditiontothesesymmetricmodes,rentimplementationofsansthatdifferentrunscanproducesubstantiallydifferentanswers,andlongerrunswillprobablynotfimainlyanissueforverycomplexdatasets,withlargevaluesofK,K>5orK>10,say(butseetheexampleofDataSet2AinPritchardetal.(2000a)).YoucanulanalysisofthistypeofsituationwaspresentedbyRosenbergetal.(2001),foradatasetwheretheestimatedKwasaround19.18

tingadmixtureproportionscanbeparticularlychallsanexampleofthisforsimulateddatainPritchardetal.(2000b).ThedataweresupposedtoapproximateasamplefromanAfricanAmericansedata,theestimatedancestryproportionswerehighlycorrelatedwiththetrue(simulated)values,cursbecauseintheabsenceofanynon-admixedindividuals,theremaybesomenon-identifiabilitywhereitispossibletopushtheallelefrequenciesfurtherapart,andsqueezetheadmixtureproportionstogether(orvice-versa),andobtainmuchthesamedegreeofmodelfiOPALPHAS=1(separateαforeachpopulation)canhelpabitwhentore,theadmixtureestimatesinthesesituationsshouldbetreatedwithcaution.7Runningstructurefromthecreintwofiles(main-paramsandextraparams),ramsspecifiestheinputformatforthedatafiaramsspecifilneedtosetallthevaluesinmainparams,atthedefaultmodelassumesadmixture,anddoesnotmakeuseoftheuser-defirameterisprintedinall-capsinoneofthesetwofiles,precededbytheword“#define”.(Theyarealsoprintedinall-capsthroughoutthisdocument.)Thevalueissetimmediatelyfollowingthenameoftheparameter(eg“#defineNUMREPS1000”setsthenumberofMCMCrepetitionsto1000).Followingeachparameterdefinition,thereisabriefcomment(marked“//”),nclude:“(str)”,forstring(usedforthenamesoftheinputandoutputfiles);“(int)”,forinteger;“(d)”,,arealnumbersuchas3.14);and“(B)”,,theparametertakesvaluesTRUEorFALSEbysettingthisto1or0,respectively).Theprogramisinsensitivetotheorderoftheparameters,soyoucanre-arrangethemoraddcomments,uesofallparametersusedforagivenrunareprintedattheendoftheoutputfile.7.1Programparamreorderedaccordingtotheparameterfilesthatareusedinthecommand-lineversionofstructure.7.2Parametersinfiloftheseparameters(LABEL,POPDATA,POPFLAG,PHENOTYPE,EXTRACOLS)indicatewhether19

particulartypesofdataarepresentintheinputfile;S(int)ardetal.(2000a)mes(dependingonthenatureofthedata)thereisanaturalvalueofKthatcanbeused,otherwiseKcanbeestimatedbycheckingthefitofthemodelatdifferentvaluesofK(seeSection5).BURNIN(int)Lengthofburninperiodbeforethestartofdatacollection.(SeeSection3.3.)NUMREPS(int)NumberofMCMCrepsafterburnin.(SeeSection3.3.)Input/Outputfi(string)Nameofinputdatafigth30characters(orpossiblylessdependingonoperatingsystem).OUTFILE(string)Nameforprogramoutputfiles(thesuffixes“1”,“2”,...,“m”(forinter-mediateresults)and“f”(finalresults)areaddedtothisname).Existingfigthofname30characters(orpossiblylessdependingonoperatingsystem).DatafiS(int)NumberofindividualsindatafiI(int)Numberoflociindatafi(int)tis2(diploid).MISSING(int)aninteger,PERIND(Boolean),fordiploiddata,thiswouldmeanthatthetwoallelesforeachlocusareinconsecutiveorderinthesamerow,ratherthanbeingarrangedinthesamecolumn,(Boolean)Inputfilecontainslabels(names)foreachindividual.1=Yes;0=A(Boolean)Inputfilecontainsauser-definedpopulation-of-originforeachindividual.1=Yes;0=G(Boolean)InputfilecontainsanindicatorvariablewhichsayswhethertousepopinfowhenUSEPOPINFO==1(seebelow).1=Yes;0=A(Boolean)Inputfilecontainsauser-definedsamplinglocationforeachindividual.1=Yes;0=LOCISPOP=YPE(Boolean)Inputfilecontainsacolumnofphenotypeinformation.1=Yes;0=No.20

EXTRACOLS(int)Numberofaddireignoredbytheprogram.0=NAMES(Boolean)ThetoprowofthedatafiIVEALLELES(Boolean)NextrowofdatafilecontaiTANCES(Boolean)Thenextrowofthedatafile(orthefirstrowifMARKER-NAMES==0)eddatafi(Boolean)(LINKAGE=1,PHASED=0),thenPHASEINFOcanbeused–thisisanextralineintheinputfiASEINFO=0eachvalueissetto0.5,elinkagemodelisusedwithpolyploids,PHASED=NFO(Boolean)Therow(s)ofgenotypedataforPHASE(Boolean)IGUOUS(int)ForusewithpolyploidswhenRECESSIVEALLELES=fitmatchMISSINGoranyallelevalueinthedata.7.3Parametersinfiptionsallowtheusertorefinethemodelinvariousways,aultvaluesareprobablyfileanoptions,type1for“Yes”,or“Usethisoption”;0for“No”or“Don’tusethisoption”.X(Boolean)Assumethemodelwithoutadmixture(Pritchardetal.,2000a).(EachindividualisassumedtobecompletelyfromoneoftheKpopulations.)Intheoutput,insteadofprintingtheaveragevalueofQasintheadmixturecase,theprogramprintstheposteriorprobabilitythateachindividualisfromeachpopulation.1=noadmixture;0=E(Boolean)ntendmakessomeguessesaboutthese,butsomecareonthepartoftheuserinrequiredtobesurethatthevaluesaresensiblefortheparticularapplication.21

USEPOPINFO(Boolean)vePOPDATA=OR(Boolean)UselocationinformationtoimORR(double)Usethe“Fmodel”,inwhichtheallelefrequenciesarecorrelatedacrosspopulations(Falushetal.,2003a).Morespecifically,ratherthanassumingapriorinwhichtheallelefrequenciesineachpopulationareindependentdrawsfromauniformDirichletdis-tribution,westartwithadistrdelismorerealisticforverycloselyrelatedpopulations(whereweexpecttheallelefrequenciestobesimilaracrosspopulations),andcanproducebetterclustering(section3.2).ThepriorofFkissetusingFPRIORMEAN,(Boolean)AssumethesamevalueofFkforallpopulations(analogoustoWright’straditionalFST).Thisisnotrecommendedformostdata,becauseinpracticeyouprobablyexpectdiff=2itmaysometimesbedifficulttoestimatetwovaluesofFSTseparately(butseeHarteretal.(2004)).Whenyou’retryingtoestimateK,youshouldusethesamemodelforallK(wesuggestONEFST=0).INFERALPHA(Boolean)Inferthevalueofthemodelparameterαfromthedata;otherwiseαisfitionisignoredundertheNOADMIXmodel.(ThepriorfortheancestryvectorQisDirichletwithparameters(α,α,...,α).Smallαimpliesthatmostindividualsareessentiallyfromonepopulationoranother,whilealpha>1impliesthatmostindividualsareadmixed.)POPALPHAS(Boolean)Inferaseparateαommend(double)Dirichletparameter(α)fordegreeofadmixture(thisistheinitialvalueifINFERALPHA==1).INFERLAMBDA(Boolean)Inferasuitablevalueforλ.CIFICLAMBDA(Boolean)Inferaseparateλ(double)parameterizestheallelefrequencyprior,requenciesatmostmarkersareveryskewedtowardslow/highfrequencies,asmallervalueofλn’tseemtoworkverywelltoestimateλatthesametimeastheotherhyperparameters,αcasesthedefaultMEAN,FPRIORSD(double)orforFkistakentobeGammawithmeanFPRIORMEAN,findthatthismakesthealgorithmsensitivetosubtlestructure,butatsomeincreasedriskofoverestimatingK(Falushetal.,2003a).22

UNIFPRIORALPHA(Boolean),ALPHAMAX(double)Assumeauniformpriorforαdelseemstoworkfine;thealternativemodel(whenUNIFPRIORALPHA=0)istotakeαashavingaGammaprior,withmeanALPHAPRI-ORA×ALPHAPRIORB,andvarianceALPHAPRIORA×10RMIN,LOG10RMAX,LOG10PROPSD,LOG10RSTART(double)Whenthelinkagemodelisused,theswitchrateristakentohaveauniformprioronalogscale,aluesneedriorpopulationinformation(USEPOPINFO).GENSBACK(int)ThiscorrespondstoG(Pritchardetal.,2000a).Whenusingpriorpopulationinformationforindividuals(USEPOPINFO=1),theprogramtestswhethereachindividualhasanimmigrantancestorinthelastGgenerations,whereG=rtohavedecentpower,Gshouldbesetfairlysmall(2,say)IOR(double)Mustbein[0,1].ThisisνinPritchardetal.(2000a).Sensiblevaluesmightbeintherange0.001—OPFLAGONLY(Boolean)Thisoption,newwithversion2.0,makesitpossibletoupdatetheallelefrequencies,P,usingonlyaprespecifihis,includeaPOPFLAGcolumn,andsetPOPFLAG=1forindividualswhoshouldbeusedtoupdateP,andPOPFLAG=nbeusedbothwith,tionwillbeuseful,forexample,ifyouhaveastandardreferencesetofindividualsfromknownpopulations,hisoption,theqestimateforeachunknownindividualdependsonlyonthereferenceset,OP(Boolean)ThisoptioninstructstheprogramtousethePopDatacolumnintheinputfiCISPOP=0,ORINIT(double)InitialvaluefortheLOCPRIORparameterr,thatparameterizeshowinformativethepopulationsare(citepHubiszEtAl09).WefoundthatLOCPRIORINIT=PRIOR(double)Rangeofrisfrom(0,MAXLOCPRIOR).WesuggestMAXLOCPRIOR=optionsPRINTNET(Boolean)Printthe“netnucleotidedistance”stancebetweenpopulationsAandB,DAB,iscalculatedas

本文发布于:2024-09-22 11:38:09,感谢您对本站的认可!

本文链接:https://www.17tex.com/fanyi/285.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议