The Google File System_免费下载


2023年12月18日发(作者:直播app开发)

TheGoogleFileSystemSanjayGhemawat,HowardGobioff,andShun-TakLeungGoogle∗ABSTRACTWehavedesignedandimplementedtheGoogleFileSys-tem,ascalabledistributedfiidesfaulttolerancewhilerunningoninexpensivecommodityhardware,haringmanyofthesamegoalsaspreviousdis-tributedfilesystems,ourdesignhasbeendrivenbyobser-vationsofourapplicationworkloadsandtechnologicalenvi-ronment,bothcurrentandanticipated,thatreflectamarkeddeparturefromsomeearlierfisledustoreexaminetraditionalchoicesandexplorerad-icallydifffidelydeployedwithinGoogleasthestorageplatformforthegenerationandprocessingofdatausedbyourser-viceaswellasresearchanddevelopmenteffgestclustertodateprovideshun-dredsofterabytesofstorageacrossthousandsofdisksonoverathousandmachines,paper,wepresentfilesysteminterfaceextensionsdesignedtosupportdistributedapplications,discussmanyaspectsofourdesign,UCTIONWehavedesignedandimplementedtheGoogleFileSys-tem(GFS)tomeettherapidlygrowingdemandsofGoogle’resmanyofthesamegoalsaspreviousdistributedfilesystemssuchasperformance,scalability,reliability,r,itsdesignhasbeendrivenbykeyobservationsofourapplicationwork-loadsandtechnologicalenvironment,bothcurrentandan-ticipated,thatreflectamarkeddeparturefromsomeearlierfireexaminedtradi-tionalchoicesandexploredradicallydiff,filesystemconsistsofhundredsoreventhousandsofstoragemachinesbuiltfrominexpensiventityandqualityofthecompo-nentsvirtuallyguaranteethatsomearenotfunctioseenproblemscausedbyapplicationbugs,operatingsystembugs,humanerrors,andthefailuresofdisks,memory,connectors,networking,ore,constantmonitoring,errordetection,faulttolerance,,fi-GBfifiletypicallycontainsmanyappliareregularlyD[4]:3—DistributedfilesystemsworkingwithfastgrowingdatasetsofmanyTBscomprisingbillionsofobjects,itisunwieldytomanagebillionsofap-proximatelyKB-sizedfilesevenwhenthefiult,designassumptionsandparametersDesign,reliability,performance,measurementsuchasI/,mostfilesaremutatedbyappendingnewdatav. 变化; 产生突变; 使变异writeswithinKeywordsafiitten,thefilesFaulttolerance,scalability,datastorage,clusteredstorageareonlyread,tyof∗yconstitutelargeTheauthorscanbereachedatthefollowingaddresses:{sanjay,hgobioff,shuntak}@ybein-termediateresultsproducedononemachineandprocessedPermissiontomakedigitalorhardcopiesofallorpartofthisworkforonanother,ersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarethisaccesspatternonhugefiles,appendingbecomesthefo-notmadeordistributedforprofitorcommercialadvantageandthatcopiescusofperformanceoptimizationandatomicityguarantees,bearthisnoticeandthefullcitationonthefiotherwise,ish,topostonserversortoredistributetolists,requirespriorspecific数据缓存Fourth,co-designingtheapplicationsandthefilesystempermissionand/efitstheoverallsystembyincreasingourfl’03,October19–22,2003,BoltonLanding,NewYork,ght2003ACM1-58113-757-5/$5.00.29

adj. 繁重的; 负有义务的; 麻烦的; 有偿的Forexample,wehaverelaxedGFS’sconsistencymodeltovastlysimplifythefialsointroducedanatomicappendoperationsothatmultipleclientscanappendconcurrentlytoafileGFSclustersarecurrentlydeployedfordiffgestoneshaveover1000storagenodes,over300TBofdiskstorage,andareheavilyaccessedbyhundredsofclientsondistinctmachinesonacontinuousbasis.2.2InterfaceGFSprovidesafamiliarfilesysteminterface,reorganizedhierarchicallyindirectoriesandidentifiorttheusualoperationstocreate,delete,open,close,read,andwritefiles.记录追加Moreover,otcreatesacopyofafiappendallowsmultipleclientstoap-adv. 同时发生地, 并存地penddatatothesamefileconcurrentlywhileguaranteeingtheatomicityofeachindividualclient’e-fulforimplementingmulti-waymergeresultsandproducer-生产者/消费者consumerqueuesthfoundthesetypesoffilestobeinvaluableinbuildinglargedistributedadj. 无价的; 价值无法衡量的OVERVIEW2.1AssumptionsIndesigningafilesystemforourneeds,wehavebeenguidedbyassumptionsthatoffdedtosomekeyobservationsearlierandnowlayoutourassumptionsinmoredetails.数据存储服务器2.3ArchitectureAGFSclusterconsistsofasinglemasterandmultiplechunkserversandisaccessedbymultipleclients,theseisytorunbothachunkserverandaclientonthesamemachine,aslongasmachineresourcespermitandthelowerreliabilitycausedbyrunningpossiblyflredividedintofiunkisidentifiedbyanimmutableandgloballyunerversstorechunksonlocaldisksasLinuxfilesandreadorwritechunkdataspecifiiability,ult,westorethreereplicas,thoughuserscandesignatedifferentreplicationlevelsfordifferentregionsofthefitermaintainsallfi-cludesthenamespace,accesscontrolinformation,themap-pingfromfilestochunks,controlssystem-wideactivitiessuchaschunkleasemanagement,garbagecollectionoforphanedchunks,terperi-odicallycommunicateswitheachchunkentcodelinkedintoeachapplicationimplementsthefilesystemAPIandcommunicateswiththesinteractwiththemasterformetadataopera-tions,trtheclientnorthechunkservercachesficachesofferlittlebenefitbecausemostapplicationsstreamthroughhugefiingthemsimplifiestheclientandtheoverallsystembyeliminatingcachecoherenceissues.(Clientsdocachemetadata,however.)ChunkserversneednotcachefiledatabecausechunksarestoredaslocalfilesandsoLinux’sbuffercachealreadykeepsfrequentlyaccesseddatainmemory.•Thesconstantlymonitoritselfanddetect,tolerate,andrecoverpromptlyfromcomponentfailuresonaroutinebasis.•Thesystemstoresamodestnumberoflargefictafewmillionfiles,-GBfilesarethecommoncaseandshouldbemanagedeffifilesmustbesupported,butweneednotoptimizeforthem.•Theworkloadsprimarilyconsistoftwokindsofreads:estreamingreads,individualoperationstypicallyreadhundredsofKBs,siveoperationsfromthesameclientoftenreadthroughacontiguousregionofafiran-domreadtypicallyreadsafewKBsatsomearbitraryoffmance-consciousapplicationsoftenbatchandsorttheirsmallreadstoadvancesteadilythroughthefileratherthangobackandforth.•Theworkloadsalsohavemanylarge,sequentialwritesthatappenddatatofiitten,filesaresel-dommodifiritesatarbitraryposi-tionsinafilearesupportedbutdonothavetobeefficient.•Thesystemmustefficientlyimplementwell-definedse-manticsformultipleclientsthatconcurrentlyappendtothesamefifidsofproducers,runningonepermachine,willconcur-rentlyappendtoafifilemaybereadlater,oraconsumermaybereadingthroughthefilesimultaneously.潜伏, 潜在, 藏起来, 看不见; 潜伏期 (心理学用语);

恢复时间 (计算机用语)•ourtargetapplicationsplaceapre-miumonprocessingdatainbulkatahighrate,whilefewhavestringentresponsetimerequirementsforanindividualreadorwrite. 带宽; 频宽2.4SingleMasterHavingasinglemastervastlysimplifiesourdesignandenablesthemastertomakesophisticatedchunkplacement30

ApplicationGFS client(file name, chunk index)(chunk handle,chunk locations)GFS masterFile namespace/foo/barchunk 2ef0Legend:Instructions to chunkserver(chunk handle, byte range)chunk dataChunkserver stateGFS chunkserverLinux file systemGFS chunkserverLinux file systemData messagesControl messagesFigure1:r,wemustminimizsneverreadandwritefid,esthisinformationforalimitedtimeand,usingthefixedchunksize,theclienttranslatesthefilenameandbyteoffsetspecifiedbytheap-plicationintoachunkindexwithinthefi,itsendsthemasterarequestcontainingthefiterentcachesthisinformationusingthefientthensendsarequesttooneofthereplicas,uestspecifirreadsofthesamechunkrequirenomoreclient-masterinteractionuntilthecachedinformationexpiresorthefi,theclienttypicallyasksformultiplechunksinthesamerequestandthemastercanalsointrainformationsidesteps,lowsustokeepthemetadatainmemory,therhand,alargechunksize,evenwithlazyspaceallocation,fileconsistsofasmallnumberofchunks,nkserversstoringthosechunksmaybecomehotspotsifmanyclientsareaccessingthesamefitice,hotspotshavenotbeenamajorissuebecauseourapplicationsmostlyreadlargemulti-chunkfir,hotspotsdiddevelopwhenGFSwasfirstusedbyabatch-queuesystem:anexecutablewaswrittentoGFSasasingle-chunkfichunkserversstorifixedthisproblembystoringsuchexecutableswithahigherreplicationfatiallong-termsolutionistoallowclientstoreaddatafromotherclientsinsuchsituations.2.6MetadataThemasterstoresthreemajortypesofmetadata:thefileandchunknamespaces,themappingfromfilestochunks,andthelocationsofeachchunk’adataiskeptinthemaster’firsttwotypes(names-pacesandfile-to-chunkmapping)arealsokeptpersistentbyloggingmutationstoanoperationlogstoredonthemas-ter’logallowsustoupdatethemasterstatesimply,reliably,d,itaskseachchunkserverabouchosen64MB,whichismuchlargerthantypicalfiunkreplicaisstoredasaplainLinuxfiaceallocationavoidswastingspaceduetointernalfragmentation,chunksizeoff,itreducesclients’needtointeractwiththemasterbecausereadsandwritesonthesamechunkrequuctionisespeciallysignificantforourwork-loadsbecauseapplicationsmostlyreadandwritelargefirsmallrandomreads,theclientcancomfo,sinceonalargechunk,aclientismorelikelytoperformmanyoperationsonagivenchunk,itcanreducenetworkoverheadbykeepingapersis-2.6.1In-MemoryDataStructuresSincemetadataisstoredinmemory,rmore,itiseasyandefficientforriodicscanningisusedtoimplementchunkgarbagecollection,re-replicationinthepresenceofchunkserverfail-ures,andchunkmigrationtobalanceloadanddiskspace31

entialconcernforthismemory-onlyapproachisthatthenumberofchunksandhenceunksarefullbecausemostfilescontainmanychunks,onlythelastofwhichmaybepartiallyfi-ilarly,thefilenamespacedatatypicallyrequireslessthen64bytesperfilebecauseitstoresfilenamescompactlyus-ingprefissarytosupportevenlargerfilesystems,thecostofaddingextramemorytothemasterisasmallpricetopayforthesimplicity,reliability,performance,andflsuccessConcurrentsuccessesFailureWritedefinedconsistentbutundefinedinconsistentRecordAppenddefinedinterspersedwithinconsistentTable1:FileRegionStateAfterMutation2.6.2ChunkLocationsThemasterdoesnotketercankeepitselfup-to-datethereafterbecauseitcontrolsallchunkplaiallyattemptedtokeepchunklocationinformationpersistentlyatthemaster,butwedecidedthatitwasmuchsimplertorequestthedatafromchunkserversatstartup,iminatedtheproblemofkeepingthemasterandchunkserversinsyncaschunkserversjoinandleavethecluster,changenames,fail,restart,sterwithhundredsofservers,rwaytounderstandthisdesigndecisionistoreal-izethatachunkserverhasthefisnopointintryingtomaintainaconsistentviewofthisinformationonthemaste,adiskmaygobadandbedisabled)ckpointisinacompactB-treelikeformthatcanbedirectebuildingacheckpointcantakeawhile,themas-ter’sinternalstateisstructuredinsuchawayterswitchestoanewlogfiecreatedinaminuteorsoforaclusterwithafewmillionfimpleted,ryneedsonlythelatestcompletecheckpointandsubsequentlogfiheckpointsandlogfilescanbefreelydeleted,reduringcheckpointingdoesnotaffectcorrectnessbecausetherecoverycodedetectsandskipsincompletecheckpoints.2.7ConsistencyModelGFShasarelaxedconsistencymodelthatsupportsourhighlydistributedapplicationswellbutremainsrelativelysimpleandeffiiscussGFS’highlighthowGFSmaintainstheseguaranteesbutleavethedetailstootherpartsofthepaper.2.6.3OperationLogThyisittheonlypersistentrecordofmetadata,butitalsoservesasalogicaltimelinethatdefindchunks,aswellastheirversions(seeSection4.5),arealluniquelyandeternallyidentifiheoperationlogiscritical,wemuststoreitreli-ablyandnotmise,weeffectivelylosethewholefilore,wereplicateitonmultipleremotemachinesandrespondtoaclientopera-tiononlyafterflterbatchesseverallogrecordstogetherbeforeflushingtherebyreducingtheimpactofflterrecoversitsfimizestartuptime,tercheckpointsitsstatewhenevertheloggrowsbeyondacertainsizesothatitcanrecoverbyloadingthelatestcheckpointfromlocaldiskandreplayingonlythe2.7.,filecreation)ehandledexclusivelybythemaster:namespacelockingguaranteesatomicityandcorrectness(Section4.1);themaster’soperationlogdefinesaglobaltotalorderoftheseoperations(Section2.6.3).Thestateofafileregionafteradatamutationdependsonthetypeofmutation,whetheritsucceedsorfails,1summa-rizestheresult.Afileregionisconsistentifallclientswillalwaysseethesamedata,nisdefinedafterafiledatamutationifitutationsucceedswithoutinterferencefromconcurrentwriters,theaffectedregionisdefined(andbyimplicationconsistent):rentsuccessfulmutationsleavetheregionundefinedbutconsistent:allclientsseethesamedata,butitmaynotrefllly,dmutationmakestheregionin-consistent(hencealsoundefined):differentclientsmayseedifferentdataatdiffribebelowhowourapplicationscandistinguishdefinedregionsfromundefined32

licationsdonotneedtofurtherdistinguishbetweendifferentkindsofundeficausesdatatobewrittenatanapplication-specifiedfileoffdappendcausesdata(the“record”)tobeappendedatomicallyatleastonceeveninthepresenceofconcurrentmutations,butatanoffsetofGFS’schoosing(Section3.3).(Incontrast,a“regular”appendismerelyawriteatanoffsetthattheclientbelievestobethecurrentendoffile.)Theoffsetisreturnedtotheclientandmarksthebeginningofadefition,cupyregionsconsidsequenceofsuccessfulmutations,themutatedfileregionisguaranteedtobedefiievesthisby(a)applyingmutationstoachunkinthesameorderonallitsreplicas(Section3.1),and(b)usingchunkversionnumberstodetectanyreplicathathasbecomestalebecauseithasmissedmu-tationswhileitschunkserverwasdown(Section4.5).Stalereplicaswillneverbeinvolientscachechunklocations,n-dowislimitedbythecacheentry’stimeoutandthenextopenofthefile,whichpurgesfromthecacheallchunkin-formationforthatfier,asmostofourfilesareappend-only,astaleeaderretriesandcontactsthemaster,terasuccessfulmutation,ntifiesfailedchunkserversbyregularhandshakesbetweenmasterandallchunkserversanddetectsdatacorruptionbychecksumming(Section5.2).Onceaproblemsurfaces,thedataisrestoredfromvalidreplicasassoonaspossible(Section4.3).AchunkislostirreversiblyonlyifallitsreplicasarelostbeforeGFScanreact,thiscase,itbe-comesunavailable,notcorrupted:applicationsreceiveclearerrorsratherthancorruptdata.filedatathatisstillincompletefromtheapplication’thertypicaluse,manywritersconcurrentlyap-pendtoafiappend’sappend-at-least-oncesemanticspre-serveseachwriter’cordpre-paredbythewritercontainsextrainformationlikecheck-sumssothatitsvaliditycanbeverifi,iftheywouldtriggernon-idempotentop-erations),itcanfilterthemoutusinguniqueidentifiersintherecords,whichareoftenneededunctionalitiesforrecordI/O(exceptduplicateremoval)areinlibrarycodesharedbyourapplicationsandapplicabletootherfiat,thesamesequenceofrecords,plusrareduplicates,INTERACTIONSWedesignedthesystemtominimizethemaster’atbackground,wenowde-scribehowtheclient,master,andchunkserversinteracttoimplementdatamutations,atomicrecordappend,andsnap-shot.3.1LeasesandMutationOrderAmutationisanoperationthatchangtationisperformedatallthechunk’tergrantsachunkleasetooneoftherepli-cas,,theglobalmutationorderisdefinedfirstbytheleasegrantorderchosenbythemaster,r,aslongasthechunkisbeingmu-tated,theprimarycanrequestandtypicallyreceiveexten-sionsfromthemasterindefixtensionrequestsandgrantsarepiggyba,whenthemasterwantstodisablemutationsonafilethatisbeingrenamed).Evenifthemasterlosescommunicationwithaprimary,ire2,weillustratethisprocessbyfollowingthecontrolflentasksthemasterwhichchunkserverhehasalease,themastergrantsonetoareplicaitchooses(notshown).terreplieswiththeidentityoftheprimaryandthelocationsoftheother(secondary)stocontactthemasteragainonlywhentheprimary2.7.2ImplicationsforApplicationsGFSapplicationscanaccommodatetherelaxedconsis-tencymodelwithafewsimpletechniquesalreadyneededforotherpurposes:relyingonappendsratherthanoverwrites,checkpointing,andwritingself-validating,callyallourapplicationsmutatefiypicaluse,awritergener-atesafiicallyrenamesthefiletoapermanentnameafterwritingallthedata,sverifyandprocessonlythefileregionuptothelastcheckpoint,whichisknowntobeinthedefilessofconsistencyandconcurrencyissues,ingisfarmoreeffi-ointingallowswriterstorestartincremen-tallyandkeepsreadersfromprocessingsuccessfullywritten33

4Client3SecondaryReplica Astep 1Master2fileregionmayendupcontainingfragmentsfromdifferentclients,althoughthereplicaswillbeidenticalbecausethein-dividuavesthefileregioninconsistentbutundefinedstateasnotedinSection2.7.63.2DataFlow5Legend:7PrimaryReplicaSecondaryReplica B6ControlDataFigure2:unkserverwillstorethedatainaninternalLRUbuffuplingthedataflowfromthecontrolflow,wecanimproveperformancebyschedulingtheexpensivedataflowbaslthereplicashaveacknowledgedreceivingthedata,uestidentifimaryassignsconsecutiveserialnumberstoallthemutationsitreceives,possiblyfrommultipleclients,condaryreplicaondariesoferrors,thewritemayhavesucceededattheprimaryandanarbitrarysubsetofthesecondaryrepli-cas.(Ifithadfailedattheprimary,itwouldnothavebeenassignedaserialnumberandforwarded.)Theclientrequestisconsideredtohavefailed,andthemodifimakeafewattemptsatsteps(3)through(7)upletheflowofdatafromtheflowofcontroltousethenetworkeffiontrolflowsfromtheclienttotheprimaryandthentoallsecondaries,dataispushedlinelsaretofullyutilizeeachmachine’snetworkbandwidth,avoidnetworkbottlenecksandhigh-latencylinks,yutilizeeachmachine’snetworkbandwidth,thedataispushedlinearlyal,tree).Thus,eachmachine’sfulloutboundba,inter-switchlinksareoftenboth)asmuchaspossible,eachmachineforwardsthedatatothe“closest”sthedatatotheclosestchunkserver,sayS1.S1for-wardsittotheclosestchunkserverS2throughS4closesttoS1,rly,S2forwardsittoS3orS4,whicheverisclosertoS2,worktopologyissimpleenoughthat“distances”y,hunkserverreceivessomedata,ningisespe-citnetworkcongestion,theidealelapsedtimefortransferringBbytestoRreplicasisB/T+RLwhereTisthenworklinksaretypically100Mbps(T),ore,1MBcanideallybedistributedinabout80ms.3.3AtomicReditionalwrite,theclientspecifiestheoff-rentwritestothesameregionarenotserializable:ordappend,however,theclientspecifiendsittothefi,asonecontinuoussequenceofbytes)atanoffsetofGFS’schoosingandreturnsthatoffsimilartowrit-ingtoafileopenedinOAPPENDmodeinappendisheavilyusedbyourdistributedapplica-tionsinwhichmanyclientsondifferentmachinesappendtothesamefiswouldneedaddi-tionalcomplicatedandexpensivesynchronization,forex-amplethroughadistributedlockmanager,orkloads,suchfilesoftenIfawritebytheapplicationislargeorstraddlesachunkboundary,lfollowthecontrolflowdescribedabovebutmaybeinterore,theshared34

serveasmultiple-producer/single-consumerqueuesorcon-tainmergedresultsfrommanydiffappendisakindofmutationandfollowsthecon-trolflentpushesthedatatoallreplicasofthelastchunkofthefileThen,marycheckstoseeifappendingtherecordtothecurrentchunkwouldcausethechunktoexceedthemaximumsize(64MB).Ifso,itpadsthechunktothemax-imumsize,tellssecondariestodothesame,andrepliestotheclientindicatingthattheoperationshouldberetriedonthenextchunk.(Recordappendisrestrictedtobeatmostone-fourthofthemaximumchunksizetokeepworst-casefragmentationatanacceptablelevel.)Iftherecordfitswithinthemaximumsize,whichisthecommoncase,theprimaryappendsthedatatoitsreplica,tellsthesecon-dariestowritethedataattheexactoffsetwhereithas,andfiordappendfailsatanyreplica,ult,replicasofthesamechunkmaycon-taindiffereop-ertyfollowsreadilyfromthesimpleobservationthatfortheoperationtoreportsuccess,thedatamusthavebeenwrittenatthesameoffr-more,afterthis,allreplicasareatleastaslongastheendofrecordandthereforeanyfuturerecordwillbeassignedahigheroffsetoradifferentchunkevenifadiffsofourconsistencyguar-antees,theregionsinwhichsuccessfulrecordappendopera-tionshavewrittentheirdataaredefined(henceconsistent),whereasinterveningregionsareinconsistent(henceunde-fined).OurapC’.ItthenaskseachchunkserverthathasacurrentreplicaofCtocreateanewchunkcalledC’.Bycreatingthenewchunkonthesamechunkserversastheoriginal,weensurethatthedatacanbecopiedlocally,notoverthenet-work(ourdisksareaboutthreetimesasfastasour100MbEthernetlinks).Fromthispoint,requesthandlingisnodif-ferentfromthatforanychunk:themastergrantsoneofthereplicasaleaseonthenewchunkC’andrepliestotheclient,whichcanwritethechunknormally,-tion,itmanageschunkreplicasthroughoutthesystem:itmakesplacementdecisions,createsnewchunksandhencereplicas,andcoordinatesvarioussystem-wideactivitiestokeepchunksfullyreplicated,tobalanceloadacrossallthechunkservers,is-cusseachofthesetopics.4.1NamespaceManagementandLockingManymasteroperationscantakealongtime:forexam-ple,asnapshotoperaore,weallowmultipleoperationstobeactivemanytraditionalfilesystems,GFSdoesnothaveaper-directorydatastructurethatlistsallthefisitsupportaliasesforthesamefileordirectory(i.e,hardorsymboliclinksinUnixterms).GFSlogicallyrepefixcompression,thistablecanbeeffideinthenamespacetree(eitheranabsolutefilenameoranabsolutedirectoryname)lly,ifitinvolves/d1/d2/.../dn/leaf,itwillacquireread-locksonthedirectorynames/d1,/d1/d2,...,/d1/d2/.../dn,andeitherareadlockorawritelockonthefullpathname/d1/d2/.../dn/atleafmaybeafillustratehowthislockingmechanismcanpreventafile/home/user/foofrombeingcreatedwhile/home/userisbeingsnapshottedto/save/pshotoper-ationacquiresreadlockson/homeand/save,andwritelockson/home/userand/save/filecreationac-quiresreadlockson/homeand/home/user,andawritelockon/home/user/operationswillbeseri-alizedproperlybecausetheytrytoobtainconflictinglockson/home/eationdoesnotrequireawritelockontheparentdirectorybecausethereisno“directory”,orinode-like,datastructuretobeprotectedfrommodifidlockonthenameissuffiepropertyofthislmple,multiplefilecreationscanbeexecutedconcurrentlyinthesamedirectory:eachacquiresareadlockonthedirectorynameandawritelockonthefidlockonthedirectorynamesufficestopreventthedirectoryfrombeingdeleted,renamed,telockson3.4SnapshotThesnapshotoperationmakesacopyofafileoradirec-torytree(the“source”)almostinstantaneously,rsuseittoquicklycreatebranchcopiesofhugedatasets(andoftencopiesofthosecopies,recursively),ortocheckpointthecurrentstatebeforeexS[5],emasterreceivesasnapshotrequest,itfirstrevokesanyoutstandingleasesonthechunksinthefisuresthatanysubsequentwritestothesechunkswillrequireaninteractionwiththemastertofillgivethemasteranopportunitytocreateanewcopyofthechunkfiheleaseshavebeenrevokedorhaveexpired,appliesthislogrecordtoitsin-memorystatebyduplicatingthemetadataforthesourcefilycreatedsnap-shotfilespointtothesamechunksasthesourcefifirsttimeaclientwantstowritetoachunkCafterthesnapshotoperation,itsendsarequesttothemastertofirsreplyingtotheclientrequestandinsteadpicksanewchunk35

filenamesserializeattemptstocreateafihenamespacecanhavemanynodes,read-,locksareacquiredinaconsistenttotalordertopreventdeadlock:theyarefirstorderedbylevelinthenamespacetreeandlexicographicallywithinthesamelevel.4.2Rehunkserversinturnmaybeaccessedfromhundredsofclientsfromthesameordifficationbetweentwomachinesondiffon-ally,bandwidthintooroutofarack-leveldistributionpresentsauniquechallengetodis-tributedataforscalability,reliability,nkreplicaplacementpolicyservestwopurposes:maximizedatareliabilityandavailability,h,itisnotenoughtospreadreplicasacrossmachines,whichonlyguardsagainstdiskormachinefailuresandfullyutilizeseachmachine’suresthatsomereplicasofachunkwillsur-viveandremainavailableevenifanentirerackisdamagedoroffline(forexample,duetofailureofasharedresourcelikeanetworkswitchorpowercircuit).Italsomeansthattraffic,especiallyreads,therhand,writetraffichastoflowthroughmultipleracks,atradeoffterpicksthehighestprioritychunkand“clones”itbyinstructingsomereplicaisplacedwithgoalssimilartothoseforcreation:equalizingdiskspaceutilization,limitingactivecloneoperationsonanysinglechunkserver,cloningtrafficfromoverwhelmingclienttraffic,themasterlimitsthenumbeonally,eachchunkserverlimitstheamountofbandwidthitspendsoneachy,themasterrebalancesreplicasperiodically:itex-aminesthecurrentreplicroughthisprocess,themastergraduallyfillsupanewchunkserverratherthaninstantlyswampsitwithnewchunksandtheheavywritetraffition,ral,itpreferstoremovethoseonchunkserverswithbelow-averagefreespacesoastoequalizediskspaceusage.4.4GarbageCollectionAfterafileisdeleted,soonlylazilyduringregulargarbagecollectionatboththefifindthatthisapproachmakesthesystemmuchsimplerandmorereliable.4.4.1MechanismWhenafileisdeletedbytheapplication,rinsteadofreclaimingresourcesimmediately,thefithemaster’sregularscanofthefilesystemnamespace,itremovesanysuchhiddenfilesiftheyhaveex-istedformorethanthreedays(theintervalisconfigurable).Untilthen,thefilecanstillbereadunderthenew,ehiddenfileisremovedfromthenamespace,ffilarregularscanofthechunknamespace,themasteridentifi,thosenotreachablefromanyfile)rtBeatmessageregularlyexchangedwiththemaster,eachchunkserverreportsasubsetofthechunksithas,andthemasterreplieswiththeidentityofallchunksthatarenolongerpresentinthemaster’nkserverisfreetodeleteitsreplicasofsuchchunks.4.3Creation,Re-replication,RebalancingChunkreplicasarecreatedforthreereasons:chunkcre-ation,re-replication,emastercreatesachunk,idersseveralfac-tors.(1)Wewanttoplacmethiswillequalizediskutilizationacrosschunkservers.(2)Wewanttolimitthenumberof“recent”ghcreationitselfischeap,itreliablypredictsimmi-nentheavywritetrafficbecausechunksarecreatedwhende-mandedbywrites,andinourappend-once-read-manywork-loadtheytypicallybecomepracticallyread-onlyoncetheyhavebeencompletelywritten.(3)Asdiscussedabove,terre-replicatesachunkassoonasthenumberofavailablereplicasfallsbelowauser-specifiuldhappenforvariousreasons:achunkserverbecomesunavailable,itreportsthatitsreplicamaybecorrupted,oneofitsdisksisdisabledbecauseoferrors,mple,wegivehigherprior-tion,weprefertofirstre-replicatechunksforlivefilesasopposedtochunksthatbelongtore-centlydeletedfiles(seeSection4.4).Finally,tominimizetheimpactoffailuresonrunningapplications,weboostthepriorityofanychunkthatisblockingclientprogress.4.4.2DiscussionAlthoughdistributedgarbagecollectionisahardproblemthatdemandscomplicatedsolutionsinthecontextofpro-gramminglanguages,asilyidentifyallreferencestochunks:theyareinthefilsoeasilyidentifyallthechunkreplicas:theyareLinuxfihreplicanotknowntothemasteris“garbage.”36

Thegarbagecollectionapproachtostoragereclamationoff,itissimpleandreliablreationmaysuc-ceedonsomechunkserversbutnotothers,adeletionmes-sagesmaybelost,andthemasterhastoremembertoresendthemacrossfailures,bothitsownandthechunkserver’ecollectionprovide,itmergesstoragereclamationintotheregularbackgroundactivitiesofthemaster,,er,terc,thedelayinreclaimingstorageprovidesasafetynetagainstaccidental,xperience,themaindisadvantageisthatthedelaysometimeshindersuserefforttofiationsthatrepeatedlycreateanddeletetemporaryfiesstheseissuesbyexpeditingstoragerecla-mationifadeletedfiallowuserstoapplydifferentreplicationandreclamationpoliciestodiffmple,userscanspecifythatallthechunksinthefileswithinsomedirectorytreearetobestoredwithoutreplication,andanydeletedfilesareimmediatelyandirrevocablyremovedfromthefityofcomponentstogethermaketheseproblemsmorethenormthantheexception:wecannotcompletelytrustthemachines,-ponentfailurescanresultinanunavailablesystemor,worse,usshowwemeetthesechallengesandthetoolswehavebuiltintothesystemtodiagnoseprob-lemswhentheyinevitablyoccur.5.1HighAvailabilityAmonghundredsofserversinaGFScluster,theoverallsystemhighlyavailablewithtwosimpleyeteffectivestrategies:fastrecoveryandreplication.5.1.1FastRecoveryBoththemasterandthechunkserveraredesigne,wedonotdistinguishbetweennormalandabnormaltermination;sandotherserversexperi-enceaminorhiccupastheytimeoutontheiroutstandingrequests,reconnecttotherestartedserver,-tion6.2.2reportsobservedstartuptimes.5.1.2ChunkReplicationAsdiscussedearlier,eachchunkisreplicatedonmultiplechunkserversondiffanspecifydifferentreplicationlevelsfordifferentpartsofthefiterclonesexistingreplicasasneededtokeepeachchunkfullyreplicatedaschunkserversgoofflineordetectcorruptedreplicasthroughchecksumver-ification(seeSection5.2).Althoughreplicationhasserveduswell,weareexploringotherformsofcross-serverredun-dancysuchaspctthatitischallengingbutmanageabletoimplementthesemorecomplicatedre-dundancyschemesinourverylooselycoupledsystembe-causeourtrafficisdominatedbyappendsandreadsratherthansmallrandomwrites.4.5StaleReplicaDetectionChunkreplicasmaybecomesthchunk,themastermaintainsacerthemastergrantsanewleaseonachunk,terandcursbeforeanyclientisnotifiherreplicaiscurrentlyunavail-able,terwilldetectthatthischunkserverhasastalereplicawhenthechunkserverrasterseesaversionnumbergreaterthantheoneinitsrecords,themas-terassumesthatitfaithat,iteffectivelyconsidersastalereplicahersafeguard,themasterincludesthechunkversionnumberwhenitinformsclientswhichchunkserverholdsaleaseonachunkorwhenitinstructsachuentorthechunkserververifiestheversionnumberwhiontothestateisconsideredcommittedonlyafteritslogrecordhasbeenflplicity,onemasterprocessremainsinchargeofallmutationsaswellasbackgrounfails,achineordiskfails,monitoringinfrastructureoutsideGsuseonlythecanonicalnameofthemaster(-test),whichiser,“shadow”mastersprovideread-onlyaccesstothefieshadows,notmirrors,inthattheymaylagtheprimaryslightly,hancereadavailabilityforfilesthatarenotbeingactive,sincefilecontentisreadfromchunkservers,appli-cationsdonotobservestalefiOLERANCEANDDIAGNOSISOneofourgreatestchalllityand37

stalewithinshortwindowsisfilemetadata,itselfinformed,ashadowmasterreadsareplicaofthegrowingoperationlogandappliestheeprimary,itpollschunkserversatstartup(andinfre-quentlythereafter)tolocatechunkreplicasanndsontheprimarymasteronlyforreplicalocationupdatesresultingfromtheprimary’sdecisionstocreateanddeletereplicas.finotverifythefirstandlastblocksbeforeoverwritingthempartially,thenewchecidleperiods,ecorruptionisdetected,themaeventsaninactivebutcorruptedchunkreplicafromfoolingthemasterintothinkingthatithasenoughvalidreplicasofachunk.5.2DataIntegrihataGFSclusteroftenhasthousandsofdisksonhundredsofmachines,itregularlyexperiencesdiskfailuresthatcausedatacorruptionorlossonboththereadandwritepaths.(SeeSection7foronecause.)Wecanrecoverfromcorruptionusingotherchunkreplicas,butitwouldbeimper,divergentreplicasmaybelegal:thesemanticsofGFSmutations,inparticularatomicrecordappendasdiscussedearlier,ore,eachchunkservermustindhermetadata,checksumsarekeptinmemoryandstoredpersistentlywithlogging,ds,thechunkserververifiesthechecksumofdatablocksthatoverlapthereadrangebeforereturninganydatatotherequester,ckdoesnotmatchtherecordedchecksum,thechunkseronse,therequestorwillreadfromotherreplicas,validnewreplicaisinplace,themasteriumminghaslittleeffostofourreadsspanatleastafewblocks,weneedtoreadandchecksumonlyarelativelysmallamountofextradataforverifientcodefurtherrer,checksumlookupsandcomparisononthechunkserveraredonewithoutanyI/O,andchecksumcalculationcanoftenbeoverlappedwithI/umcomputationisheavilyoptimizedforwritesthatappendtotheendofachunk(asopposedtowritesthatoverwriteexistingdata)incrementallyupdatethecheck-sumforthelastpartialchecksumblock,andcomputenewchecksumsforanybrandnewchecksumblocksfithelastpartialchecksumblockisalreadycorruptedandwefailtodetectitnow,thenewchecksumvaluewillnotmatchthestoreddata,rast,ifawriteoverwritesanexistingrangeofthechunk,wemustreadandverifythefirstandlastblocksoftherangebeingoverwritten,thenperformthewrite,and5.3DiagnosticToolsExtensiveanddetaileddiagnosticlogginghashelpedim-measurablyinproblemisolation,debugging,andperfor-manceanalysis,-outlogs,itishardtounderstandtransient,versgeneratedi-agnosticlogsthatrecordmanysignificantevents(suchaschunkserversgoingupanddown)iagnosticlogscanbefreelydeletedwithoutaffr,logsincludetheexactrequestsandresponsessentonthewire,exceptforthefihingrequestswithrepliesandcollatingRPCrecordsondifferentmachines,formanceimpactofloggingisminimal(andfaroutweighedbythebenefits)trecenteveEMENTSInthissectionwepresentafewmicro-benchmarkstoillus-tratethebottlenecksinherentintheGFSarchitectureandimplementation,andalsosomenumbersfromrealclustersinuseatGoogle.6.1Micro-benchmarksWemeasuredperformanceonaGFSclusterconsistingofonemaster,twomasterreplicas,16chunkservers,atthisconfimachinesareconfiguredwithdual1.4GHzPIIIprocessors,2GBofmemory,two80GB5400rpmdisks,19GFSservermachinesareconnectedtooneswitch,switchesareconnectedwitha1Gbpslink.6.1.1ReadsNclientsreadsimultaneouslyfromthefiientreadsarandomlyselected4MBregionfroma320GBfinkserverstakentogetherhaveonly32GBofmemory,soweexpectatmosta10%hitrateintheLinuxbuffultsshouldbeclosetocoldcacheresults.38

Figure3(a)itpeaksatanaggregateof125MB/swhenthe1Gbpslinkbetweenthetwoswitchesissaturated,or12.5MB/sperclientwhenits100Mbpsnetworkinterfacegetssaturated,ervedreadrateis10MB/s,or80%oftheper-clientlimit,regatereadratereaches94MB/s,about75%ofthe125MB/slinklimit,for16readers,or6MB/fficiencydropsfrom80%to75%becauseasthenumberofreadersincreases,sodoestheprobabirChunkserversAvailablediskspaceUseddiskspaceNumberofFilesNumberofDeadfilesNumberofChunksMetadataatchunkserversMetadataatmasterA34272TB55TB735k22k992k13GB48MBB227180TB155TB737k232k1550k21GB60MBTable2:CharacteristicsoftwoGFSclusterslongerandcontinuouslygeneracases,asingle“task”consistsofmanyprocessesonmanymachinesreadingandwritingmanyfilessimultaneously.6.1.2WritesNclientswritesimultaneouslytoNdistinctfiientwrites1GBofdatatoanewfiregatewriterateanditstheoreticallimitareshowninFigure3(b).Thelimitplateausat67MB/sbe-causeweneedtowriteeachbyteto3ofthe16chunkservers,eachwitha12.5MB/terateforoneclientis6.3MB/s,notinteractveinatewriteratereaches35MB/sfor16clients(or2.2MB/sperclient),ecaseofreads,itbecomesmorelikelythatmultipleclientswrier,collisionismorelikelyfor16writersthanfor16readersbecauseeachwriteinvolvesthreediffticethishasnotbeenamajorproblembecauseeventhoughitincreasesthelatenciesasseenbyindividualclients,itdoesnotsig-nificantlyaffecttheaggregatewritebandwidthdeliveredbythesystemtoalargenumberofclients.6.2.1StorageAsshownbythefirstfiveentriesinthetable,bothclustershavehundredsofchunkservers,supportmanyTBsofdiskspace,andarefairlybutnotcompletelyfull.“Usedspace”llyallfiore,theclustersstore18TBand52TBofficlustershavesimilarnumbersoffiles,thoughBhasalargerproportionofdeadfiles,namelyfileswhichweredelethasmorechunksbecauseitsfilestendtobelarger.6.2.2MetadataThechunkserversinaggregatestoretensofGBsofmeta-data,yothermetadataadatakeptatthemasterismuchsmaller,onlytensofMBs,orabout100bytesperfireeswithourassumptionthatthesizeofthemaster’smemorydoesnotlimitthesystem’theper-filemetadataisthefilenamesstoredinaprefietadataincludesfileown-ershipandpermissions,mappingfromfilestochunks,andeachchunk’tion,foreachchunkwestorethecurrendividualserver,bothchunkserversandthemaster,orerecoveryisfast:ittakesonlyafewsecondr,themasterissomewhathobbledforaperiod–typically30to60seconds–untilithasfetchedchunklocationinformationfromallchunkservers.6.1.3RecordAppendsFigure3(c)tsappendsimultaneouslytoasinglefimanceislim-itedbythenetworkbandwidthofthechunkserversthatstorethelastchunkofthefile,tsat6.0MB/sforoneclientanddropsto4.8MB/sfor16clients,mostlyduetocongestionandvariancesinnetworktransferratesseenbydifflicationstendtoproducemultiplesuchfirwords,NclientsappendtoMsharedfiore,thechunkservernetworkcongestioninourexperimentisnotasignificantissueinpracticebe-causeaclientcanmakeprogressonwritingonefilewhilethechunkserversforanotherfilearebusy.6.2RealWorldClustersWenowexaminetwoclusterAisthroughafewMBstoafewTBsofdata,transformsoranalyzesthedata,kslastmuch6.2.3Readustershadbeenupforaboutoneweekwhenthesemeasurementsweretaken.(TheclustershadbeenrestartedrecentlytoupgradetoanewversionofGFS.)Theaveragewriteratewaslessthan30MB/tookthesemeasurements,Bwasinthemiddleofaburstofwriteactivitygeneratingabout100MB/sofdata,whichproduceda300MB/snetworkloadbecausewritesarepropagatedtothreereplicas.39

Network limit60Write

rate

(MB/s)Network limitAppend

rate

(MB/s)10Read

rate

(MB/s)100Network limit4050Aggregate read rate5Aggregate append rate020Aggregate write rate00510Number of clients N1500510Number of clients N150510Number of clients N15(a)Reads(b)Writes(c)RecordappendsFigure3:veerrorbarsthatshow95%confidenceintervals,rReadrate(lastminute)Readrate(lasthour)Readrate(sincerestart)Writerate(lastminute)Writerate(lasthour)Writerate(sincerestart)Masterops(lastminute)Masterops(lasthour)Masterops(sincerestart)AMB/sMB/sMB/sMB/sMB/sMB/sOps/sOps/sOps/sBMB/sMB/sMB/sMB/sMB/sMB/sOps/sOps/sOps/s5835625898347Table3:PerformanceMetricular,Ahadbeensustainingareadrateof580MB/workcon-figurationcansupport750MB/s,soitwasusingitsre-sourceseffirBcansupportpeakreadratesof1300MB/s,butitsapplicationswereusingjust380MB/s.15,ttheim-pactonrunningapplicationsandprovideleewayforschedul-ingdecisions,ourdefaultparameterslimitthisclusterto91concurrentclonings(40%ofthenumberofchunkservers)whereeachcloneoperationisallowedtoconsumeatmost6.25MB/s(50Mbps).Allchunkswererestoredin23.2min-utes,ataneffectivereplicationrateof440MB/herexperiment,wekilledtwochunkserverseachwithroughly16,266chunkswereclonedatahigherpriority,andwereallrestoredtoatleast2xreplicationwithin2minutes,thusputtingtheclusterinastatewhereitcouldtolerateanotherchunkserverfailurewithoutdataloss.6.3WorkloadBreakdownInthissection,wepresentadetailedbreakdownoftheworklrXisforresearchanddevelopmentwhileclusterYisforproductiondataprocess-ing.6.2.4MasterLoadTable3alsoshowsthattheratetercaneasilykeepupwiththisrate,rlierversionofGFS,tmostofitstimesequentiallyscanningthroughlargedirectories(whichcon-tainedhundredsofthousandsoffiles)lookingforparticularfisincechangedthemasterdatastructurestoalloweffioweasilysupportmanythousandsoffissary,wecouldspeeditupfurtherbyplacingnamelookupcachesinfrontofthenamespacedatastruc-tures.6.3.1MethodologyandCaveatsTheseresultsincludeonlyclientoriginatedrequestssothattheyreflecttheworkloadgeneratedbyourapplicationsforthefinotincludeinter-serverrequeststocarryoutclientrequestsorinternalback-groundactivities,ticsonI/Ooperationsarebasedoninformationhemple,GFSclientcodemaybreakareadintomultipleRPCstoincreaseparallelism,uraccesspatternsarehighlystylized,-plicitloggingbyapplicationsmighthaveprovidedslightlymoreaccuratedata,butitislogisticallyimpossibletore-compileandrestartthousandsofrunningcooglecompletelycontrolsbothGFSanditsapplications,theapplicationstendtobetunedforGFS,tualinfluencemayalsoexistbetweengeneralapplications6.2.5RecoveryTimeAfterachunkserverfails,somechunkswillbecomxperiment,nkserverhadabout40

OperationCluster0K1B..1K1K..8K8K..64K64K..128K128K..256K256K..512K512K..1M1M..infReadXY0.42.60.14.165.238.529.945.10.10.70.20.30.10.13.96.90.11.8WriteXY006.64.90.41.017.843.02.31.931.60.44.27.735.528.71.512.3RecordAppendXY000.29.218.915.278.02.8<.14.3<.110.6<.131.22.225.50.72.2OperationCluster1B..1K1K..8K8K..64K64K..128K128K..256K256K..512K512K..1M1M..infReadXY<.1<.113.83.911.49.30.30.70.80.61.40.365.955.16.430.1WriteXY<.1<.1<.1<.12.45.90.30.316.50.23.47.774.158.03.328.0RecordAppendXY<.1<.1<.10.12.30.322.71.2<.15.8<.138.4.146.853.97.4Table4:OperationsBreakdownbySize(%).Forreads,thesizeistheamountofdataactuallyreadandtrans-ferred,filesystems,buttheeff5:BytesTransferredBreakdownbyOpera-tionSize(%).Forreads,thesizeistheamountofdataactuallyreadandtransferred,maydifferifthereadattemptstoreadbeyondendoffile,rOpenDeleteFindLocationFindLeaseHolderFindMatchingFilesAllothercombinedX26.10.764.37.80.60.5Y16.31.565.813.42.20.86.3.2llreads(un-der64KB)comefromseek-intensiveclientsthatlookupsmallpiecesofdatawithinhugefigereads(over512KB)comefromlongsequentialreadsthroughentirefifilications,especiallythoseintheproductionsystems,oftenusefi-ducersappendconcurrentlytoafilewhileaconsumerreadstheendoffionally,rXshowsthislessoftenbecauseitisusuallyusedforshort-ligewrites(over256KB)typicallyresultfromsignificantbuffsthatbufferlessdata,check-pointorsynchronizemoreoften,orsimplygeneratelessdataaccountforthesmallerwrites(under64KB).Asforrecordappends,clusterYseesamuchhigherper-centageoflargerecordappendsthanclusterXdoesbecauseourproductionsystems,whichuseclusterY,5kindsofoperations,thelargeroperations(over256KB)eads(under64KB)dotransferasmallbutsignifi6:MasterRequestsBreakdownbyType(%)proximatesthecasewhereaclientdesterX,overwritingaccountsforunder0.0001%ofbytesmutatedandunder0.0003%sterY,theratiosareboth0.05%.Althoughthisisminute,soutthatenotpartoftheworkloadpersebutaconsequenceoftheretrymechanism.6.3.4Mquestsaskforchunklocations(FindLo-cation)forreadsandleaseholderinformation(FindLease-Locker)rsXandYseesignificantlydifferentnumbersofDeleterequestsbecauseclusterYstoresproductionthisdifferenceisfurtherhiddeninthedifferenceinOpenrequestsbecauseanoldversionofafilemaybeimplicitlydeletedbybeingopenedforwritefromscratch(mode“w”inUnixopenterminology).FindMatchingFilesisapatternmatchingrequestthatsup-ports“ls”andsimilarfiotherrequestsforthemaster,rYseesitmuchmoreoftenbecauseautomateddataprocessingtaskstendtoexaminepartsofthefirast,clusterX’sapplicationsareundermoreexplicitusercontrolandusuallyknowthenamesofallneededfilesinadvance.6.3.3AppendsversusWristerX,theratioofwritestorecordappendsis108:1bybytestransferredand8:sterY,usedbytheproductionsystems,theratiosare3.7:1and2.5:er,thesera-tiossterX,however,theoverallusageofrecordappendduringthemeasuredperiodisfairlylowandsotheresultsarelikelyskewedbyoneortwoappli-cationswithparticularbuffcted,ourENCESIntheprocessofbuildinganddeployingGFS,wehaveexperiencedavarietyofissues,someoperationalandsometechnical.41

Initially,GFSwasconceivedasthebackendfime,tedwithlittlesupportforthinroductionsys-temsarewelldisciplinedandcontrolled,ourdisksclaimedtotheLinuxdriverthattheysupportedarangeofIDEprohepro-tocolversionsareverysimilar,thesedrivesmostlyworked,butoccasionallythemismatcheswouldcausethedriveandthekerneltodisagreeaboutthedrive’oblemmotivatedouruseofchecksumstodetectdatacor-ruption,whileconcurrentlywemodifirwehadsomeproblemswithLinux2.2kernelsduetothecostoffsync().ItscostisproportionaltothesizeofthefileratherthanthesizeofthemodifisaproblemforedaroundthrLinuxproblemwasasinglereader-writerlockwhichanythreadinanaddressspacemustholdwhenitpagesinfromdisk(readerlock)ormodifiestheaddressspaceinanmmap()call(writerlock).Wesawtransienttimeoutsinoursystemunderlightlo-tually,wefoundthatthissinglelockblockedtheprimarynetworkthreadfrommappingnewdaearemainlylimitedbythenetworkinterfaceratherthanbymemorycopybandwidth,weworkedaroundthisbyreplacingmmap()withpread()eoccasionalproblems,theavailabilityofLinuxpropriate,orthecentralizedapproachinordertosimplifythedesign,increaseitsreliability,andgainflicular,acentralizedmastermakesitmucheasiertoimplementsophisticatedchunkplacementandreplicationpoliciessincethemastessfaulttolerancilityandhighavailability(forreads)orewecouldadaptaprimary-copyschemeliketheoneinHarp[7]topro-videhighavddressingaproblemsimilartoLustre[8]r,wehavesimplifiedtheproblemsignifi-cantlybyfocusingontheneedsofourapplicationsratherthanbuildingaPOSIX-compliantfionally,GFSassumeslargenumbtcloselyresemblestheNASDarchitecture[4].WhiletheNASDarchitectureisbasedonnetwork-attacheddiskdrives,GFSusescommoditymachinesaschunkservers,theNASDwork,ourchunkserversuselazilyallocatedfionally,GFSimplementsfeaturessuchasrebalancing,replication,Minnesota’sGFSandNASD,sonad-dressingday-to-daydataprocessingneedsducer-consumerqueuesenabledbyatomicrecordappendsaddressasimilarproblemasthedistributedqueuesinRiver[2].WhileRiverusesmemory-basedqueuesdis-tributedacrossmachinesandcarefuldataflowcontrol,GFSusesapersistentfiermodelsupportsm-to-ndis-tributedqueuesbutlacksthefaulttolerancethatcomeswithpersistentstorage,whileGFSonlysupportsm-to-1queueseffileconsumerscanreadthesamefile,herlargedistributedfilesystemssuchasAFS[5],GFSprovidesalocationindependentnamespacewhicheAFS,GFSspreadsafile’sdataacrossstorageserversinawaymoreakintoxFS[1]andSwift[3]sarerelativelycheapandreplicationissimplerthanmoresophisticatedRAID[9]approaches,GFScur-rentlyusesonlyrerasttosystemslikeAFS,xFS,Frangipani[12],andIntermezzo[6],GFSdoesnotprovideanycachingbelowthefigetworkloadshavelittlereusewithinasingleapplicationrunbecausetheyeitherstreamthroughalargstributedfilesystemslikeFrangipani,xFS,Min-nesota’sGFS[11]andGPFS[10]SIONSTheGoogleFileSystemdemonstratesthequalitieses-sentialforomedesigndecisionsarespecifictoouruniquesetting,manymayapptedbyreexaminingtraditionalfilesystemassump-tionsinlightofourcurreervationshaveledtoradicallydifftcomponentfailuresasthenormratherthantheexception,optimizeforhugefilesthataremostlyappendedto(perhapsconcurrently)andthenread(usuallysequen-tially),andbothextendandrelaxthestandardfitemprovidesfaulttolerancebyconstantmoni-toring,replicatingcrucialdata,eplicationallowsustotoleratechunkserver42

quencyofthesefailuresmotivatedanovelonlinerepairmechanismthatregularlyandtransparentlonally,weusechecksummingtodetectdatacorruptionatthediskorIDEsubsystemlevel,igndelivershighaggregatethrouevethisbyseparatingfilesystemcontrol,whichpassesthroughthemaster,fromdatatransfer,involve-mentincommonoperationsisminimizedbyalargechunksizeandbychunkleases,kespossibleasim-ple,evethatimprovementsinournetworkingstackwilllifttsuccessfullymetourstorageneedsandiswidelyusedwithinGoogleasthestorageplimportanttoolthatenablesLEDGMENTSWewishtothershad(ourshepherd)Acharya,JeffDean,ng,UrsHoelzle,MaxIbel,SharonPerl,RobPike,ourcolleaguesatGooglebravelytrustedtheirdatatoanewfiNCES[1]ThomasAnderson,MichaelDahlin,JeannaNeefe,DavidPatterson,DrewRoselli,lessnetworkfieedingsofthe15thACMSymposiumonOperatingSystemPrinciples,pages109–126,CopperMountainResort,Colorado,December1995.[2]-Dusseau,EricAnderson,NoahTreuhaft,,stein,DavidPatterson,rI/OwithRiver:eedingsoftheSixthWorkshoponInput/OutputinParallelandDistributedSystems(IOPADS’99),pages10–22,Atlanta,Georgia,May1999.[3]:UsingdistributeddiskstripingtoprovidehighI/erSystems,4(4):405–436,1991.[4],,KhalilAmiri,JeffButler,,HowardGobioff,CharlesHardin,ErikRiedel,DavidRochberg,-effective,eedingsofthe8thArchitecturalSupportforProgrammingLanguagesandOperatingSystems,pages92–103,SanJose,California,October1998.[5]JohnHoward,MichaelKazar,SherriMenees,DavidNichols,MahadevSatyanarayanan,RobertSidebotham,ndperformanceinadistributedfinsactionsonComputerSystems,6(1):51–81,February1988.[6]://,2003.[7]BarbaraLiskov,SanjayGhemawat,RobertGruber,PaulJohnson,LiubaShrira,ationintheHarpfi13thSymposiumonOperatingSystemPrinciples,pages226–238,PacificGrove,CA,October1991.[8]://org,2003.[9]son,,orredundantarraysofinexpensivedisks(RAID).InProceedingsofthe1988ACMSIGMODInternationalConferenceonManagementofData,pages109–116,Chicago,Illinois,September1988.[10]:Ashared-diskfieedingsoftheFirstUSENIXConferenceonFileandStorageTechnologies,pages231–244,Monterey,California,January2002.[11],,andMatthewT.O’eedingsoftheFifthNASAGoddardSpaceFlightCenterConferenceonMassStorageSystemsandTechnologies,CollegePark,Maryland,September1996.[12]th,TimothyMann,pani:Ascalabledistributedfieedingsofthe16thACMSymposiumonOperatingSystemPrinciples,pages224–237,Saint-Malo,France,October1997.43


本文发布于:2024-09-25 15:20:00,感谢您对本站的认可!

本文链接:https://www.17tex.com/fanyi/11950.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

上一篇:system c 编译
标签:突变   价值   数据   用语   计算机   时间   直播
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议