GFS: Evolution on Fast-forward

GFS: Evolution on Fast-forward

by Marshall Kirk McKusick, Sean Quinlan | August 7, 2009

A discussion between Kirk McKusick and Sean Quinlan aboutthe origin and evolution of the Google File System.

During the early stages of development at Google, the initial thinking didnot include plans for building a new file system. While work was still beingdone on one of the earliest versions of the company's crawl and indexingsystem, however, it
became quite clear to the core engineers that they reallyhad no other choice, and GFS (Google File System) was born.

First, given that Google's goal was to build a vast storage network out ofinexpensive commodity hardware, it had to be assumed that component failureswould be the norm—meaning that constant monitoring, error detection, faulttolerance, and automatic
recovery would have to be an integral part of the filesystem. Also, even by Google's earliest estimates, the system's throughputrequirements were going to be daunting by anybody's standards—featuringmulti-gigabyte files and data sets containing terabytes of
information andmillions of objects. Clearly, this meant traditional assumptions about I/Ooperations and block sizes would have to be revisited. There was also thematter of scalability. This was a file system that would surely need to scalelike no other. Of
course, back in those earliest days, no one could havepossibly imagined just how much scalability would be required. They would learnabout that soon enough.

Still, nearly a decade later, most of Google's mind-boggling store of dataand its ever-growing array of applications continue to rely upon GFS. Manyadjustments have been made to the file system along the way, and—together witha fair number of
accommodations implemented within the applications that useGFS—they have made the journey possible.

To explore the reasoning behind a few of the more crucial initial designdecisions as well as some of the incremental adaptations that have been madesince then, ACM asked Sean Quinlan to pull back the covers on the changingfile-system requirements
and the evolving thinking at Google. Since Quinlanserved as the GFS tech leader for a couple of years and continues now as aprincipal engineer at Google, he's in a good position to offer thatperspective. As a grounding point beyond the Googleplex, ACM asked
KirkMcKusick to lead the discussion. He is best known for his work on BSD (BerkeleySoftware Distribution) Unix, including the original design of the Berkeley FFS(Fast File System).

The discussion starts, appropriately enough, at the beginning—with theunorthodox decision to base the initial GFS implementation on a single-masterdesign. At first blush, the risk of a single centralized master becoming abandwidth bottleneck—or,
worse, a single point of failure—seems fairly obvious,but it turns out Google's engineers had their reasons for making this choice.

MCKUSICK One of the moreinteresting—and significant—aspects of the original GFS architecture was thedecision to base it on a single master. Can you walk us through what led tothat decision?

QUINLAN The decision to go with asingle master was actually one of the very first decisions,
mostly just tosimplify the overall design problem. That is, building a distributed masterright from the outset was deemed too difficult and would take too much time. Also, by going with the single-master
approach, theengineers were able to simplify a lot of problems. Having a central place to control replicationand garbage collection and many other activities was
definitely simpler thanhandling it all on a distributed basis. So the decision was made to centralizethat in one machine.

MCKUSICK Was this mostly about being able to rollout something within a reasonably short time frame?

QUINLAN Yes. In fact, some of theengineers who were involved in that early effort later went on to buildBigTable, a distributed storage system, but that effort took many years. Thedecision to build the original GFS around the
single master really helped getsomething out into the hands of users much more rapidly than would haveotherwise been possible.

Also, in sketching out the use cases they anticipated, it didn't seem thesingle-master design would cause much of a problem. The scale they werethinking about back then was framed in terms of hundreds of terabytes and a fewmillion files. In
fact, the system worked just fine to start with.

MCKUSICK But then what?

QUINLAN Problems started to occur once the size of the underlying storage increased. Goingfrom a few hundred terabytes up to petabytes, and then up to tens of petabytes�
that really required a proportionate increase in the amount of metadatathe master had to maintain. Also,
operations such as scanning the metadatato look for recoveries all scaled linearly with the volume of data. So theamount of work required of the master grew substantially. The amount of storageneeded
to retain all that information grew as well.

In addition, this proved to be a bottleneck for the clients, even thoughthe clients issue few metadata operations themselves—for example, a clienttalks to the master whenever it does an open. When you have thousands ofclients all talking to
the master at the same time, given that the master iscapable of doing only a few thousand operations a second, the average clientisn't able to command all that many operations per second. Also bear in mindthat there are applications such as MapReduce, where
you might suddenly have athousand tasks, each wanting to open a number of files. Obviously, it would takea long time to handle all those requests, and the master would be under a fairamount of duress.

MCKUSICK Now, under the current schema for GFS, youhave one master per cell, right?

QUINLAN That's correct.

MCKUSICK And historically you've hadone cell per data center, right?

QUINLAN That was initially the goal,but it didn't work out like that to a large extent—partly because of thelimitations of the single-master design and partly because isolation proved tobe difficult. As a consequence, people
generally ended up with more than onecell per data center. We also ended up doing what we call a"multi-cell" approach, which basically made it possible to putmultiple GFS masters on top of a pool of chunkservers. That way, thechunkservers could be configured
to have, say, eight GFS masters assigned tothem, and that would give you at least one pool of underlying storage—withmultiple master heads on it, if you will. Then the application was responsiblefor partitioning data across those different cells.

MCKUSICK Presumably each applicationwould then essentially have its own master that would be responsible formanaging its own little file system. Was that basically the idea?

QUINLAN Well, yes and no.Applications would tend to use either one master or a small set of the masters.We also have something we called Name Spaces, which are just a very static wayof partitioning a namespace that people can
use to hide all of this from theactual application. The Logs Processing System offers an example of this approach:once logs exhaust their ability to use just one cell, they move to multiple GFScells; a namespace file describes how the log data is partitioned
across thosedifferent cells and basically serves to hide the exact partitioning from theapplication. But this is all fairly static.

MCKUSICK What's the performance like,in light of all that?

QUINLAN We ended up putting a fairamount of effort into tuning master performance, and it's atypical of Google toput a lot of work into tuning any one particular binary. Generally, ourapproach is just to get things working reasonably
well and then turn our focusto scalability—which usually works well in that you can generally get yourperformance back by scaling things. Because in this instance we had a singlebottleneck that was starting to have an impact on operations, however, we feltthat
investing a bit of additional effort into making the master lighter weightwould be really worthwhile. In the course of scaling from thousands ofoperations to tens of thousands and beyond, the single master had becomesomewhat less of a bottleneck. That was
a case where paying more attention tothe efficiency of that one binary definitely helped keep GFS going for quite abit longer than would have otherwise been possible.

It could be argued that managing to get GFS ready for production in recordtime constituted a victory in its own right and that, by speeding Google tomarket, this ultimately contributed mightily to the company's success. A teamof three was responsible
for all of that—for the core of GFS—and for the systembeing readied for deployment in less than a year.

But then came the price that so often befalls any successful system—thatis, once the scale and use cases have had time to expand far beyond what anyonecould have possibly imagined. In Google's case, those pressures proved to beparticularly intense.

Although organizations don't make a habit of exchanging file-systemstatistics, it's safe to assume that GFS is the largest file system inoperation (in fact, that was probably true even before Google's acquisition ofYouTube). Hence, even though
the original architects of GFS felt they hadprovided adequately for at least a couple of orders of magnitude of growth,Google quickly zoomed right past that.

In addition, the number of applications GFS was called upon to supportsoon ballooned. In an interview with one of the original GFS architects, HowardGobioff (conducted just prior to his surprising death in early 2008), herecalled, "The original
consumer of all our earliest GFS versions wasbasically this tremendously large crawling and indexing system. The second wavecame when our quality team and research groups started using GFS ratheraggressively—and basically, they were all looking to use GFS
to store large datasets. And then, before long, we had 50 users, all of whom required a littlesupport from time to time so they'd all keep playing nicely with eachother."

One thing that helped tremendously was that Google built not only the filesystem but also all of the applications running on top of it. While adjustmentswere continually made in GFS to make it more accommodating to all the new usecases, the
applications themselves were also developed with the variousstrengths and weaknesses of GFS in mind. "Because we built everything, wewere free to cheat whenever we wanted to," Gobioff neatly summarized."We could push problems back and forth between the application
space andthe file-system space, and then work out accommodations between the two."

The matter of sheer scale, however, called for some more substantialadjustments. One coping strategy had to do with the use of multiple"cells" across the network, functioning essentially as related butdistinct file systems. Besides helping to
deal with the immediate problem ofscale, this proved to be a more efficient arrangement for the operations ofwidely dispersed data centers.

Rapid growth also put pressure on another key parameter of the originalGFS design: the choice to establish 64 MB as the standard chunk size. That, ofcourse, was much larger than the typical file-system block size, but onlybecause the files generated
by Google's crawling and indexing system wereunusually large. As the application mix changed over time, however, ways had tobe found to let the system deal efficiently with large numbers of filesrequiring far less than 64 MB (think in terms of Gmail, for example).
Theproblem was not so much with the number of files itself, but rather with thememory demands all of those files made on the centralized master, thus exposingone of the bottleneck risks inherent in the original GFS design.

MCKUSICK I gather from the originalGFS paper [Ghemawat, S., Gobioff, H., Leung, S-T. 2003. The Google File System. SOSP (ACM Symposium onOperating Systems Principles)] that file counts have been a significant issuefor you right
along. Can you go into that a little bit?

QUINLAN The file-count issue came up fairlyearly because of the way people ended up designing their systems around GFS.Let me cite a specific example. Early in my time at Google, I was
involved inthe design of the Logs Processing system. We initially had a model where afront-end server would write a log, which we would then basically copy into GFSfor processing and archival. That was fine to start with, but then the numberof front-end servers
increased, each rolling logs every day. At the same time,the number of log types was going up, and then you'd have front-end serversthat would go through crash loops and generate lots more logs. So we ended upwith a lot more files than we had anticipated based
on our initialback-of-the-envelope estimates.

This became an area we really had to keep an eye on. Finally, we just hadto concede there was no way we were going to survive a continuation of the sortof file-count growth we had been experiencing.

MCKUSICK Let me make sure I'mfollowing this correctly:
your issue with file-count growth is a result of yourneeding to have a piece of metadata on the master for each file, and thatmetadata has to fit in the master's memory.

QUINLAN That's correct.

MCKUSICK And there are only a finitenumber of files you can accommodate before the master runs out of memory?

QUINLAN Exactly. And there aretwo bits of metadata. One identifies the file, and the other points out thechunks that back that file. If you had a chunk that containedonly
1 MB, it would take up only 1 MB of disk space, but it still would requirethose two bits of metadata on the master. If your average file size ends updipping below 64 MB, the ratio of the number of objects on your master to whatyou have in storage starts to
go down. That's where you run into problems.

Going back to that logs example, it quickly became apparent that thenatural mapping we had thought of—and which seemed to make perfect sense backwhen we were doing our back-of-the-envelope estimates—turned out not to beacceptable at all. We
needed to find a way to work around this by figuring outhow we could combine some number of underlying objects into larger files. Inthe case of the logs, that wasn't exactly rocket science, but it did require alot of effort.

MCKUSICK That sounds like the old dayswhen IBM had only a minimum disk allocation, so it provided you with a utilitythat let you pack a bunch of files together and then create a table of contentsfor that.

QUINLAN Exactly. For us, eachapplication essentially ended up doing that to varying degrees. That proved tobe less burdensome for some applications than for others. In the case of ourlogs, we hadn't really been planning to delete
individual log files. It wasmore likely that we would end up rewriting the logs to anonymize them or dosomething else along those lines. That way, you don't get thegarbage-collection problems that can come up if you delete only some of thefiles within a bundle.

For some other applications, however, the file-count problem was moreacute. Many times, the most natural design for some application just wouldn'tfit into GFS—even though at first glance you would think the file count wouldbe perfectly acceptable,
it would turn out to be a problem. When we startedusing more shared cells, we put quotas on both file counts and storage space.The limit that people have ended up running into most has been, by far, thefile-count quota. In comparison, the underlying storage
quota rarely proves tobe a problem.

MCKUSICK What longer-term strategyhave you come up with for dealing with the file-count issue? Certainly, itdoesn't seem that a distributed master is really going to help with that—not ifthe master still has to keep all the
metadata in memory, that is.

QUINLAN The distributed master certainly allowsyou to grow file counts, in line with the number of machines you're willing tothrow at it. That certainly helps.

One of the appeals of the distributed multimaster model is that if youscale everything up by two orders of magnitude, then getting down to a 1-MBaverage file size is going to be a lot different from having a 64-MB averagefile size. If you end
up going below 1 MB, then you're also going to run intoother issues that you really need to be careful about. For example, if you endup having to read 10,000 10-KB files, you're going to be doing a lot moreseeking than if you're just reading 100 1-MB files.

My gut feeling is that if you design for an average 1-MB file size, thenthat should provide for a much larger class of things than does a design thatassumes a 64-MB average file size. Ideally, you would like to imagine a systemthat goes all
the way down to much smaller file sizes, but 1 MB seems areasonable compromise in our environment.

MCKUSICK What have you been doing todesign GFS to work with 1-MB files?

QUINLAN We haven't been doinganything with the existing GFS design. Our distributed master system that willprovide for 1-MB files is essentially a whole new design. That way, we can aimfor something on the order of 100 million
files per master. You can also havehundreds of masters.

MCKUSICK So, essentially no singlemaster would have all this data on it?

QUINLAN That's the idea.

With the recent emergence within Google of BigTable, a distributed storagesystem for managing structured data, one potential remedy for the file-countproblem—albeit perhaps not the very best one—is now available.

The significance of BigTable goes far beyond file counts, however.Specifically, it was designed to scale into the petabyte range across hundredsor thousands of machines, as well as to make it easy to add more machines tothe system and automatically
start taking advantage of those resources withoutreconfiguration. For a company predicated on the notion of employing thecollective power, potential redundancy, and economies of scale inherent in amassive deployment of commodity hardware, these rate as significant
advantagesindeed.

Accordingly, BigTable is now used in conjunction with a growing number ofGoogle applications. Although it represents a departure of sorts from the past,it also must be said that BigTable was built on GFS, runs on GFS, and wasconsciously designed
to remain consistent with most GFS principles. Considerit, therefore, as one of the major adaptations made along the way to help keepGFS viable in the face of rapid and widespread change.

MCKUSICK You now have this thingcalled BigTable. Do you view that as an application in its own right?

QUINLAN From the GFS point of view,it's an application, but it's clearly more of an infrastructure piece.

MCKUSICK If I understand thiscorrectly, BigTable is essentially a lightweight relational database.

QUINLAN It's not really a relational database. Imean, we're not doing SQL and it doesn't really support joins and such. ButBigTable is a structured storage system that lets you have lots of key-valuepairs
and a schema.

MCKUSICK Who are the real clients ofBigTable?

QUINLAN BigTable is increasinglybeing used within Google for crawling and indexing systems, and we use it a lotwithin many of our client-facing applications. The truth of the matter is thatthere are tons of BigTable clients.
Basically, any app with lots of small dataitems tends to use BigTable. That's especially true wherever there's fairlystructured data.

MCKUSICK I guess the question I'mreally trying to pose here is: Did BigTable just get stuck into a lot of theseapplications as an attempt to deal with the small-file problem, basically bytaking a whole bunch of small things
and then aggregating them together?

QUINLAN That has certainly been oneuse case for BigTable, but it was actually intended for a much more generalsort of problem. If you're using BigTable in that way—that is, as a way offighting the file-count problem where you
might have otherwise used a filesystem to handle that—then you would not end up employing all of BigTable'sfunctionality by any means. BigTable isn't really ideal for that purpose inthat it requires resources for its own operations that are nontrivial. Also,
ithas a garbage-collection policy that's not super-aggressive, so that might notbe the most efficient way to use your space. I'd say that the people who havebeen using BigTable purely to deal with the file-count problem probably haven'tbeen terribly happy,
but there's no question that it is one way for people tohandle that problem.

MCKUSICK What I've read about GFSseems to suggest that the idea was to have only two basic data structures: logsand SSTables (Sorted String Tables). Since I'm guessing the SSTables must beused to handle key-value pairs and that
sort of thing, how is that differentfrom BigTable?

QUINLAN The main difference is thatSSTables are immutable, while BigTable provides mutable key value storage, anda whole lot more. BigTable itself is actually built on top of logs andSSTables. Initially, it stores incoming data
into transaction log files. Thenit gets compacted—as we call it—into a series of SSTables, which in turnget compacted together over time. In some respects, it's reminiscent of alog-structure file system. Anyway, as you've observed, logs and SSTables
doseem to be the two data structures underlying the way we structure most of ourdata. We have log files for mutable stuff as it's being recorded. Then, onceyou have enough of that, you sort it and put it into this structure that has anindex.

Even though GFS does not provide a Posix interface, it still has a prettygeneric file-system interface, so people are essentially free to write any sortof data they like. It's just that, over time, the majority of our users haveended up using
these two data structures. We also have something called protocolbuffers, which is our data description language. The majority of data endsup being protocol buffers in these two structures.

Both provide for compression and checksums. Even though there are somepeople internally who end up reinventing these things, most people are contentjust to use those two basic building blocks.

Because GFSwas designed initially to enable a crawling and indexing system, throughput waseverything. In fact, the original paper written about the systemmakes this quite explicit: "High
sustained bandwidth is more importantthan low latency. Most of our target applications place a premium on processingdata in bulk at a high rate, while few have stringent response-timerequirements for an individual read and write."

But then Google either developed or embraced many user-facing Internetservices for which this is most definitely not the case.

One GFS shortcoming that this immediately exposed had to do with theoriginal single-master design. A single point of failure may not have been adisaster for batch-oriented applications, but it was certainly unacceptable forlatency-sensitive
applications, such as video serving. The later addition ofautomated failover capabilities helped, but even then service could be out forup to a minute.

The other major challenge for GFS, of course, has revolved around findingways to build latency-sensitive applications on top of a file system designedaround an entirely different set of priorities.

MCKUSICK It's well documented that theinitial emphasis in designing GFS was on batch efficiency as opposed to lowlatency. Now that has come back to cause you trouble, particularly in terms ofhandling things such as videos. How
are you handling that?

QUINLAN The GFS design model from the get-go wasall about achieving throughput, not about the latency at which that might beachieved. To give you a concrete example, if you're writing
afile, it will typically be written in triplicate—meaning you'll actually bewriting to three chunkservers. Should one of those chunkservers die or hiccupfor a long period of time, the GFS master will notice the problem and schedulewhat we call a
pullchunk, which means it will basically replicate one ofthose chunks. That will get you back up to three copies, and then the systemwill pass control back to the client, which will continue writing.

When we do a pullchunk we limit it to something on the order of 5-10 MB asecond. So, for 64 MB, you're talking about 10 seconds for this recovery totake place. There are lots of other things like this that might take 10 secondsto a minute, which
works just fine for batch-type operations. If you're doing alarge MapReduce operation, you're OK just so long as one of the items is not areal straggler, in which case you've got yourself a different sort of problem.Still, generally speaking, a hiccup on the
order of a minute over the course ofan hour-long batch job doesn't really show up. If you are working on Gmail,however, and you're trying to write a mutation that represents some useraction, then getting stuck for a minute is really going to mess you up.

We've had similar issues with our master failover. Initially, GFS had noprovision for automatic master failover. It was a manual process. Although itdidn't happen a lot, whenever it did, the cell might be down for an hour. Evenour initial master-failover
implementation required on the order of minutes.Over the past year, however, we've taken that down to something on the order oftens of seconds.

MCKUSICK Still, for user-facing applications, that's notacceptable.

QUINLAN Right. While theseinstances—where you have to provide for failover and error recovery—may havebeen acceptable in the batch situation, they're definitely not OK from alatency point of view for a user-facing application.
Another issue here is thatthere are places in the design where we've tried to optimize for throughput bydumping thousands of operations into a queue and then just processing throughthem. That leads to fine throughput, but it's not great for latency. You caneasily
get into situations where you might be stuck for seconds at a time in aqueue just waiting to get to the head of the queue.

Our user base has definitely migrated from being a MapReduce-based worldto more of an interactive world that relies on things such as BigTable. Gmailis an obvious example of that. Videos aren't quite as bad where GFS isconcerned because you
get to stream data, meaning you can buffer. Still, tryingto build an interactive database on top of a file system that was designed fromthe start to support more batch-oriented operations has certainly proved to bea pain point.

MCKUSICK How exactly have you managedto deal with that?

QUINLAN Within GFS, we've managed toimprove things to a certain degree, mostly by designing the applications todeal with the problems that come up. Take BigTable as a good concrete example.The BigTable transaction log is actually
the biggest bottleneck for getting atransaction logged. In effect, we decided, "Well, we're going to seehiccups in these writes, so what we'll do is to have two logs open at any onetime. Then we'll just basically merge the two. We'll write to one and if thatgets
stuck, we'll write to the other. We'll merge those logs once we do areplay—if we need to do a replay, that is." We tended to design our applicationsto function like that—which is to say they basically try to hide that latencysince they know the system underneath
isn't really all that great.

The guys who built Gmail went to a multihomed model, so if one instance ofyour Gmail account got stuck, you would basically just get moved to anotherdata center. Actually, that capability was needed anyway just to ensureavailability. Still,
part of the motivation was that they wanted to hide theGFS problems.

MCKUSICK I think it's fair to say that, by movingto a distributed-master file system, you're definitely going to be able toattack some of those latency issues.

QUINLAN That was certainly one of ourdesign goals. Also, BigTable itself is a very failure-aware system that triesto respond to failures far more rapidly than we were able to before. Using thatas our metadata storage helps with
some of those latency issues as well.

The engineers who worked on the earliest versions of GFS weren'tparticularly shy about departing from traditional choices in file-system designwhenever they felt the need to do so. It just so happens that the approachtaken to consistency is
one of the aspects of the system where this isparticularly evident.

Part of this, of course, was driven by necessity. Since Google's plansrested largely on massive deployments of commodity hardware, failures andhardware-related faults were a given. Beyond that, according to the originalGFS paper, there were
a few compatibility issues. "Many of our disksclaimed to the Linux driver that they supported a range of IDE protocolversions but in fact responded reliably only to the more recent ones. Since theprotocol versions are very similar, these drives mostly worked
but occasionallythe mismatches would cause the drive and the kernel to disagree about thedrive's state. This would corrupt data silently due to problems in the kernel.This problem motivated our use of checksums to detect data corruption."

That didn't mean just any checksumming, however, but instead rigorousend-to-end checksumming, with an eye to everything from disk corruption toTCP/IP corruption to machine backplane corruption.

Interestingly, for all that checksumming vigilance, the GFS engineeringteam also opted for an approach to consistency that's relatively loose byfile-system standards. Basically, GFS simply accepts that there will be timeswhen people will end
up reading slightly stale data. Since GFS is used mostly as an append-onlysystem as opposed to an overwriting system, this generally means those peoplemight end up missing something that was appended to the end of the
file afterthey'd already opened it. To the GFS designers, this seemed anacceptable cost (although it turns out that there are applications for whichthis proves problematic).

Also, as Gobioff explained, "The risk of stale data in certaincircumstances is just inherent to a highly distributed architecture thatdoesn't ask the master to maintain all that much information. We definitelycould have made things a lot tighter
if we were willing to dump a lot more datainto the master and then have it maintain more state. But that just reallywasn't all that critical to us."

Perhaps an even more important issue here is that the engineers makingthis decision owned not just the file system but also the applications intendedto run on the file system. According to Gobioff, "The thing is that wecontrolled both the horizontal
and the vertical—the file system and theapplication. So we could be sure our applications would know what to expectfrom the file system. And we just decided to push some of the complexity out tothe applications to let them deal with it."

Still, there are some at Google who wonder whether that was the right callif only because people can sometimes obtain different data in the course ofreading a given file multiple times, which tends to be so strongly at odds withtheir whole notion
of how data storage is supposed to work.

MCKUSICK Let's talk about consistency.The issue seems to be that it presumably takes some amount of time to geteverything fully written to all the replicas. I think you said somethingearlier
to the effect that GFS essentially requires that this all be fullywritten before you can continue.

QUINLAN That's correct.

MCKUSICK If that's the case, then howcan you possibly end up with things that aren't consistent?

QUINLAN Client failures have a way offouling things up. Basically, the model in GFS is that the client justcontinues to push the write until it succeeds.
If the client ends up crashing in themiddle of an operation, things are left in a bit of an indeterminate state.

Early on, that was sort of considered to be OK, but over time, wetightened the window for how long that inconsistency could be tolerated, andthen we slowly continued to reduce that. Otherwise, whenever the data is inthat inconsistent state,
you may get different lengths for the file. That canlead to some confusion. We had to have some backdoor interfaces for checkingthe consistency of the file data in those instances. We also have somethingcalled RecordAppend, which is an interface designed for
multiple writers toappend to a log concurrently. There the consistency was designed to be very loose. Inretrospect, that turned out to be a lot more painful than anyone expected.

MCKUSICK What exactly was loose? Ifthe primary replica picks what the offset is for each write and then makes surethat actually occurs, I don't see where the inconsistencies are going to comeup.

QUINLAN What happens is that the primary willtry. It will pick an offset, it will do the writes, but then one of them won'tactually get written. Then the primary might change, at which point it can picka
different offset. RecordAppend does not offer any replay protection either.You could end up getting the data multiple times in the file.

There were even situations where you could get the data in a differentorder. It might appear multiple times in one chunk replica, but not necessarilyin all of them. If you were reading the file, you could discover the data indifferent ways at
different times. At the record level, you could discover therecords in different orders depending on which chunks you happened to bereading.

MCKUSICK Was this done by design?

QUINLAN At the time, it must haveseemed like a good idea, but in retrospect
I think the consensus is that it proved to be morepainful than it was worth. It just doesn't meet the expectationspeople have of a file system, so they end up getting surprised. Then they hadto figure
out work-arounds.

MCKUSICK In retrospect, how would youhandle this differently?

QUINLAN I think it makes more senseto have a single writer per file.

MCKUSICK All right, but what happenswhen you have multiple people wanting to append to a log?

QUINLAN You serialize the writes through asingle process that can ensure the replicas are consistent.

MCKUSICK There's also this businesswhere you essentially snapshot a chunk. Presumably, that's something you usewhen you're essentially replacing a replica, or whenever some chunkserver goesdown and you need to replace some of
its files.

QUINLAN Actually, two things aregoing on there. One, as you suggest, is the recovery mechanism, whichdefinitely involves copying around replicas of the file. The way that works inGFS is that we basically revoke the lock so that
the client can't write itanymore, and this is part of that latency issue we were talking about.

There's also a separate issue, which is to support the snapshot feature ofGFS. GFS has the most general-purpose snapshot capability you can imagine. Youcould snapshot any directory somewhere, and then both copies would be entirelyequivalent.
They would share the unchanged data. You could change either oneand you could further snapshot either one. So it was really more of a clonethan what most people think of as a snapshot. It's an interesting thing, but itmakes for difficulties—especially as you
try to build more distributed systemsand you want potentially to snapshot larger chunks of the file tree.

I also think it's interesting that the snapshot feature hasn't been usedmore since it's actually a very powerful feature. That is, from a file-systempoint of view, it really offers a pretty nice piece of functionality.
But putting snapshots into file systems, as I'm sure you know, is a real pain.

MCKUSICK: I know. I've done it. It'sexcruciating—especially in an overwriting file system.

QUINLAN Exactly. This is a case wherewe didn't cheat, but from an implementation perspective, it's hard to createtrue snapshots. Still, it seems that in this case, going the full deal was theright decision. Just the same, it's
an interesting contrast to some of theother decisions that were made early on in terms of the semantics.

All in all, the report card on GFS nearly 10 years later seems positive.There have been problems and shortcomings, to be sure, but there's surely noarguing with Google's success and GFS has without a doubt played an importantrole in that. What's
more, its staying power has been nothing short ofremarkable given that Google's operations have scaled orders of magnitudebeyond anything the system had been designed to handle, while the applicationmix Google currently supports is not one that anyone could
have possiblyimagined back in the late '90s.

Still, there's no question that GFS faces many challenges now. For onething, the awkwardness of supporting an ever-growing fleet of user-facing,latency-sensitive applications on top of a system initially designed forbatch-system throughput is
something that's obvious to all.

The advent of BigTable has helped somewhat in this regard. As it turnsout, however, BigTable isn't actually all that great a fit for GFS. In fact, itjust makes the bottleneck limitations of the system's single-master design moreapparent than
would otherwise be the case.

For these and other reasons, engineers at Google have been working formuch of the past two years on a new distributed master system designed to takefull advantage of BigTable to attack some of those problems that have provedparticularly difficult
for GFS.

Accordingly, it now seems that beyond all the adjustments made to ensurethe continued survival of GFS, the newest branch on the evolutionary tree willcontinue to grow in significance over the years to come.

原文地址：http://queue.acm.org/detail.cfm?id=1594206

GFS: Evolution on Fast-forward的更多相关文章

Git – Fast Forward 和 no fast foward
Git 很是强大,在体验过rebase的华丽之后,再次发现之前在TFS上遇到的问题一下都有解了.但也印证了Git深入并非易事.这篇就谈下一个容易迷糊的概念:Fast forward. Fast-For ...
Git分支(2/5) -- Fast Forward 合并
快捷操作: 切换并创建分支: git checkout -b 分支名. git checkout -b some-change 然后我打开某个文件(index.html)修改一下标题. Commit之 ...
Git的fast forward和no fast forward和 three way merge 以及squash(聚合)
github上上传了版本库https://github.com/ChuckGitMerge 包括merge和rebase 没时间画图,貌似也不太会用画图工具,先写了一个文字版本的更新:2015年 ...
Git：非Fast forward下的合并（--no-ff方式的git merge）
创建dev分支,并且修改readme.txt的内容,然后提交使用git merge --no-ff -m "说明内容" 分支名称合并分支使用git log --graph -- ...
git教程5-查看关系图与no fast forward融合
1.每一个提交相当于一个版本,版本都有版本号与之对应.通常通过git commit -m "name"为每次提交命名. 2.融合:即将次分支的最后一个版本添加到主分支上.当融合冲突 ...
Git 分支管理不使用Fast forward模式进行合并分支管理策略
通常,合并分支时,如果可能,Git会用Fast forward模式,但这种模式下,删除分支后,会丢掉分支信息. 如果要强制禁用Fast forward模式,Git就会在merge时生成一个新的comm ...
【Todo】git的fast forward &amp&semi; git命令学习 &amp&semi; no-ff
git的fast-forward在之前的文章有介绍过,但是介绍的不细: http://www.cnblogs.com/charlesblc/p/5953066.html fast-forward方式就 ...
Git Fast Forward 和 no fast foward
如果执行了 Fast Forward,开发者根本不会看到这个分支,就像在 master 直接 commit 一样.
Git 关于Fast Forward提交的简单说明
多人协同开发,使用Git经常会看到警告信息包含术语:fast forward, 这是何义? 简单来说就是提交到远程中心仓库的代码必须是按照时间顺序的. 比如A从中心仓库拿到代码后,对文件f进行了修改. ...

随机推荐

Module-Zero之启动模板
返回<Module Zero学习目录> 概览介绍社交登录基于Token的认证单元测试概览介绍使用ABP和Module-Zero开始一个新的项目最简单的方式通过ABP官网的模板页面 ...
JSON 之 SuperObject(1)
一直盼着 Delphi 能够直接支持 "正则表达式" 与 "JSON"; Delphi 2009 刚来的时候, 有了 JSON, 但不好, 那时尝试过一点. 这 ...
HTTP传递数据的几种方法
Http请求的时候,需要传递参数给后端,一般都是key-value的形式,传递的方法有很多种例如需要传递的数据是 dict(key1=value1,key2=value2) 1. URL参数把参数 ...
小黑的镇魂曲（HDU2155：贪心+dfs+奇葩解法）
题目:点这里题目的意思跟所谓的是英雄就下100层一个意思……在T秒内能够下到地面,就可以了(还有一个板与板之间不能超过H高). 接触这题目是在昨晚的训练赛,当时拍拍地打了个贪心+dfs,果断跟我想的 ...
java获得路径的多种方式
本文讲解java语言中获得运行时路径的多种方式,包括java项目.java web项目.jar.weblogic等多种场景. 一.this.getClass().getClassLoader().ge ...
Oracle SQL优化[转]
Oracle SQL优化 1. 选用适合的ORACLE优化器 ORACLE的优化器共有3种: a. RULE (基于规则) b. COST (基于成本) c. CHOOSE (选择性) 设置缺省的优化 ...
[Abp 源码分析]五、系统设置
0.简要介绍 Abp 本身有两种设置,一种就是上一篇文章所介绍的模块配置 Configuration,该配置主要用于一些复杂的数据类型设置,不仅仅是字符串,也有可能是一些 C# 运行时的一些变量. ...
debian服务器解决中文安装后出现乱码的问题
由于安装debian选择语言时选择了简体中文安装,但内核没有中文字库,导致某些字符显示为乱码(菱形,方块). 解决办法: 普通用户如果没有设置sudo权限,首先切换到root权限.然后: apt-ge ...
Excel2013 破解（编辑工作表受保护）密码
在日常工作中,大家有时会遇到过这样的情况:使用Excel编制的报表.表格.程序等,在单元格中设置了公式.函数等,为了防止其他人修改您的设置或者防止您自己无意中修改,您可能会使用Excel的工作表保护功 ...
奇异值分解(SVD)原理详解及推导（转载）
转载请声明出处http://blog.csdn.net/zhongkejingwang/article/details/43053513 在网上看到有很多文章介绍SVD的,讲的也都不错,但是感觉还是有 ...