Google File System

GFS’s Focus

Fault tolerance
Scalability (focus on throughput and sacrifice response time/latency for individual reads/writes)
Low synchronization overhead between entities of GFS
- no caching
- Allow inconsistency on append, no synchronization overhead between clients
Decoupling control and data for throughput and scalability

Files are large, chunk size = 64MB
Streaming read and write (read once, write once)
Atomicity for appends (Want atomicity for appends without synchronization overhead among clients)

Role of master:
- Garbage collection
- Metadata
- Mitigates chunks between servers for fault tolerance
- Delegate consistency management
Role of chunk server:
- Know nothing besides itself
- Send heartbeats to server
- Handle client request
- No caching!
Role of client:
- No caching!
  - Reduce consistency overhead, more easy
  - Assume streaming read and write, no benefit for caching
- Directly issue data flow with chunk server
  - Reduce master server bottleneck, exploits parallelism.
- Control flow with master server

Append:
- Atomicity ensured, would not overwrite each other
- At least once
- Why append?
  - Clients don’t overwrite, append more often
  - Higher performance for writing lots of small data instead of large chunks, e.g. log data, CPU info…
Read:
- Read the closest
Write:
- Write to all
- Data passing - daisy chain
- Order are the same
- Client request decide the offset
Delete:
- File rename to hidden
- Scan in background to delete metadata in RAM and delete chunk in chunk server

Chunk server fault tolerance
- Missing heart beat
- Decrement count of chunks
- Replicate in background
The chunk-server fault tolerance mechanism - lease and version number
- Master grant lease to chunk primary (only during modification operation!)
- The lease would be revoked if file is being renamed or deleted!
- The lease would be updated each time assigning a new one
- The lease would be refreshed every 60 second if modification is not finished
Why Lease/version number?
- Network partition
- Primary failed
- Revoke lease when expire or rename/delete file
- Detect outdated chunk server with version number, consider the server with failed (shut down during lease renewal).
Master fault tolerance
- WAL
- Master-slave replica
- Only reply when log are safe
- Logs cannot be too long, bottleneck on replication. Log replay cost recovery time. RAM limit?
- Build periodical checkpoints
- Recovery steps
  - Metedata recovery
  - File-to-chunkID recovery
  - Chunk-server - chunkID-to-chunk-server recovery
    - Has older version number - stale, down the chunk server during recovery
    - Has newer version number - accept it.

Data corruption - application-level checksum
Master biggest bottleneck to scaling
- Take long time to rebuild/recovery metedata
- Performance and availability
- Solution: Multiple server
Not so secure, users can interfere
Chunk size cannot be smaller
Read-once and write-once
No caching

Next Post → ← Previous Post