HPC, MPI, MapReduce - Haohan (Carrie) Zhang

HPC

(Supercomputers + MPI)

(MapReduce + Implementation)

MapReduce deals with stragglers:
- Solved: Caused not by poor partition (by I/O latency, server performance, bugs)
  - Reschedule the tasks not done
- Not solved: by poor partition (which is solved by Spark) Hashing is on mapper, M mapper R reducer, each mapper output R files.
Why use cluster file system on the input and output? (HDFS)
- Fault tolerance
- Dynamic schedule tasks in replicated files
- Hetorogeneous nodes
- Maximize cost-performance
- Minimal downtime, minimal staffing
MapReduce:
- Pros
  - Failure model
  - High throughput
  - Simple programming
  - Loosely-coupled, course-grained parallel tasks
- Cons
  - Disk IO, high latency
  - Iterative, interactive computation
  - Abstraction not expressive enough, have to build specialized analytics systems.

Next Post → ← Previous Post