Response time is large (latency) due to I/O penalty, interactive data analysis is not possible
Iterative applications would be slow, 90% time spent on I/O and network (e.g. ML)
Abstraction not expressive enough, different use cases has different applications now.
What is Spark?
Berkeley’s extensions to Hadoop
Goal 1: keep and share data sets in main memory
Which lead to the problem of fault tolerance
Which solved by lineage and small partitioning
Goal 2: Unified computation abstraction
Which make iterations viable(local work + message passing = BSP = Bulk synchronous parallel)
Which lead to the problem of how to balance ease-of-use and generality?
Spark: Pros and Cons
Pros
Fast response time
In-memory computation
Data sharing and fault tolerance
Cons
Cannot solve fine-grained operations
Cannot solve large datasets which cannot fit into memory task
Cannot solve task with requirement of very high efficiency - SIMD, or GPU
Why didn’t use in-memory computation and data-sharing before?
Because there’s not a way of efficient fault-tolerance if using in-memory computation, you have to
Checkpoint often
Replicate data across nodes
Log each operation to node-local persistent storage
The latency is still the same.
Spark is fault-tolerance (recovery) efficient
Using coarse grained operations - lineage graph instead of storing the real data.
In contrast to Hadoop/GFS, RDDs provide an interface based on coarse-grained transformations (e.g., map, filter and join) that apply the same operation to many (small-size partition) data items. This allows them to efficiently provide fault tolerance by logging the transformations used to build a dataset (its lineage) rather than the actual data.1 If a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute just that partition. Thus, lost data can be recovered, often quite quickly, without requiring costly replication.