This talk covers ongoing work in Hadoop 1.x and 2.x to improve service availability. We're eliminating improving failover and recovery of the master nodes; adding resilience to the clients, and making the layers in the stack resilient to transient dependency outages. The goal of this: to make the entire stack Highly Available.
Hadoop was designed from the outset to be resilient to the failures of individual "worker nodes", because in a Yahoo!-scale datacentre, failures of worker node servers and their disks outages are ongoing. Hadoop's design reduces the noise of this from the beeping of pagers to statistics the ops team can look at at their leisure. The central manager nodes: the NameNode, the Job Tracker are a different story -and the fact they are single points of failure sometimes counts against Hadoop.
This talk introduces recent availability work in Hadoop's 1.x, deploying the NameNode in HA cluster systems, then doing this for the JobTracker while extending it to be resilient to transient HDFS outages. The details of this approach are covered, including the complications and the outcome of testing the entire Hadoop stack against cluster failures.
It then looks at Hadoop 2: what HDFS 2 will offer, how that complements the 1.x work with hot failover, and then considers what should think about doing for YARN and applications that run in it.
The conclusion is that we can make the "Full Stack", Highly Available through a combination of: effective use of existing infrastructure, designing new applications without central failure points, and making each layer in the stack -and its clients- resilient to failures.