Twitter Storm and Apache Spark are Open Source solutions for real time distributed computing and data flow processing. Yahoo! has recently released a port of Storm for YARN: Storm-Yarn.
Therefore, if your problem is to build computing or dataflow processing topologies (to compute agregates, predictions using R ou something else, …) on large data flow volumes that feed the cluster in real time, it is now possible to do it without needing to build a separate and Storm dedicated cluster and without generating MapReduce jobs.
In a previous article, we discussed about graph databases and how to process it. Apache Giraph is an Open Source solution for distributed graph processing. Facebook uses it to analyse its social graph.
Therefore, now you can rely on the resources of your Hadoop cluster to perform this kind of analysis.
In YARN, the ResourceManager and the NodeManager are dedicated to the cluster resource management and their scheduling. Since MapReduce is no longer tightly coupled to resource manager, a Hadoop 2 cluster does not manage mappers and reducers slots. Instead, we talk about containers which is a more generic abstraction. A container can execute any kind of processing depending on the framework you chose to use for that specific job.
Concretely, what is the difference?
Unlike Hadoop 1, a container is sized by two parameters:
The API used to develop a distributed computing framework on YARN (called an application), is a bit complex. New libraries have been written in order to simplify such development thus fostering the integration of new frameworks on top of YARN.
Therefore, if you have a computing need that is currently not properly fulfilled by the existing frameworks or if the cost to port your existing distributed codes is to high, you can write your own YARN application and thus take advantage of Hadoop distributed filesystem and resource allocation in a better way.
Hadoop 2 now officially supports Windows Server and Windows Azure. Moreover, Microsoft is increasingly involved in Hadoop development.
Therefore, if your IT and your internal skills are mainly on Windows platforms, moving to Hadoop is no longer a synonym of moving to GNU/Linux. And that can cut some superfluous training costs.
HDFS also benefits from some major improvements even thought some of them are already widely spread in the most common Hadoop distributions.
One of the first major improvement of HDFS 2 is the High Availability of the NameNode. You already know it since it has already been released in most if not all Hadoop distributions.
It is now possible to create readonly or readwrite snapshots of HDFS directories.
These snapshots have the following important features:
The HDFS command line and web client were not very user friendly and convenient to use, especially for non technical people.
Now, it is possible to access HDFS though an NFS share. Therefore, HDFS is viewed and manipulated like any other network share on a network.
Used in conjonction with Kerberos which is already supported in Hadoop, it is now possible to provide HDFS shares to non tehcnical people in your company, thus facilitating its adoption.
However, it is a new feature and like any other new feature, it will be necessary to assess it first in order to understand how HDFS specific features (block sizes, write once files, append only, ...) impact the use of an NFS share.
With this new major release, Hadoop has reached a new maturity level and is now ready for our IT departments and corporations.
Moreover, the integration of NFS, snapshots and the possibility to use new distributed computing frameworks clearly make Hadoop the glue, that is facilitating the development of large scale processing on large volumes of data.