The stress test is a very important step when you go live.
Good stress tests help us to:
Hadoop is not a web application, a database or a webservice. You don't stress test a Hadoop job with a heavy load. Instead, you need to becnhmark the cluster which means assessing its performances by running a variety of jobs each focused on a specific field (indexing, querying, predictive statistics, machine learning, ...).
Intel has released HiBench, a tool dedicated to run such benchmarks. In this article, we will talk about this tool.
HiBench is a est un collection of shell scripts published under the Apache Licence 2 on GitHub : https://github.com/intel-hadoop/HiBench
It allows to stress test a Hadoop cluster according to several usage profile.
This test dispatches the counting of the number of words from a data source.
The data source is generated by a preparation script of HiBench which relies on the randomtextwriter of Hadoop.
This test belongs to a class of jobs which extracts a small amount of information from a large data source.
It is a CPU bound test.
This test dispatches the sort of a data source.
The data source is generated by a preparation script which relies on the randomtextwriter d'Hadoop.
This test is the simplest one you can imagine. Indeed, both Map and Reduce stages are identity functions. The sorting is done automatically during the Shuffle & Merge stage of MapReduce.
It is I/O bound.
This test too dispatches the sort of a data source.
The data source is generated by the Teragen jobs which creates by default 1 billion of 100 bytes lines.
These lines are then sorted by the Terasort. Unlike Sort, Terasort provides its own input and output format and also its own Partitioner which ensures that the keys are equally distributed among all nodes.
Therefore, it is an improved Sort which aims at providing an equal load between all nodes during the test.
With this specificity, this test is:
This test is dedicated to HDFS. It aims at measuring the agregated I/O rate and throughput of HDFS during reads and writes.
During its preparation stage, a data source is generated and put on HDFS.
Then, two tests are run:
The write test is basically the same thing as the preparation stage.
This test is I/O bound.
This test focuses on the performances of the cluster when it comes to indexing data.
In order to do it, the preparation stage generates the data to be indexed.
Then, indexing is performed with Apache Nutch.
This test is I/O bound with a high CPU utilization during the Map stage_._
This test measures the performances of the cluster for PageRanking jobs.
The preparation phase generates the data source in the form of a graph which can be processed using the PageRank algorithm.
Then, the actual indexing is performed by a chain of 6 MapReduce jobs.
This test is CPU bound.
This test performs a probabilistic classification on a data source.
It is explained in depth on Wikipedia.
The preparation stage generates the data source.
Then, the test chains two MapReduce jobs with Mahout:
This test is I/O bound with a high CPU utilization during the Map stage of the seq2sparse.
When using this test, we didn't observe a real load on the cluster. It looks like it is necessary to either provide its own data source or to greatly increase the size of the generated data during the preparation stage.
This test partitions a data source into several clusters where each element belongs to the cluster with the nearest mean.
It is explained in depth on Wikipedia.
The preparation stage generates the data source.
Then, the algorithm runs on this data source through Mahout.
The K-Means clustering algorithm is composed of two stages:
Each of these stages runs MapReduce jobs and has a specific usage profile.
This class of tests performs queries that correspond to the usage profile of business analysts and other database users.
The data source is generated during the preparation stage.
Two tables are created:
This is a common schema that we can meet in many web applications.
Once that the data source has been generated, two Hive requests are performed:
These tests are I/O bound.
Run HiBench is not very hard:
Then, the file bin/hibench-config.sh contains all the options to tune before startting the stress test. It includes the HDFS directory where you want to write both source and result data, the path of the final report file on the local filesystem, ...
Once configured, ensures that the HDFS directory where you want to write your data source and your results exists on the cluster and run the command bin/run-all.sh. Now you can take a coffee... or two.
Results are written into the hibench.report file with the following CSV format:
_test_name end_date <jobstart_timestamp,jobend__timestamp> size_in_bytes duration_ms throughput
Beware that the actual result file does not contain the column header above.
The DFSIOE test also writes a CSV and an interpretation of its results in its subdirectory dfsioe.
Presently, HiBench runs on Hadoop 1.0. This means that the latest versions of Cloudera or Hortons Works distributions for example won't be able to run all tests since they rely on Hadoop 2.
However, the effort necessary to support Hadoop 2 is not that big for the majority of the tests since it is mainly a matter of updating configuration parameter names.
Also, HiBench alone is not enough for a good report of a stress test. It is necessary to also retrieve the informations provided by the JobTracker/ResourceManager like the mean execution time of Maps, Reduces, Shuffle and Merges of every job in order to build an accurate final report.
This is a lack that HiBench tried to address through its wiki page which invites you to post your results but with no success until now.
Building a public benchmark repository in order to provide a set of meaningful metrics to compare a cluster is still an uncover issue but would be interesting and quite useful.
An alternative exists to HiBench, but it is more focused on a specific usage profile.
GridMix is included in Hadoop besides the example jobs like TeraSort, Sort, ...
However, it generates MapReduce jobs which are focused on sorting large amount of data and does not cover other profiles like Machine Learning.
In spite of these drawbacks, HiBench greatly simplifies the benchmarking of a Hadoop cluster.
In the future, this domain will certainly see new tools with more functionalities and a better coverage or different usage profiles. It is only the beginning.