In the first part, I described the potential interest of using Hadoop for Value At Risk calculation in order to analyze intermediate results. In the three (2,3, 4) next parts I have detailled how to implement the VAR calculation with Hadoop. Then in the fifth part, I have studied how to analyse the intermediate results with Hive. I will finally give you now some performance figures on Hadoop and compare them with GridGain ones. According to those figures, I will detail some performance key points by using such kind of tools. Finally, I will conclude why Hadoop is an interesting choice for such kind of computation.
With that implementation it is now possible to grab some performance measures. I used a virtual machine with 2 GB RAM and 4 cores on my laptop, a DELL Latitude E6510 with a core i7 quad core and 4 GB of RAM. Indeed, Hadoop is easier to install on Linux bearing in mind I'm using Windows for my day to day work. Performance comparison cannot be made directly with previous measures taken on a physical machine. So I replayed the GridGain run in which all results are stored on disk on the virtual machine. I have done the measures with 1 and then 4 CPUs. The following graph clearly shows the gain of using Hadoop in both cases: To be honest, I should have said the loss due to Hadoop is, in most cases, slower than GridGain. Hadoop is only quicker when we use one thread and 100,000 draws. I used Hadoop in that case in local (standalone) mode with one process working on the local file system. Even if I don't show you the figures, I should mention the fact that order of magnitude for Hive requests are very similar: 30 to 50 s. per request is a minimum. I shall conclude that distribution based on a distributed file system is more costly than the RPC based approach, even if I have to write the data to the disk with GridGain. It seems to me that the main reasons are:
For large volume of data, I could not do performance measures with GridGain due to my 32 bits architecture which limits my heap size. However, beginning with 100,000 draws we can see that Hadoop is as quick as GridGain. So my next step will be to analyze performance for larger volumes of data with the two implementations I have described and some optimizations.
Values are written in a text for or in a binary form.
The VAR is extracted by the main()
function or by the reduce phase
Some configuration parameters have been tuned as described below:
#core-site.xml
io.file.buffer.size=131072
#hdfs-site.xml
dfs.block.size=134217728
#mapred-site.xml
mapred.child.java.opts=-Xmx384m
io.sort.mb=250
io.sort.factor=100
mapred.inmem.merge.threshold=0.1
mapred.job.reduce.input.buffer=0.9
In brief, it allows having a bigger heap (mapred.child.java.opts
), processing larger batch in memory before going to disk (io.file.buffer.size
, io.sort.mb
, io.sort.factor
, mapred.inmem.merge.threshold
, mapred.job.reduce.input.buffer
) and reading/writing larger blocks on HDFS (dfs.block.size
).
These experiments show that the response time is quite linear for very high number of draws. So Hadoop bad performance is mainly due to a higher overhead compared to GridGain. In order to be able to clearly see the gain between GridGain and these different tests I constructed an indicator with a "Throughput dimension" in the sense of dimensional analysis: It allows us to compare the relative performance of GridGain and of the different implementations on Hadoop in a first order of magnitude. The represented runs are the same as above but the scale of the vertical axis is linear in order to see the gain more clearly.
This graph shows that on one laptop, Hadoop has a too high overhead in order to be competitive for less than 1,000,000 draws. Yet, for higher number of draws the relative performance is better than GridGain writing intermediate data to disk. A peak - a maximum throughput - appears on the lines. I guess it is due to I/O limitations as tuning that point moves it to the right (The peak is reached for a higher number of draws for Hadoop with reduce optimized). Finally a fully distributed test has been run. The results were correct up to 10,000,000 draws but the performance was bad, at maximum 1.3 better than with one PC. The experiment conditions were not very good, in particular disk size constraint on the other PC leads to very bad blocks repartition. Yet, it remains that the distribution for the VAR calculation of one scenario does not scale well to multiple PCs. The number of writes is very important leading to high number of blocks transfers. And the implementation with only one reduce phase needs to process all the data for reduce on a single machine. Distribution of independent scenarios should be much more efficient because the reduce phase can be distributed. But I didn't verify it with a performance run. Multiple other optimizations have been tested during investigations but with few or no gain at all. The less bad one was the binary comparator described in the 4th. article. It was tested with 100,000 scenarios of 1,000 draws which is the most favorable case - reading only the scenario key compares 99.9% of the results. Binary comparator allows computing 1.19x quicker in that particular example.
Hadoop is able to perform distributed Value At Risk calculation. It cannot compete directly with tools like GridGain as it has been designed to process large volumes of data on disk. However, in the case of VAR calculation with further analysis of all intermediate results, it does indeed provide a better framework. First, optimized file system distribution and job distribution in a colocalized way provides out of the box a very good scalability (up to 1,000,000,000 draws, 15.4 GB of compressed binary data on a single laptop). Then, being able to do BI directly on the file system removes a lot of transfer costs for large volume of data. From a developer point of view, Hadoop implements the same map/reduce pattern as GridGain. However, code has to be designed with Hadoop distribution mechanism to be really efficient. Performing financial computing on Hadoop can still be considered today as an R&D subject as the tools are still very young. Distribution for computer intensive tasks is used for about 10 years in Corporate and Investment banks. Benefiting from distribution on storages has allowed big internet actors to process huge volumes of data due to reasonable storage and processing costs that were not affordable with big and monolithic systems. Different initiatives around distributed storage and data processing - with the NoSQL movement - show that more and more integrated tools are currently in development. Such architectures can both help solving particular problems that do not fit very well with traditional approach or allow new data analysis in use cases where processing TB of data was a limitation with traditional architectures.