This talk will briefly show why choosing Apache Hama's BSP model for running machine learning algorithms on huge data sets could be a good idea and how it differs from common MapReduce based solutions showing some examples and benchmarks.
Apache Hama is a pure Bulk Synchronous Parallel (BSP) computing framework on top of HDFS (Hadoop Distributed File System) for massive scientific computations such as matrix, graph and network algorithms. Recently the community started working on a machine learning module to incorporate some widely known algorithms in their BSP form, starting from collaborative filtering and clustering tasks. This talk will briefly show the BSP paradigm and the Apache Hama API and how it's been leveraged to create efficient machine learning algorithms. Also benchmarks will be provided in order to compare existing machine learning implementations (i.e. MapReduce based) highlighting pros and cons of each solution.