Watershed-ng : an extensible distributed stream processing framework.
No Thumbnail Available
Date
2016
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Most high-performance data processing (a.k.a. big data) systems allow users to express their computation
using abstractions (like MapReduce), which simplify the extraction of parallelism from applications. Most
frameworks, however, do not allow users to specify how communication must take place: That element is
deeply embedded into the run-time system abstractions, making changes hard to implement. In this work, we
describe Wathershed-ng, our re-engineering of the Watershed system, a framework based on the filter–stream
paradigm and originally focused on continuous stream processing. Like other big-data environments,
Watershed provided object-oriented abstractions to express computation (filters), but the implementation
of streams was a run-time system element. By isolating stream functionality into appropriate classes,
combination of communication patterns and reuse of common message handling functions (like compression
and blocking) become possible. The new architecture even allows the design of new communication patterns,
for example, allowing users to choose MPI, TCP, or shared memory implementations of communication
channels as their problem demands. Applications designed for the new interface showed reductions in
code size on the order of 50% and above in some cases. The performance results also showed significant
improvements, because some implementation bottlenecks were removed in the re-engineering process.
Description
Keywords
Distributed systems, Watershed, Big data, Frameworks
Citation
ROCHA, R. et al. Watershed-ng: an extensible distributed stream processing framework. Concurrency and Computation, v. 28, p. 2487-2502, jan. 2016. Disponível em: <https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.3779>. Acesso em: 03 maio 2023.