Writing Big Data Pipelines: the Apache Beam Project
Neal Glew – Software Engineer
Google, Inc. Sunnyvale, California, USA.
Apache Beam is an open-source project for writing big-data pipelines (from TBs to PBs+). Its heart is a programming model that unifies both batch and stream processing, allowing the programmer to separate the what, where, when, and how of processing. What actual processing is performed on the data. Where in event time is that processing done – how are event times windowed. When in processing time to materialise results. How are updates of results (due e.g. to late data) combined. Beam also provides several language-specific SDKs that instantiate the model for particular languages. Currently Java and Python are available and Go is under development. Beam also provides a portability framework that allows pipelines to be run on a variety of execution technologies. Beam itself provides a reference runner. There are also efforts to develop runners based on Apache Flink and Apache Spark. Google provides a commercial managed runner on its Google Cloud. Beam builds on the work of Map Reduce, Hadoop, Flume, Spark, and Flink. In this talk I will give an overview of the Beam programming model and briefly describe the portability framework.
Wednesday 13th February 2019 – 9:20 am – 9:50 am – Schedule
Software Engineer, DataPLS, Google
Neal is a software engineer in the Flume project at Google, where he mostly works on the shuffle system. He previously worked at Intel on parallel programming models within Intel Labs. He has a PhD in computer science from Cornell University and a BSc(hons) in computer science from Victoria University of Wellington.