stop wasting time fighting fires in your data processing pipelines

Data Reply can help you simplify using Apache Airflow,
allowing you to get on with running your business

Apache Airflow

Scenario

For organisations seeking value from Big Data, time is of the essence. Recurring tasks need to be performed as soon as the data is ready to be processed. Time-based execution over-simplifies, and will not always meet the business’s needs. By deploying Apache Airflow, Data Reply helps our Clients benefit from time-driven and event-driven task execution, enabling streamlined reporting and analytics, easier to manage Machine Learning pipelines, and providing data to your customer app/website more reliably.

APACHE AIRFLOW

Apache Airflow is an orchestration engine. It can be used to build a Data Pipeline with Task Dependencies. Apache Airflow provides close monitoring of the entire workflow as well as individual task performance over time allowing for continuous improvement of the data pipeline and gives you a reliable and transparent basis to enforce SLAs. Apache Airflow easily scales with increasing workloads, and will also detect underperforming tasks for debugging.

ORCHESTRATION WITH APACHE AIRFLOW

At its core, Apache Airflow ensures that all process tasks are carried out in the right order and at the right time. The scheduling of these tasks is planned in a DAG (Directed Acyclic Graph – a way of representing how to run a workflow). One of the benefits of the DAG is that it facilitates parallelisation, meaning that several tasks can be executed simultaneously. The tasks themselves are commonly written in Python, but through operators, other technologies can be supported too.

Apache Airflow can be used to build a data pipeline (ETL, Machine Learning, etc.) with task dependencies. It supports the scheduling of tasks and can handle task failures, so that certain actions will be triggered if a task results in an error: for example, issuing an alert, rerunning a task, or triggering alternative workflows. Also, thanks to parallelisation, the DAG can branch, so a task failure in one branch does not have to affect tasks in another.

Apache Airflow offers a user interface that provides close monitoring of the entire workflow as well as individual task performance over time. This is essential for the continuous improvement of the data pipeline and gives you a reliable and transparent basis to enforce SLAs. Apache Airflow easily scales with increasing workloads, and will also detect underperforming tasks for debugging.

Why Data Reply?

Data Reply consultants are Apache Airflow experts. We have within our team developers actively contributing to the open source project codebase – indeed one of our team is represented on the Apache Airflow PPMC (the committee that oversees the project).

We have comprehensive experience in Big Data technologies, many of which can be orchestrated through Apache Airflow, and our experience across a wide array of industries means we have encountered common problems and we can share best practice. Data Reply helps companies build custom features on top of Airflow to fit their needs and use-case.

By way of example, Data Reply built a configurable and automated Data Pipeline on Google Cloud Platform for a leading UK retailer. As soon as the data arrives in the Data Lake (Cloud Storage), Apache Airflow moves the data to a staging bucket and then inserts this data into an ODS (Operational Data Store) table in BigQuery (Google’s managed, petabyte scale, low cost enterprise data warehouse). Airflow then orchestrates joins to create a new table in a BigQuery Data Mart, to be accessed by Data Visualisation tools such as Tableau. The entire pipeline was automated, reducing the pipeline latency (time taken from data arrival to report generation) from 1 week to a single day.

  • strip-0

    Data Reply

    Data Reply is a Reply Group company that specialises in Big Data and Analytics. Our main focus is helping clients to run successful data engineering and machine learning projects. We are based in London, Munich and Milan. www.datareply.co.uk