Thursday, January 14, 2016

Designing workflow with Airflow

I have been using Oozie for a while now and was a little dissatisfied with the tool in terms of managing the Hadoop jobs and not to mention  debugging vague errors. While I was analyzing the substitute workflow engine, the Airflow by Aribnb caught my eye. I'll skip the introduction for now, you can read more about it here. This post highlights a its key features and demonstration of hadoop job.

Before I begin with the example, I'd like to mention the key advantages of Airflow over other tools:
  • Amazing UI for viewing job flow(DAG), run stats, logs etc.
  • You write an actual Python program instead of ugly configuration files
  • Exceptional monitoring options of batch jobs
  • Ability to query metadata and generate custom charts
  • Contributors in the developer community have mostly worked/evaluated the other similar tools, thus it brings the best of everything as the tool evolves.


Moving on to the example, lets consider we have an "Orders" table in MySQL database which is being populated with new order records with time and our job is to get the new  records loaded to Hive Target table (assume that the first full load is already done to hive table). Thus, we'll do an incremental load via Sqoop from Mysql to HDFS and store it in avro format. Then have a Hive external table point to it and use HQL to load the target table(parquet format) from external table.  This is just the basic sequence of tasks, in a real use case we'll have file check, record check tasks in our workflow.

Here I have used a wrapper shell script with property file for running our sqoop job, you can get the full code here. Our main workflow will be the following Python script:



On submitting the workflow, we can view its DAG, scheduled instances,run-time for each task, code and logs for each task.
output_ZGY1mp

To stay tuned about about Airflow queries/issues  join the user community on their Google group

No comments :

Post a Comment