Apache Airflow

  • Prerequisites
    • Python
    • Docker or Linux/Mac machine
  • Astronomer.io is the best place to run airflow in cloud.
  • Install UV an extremely fast python Package and project manager, written in RUST.
    • A single tool to replace PIP, PIP-tools, Pipx ,poetry ,pyenv ,twine ,virtual env etc.
  • For data engineering projects, refer to following website.
  • Apache air flow Provides following benefits to our data applications.
    • Organisation
      • Apache air flow helps us to set the order of our tasks.
      • Apache air flow, make sure each task starts only when the previous one is complete.
      • It controls the timing of our entire data process.
      • We have methods which have predefined functionality These are synchronised by Apache air flow.
      • It acts as an automated coordinator of our data tasks.
      • Example, if we need to collect data from a database, clean it, Perform some calculations, and then generate a report.
        • Airflow helps-us To define this sequence and make sure each step happens in the correct order, even if some tasks take longer than others.
    • Visibility(Our control Tower)
      • Airflow gives us a bird-eye view of our data tasks.
      • Helps us to monitor the progress of our Workflows.
      • Quickly identify and troubleshoot issues
      • Understand dependencies between tasks.
      • Example, if we are running multiple workflows For different projects, air flow provides a dashboard where you can see the status of each pipeline at a glance
      • If one task fails, we can easily spot it and take action, rather than discovering the problem hours later when your report doesn’t arrive.
    • Flexibility and scalability
      • Air flow is like a Swiss Army knife For data workflows.
      • It is versatile enough to handle large variety of tasks and can grow with your needs.
      • This flexibility allows us to connect too many data sources and tools.
      • Start small and Grow as your project gets bigger.
      • Customise your work to fit your exact needs.
      • You might start by using a flow to plan simple database queries
      • As you need to grow, You can add more complex tasks. Like training, AI models, checking data, quality, or even starting outside programs By triggering external API all by using the same air flow system, you know.
  • Apache airflow is an open source platform to Programmatically author, Schedule and monitor workflows.
  • Apache airflow is a tool That helps you to create, organise, and keep track of all your data tasks automatically.
  • It is a very smart to do list for your data work that runs itself, example extracting data from database, processing the data, and loading the data.
  • Airflow provides following benefits
    • Dynamic
      • Air flow Can adopt and change based on what’s happening.
      • Python based, so it is easy to use and powerful, write your workflow in python.
      • Dynamic tasks helps us to generate tasks based on dynamic inputs.
        • for example If today we have data form two sources to process and tomorrow, we will have data from three sources Then air flow Can generate task per source dynamically without any need of any change
      • Dynamic workflows Helps us to generate work flow based on Static inputs.
        • We need to generate our workflow for configuration files.
      • Branching Helps us to execute Different set of tasks Based on a condition or result.
      • We are analysing quotations received and if we receive more quotations today, then a flow will automatically add extra task to process every quotation without us having to manually change anything.
    • Scalability
      • Airflow can handle Both small and large amount of work.
      • Air flow can manage few simple tasks or hundred of large ones.
      • Air flow provides Different execution modes, which depend on your infrastructure and budget.
      • For example, if we use airflow to process data From one point of sale. As our business grows And point of sale increases Airflow can easily scale up to Handle data from all newly added even hundreds of point of sale without needing a new system.
    • Fully functional user interface
      • Airflow has a visual dashboard where we can see and control our tasks and workflows.
      • It is like having a quarter panel for your data tasks where we can see what is happening and make changes.
      • We can monitor and troubleshoot our workflow
      • Highlight relationships between workflows and tasks
        • identify dependencies between workflows
      • Identify bottlenecks And performance matrix.
      • Manage users and roles, of instance.
      • Example we have a task of updating our sales report. In air flow, UI, we can see if today’s update is running, check when it last ran successfully and even pause or restart if needed.
    • Extensibility
      • We can add new features or connect air flow to other tools easily.
      • We can add new applications to do more things.
      • We can connect it to other data tools, we use.
      • Many providers package with functions to interact with tool or service(AWS, Snowflake, et cetera)
      • Customisable user interface.
      • Possibility to customise existing functions
        • We can add abstraction layer above the function to Abstract the complexity of using it.
      • Example We are using a new cloud storage service. Even if airflow doesn’t work with it Out of the box We can write a small piece of code To connect airflow To this new service, thus extending its capabilities.

Comments

Popular posts from this blog