General Airflow Questions
Q1: Define Apache Airflow
Apache Airflow is a workflow orchestration tool that enables the creation, scheduling, and monitoring of workflows structured as Directed Acyclic Graphs (DAGs).
Q2: What are Directed Acyclic Graphs (DAGs) in Airflow?
A DAG in Airflow is a representation of tasks linked together with dependencies, executed in a sequence without forming loops.
Q3: Explain the role of Operators in Airflow.
Operators specify the work to be done by a task, such as running Python code, executing shell commands, or interacting with databases.
Q4: What does a Task represent in Airflow?
A task in Airflow symbolizes a unit of work that can be executed independently, as part of a larger workflow.
Q5: Which databases are compatible with Airflow for metadata storage?
Airflow supports databases such as PostgreSQL, MySQL, and SQLite (primarily for testing and development).
Scenario Based Questions
Q1: How can you recover from a task failure in Airflow?
Configure task retries using the retries parameter and a delay between attempts using retry_delay. For custom handling, implement on_failure_callback.
Q2: How do you dynamically generate tasks within a DAG?
Use Python loops and the Task Group class to create tasks dynamically during DAG creation.
Q3: What steps would you take to set up Airflow for near real-time task execution?
Leverage sensors like File Sensor for event-based task triggering and optimize task scheduling intervals.
Q4: How would you schedule a DAG to execute on the last day of each month?
Implement a CRON schedule such as 0 0 28-31 * * and include logic to verify the date is the month's final day.
Q5: How do you manage tasks dependent on an external workflow?
Use ExternalTaskSensor to monitor the state of the external workflow and proceed only when its tasks are complete.
Architecture Level Questions
Q1: Describe Apache Airflow`s architecture.
Airflow`s core components include:
- Scheduler: Orchestrates task execution by assigning them to executors.
- Executor: Executes tasks, either locally or in distributed setups.
- Metadata Database: Tracks state, DAG definitions, and execution history.
- Web Server: Provides a dashboard for monitoring and control.
Q2: What is the purpose of the Executor in Airflow?
Executors determine how tasks are run. For example, LocalExecutor handles tasks sequentially, while CeleryExecutor enables distributed task execution.
Q3: How are task dependencies enforced in Airflow?
Task dependencies are defined programmatically using operators like >>, <<, or methods like set_upstream.
Q4: Differentiate between CeleryExecutor and KubernetesExecutor.
CeleryExecutor uses Celery workers for task execution in a distributed fashion.
KubernetesExecutor creates isolated Kubernetes pods for each task, ideal for scalability and resource management.
Q5: How does Airflow manage concurrent task execution?
Concurrency is controlled through global settings (parallelism), DAG-specific limits (max_active_runs), and task-specific parameters (task_concurrency).
Q1: How do you enhance the efficiency of a DAG in Airflow?
Minimize inter-task dependencies, utilize parallelism, split large DAGs into smaller modular workflows, and fine-tune database settings.
Q2: What is parallelism in Airflow, and how is it achieved?
Parallelism allows multiple tasks to execute simultaneously. It is achieved by setting the parallelism and max_active_tasks parameters.
Q3: How does Airflow implement task retry logic?
Task retries are handled via the retries count and retry_delay interval, ensuring failed tasks can be retried a specified number of times.
Q4: What is the function of Pools in Airflow?
Pools are resource management tools in Airflow, limiting the number of concurrent tasks accessing shared resources.
Q5: Explain the concept of XCom in Airflow.
XCom, or 'cross-communication,' is a feature allowing data sharing between tasks. Tasks can push data using xcom_push and retrieve it with xcom_pull.
Q1: How do you manage dependencies between tasks in Airflow?
Use dependency operators (>>, <<) or task methods like set_upstream and set_downstream to enforce order.
Q2: What is a Sensor, and why is it important in Airflow?
A Sensor is a special operator designed to wait for a particular event or condition, such as a file being available, before moving to the next task.
Q3: How do you approach building large-scale workflows in Airflow?
Divide the workflow into smaller, independent DAGs or use TaskGroups for better manageability.
Q4: What is the role of Branch Operators in Airflow?
Branch Operators enable conditional execution paths within a workflow, allowing for dynamic decision-making.
Q5: How does Airflow handle missed task executions (backfilling)?
Backfilling is automatically triggered by Airflow to ensure tasks that were scheduled but not executed are completed.
Scenario Based Advanced Questions
Q1: Design a DAG to process and archive daily logs.
Create tasks for log processing using a PythonOperator, followed by an archiving task leveraging FileTransferOperator or cloud storage integrations.
Q2: How would you address a long-running task in Airflow?
Set an execution_timeout to terminate excessively long tasks and use task retries for partial failures.
Q3: How do you ensure workflows only execute when specific conditions are met?
Use Sensors for event-driven triggers and Pools to manage resource availability.
Q4: How do you integrate an external API trigger with Airflow?
Create a custom trigger using the Airflow REST API or implement an event-driven external process.
Q5: How do you monitor workflows in real-time using Airflow?
Utilize the Web UI to monitor task statuses and examine logs for runtime diagnostics.
Security and Compliance Questions
Q1: What measures would you take to secure an Airflow setup?
Enable RBAC (role-based access control), encrypt metadata storage, and configure secure network access for the web server.
Q2: What is the purpose of RBAC in Airflow?
RBAC restricts user access based on roles, ensuring only authorized personnel can access specific Airflow functions.
Q3: How do you store sensitive data like passwords in Airflow?
Use Airflow's Secret Backends, such as HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault.
Q4: How do you integrate authentication mechanisms with Airflow?
Airflow supports external authentication systems like LDAP, OAuth, and Kerberos for secure access control.
Q5: Why is logging critical in Airflow?
Logs provide detailed execution histories for debugging issues and fulfilling audit requirements.