Data Warehouse, Business Intelligence and Data Science

Wednesday, November 20, 2024

Apache Airflow: Orchestrating Your Data Pipelines

Introduction

Apache Airflow, an open-source data pipeline orchestration tool, has revolutionized the way companies manage their workflows. Originally developed by Airbnb, Airflow has become an industry standard, offering flexibility, scalability, and extensibility for a wide range of applications.

What is Apache Airflow?

Simply put, Airflow is a platform for defining, scheduling, and monitoring workflows. These workflows, represented by Directed Acyclic Graphs (DAGs), describe the sequence of tasks to be executed, their dependencies, and the order in which they should occur.

Why use Airflow?

Flexibility: Create custom and complex pipelines, tailoring them to your specific needs.
Scalability: Handle large volumes of data and complex pipelines without compromising performance.
Extensibility: Integrate Airflow with a variety of tools and technologies, such as databases, cloud services, and machine learning frameworks.
Visibility: Monitor your pipelines in real time through an intuitive web interface, identifying problems and optimizing processes.
Community: Benefit from an active and constantly growing community, with several resources and tutorials available.

Fundamental Concepts

DAGs: The basic unit of Airflow, a DAG represents a complete workflow, with all its tasks and dependencies.
Tasks: The individual tasks that make up a DAG. Each task performs a specific action, such as extracting data from a database, processing a file or training a machine learning model.
Operators: Operators define the type of task to be executed. There are several types of operators, such as BashOperator, PythonOperator, HiveOperator, among others.
Scheduler: Responsible for monitoring DAGs and starting tasks according to the defined schedule.
Executor: Executes tasks, which can be local or distributed across a cluster.

Airflow Architecture

Airflow has a modular architecture, consisting of:

Webserver: Web interface to interact with Airflow, view DAGs, monitor executions and configure alerts.
Scheduler: Orchestration engine that monitors DAGs and triggers tasks.
Metadata Database: Stores information about DAGs, tasks, executions and logs.
Workers: Processes that execute tasks.

Creating a Simple Pipeline

Creating a DAG in Airflow is relatively simple. Just define the tasks, their dependencies and the execution schedule. Airflow offers an intuitive Python API for this task.

Managing and Monitoring Pipelines

The Airflow web interface provides a complete view of your pipelines. You can:

View the DAG tree.
Monitor the status of each task.
View execution logs.
Set up alerts for failures and important events.

Use Cases

Airflow is used in several areas, such as:

ETL: Data extraction, transformation and loading.
Machine Learning: Model training, evaluation and deployment.
CI/CD: Continuous integration and continuous delivery.
Task automation: Automation of any process that can be represented as a workflow.

Conclusion

Apache Airflow is a powerful and versatile tool for orchestrating data pipelines. Its flexibility, scalability, and extensibility make it an ideal choice for companies looking to streamline their processes and ensure data quality.

Friday, October 21, 2022

Why are graphs important for data analysis?

Hello!

Why are graphs important for data analysis? My dear reader, you may already know the answer to this simple question but rest assured that there are still those out there who just want to look at rows and columns in a spreadsheet.

If I remember correctly, I started developing my first graphs in good old Excel to control my household budget. Nothing special until then. A few years later, I ended up on a Data Warehouse team with some colleagues who were averse to colors, shapes and graphs. I would sometimes hear some questions: "Who cares about this? All these colors and what matters is not the data?"

Journalism has been using visual resources such as infographics to illustrate news stories for years. Corporate data warehouses have also been using visual resources to show their data to users, but it was the advent of BIG DATA that made the volume of data impossible to use in a spreadsheet. Graphs then reinforced the role of showing data in a more intelligible way than just in rows and columns.

Many tools have been developed to make the most of data visualization. There are very sophisticated tools on the market with dozens of chart models.

You can have a Ferrari as a tool, but if you don't know how to use it, it will be your favorite car in data visualization. Visual tools should be well used so that you can not only drive them, but also extract the best interpretations that the data can show.

In this post, I share the graphs built in Microsoft's Power BI tool to illustrate how a bad graph can make data analysis difficult.

You, dear reader, can download the data set I used for free from the address provided on the web by Microsoft: Sales Financial Data .

First, I wanted to analyze product sales over time, and my favorite chart for this is the line chart. See how it looks in the figure below.

You can see a jumble of colored lines overlapping each other. The graph has become so confusing that you feel discouraged from even starting to analyze it.

Determined to get a better view, I set out for a second attempt. Did I succeed? See below.

Well, it wasn't this time. I removed the quarter from the X-axis and kept only the year in an attempt to improve the visualization. The lines continue to overlap, making it difficult to analyze sales.

Don't be discouraged, I've come up with a solution. See below.

In the figure above, I used a horizontal bar chart stacked at 100%. I put the year on the y-axis and sales on the x-axis. I used the products as the legend. Because this stacked bar chart made it possible to visualize the change in sales of products year by year without the colors and shapes competing with the data. For example, it is clear that there was an increase in the number of sales of the product/car "Carretera" in 2014. With this visualization, we do not need to spend hours trying to find the product/car "Carretera" as in the line chart.

Dear reader, it is worth spending a few minutes or hours testing the best visualization. Your user/client will certainly thank you for being able to easily extract inferences from a well-made graph.

Happy data analysis!

Friday, June 10, 2016

Dimensional Modeling

Information about business situations helps managers make decisions. Identifying and selecting a course of action to deal with a specific problem or taking advantage of an opportunity are decision-making tasks.

The faster information reaches managers, the faster a decision is made. This means profit, competitiveness and a series of advantages that can certainly help decide the future of a company.

It is through knowledge that managers make their decisions, and to make this as effective as possible, information systems were created to support them.

Transactional systems were unable to perform the task of analysis for decision-making by simply generating reports. In this context, the data warehouse emerged and became a reality in large corporations.

To build a DW (Data Warehouse), one technique used is dimensional modeling. In the same way that civil engineers draw the blueprint of a house before putting up its walls, it is necessary to design the data warehouse before building it. A dimensional model is basically formed by a central fact table and dimension tables directly linked to it.

Some important concepts:

Fact: is an event worthy of analysis and control in the organization;
Dimension: presents a perspective of the data. Ex.: per day, per salesperson, per customer, per product, per agency;
Metric: are the values to be analyzed, measured. Ex.: total sales, percentage of absenteeism.

There are two types of modeling for a Data Warehouse: “Star Schema” and “Snow Flake”.

“Snow Flake” is composed of a fact table, dimensions and sub-dimensions. The sub-dimensions are normalizations of the dimension tables, which reduces the number of records and consequently requires less disk space. However, since it is normalized to display a query, several “joins” will be necessary, which results in slow query speed.

“Star Schema” is characterized by denormalized dimension tables. In other words, dimension tables in which data is repeated in several rows. This characteristic requires a lot of disk space. The star model, being denormalized, reduces the number of “joins” between tables, which results in faster queries. Since it has fewer tables compared to the “snow flake” model, it can be said that maintenance is simpler.