This project aims to create an ETL pipeline that runs daily to gather historical data about flights into & out of an airport into the data warehouse, from which business users can query and do further analysis.
- This project will work with flights data at Frankfurt Airport. You should be able to change the airport in the settings (more on that later).
- Apache Airflow for data orchestration.
- Apache Spark for data transformation and query processing.
- Apache Hadoop's HDFS as the project's distributed file system.
- Apache Hive as the data warehousing tool.
- 2-tier architecture with data lake to store raw data.
- Data sources:
- Flights data: from OpenSky API, extracted daily.
- Aircrafts, aircraft types & manufacturers data: from OpenSky Metadata directory, downloaded as local files.
- Airports & airlines data: from FlightRadar24 (airports, airlines), saved as local files.
- All source data is ingested into HDFS. After transformation, they are loaded as Hive tables into data warehouse.
Note
For simplicity, dimension data is extracted only once and saved as .csv
, .json
files. Therefore, the data will not be up to date in the future. In real-life cases, there should be CDC systems that can detect changes in dimension data.
-
create_hive_tbls
task create Hive tables in the data warehouse. -
upload_from_local
upload files from local to HDFS. If one of the files already exists in HDFS, its corresponding task is skipped. - Once
create_hive_tbls
andupload_from_local
were skipped/ executed successfully,load_dim_tbls
starts to load dimension tables to Hive data warehouse. -
extract_flights
task extracts daily flight data from OpenSky API and ingest to HDFS. - After
load_dim_tbls
andextract_flights
have executed successfully,load_fct_flights
load transformed flights data into Hive warehouse. - Map tasks to the above data flow:
- Ingest:
extract_flights
. - Generate dates table:
load_dim_tbls.dates
. - Transform airports data:
upload_from_local.airports
$\rightarrow$ load_dim_tbls.airports
. - Transform aircrafts data:
upload_from_local.{aircrafts,aircraft_types,airlines,manufacturers}
$\rightarrow$ load_dim_tbls.aircrafts
. - Transform flights data:
load_fct_flights
.
- Ingest:
- Clone this repository and navigate to project directory:
git clone https://github.com/minkminkk/etl-opensky
cd etl-opensky
- Run initial setup scripts through
Makefile
:
make setup
User will be required to enter password. This is for modifying read/write permissions for containers to bind mount directories.
- Build/get necessary images and start containers:
make up
After the containers have successfully started, the system is ready to use.
After you are done and want to delete all containers:
make down
- Airflow web UI:
localhost:8080
(Username:admin
, password:admin
). - Spark master web UI:
localhost:8081
. - HDFS web UI:
localhost:9870
. - Hiveserver web UI:
localhost:10002
.
- Browser-based: Via the Airflow web UI using your browser.
- Command line-based:
make airflow_shell
Then use the Airflow CLI commands to interact with Airflow.
Warning
The OpenSky API will mostly errors out when we tries to retrieve data near current time. Therefore you should only run pipelines about 2 months earlier than current date.
- Via Hive's command line
beeline
:
make beeline
The Makefile
also connects to the existing database. After connected, it is ready to write queries using HQL.
-
The target airport for flight data extraction is currently Frankfurt airport (ICAO code
EDDF
). -
If you want to change the target airport:
- At runtime: create new variable
airport_icao
in Airflow Web UI. - At container creation: modify environment variable
AIRFLOW_VAR_AIRPORT_ICAO
incontainers/airflow/airflow.env
.
The variable should have the ICAO code of the target airport.
- At runtime: create new variable
- Improve data modeling for data warehouse: So that data can be more effectively queried.
- Configure HDFS to persist data between runs.
- Implement CDC for dimension data.
- Configure cluster to save Spark logs.