OpenSky Tech Stack I - Architectural Overview

         ·

This is the first post of my series about the OpenSky Network. We will have a look at OpenSky’s architecture and the requirements which supported this choice.

Requirements

We want to collect every single Mode S message emitted all over the globe. – OpenSky’s objective

OpenSky formerly started as a research project to collect ADS-B and and later on Mode S data for security research. Due to the lack of available sources, researchers at armasuisse Science and Technology put up a few receivers to collect messages in a MySQL database.

It quickly turned out how useful this data collection was as more and more researchers started using it. People installed new sensors and the network grew larger. At some point, MySQL became the bottleneck and could not cope with the insert rate of 600 messages per second. This doesn’t seem to be much, but there were several factors which supported performance degradation. There were separate tables for raw messages, decoded information and flights. Every incoming messages triggered an insertion into at least three tables. The size on disk reached 2TB for around 12 billion messages, which made updating the indices an expensive operation. As our goals were (and still are) quite ambitious: collecting and storing every single Mode S message emitted all over the globe. Hence, we needed a system which could easily cope with a massively growing network. Besides, we have identified some other requirements within two years of operation.

Fault Tolerance & Data Storage

Data is OpenSky’s most valuable asset. The main priority of the project is to never stop collecting and never losing any data. For this purppose fault tolerant and highly available solutions need to be applied. High availability is not the main focus for every single part of the system, but a strict requirement for data ingestion. Losing a single machine should not interrupt data collection and must never lead to losing any parts of archived data

On the other hand, OpenSky’s infrastucture resides in a single data center and we need to live with the consquences which lead to service interruption like local power outages, failing UPS or network breakdowns. Moving to the cloud is not an option due to high egress and storage costs.

In the end, the data ingestion and storage system should be robust against failures within a single data center, such as server outages in general and failing disks in particular.

When it comes to data storage, we quickly came to the conclusion, that a relational database is not the right choice to persist our master data set. We needed a more flexible solution. New data sources will be exploited and demand for a heterogeneous information system.

Data Ingestion

Processing


The Lambda Architecture

Ingestion Layer

Batch Layer

Speed Layer

Serving Layer