Architecture for data platforms

The organization of data plays an important role in modern companies. In addition to the flood of digital information from customers and partners, it is also important to manage and analyze the internal database as efficiently and cost-effectively as possible in order to remain competitive in the long term. A data platform is specifically designed to collect data from a variety of different sources, unify this data across the organization, and make it available to all data consumers.

Characteristics of a Data PlatformAnchor

A modern data platform has to quickly adapt to the sheer scale, complexity, and diversity of data and its users. Storing and processing the data isn't enough—a modern data platform should exhibit a number of key characteristics.

Fast SetupAnchor

A key feature of a modern data platform is that it is quick and easy to set up and use. It should connect to your data sources without requiring complicated configuration, and once the platform has been deployed, users should be able to log in into the platform and use it, regardless of their technical skills.

Agile Data ManagementAnchor

The data platform must connect and integrate a large number of different data sources, meet data protection requirements, and offer sufficient flexibility to react promptly to new requirements, including both legal requirements and business requirements set within the organization. The data platform should enable agile data management, which means that the right data is provided to the right people at the right time.

Three fundamental principles that govern agile data platforms are:

Simplicity: A particular challenge in data management is providing the relevant departments of an organization with the data they need. Access to the required data should be as frictionless as possible so as not to create unnecessary overhead.
Speed: Modern data management has to be fast. This is true when it comes to processing and analyzing data, but also with regards to getting data where it needs to be. Your data management tool should speed up your workflows, not bog them down.
Elasticity: Modern data management is highly elastic with regards to the type and volume of data, as well as the resources required for data processing and analysis. Users should also be able to easily integrate new data sources into existing data processing flows. A key property of an elastic data platform is modularity. In the event of a changed requirement for individual components of the data platform (eg changed data protection requirements), the infrastructure should be unaffected—only those specific components affected by the change should need to be adapted.

SecureAnchor

A modern data platform must balance making data available to all its users and processes with the legal requirements around securing data, such as the GDPR or the Payment Card Industry Data Security Standard. The data platform must ensure that the data is securely loaded and stored on the platform, which it achieves by encrypting the data when loaded and masking all personal data. Platforms must support robust access controls to ensure that the data can be accessed or altered only by authorized persons.

User FocusedAnchor

A data platform should be able to handle a wide range of users of various skill levels. Regardless of whether the users are engineers, analysts, project managers, or marketers, the data platform should enable them to locate, analyze, and understand the company's data. The context of the data, such as meta-description or source, must be easy to access and understand. Users must be able to derive insights and analytics from the data quickly and with minimal effort.

Components of a Data PlatformAnchor

A data platform collects, stores, processes, and analyzes data to make it available to business users, who derive valuable insights from it. The architecture of a typical modern data platform consists of several layers, each fulfilling a different function. These layers are presented in detail below.

Layers of a data platform

Data SourcesAnchor

The data source layer stores the data that is used by the data platform. The sources for the data can be:

Entire information systems, such as customer relationship management (CRM) or enterprise resource planning (ERP) systems
Unstructured sources, such as text files
Structured data, such as Excel documents
Audio, video, or streaming sources

In order to efficiently store these potentially vast amounts of data, it is advisable to use a data lake, such as Amazon S3 or Google Cloud Storage.

Data Ingestion LayerAnchor

The data ingestion layer merges the multiple sources in the data source layer and makes the data available on the data platform. In the first step, the ingestion layer extracts the data from the various data sources in the data source layer. Next, the data is checked for validity, ensuring that it's in the correct format and doesn't contain errors. The ingested data is stored in the staging area of the data platform, where the data is waiting for further processing steps.

Some modern tools for data ingestion differentiate between open source and SaaS tools. Open source tools are free to use, but often require more engineering hours spent on installing, configuring, and maintaining the tools. Paid SaaS solutions have an ongoing price associated with them, but generally benefit from more features and an easier integration process, as well as reduced maintenance.

Open Source Tools: StreamSets, Singer
SaaS Tools: Stitch, Fivetran, Hevo Data

Processing and Transformation LayerAnchor

In the processing and transformation layer, the source data gets prepared for storage in the storage layer, where the data is stored in a specific data model. If this data model is the same as the source data, no preprocessing of source data in the processing and transformation layers is necessary. Otherwise, the data gets transformed into the data model that fits the data model in the storage layer.

Data processing can be applied in real time, or done in batches scheduled for a specific time of the day. Both data processing techniques can be executed in either extract-transform-load (ETL) or extract-load-transform (ELT) processes. For larger amounts of data, the latter procedure is recommended for performance reasons. Some key tools for data transformation and processing include Databricks, Athena, and Starburst.

Storage LayerAnchor

After data has been ingested from the data source layer and processed in the processing and transformation layer, it gets stored in the storage layer. The data storage layer of a data platform has several functions:

Making the data available to the data consumers, such as data scientists and developers.
Protecting the data from errors and failures in the system.
Archiving the data over a very long period of time.

The storage layer of a data platform can be implemented using different technologies:

NoSQL databases
Hadoop distributed file systems
Cloud storage
In-memory databases

Analytics LayerAnchor

The analytics layer serves the purpose of analyzing the data and gaining valuable insights from it by applying various analytics algorithms to the data. Such algorithms could involve descriptive and exploratory analytics, as well as more advanced algorithms based on machine learning and neural networks.

Visualization LayerAnchor

The insights from the data gained in the analytics layer are presented to the end user in the visualization layer. This is usually done through business intelligence (BI) dashboards. BI dashboards allow the end user to explore the data more deeply than would be possible if they were just consuming static data reports.

Modern BI tools include Tableau, Looker, Sigma, and Superset.

Security and Privacy LayerAnchor

One challenge of a modern data platform is the management and application of privacy and security policies. This challenge is addressed in the data protection and security layer. The application of privacy and security policies is achieved by user authentication and authorization, which ensures that access to the data in the data platform is only granted to authorized users. The security and confidentiality of the data can be further assured by encrypting the data during transmission and storage.

The privacy and security layer also tracks and audits all activities performed on the data, providing a comprehensive record of who accessed or modified data and when they did it. Some tools used for data privacy and security include Immuta, Privacera, and Apache Ranger.

Data Catalog and GovernanceAnchor

A modern data platform brings many advantages, including agile data management, easy scalability, and speed. However, as metadata itself increasingly becomes big data, management of metadata, data discovery options, trust, and governance often becomes a struggle for modern data platforms. The data catalog and governance layer aims to avoid these complications. A data catalog is a metadata directory that can be used as a tool or service to manage the metadata of the data assets of the data platform. For example, data sources, data lineage, table names, attributes, value ranges, data types, and indices are stored in a data catalog. This way, the data catalog enables data governance in a data platform.

Hygraph is a convenient and flexible tool for metadata management. It equips its users with the ability to edit and source content from other APIs via remote fields.

ConclusionAnchor

In this article, you learned about the data platform as a tool for organizing and analyzing data in a data-driven organization. You looked at the key characteristics of a modern data platform, which include simplicity, adaptability, and security. You also had an overview of the architecture of a data platform, with detailed explanations of each of the individual layers.

Convenient and flexible metadata management is not the only benefit Hygraph can bring into your organization. Hygraph is a content management system that allows users to bring content to any platform with full reading and writing capabilities. With GraphQL, developers can use an intuitive interface and readable syntax to request exactly the data they need to support their platform.