The Concepts Data Professionals Should Know in 2025: Part 1
From Data Lakehouses to Event-Driven Architecture — Master 12 data concepts and turn them into simple projects to stay ahead in IT.
When I scroll through YouTube or LinkedIn and see topics like RAG, Agents or Quantum Computing, I sometimes get a queasy feeling about keeping up with these innovations as a data professional.
But when I reflect then on the topics my customers face daily as a Salesforce Consultant or as a Data Scientist at university, the challenges often seem more tangible: examples are faster data access, better data quality or boosting employees’ tech skills. The key issues are often less futuristic and can usually be simplified. That’s the focus of this and the next article:
I have compiled 12 terms that you will certainly encounter as a data engineer, data scientist and data analyst in 2025. Why are they relevant? What are the challenges? And how can you apply them to a small project?
So — Let’s dive in.
Table of Content
1 — Data Warehouse, Data Lake, Data Lakehouse
2 — Cloud platforms as AWS, Azure & Google Cloud Platform
3 — Optimizing data storage
4 — Big data technologies such as Apache Spark, Kafka
5 — How data integration becomes real-time capable: ETL, ELT and Zero-ETL
6 — Even-Driven Architecture (EDA)
Term 7–12 in part 2: Data Lineage & XAI, Gen AI, Agentic AI, Inference Time Compute, Near Infinite Memory, Human-In-The-Loop-Augmentation (will be published tomorrow)
Final Thoughts
1 — Data Warehouse, Data Lake, Data Lakehouse
We start with the foundation for data architecture and storage to understand modern data management systems.
Data warehouses became really well known in the 1990s thanks to Business Intelligence tools from Oracle and SAP, for example. Companies began to store structured data from various sources in a central database. An example are weekly processed sales data in a business intelligence tool.
The next innovation was data lakes, which arose from the need to be able to store unstructured or semi-structured data flexibly. A data lake is a large, open space for raw data. It stores both structured and unstructured data, such as sales data alongside social media posts and images.
The next step in innovation combined data lake architecture with warehouse architecture: Data lakehouses were created.
The term was popularized by companies such as Databricks when it introduced its Delta Lake technology. This concept combines the strengths of both previous data platforms. It allows us to store unstructured data as well as quickly query structured data in a single system. The need for this data architecture has arisen primarily because warehouses are often too restrictive, while lakes are difficult to search.
Why are the terms important?
We are living in the era of big data — companies and private individuals are generating more and more data (structured as well as semi-structured and unstructured data).
A short personal anecdote: The year I turned 15, Facebook cracked the 500 million active user mark for the first time. Instagram was founded in the same year. In addition, the release of the iPhone 4 significantly accelerated the global spread of smartphones and shaped the mobile era. In the same year, Microsoft further developed and promoted Azure (which was released in 2008) to compete with Google Cloud and AWS. From today’s perspective, I can see how all these events made 2010 a decisive year for digitalisation: 2010 was a key year in which digitalisation and the transition to cloud technologies gained momentum.
In 2010, around 2 zettabytes (ZB) of data were generated, in 2020 it was around 64 ZB, in 2024 we are at around 149 zettabytes.
Due to the explosive data growth in recent years, we need to store the data somewhere — efficiently. This is where these three terms come into play. Hybrid architectures such as data lakehouses solve many of the challenges of big data. The demand for (near) real-time data analysis is also rising (see term 5 on zero ETL). And to remain competitive, companies are under pressure to use data faster and more efficiently. Data lakehouses are becoming more important as they offer the flexibility of a data lake and the efficiency of a data warehouse — without having to operate two separate systems.
What are the challenges?
- Data integration: As there are many different data sources (structured, semi-structured, unstructured), complex ETL / ELT processes are required.
- Scaling & costs: While data warehouses are expensive, data lakes can easily lead to data chaos (if no good data governance is in place) and lakehouses require technical know-how & investment.
- Access to the data: Permissions need to be clearly defined if the data is stored in a centralized storage.
Small project idea to better understand the terms:
Create a mini data lake with AWS S3: Upload JSON or CSV data to an S3 bucket, then process the data with Python and perform data analysis with Pandas, for example.
2 — Cloud Platforms as AWS, Azure & Google Cloud Platform
Now we move on to the platforms on which the concepts from 1 are often implemented.
Of course, everyone knows the term cloud platforms such as AWS, Azure or Google Cloud. These services provide us with a scalable infrastructure for storing large volumes of data. We can also use them to process data in real-time and to use Business Intelligence and Machine Learning tools efficiently.
But why are the terms important?
I work in a web design agency where we host our clients’ websites in one of the other departments. Before the easy availability of cloud platforms, this meant running our own servers in the basement — with all the challenges such as cooling, maintenance and limited scalability.
Today, most of our data architectures and AI applications run in the cloud. Cloud platforms have changed the way we store, process and analyse data over the last decades. Platforms such as AWS, Azure or Google Cloud offer us a completely new level of flexibility and scalability for model training, real-time analyses and generative AI.
What are the challenges?
- A quick personal example of how complex things get: While preparing for my Salesforce Data Cloud Certification (a data lakehouse), I found myself diving into a sea of new terms — all specific to the Salesforce world. Each cloud platform has its own terminology and tools, which makes it time-consuming for employees in companies to familiarize themselves with them.
- Data security: Sensitive data can often be stored in the cloud. Access control must be clearly defined — user management is required.
Small project idea to better understand the terms:
Create a simple data pipeline: Register with AWS, Azure or GCP with a free account and upload a CSV file (e.g. to an AWS S3 bucket). Then load the data into a relational database and use an SQL tool to perform queries.
3 — Optimizing Data Storage
More and more data = more and more storage space required = more and more costs.
With the use of large amounts of data and the platforms and concepts from 1 and 2, there is also the issue of efficiency and cost management. To save on storage, reduce costs and speed up access, we need better ways to store, organize and access data more efficiently.
Strategies include data compression (e.g. Gzip) by removing redundant or unneeded data, data partitioning by splitting large data sets, indexing to speed up queries and the choice of storage format (e.g. CSV, Parquet, Avro).
Why is the term important?
Not only is my Google Drive and One Drive storage nearly maxed out…
… in 2028, a total data volume of 394 zettabytes is expected.
It will therefore be necessary for us to be able to cope with growing data volumes and rising costs. In addition, large data centers consume immense amounts of energy, which in turn is critical in terms of the energy and climate crisis.
What are the challenges?
- Different formats are optimized for different use cases. Parquet, for example, is particularly suitable for analytical queries and large data sets, as it is organized on a column basis and read access is efficient. Avro, on the other hand, is ideal for streaming data because it can quickly convert data into a format that is sent over the network (serialization) and just as quickly convert it back to its original form when it is received (deserialization). Choosing the wrong format can affect performance by either wasting disk space or increasing polling times.
- Cost / benefit trade-off: Compression and partitioning save storage space but can slow down computing performance and data access.
- Dependency on cloud providers: As a lot of data is stored in the cloud today, optimization strategies are often tied to specific platforms.
Small project idea to better understand the terms:
Compare different storage optimization strategies: Generate a 1 GB dataset with random numbers. Save the data set in three different formats such as CSV, Parquet & Avro (using the corresponding Python libraries). Then compress the files with Gzip or Snappy. Now load the data into a Pandas DataFrame using Python and compare the query speed.
4 — Big Data Technologies such as Apache Spark & Kafka
Once the data has been stored using the storage concepts described in sections 1–3, we need technologies to process it efficiently.
We can use tools such as Apache Spark or Kafka to process and analyze huge amounts of data. They allow us to do this in real-time or in batch mode.
Spark is a framework that processes large amounts of data in a distributed manner and is used for tasks such as machine learning, data engineering and ETL processes.
Kafka is a tool that transfers data streams in real-time so that various applications can access and use them immediately. One example is the processing of real-time data streams in financial transactions or logistics.
Why is the term important?
In addition to the exponential growth in data, AI and machine learning are becoming increasingly important. Companies want to be able to process data in (almost) real-time: These Big Data technologies are the basis for real-time and batch processing of large amounts of data and are required for AI and streaming applications.
What are the challenges?
- Complexity of implementation: Setting up, maintaining and optimizing tools such as Apache Spark and Kafka requires in-depth technical expertise. In many companies, this is not readily available and must be built up or brought in externally. Distributed systems in particular can be complex to coordinate. In addition, processing large volumes of data can lead to high costs if the computing capacities in the cloud need to be scaled.
- Data quality: If I had to name one of my customers’ biggest problems, it would probably be data quality. Anyone who works with data knows that data quality can often be optimized in many companies… When data streams are processed in real-time, this becomes even more important. Why? In real-time systems, data is processed without delay and the results are sometimes used directly for decisions or are followed by reactions. Incorrect or inaccurate data can lead to wrong decisions.
5 — How Data Integration Becomes Real-Time Capable: ETL, ELT and Zero-ETL
ETL, ELT and Zero-ETL describe different approaches to integrating and transforming data.
While ETL (Extract-Transform-Loading) and ELT (Extract-Loading-Transform) are familiar to most, Zero-ETL is a data integration concept introduced by AWS in 2022. It eliminates the need for separate extraction, transformation, and loading steps. Instead, data is analyzed directly in its original format — almost in real-time. The technology promises to reduce latency and simplify processes within a single platform.
Let’s take a look at an example: A company using Snowflake as a data warehouse can create a table that references the data in the Salesforce Data Cloud. This means that the organization can query the data directly in Snowflake, even if it remains in the Data Cloud.
Why are the terms important?
We live in an age of instant — thanks to the success of platforms such as WhatsApp, Netflix and Spotify.
This is exactly what cloud providers such as Amazon Web Services, Google Cloud and Microsoft Azure have told themselves: Data should be able to be processed and analyzed almost in real-time and without major delays.
What are the challenges?
Here, too, there are similar challenges as with big data technologies: Data quality must be adequate, as incorrect data can lead directly to incorrect decisions during real-time processing. In addition, integration can be complex, although less so than with tools such as Apache Spark or Kafka.
Let me share a quick example to illustrate this: We implemented Data Cloud for a customer — the first-ever implementation in Switzerland since Salesforce started offering the Data Lakehouse solution. The entire knowledge base had to be built at the customer’s side. What did that mean? 1:1 training sessions with the power users and writing a lot of documentation.
This demonstrates a key challenge companies face: They must first build up this knowledge internally or rely on external resources as agencies or consulting companies.
Small project idea to better understand the terms:
Create a relational database with MySQL or PostgreSQL, add (simulated) real-time data from orders and use a cloud service such as AWS to stream the data directly into an analysis tool. Then visualize the data in a dashboard and show how new data becomes immediately visible.
6 — Event-Driven Architecture (EDA)
If we can transfer data between systems in (almost) real time, we also want to be able to react to it in (almost) real time: This is where the term Event-Driven Architecture (EDA) comes into play.
EDA is an architectural pattern in which applications are driven by events. An event is any relevant change in the system. Examples are when customers log in to the application or when a payment is received. Components of the architecture react to these events without being directly connected to each other. This in turn increases the flexibility and scalability of the application. Typical technologies include Apache Kafka or AWS EventBridge.
Why is the term important?
EDA plays an important role in real-time data processing. With the growing demand for fast and efficient systems, this architecture pattern is becoming increasingly important as it makes the processing of large data streams more flexible and efficient. This is particularly crucial for IoT, e-commerce and financial technologies.
Event-driven architecture also decouples systems: By allowing components to communicate via events, the individual components do not have to be directly dependent on each other.
Let’s take a look at an example: In an online store, the “order sent” event can automatically start a payment process or inform the warehouse management system. The individual systems do not have to be directly connected to each other.
What are the challenges?
- Data consistency: The asynchronous nature of EDA makes it difficult to ensure that all parts of the system have consistent data. For example, an order may be saved as successful in the database while the warehouse component has not correctly reduced the stock due to a network issue.
- Scaling the infrastructure: With high data volumes, scaling the messaging infrastructure (e.g. Kafka cluster) is challenging and expensive.
Small project idea to better understand the terms:
Simulate an Event-Driven Architecture in Python that reacts to customer events:
- First define an event: An example could be ‘New order’.
- Then create two functions that react to the event: 1) Send an automatic message to a customer. 2) Reduce the stock level by -1.
- Call the two functions one after the other as soon as the event is triggered. If you want to extend the project, you can work with frameworks such as Flask or FastAPI to trigger the events through external user input.
Final Thoughts
In this part, we have looked at terms that focus primarily on the storage, management & processing of data. These terms lay the foundation for understanding modern data systems.
In part 2, we shift the focus to AI-driven concepts and explore some key terms such as Gen AI, agent-based AI and human-in-the-loop augmentation.
All information in this article is based on the current status in January 2025.
The Concepts Data Professionals Should Know in 2025: Part 1 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Bigdata in Towards Data Science on Medium https://ift.tt/QfENSuI
via IFTTT