Datascience in Towards Data Science on Medium,

OLAP is Dead — Or Is It ?

10/21/2024 Jesus Santana

OLAP is Dead — Or Is It ?

OLAP’s fate in the age of modern analytics

In 1993, E.F. Codd & Associates introduced the term OLAP (Online Analytical Processing) to describe techniques used for answering multidimensional analytical queries from various perspectives. OLAP primarily involves three key operations :

Roll-up : Summarizing data at higher levels of aggregation,
Drill-down : Navigating to more detailed levels of data,
Slice and dice : Selecting and analyzing data from different viewpoints.

Browsing the web nowadays, it feels like every data analytics issue is somehow tied to trendy self-service BI, focused on crunching Big Data with AI on steroids. Platforms like LinkedIn and Reddit are flooded with endless discussions about the disadvantages of outdated OLAP compared to the latest trends in data analytics for all. So yes, we can confidently declare: OLAP is dead. But wait… is it really?

RIP OLAP (Image by the author — AI generated)

Who Am I and Why This Post ?

Before we dive into that disputed subject, let me introduce myself and explain why I’m bothering you with this post. I work at icCube, where amongst others, I solve the technical challenges of our customers. Occasionally, the sales team asks me to join demos for potential clients, and almost, without fail, the central concern of data scalability comes up — to handle the (soon-to-be) Big Data of that customer. Being a technical and pragmatic person, my naive, non-sales response would be :

Could we first please define the actual problems to see if we really need to talk about Big Data ?

Ouch ;-) Told you, I’m a techie at heart. So, in this post, I’d like to clarify what OLAP means in 2024 and the kinds of challenges it can solve. I’ll draw from my experience at icCube, so I might be a bit biased, but I’ll do my best to remain objective. Feel free to share your thoughts in the comments.

OLAP != OLAP Cube

OLAP is often, if not always, used interchangeably with OLAP Cube — i.e., a materialized structure of pre-aggregated values in a multidimensional space. With this wrong definition, it’s easy to see why people might say OLAP is outdated, as advances in technology have reduced the need for pre-aggregation.

However, OLAP is not synonymous with OLAP Cubes. If there’s one thing I would highlight from the various definitions and discussions about OLAP, it’s that OLAP embodies a set of concepts and methods for efficiently analyzing multidimensional data.

Chris Webb captured this well in a post, reflecting back in the old days:

By “OLAP” I mean the idea of a centralised model containing not just all your data but also things like how your tables should be joined, how measures aggregate up, advanced calculations and KPIs and so on.

In his post, “Is OLAP Dead”, Chris Webb also referred to the FASMI Test as a way to qualify an OLAP system in just five keywords : “Fast Analysis of Shared Multidimensional Information”.

FAST              : means that the system is targeted to deliver most
                    responses to users within about five seconds, with the
                    simplest analyses taking no more than one second and
                    very few taking more than 20 seconds.

ANALYSIS          : means that the system can cope with any business logic 
                    and statistical analysis that is relevant for the
                    application and the user, and keep it easy enough for
                    the target user.

SHARED            : means that the system implements all the security
                    requirements for confidentiality (possibly down to cell
                    level).

MULTIDIMENSIONAL  : is our key requirement. If we had to pick a one-word
                    definition of OLAP, this is it. The system must provide
                    a multidimensional conceptual view of the data,
                    including full support for hierarchies and multiple
                    hierarchies, as this is certainly the most logical way
                    to analyze businesses and organizations.

INFORMATION       : is all of the data and derived information needed,
                    wherever it is and however much is relevant for the
                    application.

I found it amusing to realize that this definition was introduced back in 2005, in a post subtitled :

An analysis of what the often misused OLAP term is supposed to mean.

So, it’s quite clear that this confusion is not something new, and our marketing and sales colleagues have contributed to it. Note that this definition does not specify how an OLAP system should be implemented. An OLAP cube is just one possible technology for implementing an OLAP solution.

Based on my data field experience, MULTIDIMENSIONAL and SHARED are the key requirements. I would replace SHARED by SECURED and make “ down to cell level ” not an option — a complex multidimensional data model with security constraints inevitably means eventually a complex security profile. Note that the FASMI Test does not mandate anything regarding the absolute size of the data being analyzed.

Before diving into the five key terms and showing how they apply to modern tools, let’s first challenge several widely held beliefs.

Data Analytics != Big Data Analytics

Inevitably, the Big Data argument is used to assert that OLAP is dead.

I could not agree less with that assertion. However, let’s see what Jordan Tigani is saying in the introduction of his “BIG DATA IS DEAD” post from early 2023 :

Of course, after the Big Data task force purchased all new tooling and migrated from Legacy systems, people found that they still were having trouble making sense of their data. They also may have noticed, if they were really paying attention, that data size wasn’t really the problem at all.

It’s a very engaging and informative post, beyond the marketing hype. I feel there’s no need for me to reiterate here what I’m experiencing on a much smaller scale in my job. His conclusion :

Big Data is real, but most people may not need to worry about it. Some questions that you can ask to figure out if you’re a “Big Data One-Percenter”:

- Are you really generating a huge amount of data?

- If so, do you really need to use a huge amount of data at once?

- If so, is the data really too big to fit on one machine?

- If so, are you sure you’re not just a data hoarder?

- If so, are you sure you wouldn’t be better off summarizing?

If you answer no to any of these questions, you might be a good candidate for a new generation of data tools that help you handle data at the size you actually have, not the size that people try to scare you into thinking that you might have someday.

I’ve nothing to add at this point. Later in this post, we’ll explore how modern OLAP tools can help you manage data at the scale you’re working with.

Data Analytics != Self-Service BI

Inevitably, the Self-Service BI is another argument used to assert that OLAP is dead.

Business users are empowered to access and work with raw corporate data independently, without needing support from data professionals. This approach allows users to perform their own analyses, generate reports, and create dashboards using user-friendly tools and interfaces.

If we acknowledge that the necessary analytics are straightforward enough for any businessperson to handle, or that the tools are advanced enough to manage more complex analytics and security profiles, then the underlying assumption is that the data is already clean and ready for making business decisions.

In icCube, during the enablement phase of customer projects, 80% of the time is spent cleaning and understanding the actual data and the business model behind it. Surprisingly, a significant portion of this time is also spent communicating with the few individuals who possess knowledge of both the technical and business worlds. This is not surprising, as the data model typically evolves over many years, becomes increasingly complex, and people come and go.

But let’s assume the raw data is clean and the business users understand it perfectly. Then what happens when hundreds (or even thousands) of reports are created, likely accessing the OLTP databases (as there is no IT involvement in creating an analytical data repository)? Are they consistent with each other? Are they following the same business rules? Are they computing things right? Are they causing performance issues?

And assuming all is fine, then how do you maintain these reports? And more importantly, how do you manage any required change in the underlying raw data as there is no easy way to know what data is used where?

So similarly to the Big Data argument, I do not believe that Self-Service BI is the actual solution for every modern analytical challenge. In fact, it can create more problems in the long run.

Data Analytics != Generative AI Data Analytics

At last the AI argument. You no longer need your OLAP engine, and by the way, you no longer need any analytical tool. AI is here to rule them all! I’m exaggerating a bit, but I’m not far off when considering all the hype around AI ;-)

More seriously, at icCube, even if we’re currently skeptical about using AI to generate MDX code or to analyze data, it certainly does not mean we’re against AI. Quite the contrary, in fact. We’ve recently introduced a chatbot widget to help end users understand their data. We’re actively investigating how to use AI to improve the productivity of our customers. The actual issues we’re facing with it are mainly:

It’s not accurate enough to give to end users who cannot distinguish the hallucinations.
It’s overkill to give to end users who are expert in the domain and can understand and fix the hallucinations.
The cost of each query (that is the LLM inference cost).

But don’t just take my word for it — I’d like to highlight the practical and similar approach shared by Marco Russo. You can check out his YouTube video here. For those short on time, skip ahead to the 32-minute mark where Marco is sharing his feelings about ChatGPT being used to generate DAX code.

Right now, generative AI is not ready to replace any OLAP system and certainly cannot be used as an argument to say OLAP is dead.

Now, let’s return to the FASMI Test and take a look at the five key terms that define an OLAP system.

FASMI Test : Fast

means that the system is targeted to deliver most responses to users
within about five seconds, with the simplest analyses taking no more than
one second and very few taking more than 20 seconds.

Delivering fast response time to analytical queries is no longer exclusive to OLAP systems. However, it remains an added benefit of OLAP systems, which are specifically tailored for such queries. One significant advantage is that it helps avoid overloading OLTP databases (or any actual sources of data) because :

A dedicated data warehouse may have been created.
It may act as a cache in front of the actual data sources.

An additional benefit of this intermediate layer is that it can help reduce the costs associated with accessing the underlying raw data.

FASMI Test : Analysis

means that the system can cope with any business logic and statistical
analysis that is relevant for the application and the user, and keep it
easy enough for the target user.

OLAP systems are designed to perform complex analytical queries and, as such, offer a range of features that are often not available out of the box in other systems. Some of these features include :

Slice-and-dice capabilities : allows users to explore data from different perspectives and dimensions.
Natural navigation : supports intuitive navigation through parent/child hierarchies in the multidimensional model.
Aggregation measures : supports various aggregations such as sum, min, max, opening, closing values, and more.

To support all these capabilities, a specialized query language is needed. MDX (Multi-Dimensional Expressions) is the de facto standard for multidimensional analysis.

Some advanced and possibly non-standard features that we frequently use with our customers are :

Time period comparisons : facilitates time-based analyses like year-over-year comparisons.
Calculated measures : enables the creation of ad-hoc calculations at design or runtime.
Calculated members : similar to calculated measures but can be applied to any dimension. For example, they can be used to create helper dimensions with members performing statistics based on the current evaluation context.
Advanced mathematical operations : provides vectors and other structures for performing complex mathematical calculations elegantly (statistics, regressions… ).
MDX extensions : functions, Java code embedding, result post-processing, and more.

FASMI Test : Shared

means that the system implements all the security requirements for
confidentiality (possibly down to cell level).

Based on my experience, I believe this is the second most important requirement after the multidimensional model. In every customer model where security is needed, defining proper authorization becomes a significant challenge.

I would suggest improving the FASMI Test by making cell-level granularity mandatory.

Both Microsoft Analysis Services, icCube, and potentially other platforms allow security to be defined directly within the multidimensional model using the MDX language (introduced in the next point). This approach is quite natural and often aligns naturally with corporate hierarchical security structures.

Defining security at the multidimensional model level is particularly important when the model is built from multiple data sources. For instance, applying corporate security to data from sources like IoT sensors could be very complex without this capability.

Since the FASMI Test was introduced, embedding analytics directly into applications has become a critical requirement. Many OLAP systems, including Microsoft Analysis Services and icCube, now support the dynamic creation of security profiles at runtime — once users are authenticated — based on various user attributes. Once this security template is defined, it will be applied on-the-fly each time a user logs into the system.

FASMI Test : Multidimensional

is our key requirement. If we had to pick a one-word definition of OLAP,
this is it. The system must provide a multidimensional conceptual view of
the data, including full support for hierarchies and multiple
hierarchies, as this is certainly the most logical way to analyze
businesses and organizations.

I completely agree. A multidimensional model is essential for data analytics because it offers a structured approach to analyzing complex data from multiple perspectives (data doesn’t exist in isolation) and often aligns with corporate hierarchical security frameworks.

Intuitive for Business Users

This model mirrors how businesses naturally think about their data — whether it’s products, customers, or time periods. It’s much more intuitive for non-technical users, allowing them to explore data without needing to understand complex SQL queries. Key features like parent-child hierarchies and many-to-many relationships are seamlessly integrated.

Enhanced Data Aggregation and Summarization

The model is built to handle aggregations (like sum, average, count) across dimensions, which is crucial for summarizing data at various levels. It’s ideal for creating dashboards that present a high-level overview, with the ability to drill down into more detailed insights as needed.

Facilitates Time Series Analysis

Time is a critical dimension in many types of data analysis, such as tracking trends, forecasting, and measuring performance over periods. A multidimensional model easily integrates time as a dimension, enabling temporal analysis, such as year-over-year or month-over-month comparisons.

Data Complexity in the Real World

Despite the rise of no-code data tools, real-world data projects are rarely straightforward. Data sources tend to be messy, evolving over time with inconsistencies that add complexity. Accessing raw data can be challenging with traditional SQL-based approaches. Given the shortage of skilled talent, it’s wise to first establish a clean semantic layer, ensuring data is used correctly and that future data-driven decisions are well-informed.

Trust and Reliability in Analytics

One major advantage of a well-defined multidimensional model (or semantic layer) is the trust it fosters in the analytics provided to customers. This robust model allows for effective testing, enabling agile responses in today’s fast-paced environment.

Perceived Inflexibility

The semantic layer in OLAP serves as a crucial step before data access, and while it may initially seem to limit flexibility, it ensures that data is modeled correctly from the start, simplifying future reporting. In many cases, this “inflexibility” is more perceived than real. Modern OLAP tools, like icCube, don’t rely on outdated, cumbersome processes for creating OLAP cubes and can even support incremental updates. For example, icCube’s category feature allows even new dimensions to be created at runtime.

In summary, OLAP and dimensional models continue to offer critical advantages in handling complex business logic, security, despite the perceived inflexibility when compared to direct raw data access.

FASMI Test : Information

is all of the data and derived information needed, wherever it is and
however much is relevant for the application.

Pulling data from various sources — whether SQL, NoSQL, IoT, files, or SaaS platforms — is no longer something exclusive to OLAP systems. However, OLAP systems still offer a key advantage: they are designed specifically to create a secure multidimensional model that serves as the de facto semantic layer for your analytical needs.

FASMI Test : Still Relevant in 2024 ?

The original definition of the FASMI Test aimed to offer a clear and memorable description of an Online Analytical Processing (OLAP) system: Fast Analysis of Shared Multidimensional Information. I believe this definition remains relevant and is more necessary than ever. In 2024, people should no longer confuse OLAP with one of its past implementations — the outdated OLAP Cube.

Do you Need OLAP in 2024 ?

As a practical person, I won’t suggest specific tools without understanding your current data analytics challenges. I recommend carefully identifying your current needs and then looking for the right tool. Most importantly, if you’re satisfied with your current analytical platform, don’t change it just for the sake of using the latest trendy tool.

However, if you’re :

struggling to query complex multidimensional business models,
struggling to apply complex security that must align with corporate hierarchical security models,
struggling to write complex calculation for advanced analytics,
struggling to manage 100s and/or 1000s of very disparate queries/dashboards,
struggling to open dashboards in under a second,
struggling to source and merge data from disparate systems,
struggling to trust your analytics insights,

then it is worth considering modern OLAP systems. Rest assured, they are not obsolete and are here for a while. Modern OLAP tools are actively developed and stay relevant in 2024. Moreover, they benefit from the latest advances in:

big-data technologies,
self-service features,
generative AI,

to implement new features or complete existing ones to improve the productivity of the end users. But this is a topic for a future post. So stay tuned!

The interested reader can explore the available OLAP servers on this Wikipedia page.

OLAP is Dead — Or Is It ? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Datascience in Towards Data Science on Medium https://ift.tt/oYPz9Gj
via IFTTT

OLAP is Dead — Or Is It ?

10/21/2024 Jesus Santana