Data Architecture: Lessons Learned
Three important lessons I have learned on my journey as data engineer and architect
It’s only been a few months since I once again had to experience what I sometimes refer to as the “self-satisfaction of IT.”
That may sound a bit harsh, but unfortunately I experience this time and again. It can be frustrating to see IT departments actually working against their business.
I remember one specific case where a running business solution had to be migrated to another execution platform solely because of ‘technical’ reasons. Sure, business was told that this target platform would be much cheaper in maintenance, but IT didn’t offer tangible evidence on that assertion. Ultimately, the decision to migrate was driven by ‘expert knowledge’ and so-called ‘best practices’, but solely from an IT-centric perspective. It cost a fortune to migrate what worked, only to find out that the promised cost reductions didn’t materialize and even worse, business functionality deteriorated in some cases.
IT professionals, not only in specific technology-oriented companies, tend to believe that technology, IT tools, and nowadays also data have an end in themselves.
Nothing could be further from the truth.
Although organizational changes are often recommended to improve cooperation between business and IT, the structure itself is not the critical factor. I’ve observed that companies with entirely different organizational setups can still achieve strong collaboration.
So what is the recipe for success of these companies?
What all these organizations had in common is their rigorous focus on the business. Not only in the sales related departments but in every other supporting unit, including and especially in IT. It’s the mindset and attitude of their people — the enterprise culture if you like. The willingness to scrutinize everything against this core requirement: Does it generate a business benefit?
The following practices have proven remarkably effective when it comes to focusing on business value and prevent silo thinking. Follow them to move away from a one-sided belief in technology towards a modern company that optimally interconnects its digitalized business processes through universal data supply.
Beware of silo specialisation
The definition of the ‘data engineering lifecycle’, as helpful and organizing it might be, is actually a direct consequence of silo specialization.
It made us believe that ingestion is the unavoidable first step of working with data, followed by transformation before the final step of data serving concludes the process. It almost seems like everyone accepted this pattern to represent what data engineering is all about.
While this is a helpful general pattern for the current definition of data engineering, it is not at all the target we should aim for.
The fact that we have to extract data from a source application and feed it into a data processing tool, a data or machine learning platform or business intelligence (BI) tools to do something meaningful with it, is actually only a workaround for inappropriate data management. A workaround necessary because of the completely inadequate way of dealing with data in the enterprise today.
We should take a completely different approach that creates data as products to be exchanged in the enterprise. This comprises machine learning (ML) models, business intelligence (BI) processes including any manual processes producing valuable business results, and any operational applications that together create the digitalized business information in your enterprise. I described this approach as universal data supply.
You may think it’s all well and good to come up with a completely new idea. But we have many systems in use that follow precisely these old, well-trodden paths. Most importantly, people have come to accept this as the standard way of doing things. It feels comfortable, and the status quo remains largely unquestioned.
What really needs to happen is a rethinking of how software engineers collaborate with data engineers, and how both groups work together with the business teams. Doing this offers also a practical way to bring software engineering and data engineering closer together again in overlapping areas. It doesn’t mean fully merging these disciplines, but rather acknowledging how much they share in common.
I have previously written about the need to redefine data engineering. This goes hand in hand with the realization that all advances in the software development of applications can and should be transferred in full to the discipline of data engineering.
After we have built all too many brittle data pipelines, it’s time for data engineers to acknowledge that fundamental software engineering principles are just as crucial for data engineering. Since data engineering is essentially a form of software engineering, it makes sense that foundational practices such as CI/CD, agile development practices, clean coding using version control, Test Driven Design (TDD), modularized architectures, and considering security aspects early in the development cycle should also be applied in data engineering.
But the narrow focus within an engineering discipline often leads to a kind of intellectual and organizational isolation, where the greater commonalities and interdisciplinary synergies are no longer recognized. This has led to the formation of the ‘data engineering silo’ in which not only knowledge and resources, but also concepts and ways of thinking were isolated from the software engineering discipline. Collaboration and understanding between these disciplines became more difficult. I think this undesirable situation needs to be corrected as quickly as possible.
Unfortunately, the very same silo thinking seems to start with the hype around artificial intelligence (AI) and its sub-discipline machine learning (ML). ML engineering is about to create the next big silo.
Although there is no doubt that much in the development of ML models is different from traditional software development, we must not overlook the large overlaps that still exist. We should recognize that most of the processes involved in the development and deployment of an ML model are still based on traditional software development practices. The production ready ML model is essentially just another application in the overall IT portfolio that also needs to be integrated in universal data supply as any other software application.
Consequently, we should only specialize in areas that are truly different, collaborate in overlapping areas and take great care to avoid silo thinking.
Model your business
Yes, it’s true what many consultants often emphasize. Without a clear understanding of your business processes, it’s impossible to structure your IT or data architecture in a way that effectively aligns with your business needs.
To effectively model your business, you need engineers who have a deep understanding of your business operations — and that goes for software, data and machine learning engineers alike. Achieving this understanding requires close cooperation between your business teams and IT professionals.
Since this is non-negotiable, it’s essential to promote a culture of collaboration and encourage end-to-end thinking across teams.
IT serves the business
We need IT professionals who view technology solely as a tool to support the business case — nothing more, nothing less. While this might sound obvious, I’ve often noticed, especially in larger companies, that IT thinking tends to become disconnected from business objectives.
This situation may have been encouraged by CIOs attempting to reinvent IT departments where IT should no longer merely act as a cost-cutting or supporting unit, but should itself contribute to revenue generation.
While IT can contribute to innovation and new digital products, its primary role remains enabling and enhancing business operations, not acting as an independent entity with separate goals.
Business needs IT to be efficient and smart
On the other hand, it’s essential for business people to recognize that modern IT technology enables them to achieve things for customers that were previously impossible. It’s not just about the growing ability of IT to automate business processes; it’s also about creating products and services that could not even exist without this technology. Moreover, it involves empowering the company through the intelligent use of available data for optimal operations and for continuous improvement of business processes.
This requires end-to-end thinking from both business and IT professionals. Close collaboration and intensive exchange of ideas and perspectives help to stay aligned. The silo specialization mentioned above is just as harmful between business units and IT as it is within IT departments.
Business models shape your IT models
Business models are almost always process models. As data is just seen as the means to exchange information between business processes, data teams need to focus on organizing this exchange in the most efficient and seamless way possible.
Data models derived from the business models are absolutely crucial to organize the exchange. While it’s the main responsibility of data engineering to offer governance and guidance to moderate the overall modeling process in the enterprise, it’s essentially a joint exercise of all involved parties.
Unlock data from your applications
In information theory everything starts with data. Even logic is derived from data when it is compiled or interpreted from source code. Hence, we could argue that all data is to be kept in applications that implement the logic.
However, I have argued that data, which does not represent logic, has fundamentally different characteristics compared to applications. As a result, it seems much more efficient to manage data separately from the applications.
This fact together with the status quo that data in RAM is volatile and therefore needs to be persisted in durable storage, is the main reason why data engineering is justified as an own discipline. The sole purpose of data engineering is therefore to manage and organize data independently from applications. Data engineering must provide the infrastructure to unlock the data from applications and enable its seamless sharing between these applications.
This is essentially the same challenge that relational databases faced when they became so popular that they were expected to serve as the enterprise-wide shared data storage for any application. Today, we recognize that one type of database isn’t enough to meet all the diverse requirements. However, the concept of a data infrastructure that allows for data sharing across all applications remains compelling.
To achieve this, we need to reconceptualize the shared database as a flexible data mesh that is highly distributed, can support both batch and stream processing, and enables the integration of business data with business context (also known as metadata and schema) into data as products. The mesh facilitates the seamless sharing of these products across all applications. This explicitly includes ML models that are derived from data and finally deployed as intelligent applications to generate valuable predictions at inference time, which can also be treated as new data to be shared across the enterprise.
Moreover, any results derived from business analysts using business intelligence tools given to them as ‘end-user‘ utilities should also be acknowledged as valuable new business data. Although these ‘end-user maintained applications‘ are often not considered part of the official IT application portfolio, they generate important business information crucial for the organization. As a result, any business data generated by these end-user applications also needs to be embraced by the engineering teams.
Digitalize business context as data
The data infrastructure enables you to combine your data with rich business context, allowing every consumer to accurately interpret the provided business data independently from the source applications. This capability is often referred to as the semantic layer within a data architecture.
However, it is the responsibility of the data producers, i.e. the owners of the applications, to provide the necessary business context. It is not the task of the data team to provide this information detached from the source application. Data engineers cannot reconstruct what the responsible business departments have failed to deliver. This is the main reason why I advocate not implementing business logic in data teams.
Instead, the data engineering team should focus on delivering the technical infrastructure and governance processes to support all business units in making this business context readily available to everyone.
From an organizational perspective, business teams must supply the content and rules that enable software engineers to deliver data as products that can be shared throughout the enterprise, leveraging the data mesh established by data engineers.
Decentralize
If a large enough organization has recognized that it’s business processes are too complex to be implemented in a single application, we should embrace decentralized architectures and principles to be able to manage the loosely coupled applications and their data as a coherent whole.
While IT is frequently discussed when considering centralization, the only areas that truly require central management are foundational services like IT infrastructure, security, governance, and compliance.
I believe that the idea of collecting data in a central data repository or platform (be it a data warehouse or a data lake(house)) from sources organized in a highly decentralized IT application infrastructure is doomed to failure. The decentralized approach of universal data supply seems to be the better approach to truly empower business with the data that is created and transformed in all the different applications.
If we prevent silo thinking in the company by actively promoting collaboration and end-to-end thinking, we will not only have more efficient IT departments, but also a better alignment with business goals.
The rigorous focus on business objectives in IT departments ultimately leads to better applications and application-independent information models. This enables the efficient use of all available data for optimal operation and continuous improvement of business processes and their applications.
Centralizing data in a single repository (like a data warehouse or data lake) is increasingly unsustainable for large, complex organizations. We need to manage the exchange of data in a decentralized data architecture as described in universal data supply.
Implementing universal data supply as a new data architecture for enterprises holds great promise. However, the challenge lies in how to transition from our current systems without necessitating a complete redesign. How do we evolve our architecture toward this innovative concept without discarding what already works?
Well, the good news is that we can make incremental improvements. We won’t need a brand-new data platform, machine learning platform, or a complete overhaul of our existing IT architecture. We can also retain our data warehouse and data lake(house), although redefined in scope and role.
I will be launching a new series of articles that will outline practical strategies for implementing universal data supply, drawing on concrete industry examples. You can expect a step-by-step guide to adopting this decentralized approach. Stay tuned for more insights!
Data Architecture: Lessons Learned was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Datascience in Towards Data Science on Medium https://ift.tt/9bBIisT
via IFTTT