記事

Why the Single Source of Truth Paradigm in Data Warehousing is Outdated

The old paradigm of the data warehouse serving as the single source of truth in today's ever evolving data landscape can no longer be sustained. Find out why.

Robbert Naastepad

2020年10月15日 4 分で読める

Earlier this year I had a discussion with my fellow team members about a specific statement in a presentation we will be giving to a prospect of ours. The statement was:

Use Teradata Vantage to automate data reconciliation into a single source of truth.

Never mind Teradata Vantage having tons of other capabilities, but let’s just focus on this statement, because it gave me goosebumps all over. Why? Because the statement is awfully old-fashioned. Please let me explain why I am convinced that in the data landscapes of today's middle to large enterprises creating a SSOT is impossible.

For this we need to go back to the definition of Single Source Of Truth (SSOT) which I copied from Wikipedia:

SSOT is the practice of structuring information models and associated data schema such that every data element is mastered (or edited) in only one place. Any possible linkages to this data element (possibly in other areas of the relational schema or even in distant federated databases) are by reference only. Because all other locations of the data just refer back to the primary "source of truth" location, updates to the data element in the primary location propagate to the entire system without the possibility of a duplicate value somewhere being forgotten (source: wikipedia).

I always learned that:

Information is a set of data in context with relevance to one or more people at a point in time or for a period of time. Information is more than data in context—it must have relevance and a time frame. Information is considered to be singular (source: dataversity).

In the early days of data warehousing, we built data warehouses to support data marts specific for answering business questions for one business line, sometimes even for only one person, such as the manager of a department or director of a company. This was the Executive Information System era (late 80’s to late 90’s). The data was used in only ONE context: the context for that specific business line, for a specific purpose ... and it was time stamped. Here we could definitively speak of the data warehouse serving as the SSOT. Even Ralph Kimball’s definition of the data warehouse screamed SSOT. Kimball's definition states that a data warehouse is:

Subject-oriented: The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together.
Time-variant: The changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time.
Non-volatile: Data in the data warehouse is never overwritten or deleted. Once committed, the data is static, read-only and retained for future reporting.
Integrated: The database contains data from most or all of an organization's operational applications, and that this data is made consistent.

The non-volatile part of this definition is specifically SSOT-orientated, meaning that the data elements in the data warehouse referenced only one "source of truth" location, making sure that every data element is mastered (or edited) in only one place.

For all the data in the data warehouse to become information, the context was created in the ETL, ELT or ELTL code. The business rules were built into the code.

In the Business Intelligence and Data Warehousing era (late 90’s to beginning 10’s), we started to realize that this non-volatile part -- and even the integrated part -- of Kimball’s definition of the data warehouse was very hard to uphold because of the proliferation of data. We simply could not keep up with putting more and more context in our ETL or ELT code. We suddenly had multiple datapoints to seed datapoints in our data warehouse. We were not able to integrate all data from our operational applications anymore. We needed metadata stores or intelligent business rule engines to manage our context in. This is where we lost the SSOT paradigm in data warehousing. We could still hang on to it, but it was a struggle.

In today’s era of Analytics (beginning 10’s till now), in the distributed data landscapes of medium to large sized companies -- with data lakes, data warehouses, data hubs, metadata stores, master data repositories, data lake houses(?) and whatever other repository or data store you can think of -- it is impossible to create the SSOT. Simply because it is impossible to manage all context. Here the context is hidden in a myriad of etl/elt interfaces using patterns like p2p, hub-spoke, esb, etc. and data delivery agreement documents. The SSOT paradigm has died.

Besides SSOT there is another paradigm in data warehousing which is the Single Version of The Truth (SVOT):

SVOT is a technical concept describing the data warehousing ideal of having either a single centralised database, or at least a distributed synchronised database, which stores all of an organisation's data in a consistent and non-redundant form. The single version of the truth is the operating data that an entire company agrees is the real and trusted information (source: wikipedia).

To me this is Data Management Utopia. It means that the whole organization needs to have the highest level of data management maturity. All datapoints within an organization are defined and agreed upon, have an owner, are governed and have their metadata set and agreed upon. I have yet to see an enterprise this mature.

With Teradata Vantage we can create a Single Source of Facts (SSOF). One window to look at all your enterprise data. Use Teradata Vantage as the glue in your data and analytics ecosystem. Connect Teradata Vantage to your data lake, your (micro)services, enterprise applications and data warehouses. Wherever they are -- on premise or in the AWS, Azure, Google clouds.

Why the Single Source of Truth Paradigm in Data Warehousing is Outdated

Robbert Naastepad について