Most data warehouse professionals have probably heard about Data Vault modelling. But not all of them have direct experience of it. How does it work? What are the benefits? What are the pitfalls?
Data Vault was started a long time ago as a modelling technique, but it evolved into a comprehensive approach that includes architecture and methodology. It is no longer simply modelling. But the modelling is still a very important part of the approach and it is the main concern of this article.
When you compare 3NF modelling, dimensional modelling and DV modelling, you soon recognize that the DV modelling approach solves some of the issues you have faced in the other types of modelling. The benefits have become even more important over time, with the world around us changing and the operational systems used as your data warehouse sources changing too, sometimes even being replaced. Of course, things can seem ever so fine when you read through the DV theory, but they are not so easy when it comes to actually do the modelling.
Before we dive into the modelling approach, let’s start by thinking about the world around us, which we try to describe with the model. There are objects (e.g., a company, a car, a person, etc.), that are usually identified in some way (e.g., a company id, a vehicle identification number, a name, etc.) and have some properties (e.g., an area of business, a number of seats, an eye color, etc.). There can be relationships and hierarchies among the objects (for instance, a person can be a customer of a company and the owner of a car). There can be transactions related to the objects (for instance, the person purchased the car from the company). When you collect this data, you can integrate it, keep a history of it, analyze it, create reports based on it and make a decision based on it. That’s what data warehousing is all about.
How does the DV modelling approach address the above? What are the three basic data structures in the DV model? What are the additional structures that help you to describe reality more precisely?
Let’s start with the real-world objects. DV uses two types of structures to keep data about the objects. HUBs are used for storing the business keys of real-world objects. SATELLITEs are used to store their properties and record their changes over time. If you remember how to count from one to three;-), you are probably not surprised to find that the third structure (LINK) is used for relationships, hierarchies and transactions. But the relationships, hierarchies or transactions can also have some properties and, therefore, SATELLITEs are used not only for storing time-variant descriptive information about the HUBs, but also about the links. It all sounds quite easy. So, let’s start the modelling.
The first task is to look around and ascertain what are the real-world objects. It is sometimes obvious. Nobody has any doubt that a person is a real-world object. But what about an invoice? Is it a real-world object or does it just represent a transaction among multiple real-world objects? Such issues are among the first you have to face when you start the DV modelling. Don’t be afraid. Some tricks can help you to overcome them.
Having real-world objects is a good start. Let’s continue with their identification. You probably remember some of your projects integrating data from several source systems into one data warehouse. One task was to match the corresponding records from different source systems. The DV comes with a good approach to this task. The real-world objects are usually identified by some real-world identifiers called “business keys”. These are stored in the source systems. Using these business keys should simplify the integration. So forget about some artificial keys, which are unique to the operational systems and should not be known to anyone else, and use the business keys. The HUBs are the right place to store them. It again sounds easy, but it is not always easily achievable because some systems do not store the right data. One system could store plate numbers for the cars. Another could store their VINs. A third could store both these identifiers. However, there are various approaches for how to deal with such scenarios, so don’t get held up, go ahead.
When you have the real-world objects and their identifiers, it is a good time to model their relationships and hierarchies and transactions related to them. Adding the links to your model should connect all your HUBs together. An isolated HUB or an island of connected HUBs would stand out as strange. You probably have some gaps in your knowledge of the business. Add more links to connect everything into one “continent”.
When you have such a skeleton model, adding the descriptive information to the SATELLITEs is the easy part. You will probably end up with multiple SATELLITEs for each HUB because that allows you to deal with different frequencies of the source data; it could save disk space, or it could help keep sensitive data separate, etc. The links usually have less SATELLITEs, but do not forget about them.
You will probably add some additional structures (e.g., reference tables, record-tracking SATELLITEs, stage-out views, PIT tables, etc.) to have the complete picture and improve query performance and then you will have your first DV model done. Let’s deploy it and populate it. But do not forget about your business users. The DV model is tailored for the back-end of your data-warehouse (core). It is not meant to form front-end delivery for business users. Let’s build a business vault and data marts on top of it in order to help your business users use your data warehouse easily instead of having to navigate through the complex web of HUBs, links and SATELLITEs.
Let’s start with the second iteration. The beauty of DV modelling is that it is prepared for change regardless of its scale. A small change in the source systems is quite easy to accommodate. Your data warehouse is prepared for the addition of new data sources. Replacing the core operational system is a challenge for your DV model but it is not a catastrophe that leads to rebuilding everything from scratch.
So, what can you expect when building your data warehouse using the DV approach? The main benefit when building properly is that it should survive over time. It becomes increasingly complex over time, but if you follow the methodology during the development, you should not get lost even in a complex data warehouse.
Thanks for reading and expect more information in the next article.
Ready to go but still a bit afraid? Ask someone who has already done this and avoid the pitfalls.