In the previous article, I shared with you some basic ideas about DV (trying to be realistic: not everything is quick and easy;-). On the other hand, there is no reason to be afraid). In this article, I will share lessons learned from using artificial keys from primary source systems to identify records in your DW and I will discuss one issue that people could encounter when building a Raw Vault model.
Let’s consider a situation in which you are about to add the regions subject area to your DW model. This is required by the sales department and you will load data about the regions from one source system only. The main source table is very simple:
So, without having to think too much, you add this table to your core model as is, i.e., using the values of PK directly from the source system.
After some time, there are requirements related to the regions from the purchasing department and you are required to load data about regions from another source system. The main table in the other source system looks like this:
You are not too concerned about this main table because you already have a good table in your DW with the same structure. The data required by the purchasing department is in the other tables referencing the main table. But the foreign keys in the other source system use different identifiers to the identifiers used in your DW:-( In order to integrate data from the two source systems together (you know that integration is one of the key benefits of DW), you decided to use the matching column in the main tables from both source systems (i.e., region_name in this case). Job done.
After some time, the purchasing department started operating in a new region, e.g., North-West.
A new record was added to their system but nobody told you about it. They should have told you a day before at the latest, but there is no process and no master data management system in your company. So, the end result is that your loading failed because the lookup of “North-West” failed and no primary and foreign keys were assigned. So, you manually added primary key value 7 for “North-West” region to your DW table, hoping (7 is your lucky number;-) that the sales department will not add more than two regions (5 and 6) before you implement a better solution. Based on this experience, you decided to redesign the region subject area in your DW. You decided to use surrogate keys (sequence or hash) controlled by the DW department. The source column for your surrogate keys would be region_name. In addition, you decided to add some additional metadata because an experienced friend told you that it could be useful. You added the date the value of region_name appeared for the very first time and the name of the source system it was loaded from. The structure of your re-designed table is as follows:
In the end, you modelled a Data Vault HUB even if you didn’t know it. The structure of the table is in accordance with the HUB definition. region_name is your business key. If you’d learnt a little bit about Data Vault before you started adding regions to your DW model, you would model the regions table this way from the very beginning and avoid relying on lucky 7;-) The Data Vault approach was tailored for DW modelling and it is based on the experience and pitfalls like those described above. The HUB will survive the adding of new records in any source system, the adding of new source systems or even the replacing or decommissioning of existing source systems. Does it all sound too good to be true? The reality is that there are some issues with business keys and I will discuss them in the following article.
So, you finally made up your mind about the modelling approach for your new DW and started with Data Vault model. You’ve already modelled the hubs for products and customers. You’ve just finished the hubs for your branches and other channels through which you sell your products and you’ve started thinking about modelling objects for your selling process. There are many events, which are recorded in your source data. One of them is the customer placing an order. Another one is you issuing an invoice. Shipping could be another of your events. The events/transactions are usually modelled as links in the Raw Vault model. But let’s think a bit more about your events.
The placing of an order by a customer is an event. However, you usually consider this event as important enough to identify it by a number or some alphanumeric code. This identifier is printed on some documents, sent to your customer via e-mail or displayed in the ordering application for smartphones. The customer uses this identifier when they check the status of their order on your website or when they make some changes. Sounds familiar? Yes, this identifier is a business key. In such a case, the order should be modelled as a hub. Some other businesses could consider the order as a mere event and model it as a link. However, in your case, “placing the order” is the event, but the order itself is the hub. So, when modelling hubs and links, consider more than just whether something is an entity or a transaction. If you have widely used a business key to identify an event, it is usually a good candidate for a hub and not a link.
Thanks for reading. Expect a discussion of some issues with business keys in the next article.