Hi Sadie:
I just spotted this on wikipedia made me think of your project. You keep thinning about it and yoll getter done.
John.
Keys are some of the most important objects in all relational databases as they tie everything together. A
primary key is a column which is the identifier for a given entity, where a
foreign key is a column in another table which refers a primary key.
These keys can also be made up from several columns, in which case they are composite keys. In many cases the primary key is an auto generated integer which has no meaning for the business entity being represented, but solely exists for the purpose of the relational database - commonly referred to as a
surrogate key.
As there will usually be more than one datasource being loaded into the warehouse the keys are an important concern to be addressed.
Your customers might be represented in several data sources, and in one SSN (Social Security Number) might be the primary key, phone in another and a surrogate in the third. All the customers needs to be consolidated into one dimension table.
A recommended way to deal with the concern is to add a warehouse surrogate key, which will be used as foreign key from the fact table.
[1]
Usually updates will occur to a dimensions source data, which obviously must be reflected in the data warehouse.
If the primary key of the source data is required for reporting the dimension already contains that piece of information for each row If the source data uses a surrogate key, the ware house must keep track of it even though it is never used in queries or reports.
That is done by creating a lookup table which contains the warehouse surrogate key and the originating key
[2]. This way the dimension is not polluted with surrogates from various source systems, while the ability to update is preserved.
The lookup table is used in different ways depending on the nature of the source data. There are 5 types to consider
[3], where three selected ones are included here:
Type 1:
- The dimension row is simply updated to match the current state of the source system. The warehouse does not capture history. The lookup table is used to identify which dimension row to update/overwrite.
Type 2:
- A new dimension row is added with the new state of the source system. A new surrogate key is assigned. Source key is no longer unique in the lookup table.
Fully-logged:
- A new dimension row is added with the new state of the source system, while the previous dimension row is updated to reflect it is no longer active and record time of deactivation.
Work should be put in to guidance on which situations the options apply to. Is that solely a business decision?
Which factors influence the choice? The update strategy might (full wipe, incremental etc.)