Sunday, September 15, 2019

Referential Integrity in Data warehouse Design?

I was involved in a project and the project was asking about implementing referential integrity for the data warehouse design. I was surprised, since I never did such design for Data Warehouse(DW).

If someone ask you to put the referential integrity in your data warehouse project, how do you explain them that it's not required for DW project?

If you design the control /config tables to maintain logs or control the flow of your ETL run then you may follow the referential integrity for only those tables, but when you start working with data load for the tables, don't go with referential integrity which will kill the performance.

So No 1 : Poor performance while loading

If your DW tables (dimensions and facts) have referential integrity which will consume extra time to insert or update the records. So standard for data warehouse design is, you don't maintain referential integrity.

Let's explain why you don't need it?

When you are doing dimensional modeling it's not require that you need to have referential integrity because ETL does it for you. The ETL process is the source of all key values that will be inserted into the Star or Snowflake model. Any ETL process should have surrogate keys in the dimension tables and while loading facts natural keys will be transformed with those surrogate keys from dimension. Any data that would violate referential integrity will be caught by these process.

In summary, If your ETL is built correct then you don't need to have referential integrity in the Data Warehouse design.