In 2019 Canadian Broadcasting Corporation (CBC) news reported a massive data breach at the Desjardins Group, which is a Canadian financial service cooperative and the largest federation of credit unions in North America. The report indicated, a "malicious" employee copied sensitive personal information collected by Desjardins from their data warehouse. The data breach compromised the data of nearly 9.7 million Canadians.
Now the question is, how do we secure data warehouses so that employees of the same organization can't breach the data? We need to make sure sensitive data is protected inside the organization.
When any IT solution is built in an organization, there are two types of environment that exist, one is called non-production and the other is a production environment. Production and non-production environments are physically separated. A non-Production Environment means an environment for development, testing, or quality assurance, and the solution is not consumed by end-users daily basis from a non-production environment. However, the Production environment is a place where the latest version of the software or IT solution is available and ready to be used by the intended users.
As stated at the beginning, a rogue employee was involved in the massive data breach at the Desjardins Group. Hence; an organization building a data-driven IT solution needs to work on setting up both Production and Non-Production environments secure way. This article will describe how sensitive data can be protected in both the Production and Non-Production environments.
A. Protecting Sensitive Data in Non-Production Environment in a Data Warehouse:
In general, a Non-Production environment is not well guarded with security. Different personas can have access to a Non-Production environment in a data warehouse e.g. Developers, Testers, business stakeholders.
So it's important to protect sensitive data inside the organization. The very first thing we need to do is whenever copying data from any application to a data warehouse (non-production environment) sensitive data need to be scrambled.
There are a few steps that can help us to scramble the data in Non-Production Environment in a Data Warehouse:
Step 1: Business or data steward find the list of sensitive or Personal Identifiable Information (PII) data
Step 2: Data Engineer or ETL Developer will use any standard tool like Azure Data Factory (ADF) to mask the data and store it in the data warehouse.
Step 3: Either Test Engineer or Business Stakeholder will verify all the sensitive columns in the database before it releases to the rest of the team.
B. Protecting Sensitive Data in Production Environment in a Data Warehouse:
In a Production environment, we can't scramble the data in such a way that is irreversible. We need to keep the original data intact but make sure only the intended users have access to the data. So if we can mask the columns that hold sensitive or PII data in such a way so that only privileged users get access to the data. Below figure shows what is expected from the solution:
As shown in Fig 2, when users try to access the data via an application such as Power BI, only intended users will be able to see the intact data. Non-intended users will find the data obfuscated. The above-explained process can be done by using dynamic data masking provided by Microsoft Databases. The process only masks the data on the fly at the query time. If you would like to learn about dynamic data masking, please follow the Microsoft document.
In Summary, whenever PII data is taken from the operational system to the Non-Production environment to build any analytics solution data need to be scrambled. And in a Production environment, though dynamic data masking can prevent viewing the data by unintended users, however; it's important to properly manage database permission on the Data Warehouse. As well as, make sure to have auditing enabled to track all activities taking place in the Data Warehouse in the Production environment.