Friday, December 2, 2022

Step-by-step guidelines to provision Azure Synapse Analytics for your organization

This blog post will cover end-to-end deployment guidelines including the right roles and permission required to provision Azure Synapse Analytics for your organization.

Prerequisites: You need to have an Azure Portal login.

After logging into the Azure Portal search with the keyword "Synapse", you will find Azure Synapse Analytics as it is shown in the below diagram 1:


Fig 1: Finding out Azure Synapse Analytics


After you find it click the item and hit the "Create" button which will take you to the next screen which will look like below (Fig 2).

There are five different sections (tabs) you need to fill up to provision Azure Synapse Analytics. You will find each section elaborated on details below. 

 Basics: You need to fill up the basic information about the Synapse workspace.



Fig 2: Basic steps of Synapse configuration

1. Choose your subscription: Please choose the subscription where you want to provision the Azure Synapse Analytics resource

2. Choose or create a Resource Group: If you already have a resource group then you can choose the resource group or create a new one.

3. Managed resource group: Managed resource group will be automatically created, if you want to give a name please fill up this field otherwise it will be chosen automatically.

4. Workspace name: You need to pick up a globally unique name, it will suggest if the name is not unique.

5. Region: Please pick the region where you want to provision the Azure Synapse Analytics resource. Since I am in the Canadian region so put the region "Canada Central"

6. Data Lake Storage: Please choose Data Lake from the subscription or put it manually.

7. Data Lake Storage Account: This is the Data Lake Storage you have already created or if you need to create one please do so by choosing "Create new"

8.  Data Lake Storage File System: The file System name is a container name in the Data Lake storage.


This is how it looks after filling up the Basics information (shown in Fig 3)


Fig 3: After filling up the basic information


Security:

Let's move to the Security tab as shown below in figure 4:

Fig 4: Security 


The Security part is to connect with both serverless and dedicated SQL Pools. You can choose either local user and AAD login or only ADD login. I have chosen SQL and AAD login like in the old days when you provision SQL database instances. So you have both options available whenever or if required. 

And the check box "Allow network to Data Lake Storage Gen2 account" will be automatically chosen if you put the Data Lake Storage information the under the "Basics" tab. Synapse Serverless SQL pool required communication with Data Lake Storage and in this step Synapse workspace network allows to access a Data Lake Storage account.


Networking: You need to choose the right networking options for your organization. If you are provisioning for demo purposes then you can allow the public network or allow outbound traffic. However, if data security is top of your mind I would suggest following the below setup (fig 5) for the networking.

Fig 5: Networking

1. Managed virtual network: Enable the Managed Virtual network so that communication between resources inside Synapse happens through the private network.

2. Managed Private End Point: Create a managed private endpoint for the primary storage account (we did a storage account under the Basic tab and step #6)

3. Allow outbound Traffic:  I have set this "No" for not limiting only the approved Azure AD tenants. However, the data security is tightened through the next point #4

4. Public Network Access: Public network access has been disabled, which means there is no risk of exposing the data to the public network, and communication between resources will take place via private endpoints.

Tags: It's the best practice to have Tags. The tagging helps identify resources and deployment environments and so on.

Fig 6: Tags

Review and Create: It's the final steps that show you a summary for you to review. Please verify your storage account, database details, security, and networking settings before you hit the create button (shown below fig 7)


Fig 7: Review and Create



You have done with provisioning the Azure Synapse Analytics, as an Admin, you can work with the Azure Synapse Analytics. 

However, if any additional team members want to work with the Azure Synapse Analytics tool you need to do a few more steps.

You need to add a user to the Synapse workspace, as shown in fig 8




Fig 8: Synapse workspace
After clicking "Access control"  you will find "+Add" button to add user with the right role. At first you need to choose role as shown in below figure 9.



                                                           Fig 9: Choosing the right role
If users are data engineers or developers you may want to choose "contributor" role which I have chosen as shown in fig 9. After choosing the role you need to choose members, it can be individual members or AD group members.

Fig 10: Choosing the right member




The above fig 10 shown I have chosen a member and then click "Next' button to review and assign the role to the members. You have completed the steps for adding right role and members to the Synapse workspace.

After adding the right role and member to the Synapse workspace, you also need to add the user to the Azure Synapse Portal as shown in below fig 11. At first click "Access Control" and then by clicking "+Add" button you can assign members or AD group to the right role. If you are giving access to Data Engineers or Developers they will require Contributor role. In below fig 9, I have given Contributor role to the member.



Fig 11: Synapse administrator from the Synapse Portal

Hpwever, to have access to Serverless SQL Pool and Linked Service creation the members will require more permission. To know more about Synapse roles please go through this Microsoft documentation.

In summary, by following up the above step by step guidelines you can provision Azure Synapse Analytics for your organization. And please make sure through this process work closely with your organization's cloud infrastructure team who can guide you through all networking and security questions you may have.






Saturday, February 12, 2022

How do you secure sensitive data in a modern Data Warehouse?

In 2019 Canadian Broadcasting Corporation (CBC) news reported a massive data breach at the Desjardins Group, which is a Canadian financial service cooperative and the largest federation of credit unions in North America. The report indicated, a "malicious" employee copied sensitive personal information collected by Desjardins from their data warehouse. The data breach compromised the data of nearly 9.7 million Canadians.

Now the question is, how do we secure data warehouses so that employees of the same organization can't breach the data? We need to make sure sensitive data is protected inside the organization.

When any IT solution is built in an organization, there are two types of environment that exist, one is called non-production and the other is a production environment. Production and non-production environments are physically separated. A non-Production Environment means an environment for development, testing, or quality assurance, and the solution is not consumed by end-users daily basis from a non-production environment. However, the Production environment is a place where the latest version of the software or IT solution is available and ready to be used by the intended users.

As stated at the beginning, a rogue employee was involved in the massive data breach at the Desjardins Group. Hence; an organization building a data-driven IT solution needs to work on setting up both Production and Non-Production environments secure way. This article will describe how sensitive data can be protected in both the Production and Non-Production environments.

A. Protecting Sensitive Data in Non-Production Environment in a Data Warehouse:

In general, a Non-Production environment is not well guarded with security. Different personas can have access to a Non-Production environment in a data warehouse e.g. Developers, Testers, business stakeholders. 

So it's important to protect sensitive data inside the organization. The very first thing we need to do is whenever copying data from any application to a data warehouse (non-production environment) sensitive data need to be scrambled. 

Fig 1: Moving data from IT application to non-prod data warehouse

There are a few steps that can help us to scramble the data in  Non-Production Environment in a Data Warehouse:

Step 1: Business or data steward find the list of sensitive or Personal Identifiable Information (PII) data

Step 2: Data Engineer or ETL Developer will use any standard tool like Azure Data Factory (ADF) to mask the data and store it in the data warehouse.

Step 3: Either Test Engineer or Business Stakeholder will verify all the sensitive columns in the database before it releases to the rest of the team.


B. Protecting Sensitive Data in Production Environment in a Data Warehouse:

In a Production environment, we can't scramble the data in such a way that is irreversible. We need to keep the original data intact but make sure only the intended users have access to the data. So if we can mask the columns that hold sensitive or PII data in such a way so that only privileged users get access to the data. Below figure shows what is expected from the solution:




Fig 2: Protect sensitive data from Data Warehouse (PROD)


As shown in Fig 2, when users try to access the data via an application such as Power BI, only intended users will be able to see the intact data. Non-intended users will find the data obfuscated. The above-explained process can be done by using dynamic data masking provided by Microsoft Databases. The process only masks the data on the fly at the query time. If you would like to learn about dynamic data masking, please follow the Microsoft document.

In Summary, whenever PII data is taken from the operational system to the Non-Production environment to build any analytics solution data need to be scrambled. And in a Production environment, though dynamic data masking can prevent viewing the data by unintended users, however; it's important to properly manage database permission on the Data Warehouse. As well as, make sure to have auditing enabled to track all activities taking place in the Data Warehouse in the Production environment.