Data Sovereignty in Watson Knowledge Catalog
Why Locations: As more and more personal data goes online, the data protection rules are becoming stringent. The governments are enacting rigorous legal laws to protect the personal data of their citizens. For instance with General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), governments can impose hefty penalties on the companies dealing with the private data of their citizens if they fail to protect the private information. In some regulations, Personal Identifiable Information (PII ) is not allowed to leave the sovereign country. That is why users should know the sovereignty of the data such that data governance can be enforced when data is restricted to leave the sovereignty.
For example, Mounika is a legal and compliance officer at a bank in Germany. Personal information in Germany is governed by Germany’s sovereign rules. Mounika must ensure the PII customer information that she is managing is masked before it leaves Germany and is accessed in another country. Therefore, It is important to establish data sovereignty such that data governance can be enforced.
How does Watson Knowledge Catalog enforce Sovereignty rules: Let’s dig deeper into how Watson Knowledge Catalog enforces the data sovereignty rules. As the first step, like a business glossary you need to define the business locations in WKC. These locations can be assigned to the data assets imported into the catalog. WKC ships with a pre-defined set of location values that can be leveraged by the customers.
How to Manage Locations in WKC: Every WKC tenant gets pre-defined locations for IBM data centers along with the list of sovereign countries. The location codes are managed in glossary reference datasets. Reference datasets (RDS) in WKC provide a logical grouping of code values. Pre-defined codes can be used on assets as attribute values and can be reliably validated to prevent free-flowing text. The reference data set values to provide a common language for all the services on the CPD platform. The common vocabulary ensures that the data governance policies and rules can be enforced on the assets across the platform.
The values from one reference dataset can be associated with the values from another reference dataset. We use this capability to establish relationships between the physical location values and the sovereign location values. The associations are used to derive the sovereignty of the data from the location of the data. For instance, physical locations like Frankfurt and London are associated with Germany and Greater Britain as sovereign countries respectively.
In the RDS, users can also create a hierarchical structure that can be used for location rule validations. The hierarchical data models provide a convenient way to group locations under one parent and define common rules for one parent. Users can add more child locations or define new parent locations in the reference datasets. Note that the predefined location codes are immutable even though customers can expand the lists by adding or removing new location codes.
Once you click on Governance >> Reference datasets >> Locations, you can see two pre-defined datasets in your account.
- Physical Locations
- Sovereign Locations
Physical Locations: Physical location on an asset means where the asset is physically located. It could be located at the customer’s data sources (DBs), data lakes, or in a cloud storage location at data centers.
The location codes for IBM data centers are defined in the ‘Physical Reference’ dataset. Each physical location has an immutable physical code. The physical location code is associated with a sovereign location code from another dataset named Sovereign Locations. For instance, physical location Frankfurt (FRA) is associated with sovereign location Germany (DEU). Users can add new locations in this reference dataset. It is important to establish a correct association between physical and sovereign locations. These relationships are used to derive the sovereignty of the assets where sovereignty is not explicitly specified by the users. The association is used during data location rules and policy enforcement as well.
Sovereign Location: Sovereign locations represent the sovereignty of the data. For instance, for a global company, the data of employees of an Asian subsidiary could be located at US locations. That means the data’s sovereign location is Asia while the physical location is the USA.
Like Physical locations, WKC also comes with pre-defined sovereign locations. Below is the predefined list of sovereign countries and codes.
Each location can have a list of sub-locations. For instance, European Union further has a list of sovereign countries like Germany, Belgium, and Italy with their location code. If a user wishes to define a new hierarchical structure they should set the ‘parent’ value on the location code while defining a new code or editing an existing one. For instance, if a user wants to add another location to Belgium with code BGM under EU, they can choose EU as the parent value for BGM. This will set up the hierarchical structure. Once you have the hierarchy set, users can define location rules for the parent and those rules will be enforced for all the child locations under that parent. This provides an easy way to define and enforce rules like GDPR across sovereignties.
Each location can have a list of sub-locations. For instance, European Union further has a list of sovereign countries like Germany, Belgium, and Italy with their location code. If a user wishes to define a new hierarchical structure they should set the ‘parent’ value on the location code while defining a new code or editing an existing one. For instance, if a user wants to add another location to Belgium with code BGM under EU, they can choose EU as the parent value for BGM. This will set up the hierarchical structure. Once you have the hierarchy set, users can define location rules for the parent and those rules will be enforced for all the child locations under that parent. This provides an easy way to define and enforce rules like GDPR across sovereignties.
Each predefined sovereign location is associated with a physical data center location from another reference dataset. For instance, pre-defined location ‘Dallas’ with code ‘DAL’ is associated with sovereign location ‘USA’. These relationships are useful to drive the sovereignty of the datasets from physical locations when it is not explicitly provided by the users. So, while applying location rules if the data asset does not have the sovereignty defined then the sovereignty is determined from the physical location to sovereign location relationship definition in the reference datasets.
Once a user defines these locations then they can use defined location codes to set the location attribute on the WKC assets.
Note: We allow you to delete a location code, but deletion of location codes does not result in deletion of location attributes already set on the assets.
How to add locations to the assets: There are three different ways to add assets to a catalog.
1) Upload local files to the catalog.
2) Add a connection to the data source.
3) Add connected assets from the data source.
A User can set and edit the location attribute in all the above cases.
Locations on catalog buckets: As the first step in WKC, the user creates a catalog with a storage location. During catalog creation (On Saas), users specify a catalog bucket from IBM cloud object storage. COS buckets are usually provisioned in a cloud region. This regional information is used to determine the location of the catalog storage. For instance, if a bucket is provisioned in the IBM US datacenter (‘us_geo’) then the physical location of the catalog is assigned as ‘USA’. The location on the catalog is immutable. Users can not change this location as it is determined based on the storage bucket location.
Whenever a user uploads a file/CSV to the catalog, the catalog bucket location is automatically populated on a locally uploaded asset. Users can not change this location but users can provide the sovereignty of the asset. By default, the sovereign location associated with the physical location in the reference datasets is automatically assigned to the assets unless users choose to overwrite the sovereignty.
Automatically populated locations from catalog bucket on locally uploaded files
Locations on connections: While creating a connection, users can add the location attribute with the location codes defined in the reference datasets under the ‘Location’ category (details above). Users will set the location pointing to the data center where the data source is physically located. They can set the sovereignty of the data source, if they know the sovereignty of the data for the data source. If data from a data source belongs to multiple nations then the sovereign location for the connection should remain empty. Users should provide the sovereign location for each asset as discussed next. If the location is set on the connection then users do not need to set the location on every connected asset. The location attribute is auto-populated on connected assets from the source connection.
Note that the UI will assist users by automatically populating the sovereign location from the physical to sovereign location relationship defined in the Reference datasets. Users need to confirm the sovereignty to set it on the connection. If users do not confirm then the sovereignty of the connection is not persisted.
Locations on assets: Users can add connected assets from the data sources to the catalog. If the data source (connection) has a location attribute set on it, then that location is automatically populated on the connected assets. Users can overwrite the sovereignty of the connected asset as well. The physical location of the asset is driven from the connection and is immutable.
If the sovereignty of the connection is empty, the UI assigns the sovereignty of the asset based on the physical to sovereign location relationships defined in the Reference datasets. The user needs to confirm the sovereignty to have it assigned to the assets. If users do not confirm then sovereign location is not set on the assets.
Once locations are set on the catalog assets, users can edit the locations.
Once you have the locations set on the assets, you can define data location rules to restrict the movement of the data to other locations. We will discuss the data location rules in the next post.
Official documentation:
https://dataplatform.cloud.ibm.com/docs/content/wsj/catalog/assets-catalog.html?audience=wdp#prop https://dataplatform.cloud.ibm.com/docs/content/wsj/governance/data-location-rules.html?audience=wdp
In this article, we learned about how to manage the location codes in WKC and how to assign the defined location codes to the catalog assets. In the next article, we will learn about defining different types of data location rules and policies.
Disclaimer: This capability is released in CPDSaas for now. It will be added to CPD in future releases. Stay tuned!