- Purview Data Governance helps customers discover and oversee data from a range of data sources.
- The feature set of Purview Data Governance has been in flux and has significantly changed since it first became generally available under a different name.
- Microsoft’s documentation for Purview Data Governance is poor, and the multiple licensing models are complex and cumbersome.
Purview Data Governance is intended to help discover, curate, manage, and control access to data, regardless of whether it is in Azure, Amazon, or Google, or on-premises. With Purview Data Governance, customers must first discover and map data, then define how it will be managed, then finally perform governance tasks. As it exists, the service’s limited scope of data sources and documentation will limit the number of customers for whom it will prove adequate.
Note: while Purview Data Governance (formerly Azure Purview) shares branding with Microsoft 365 Purview offerings, the latter are generally focused on compliance related to Microsoft 365, while Purview Data Governance focuses predominantly on data stored externally to Microsoft 365.
Data Governance Basics
Data governance as defined within Purview consists of two key phases:
- Data discovery and definition using Data Map
- Data governance and oversight using Unified Catalog.
Data Map
Data Map is used for the initial phases of governance, when organizations define how data should be overseen and how it should be structured for management and reuse.
Domain Definition
A key step of the process for using Data Map includes the definition of up to four “domains,” which are typically created to represent functional areas, business units, or some other form of work. They are not analogous to AD (Active Directory) domains, domain names, or any other existing structure.
Domains are containers for assets (individual tables or files) and collections (which are organizationally defined groupings of assets).
Data Source Registration
The next step with Data Map is defining the sources of data to be included in data governance. The range of data sources supported by Microsoft is broad but may not include every data source the organization requires.
Data sources supported by Data Map include the following:
- Most data sources within Azure
- Key on-premises data sources and data sources within AWS (Amazon Web Services) or the Google Cloud Platform.
- Amazon S3 or HDFS (Hadoop Distributed File System)
- A range of key data sources including Salesforce, Dataverse, and SAP.
Scanning, Lineage, and Ingestion
In addition to specifying the data source, the organization will need to choose the Integration Runtime (IR) that will be used to connect to and scan the data source. Where the data source resides will largely determine the IR required, and how to scan that data source, and ingest the schema and select metadata into Data Map.
A key concern for the IR is providing the right level of administrative access that allows the IR to access and query the data source without providing access to the underlying data or protected information. Typically, the IR is configured and managed by an admin with high-level privileges.
The available IRs include the following:
- An Azure IR for connecting to data sources in Azure
- A self-hosted IR for connecting to data sources on-premises or across a virtual network (VNet)
- A Kubernetes IR for on-premises or VNet-connected Kubernetes data sources
- An AWS IR for integration of Amazon data sources including S3 or RDS.
Scanning a connected data source involves Purview retrieving and storing the structure of the data (but not retrieving the data itself). The depth level of scan defines what level of metadata is gathered. The possible depth levels include the following:
- Autodetect (the default level, which applies to all resources outside of Azure)
- Level-1 (L1) scan: Extracts basic information and metadata including file names, size, and fully qualified name
- Level-2 (L2) scan: Extracts schema for structured file types and database tables
- Level-3 (L3) scan: Extracts schema (where applicable) and subjects sampled data to the system and custom classification rules.
Sources that use Azure Data Factory or Azure Synapse Analytics as a part of their processes can also provide data lineage information, which will be shown in Data Map (lineage information includes input to and output from Azure Data Factory or Azure Synapse Analytics, and the data activity).
As noted, select scans can also perform automatic or manual classification of scanned assets, depending on the organization’s existing Purview classifications. These classifications include over 200 built-in classifications including driver’s license, passport number, credit card number, and ABA routing numbers. Custom classifications can be specified using a custom regular expression.
Data Map currently offers two features in preview: data map history and sensitivity labeling.
Unified Catalog
The latter phases of Purview Data Governance involve using Unified Catalog for the definition of data products, managing access to those data products, and their health management.
Data Curation
The data curation process involves the definition of data products and adding the assets they offer.
As a part of data curation, the organization will specify what type of data is available, and add data assets. These assets could include a dataset, an analytics model, master or reference data, or ML training data.
Organizations can also optionally link glossary terms (concepts that define the business) and OKRs (objectives and key results that define measurable/trackable business objectives).
After the data product has been defined, users can search, browse, or filter across that data product, and the organization can define access management processes for that data product.
Access management includes both managing existing access to the data product and management of incoming data product access requests.
Health Management
Ongoing health management capabilities offered by Purview Data Governance are limited but include management of the following:
- Data access and use
- Data discoverability
- Health observability
- Metadata management
- Data trust (ownership, etc.)
- Value creation (OKRs, etc.).
Master Data Management Requires Partners
Unlike the case with SQL Server, there is no master data management solution included in Purview Data Governance, so organizations must depend on partner solutions. Partner software supported by Purview Data Governance includes the following:
- CluedIn
- Profisee
- Reltio
- Semarchy.
Licensing
Purview Data Governance consists of two sets of complicated licensing:
- Data Catalog (Standard) – Governed assets within Unified Catalog
- Data Health Management – Management of health and quality of data.
There is no charge for assets that exist in Data Map that have not been linked to any governance concepts. Each Data Catalog (Standard) governed asset that exists within Unified Catalog, including tables, views, AI models, semantic models, etc., accrues a charge of US$0.0165 per day (approximately US$0.50 per month).
Data Health Management charges accrue based on the compute cost of the jobs per day that are managing the health and quality of data. These charges are based on DGPUs (data governance processing units). A DGPU is the equivalent of 60 minutes of fully managed compute time taken to produce data management results. DGPU levels include Basic, Standard, or Advanced performance at US$15, US$60, or US$240 per DGPU, respectively.
Directions Recommends
Evaluate first. Organizations considering Purview Data Governance versus third-party offerings should evaluate them against each other to ensure Purview can connect to the range of data sources and offer the features that they require.
Plan carefully. Customers should plan Purview Data Governance prior to deployment (domains, data products, access plans, etc.), as the organization may require consulting help to assemble it.
Monitor costs. Customers should track DGPU costs over time and ensure they are licensing the appropriate tier effectively. Purview Data Governance is an option in the Azure calculator but is challenging to navigate without several months’ worth of existing charges to estimate future costs.
Watch carefully. Customers should be patient when using Purview Data Governance, as the service has changed significantly since Azure Purview first became available and is likely to continue to change in order to emphasize Microsoft’s AI initiatives first.
Resources
Purview Data Governance is described at “Data governance with Microsoft Purview” (Microsoft).