Data map in Azure purview is a unified map for all of your assets and their relationships. This intelligent graph describes all the data across your data estate and can capture data from on-prime, hybrid and multi cloud environment. This feature enables you to manage your resources and ensures that your organization follows data governance procedures. Data map includes the following parts: collections & sources, scans and classifications.
Collections and sources
Let us consider a scenario with more than 1000 resources. Without a proper grouping practise it would be impossible to go through all the resources. Furthermore changing access levels of all 1000 resources which belong to a same category will be an unproductive task that could take hours. To tackle this in purview, collections and sources come quite in handy. Collections and resources can help users to manage their data resources effectively. We can think about it like a hierarchy structure, each collection can include several related sources, we can set different roles for collections so that the key users and groups can access and take response for the resources in collection. All the sources in collections are secured by the credentials. Data map stores the metadata from all the resources with an elastic auto-scale. It can start from the lowest one unit capacity and increase with the size of load.
Register and scan
In purview, we can register data resources in the collection and scan them to get the metadata info. Purview provides many kinds of data sources to register, not only the services from Azure, but also from other platforms. It includes the back-end structured and non-structured database like Hive and SQL servers, as well the front-end data visualization tool Power BI. In the end, users can get the lineage from raw data to visualization. When something goes wrong in the visualization, key users can easily identify the raw sources of the values in the reports and in addition would also get the information about the Data Owners and Data Stewards. The entire process of identifying the cause and the person responsible is significantly shortened.
Purview scanning can discover the technical information about the data sources such as the technical name, data type, and size. It can extract the schema from data resources for structured data. The business information such as classification, glossary, descriptions can be found automatically while scanning or can be added manually to each asset.
Each scan needs credentials of the matching resources. This authorization method provides security to customers’ data, since purview does not save any password or access key directly. Instead, users can establish the key vault connection in purview and add secrets as credentials. The four possible ways to authenticate the Purview account are Managed Identity, Service Principal, SQL Authentication, and Account Key or Basic Authentication. When users add the secrets as credentials, they should also choose the matching Authentication method. Credential management provides the interface to manage the key vaults and secrets.
After the resource connection is tested, the assets in this resource can be scanned. The users can set the scan for full scope or just for special folders or tables. The options are based on the type of source users want to scan. The scan can be set or regularly scheduled e.g. weekly or monthly. For regular scans, the first-time scan setting will apply for every scan after that as well. Each scan follows a scan rule. This rule defines what information the scan should look for, what classification rule it should use for columns, and so on. There are already some system default scan rules for many kinds of resources such as Azure Data Lake and SQL database. Users can set custom scan rules with special classification rules and pattern rules based on needs. So that the classification and pattern mapping can be generated automatically after the scan.
Within Scan rules there are types like Pattern rules and Classification rules which help categorize your data assets or folders.
Pattern rules are used for azure file, data lake, blob storage and amazon S3. After choosing the storage account for the pattern rule, the rule will apply for this certain account or container from next time scan. Users can use statistic and dynamic replacer to match the qualified name of assets. So that the rule can map for a group of data and use the display name users add in the rule. When the files shouldn’t be grouped, users can enable ‘do not group as a resource set.
Purview classification helps key user to find the special data or certain data type in the data estate. There are many system default classification rules which can be used for scan rules set. Customer can also create the custom classification rules to detect the data type for their own datasets such as primary key or production id. The dictionary lists all the possible objects that the column can include. There is also a minimum match threshold which shows at least how many percentage of column content should match the expression or dictionary.
Data map provides elastic pay as you go model so that the customer don’t need to worry about the scaling and the size limit for the features. It is the fundamental of Purview for data discovery and data governance. With the register and scan feature for all the data sources, you can start you data governance journey with the least efforts.