The DQS knowledge base is a container of metadata to use in improving data quality through data cleansing and data matching. A knowledge base consists of domains, each of which represents the data in a data field. DQS knowledge management includes the processes used to create and manage the knowledge base, both in a computer-assisted manner and interactively.
Knowledge Discovery
Knowledge discovery is a computer-assisted process that analyzes samples of data to build knowledge about the data. After analysis results, we can validate and enhance the knowledge, and then apply it to perform data cleansing, matching, and profiling.
To prepare knowledge for a data quality project, we can build and maintain a knowledge base (KB) that DQS can use to identify incorrect or invalid data. DQS enables to use both computer-assisted and interactive processes to create, build, and update knowledge base. Knowledge in a knowledge base is maintained in domains, each of which is specific to a data field. The knowledge base is a repository of knowledge about data that enables to understand data and maintain its integrity.
DQS knowledge base have the following benefits:
- Building knowledge about data is a detailed process. The DQS process of extracting knowledge about data automatically, from sample data, makes the process much easier.
- DQS enables to see its analysis of the data, and to augment the knowledge in the knowledge base by creating rules and changing data values. We can do so repeatedly to improve the knowledge over time.
- We can leverage pre-existing data quality knowledge by basing a knowledge base on an existing KB, importing domain knowledge from files into the KB, importing knowledge from a project back into a KB, or using the DQS default KB, DQS Data.
- We can ensure the quality of data by comparing it to the data maintained by a reference data provider.
- There is a clear separation between building a knowledge base and applying it in the data correction process, which gives flexibility in how to build and update the knowledge base.
We can use the Data Quality Client application to execute and control the computer-assisted steps, and to perform the interactive steps. The knowledge discovery activity builds the knowledge base by analyzing a sample of data for data quality criteria, looking for data inconsistencies and syntax errors, and proposing changes to the data. This analysis is based on algorithms built into DQS.
We can prepare the process by linking a knowledge base to a SQL Server database table or view that contains sample data similar to the data that the knowledge base will be used to analyze. We can then maps a knowledge base domain to each column of sample data to be analyzed. A domain can either be a single domain that is mapped to a single field, or it can be a composite domain that consists of multiple single domains each of which is mapped to part of the data in a single field. When we run knowledge discovery, DQS extracts data quality information from the sample data into domains in the knowledge base.
We can manually add value changes and we can import domain values from an Excel file. In addition, we can run the knowledge discovery process again at a later point if the data in the sample has changed. We can apply more knowledge from within the Domain Management activity and from within the Data matching activity.
The knowledge discovery process need not be performed on the same data that data correction is performed on. DQS provides the flexibility to create knowledge from one set of database fields and apply it to a second set of related data that needs to be cleansed. The data steward can create a new knowledge base from scratch, base it on an existing knowledge base, or import a knowledge base from a data file. We can also re-run knowledge discovery on an existing knowledge base. We can maintain multiple knowledge bases on a single Data Quality Server. We can also connect multiple instances of an application to the same knowledge base. DQS prevents concurrency conflicts by locking the knowledge base to a user who opens it in a knowledge management session.
Case Insensitivity in DQS
Values in DQS are case-insensitive. That means that when DQS performs knowledge discovery, domain management, or matching, it does not distinguish values by case. If we add a value in value management that differs from another value only by case, they will be considered the same value, not synonyms. If two values that differ only by case are compared in the matching process, they will be considered an exact match.