Domain management enables the user to interactively change and augment the metadata that is generated by the computer-assisted knowledge discovery activity.
Following are the activities we can perform on Domain Management:
- Create a new domain. The new domain can be linked to or copied from an existing domain.
- Set domain properties that apply to each term in the domain.
- Apply domain rules that perform validation or standardization for a range of values.
- Interactively apply changes to any specific data value in the domain.
- Use the DQS Speller to check the syntax, spelling, and sentence structure of string values.
- Import a domain from a .dqs data file or domain values from a Microsoft Excel file.
- Import values that have been found by a cleansing process in a data quality project back into a knowledge base.
- Attach a domain to the reference data maintained by a reference data provider, with the result that the domain values are compared to the reference data to determine their integrity and correctness.
- Apply term-based relations for a single domain.
When the domain management activity is completed, we can publish the knowledge base for use in a data project.
Domain properties define and drive the processing that will be applied to the associated values. We can set these properties in the domain management activity. We can set the data type of the values, specify that only the leading value in a group of synonyms will be exported, configure the formatting of the output (to upper case, lower case, or initial capitalization), and define which algorithms (syntax error, speller, and string normalization) will be activated.
Reference Data Services
In the domain management process, we can attach online reference data to a domain.
This is how we compare the data in our domain to the data maintained by a reference data provider. We must first configure the reference data provider through the DQS configuration capabilities in the Administration section of the Data Quality Client application.
Applying Domain Rules
We can create domain rules for validation or standardization. A validation rule ensures the accuracy of data, ranging from a basic constraint, such as the possible terms that a string value can be, to a more complex regular expression, such as the valid forms of an email address. A standardization rule is performed to achieve a common data representation. It ensures that data values from multiple sources with the same meaning do not appear in different representations. A standardization rule changes the format or presentation of a value according to a generic function, ensuring conforming according to metadata such as data type, length, precision, scale, and formatting patterns. A standardization rule can be based on a character, date/time, numeric, or SQL function.
For a composite domain, we can create a CD rule that specifies a relation between a value in one single domain and a value in another single domain, both of which are parts of a composite domain.
When a domain rule is applied and a domain value fails the rule, the value is designated invalid. For example, we can create a phone rule which will validate the phone length based on country.
After we have built a knowledge base, we can populate and display data values in each domain of the knowledge base. After knowledge discovery, DQS will show how many times each term appears, what the status of each term is, and any corrections that it proposes. We can manage this knowledge as follows:
- Change the status of a value, making it correct, in error, or not valid
- Add a specific value to, or delete a specific value from, the knowledge base
- Change the relation of one value to another value, including designating a replacement for a term that is in error or not valid
- Add, remove, or change knowledge associated to the domain.
Values can be created specifically by the user or as part of data discovery or import functionalities. This enables to align the domain to the business and makes it easily extensible.
We can set domain values either in the domain management activity or in the Manage Domain Values step at the end of the knowledge discovery activity. The domain-value functionality is the same in both activities.
Setting Term Relations
In domain management, we can specify a term-based relation for a single domain, specifying a change to a single value. This will build a list of Value/Correct To pairs, such as “LTD.” and “Limited”, or “CO.” and “Company”. This will enable to change a term throughout the domain without manually setting individual domain values as synonyms. If a term-based relation transformation causes two values to be identical, then DQS will create a synonym relationship between them (in knowledge discovery).
A composite domain is a structure comprised of two or more single domains that each contains knowledge about common data. Examples of data that can be addressed by composite domains are the first, middle, and family names in a name field, and the house number and street, city, state, postal code, and country in an address field. When we map a single field to a composite domain, DQS parses the data from the one field into the multiple domains that make up the composite. Sometimes a single domain does not represent field data in full. Grouping two or more domains in a composite domain can enable to represent the data in an efficient way.
The following are advantages of using composite domains:
- Analyzing the different single domains that make up a composite domain can be a more effective way of assessing data quality.
- When we use a composite domain, we can also create cross-domain rules that enable to verify that the relationship between the data in multiple domains is appropriate. For example, we can verify that the string “London” in a city domain corresponds to the string “England” in a country domain. Note that cross-domain rules are taken into consideration after domain rules.
- Data in composite domains can be attached to a reference data source, in which case the composite domain will be sent to the reference data provider. This is often done with address data.
The data can be parsed by a delimiter, by the order of the columns, or based upon reference data.
Composite domains are managed differently than single domains. We do not manage values in a composite domain; we do so for the single domains that comprise the composite domain. However, from the domain list in the Domain Management activity, we can see the relationships between the different values in a composite domain, and the statistics that apply to them.
In the Discover step of the Knowledge Discovery activity, profiling is performed on the single domains within a composite domain, not on the composite domain. However, in interactive cleansing, we cleanse data in the composite domain, not the single domains.
Matching can be performed on the single domains that comprise the composite domain, but not on the composite domain itself.