Data Quality Gets Personal at Information Builders


By  
Position
Technical Content Services Director and Technical Support Specialist
Body

data quality gets personal

More than ever before, businesses today are challenged with harnessing and making sense of the growing volume and variety of data they have amassed. This is especially true in the digital era, which is dominated by mobile devices, social media platforms, Big Data, and the Internet of Things (IoT). This trend will only continue as more applications and services powered by Artificial Intelligence (AI) emerge on the scene. As a result, it is no surprise that a growing number of industry leaders are accepting the notion that data is the new oil (in terms of opportunity and as a valuable commodity).

However, as organizations journey across these growing oceans of data, the importance of data quality should never be underestimated or overlooked. After all, as we all know and surely heard before, it’s the quality that matters, not quantity.

At Information Builders, you can say that providing solutions that improve data quality for our customers and users is actually our middle name (Intelligence, Integrity, and Integration). We strongly believe that better data quality can lead to reduced costs, increased sales, improved performance, more customer engagement, and better business decisions. 

Information Builders recognizes that the majority of data governance tools on the market today are geared towards more technical users, which usually require direct experience with coding or programming. As a result, these tools may be too advanced and out of reach for the average business user or data steward who is assigned with data quality-related workloads.

The new Omni-Gen™ Personal Edition is a direct response from Information Builders to provide the data governance industry with a highly accessible, lightweight, and user-friendly tool without sacrificing the data quality and profiling capabilities usually found in larger and more complex solutions. 

Omni-Gen Welcome Screen

Installed in just a few minutes, Omni-Gen™ Personal Edition is a browser-based, self-service application that allows you to import, profile, inspect, and cleanse data from various data sources in a unified interface. The resulting trusted data can be quickly exported to provide insightful analytics and decision-making, or made easily available for other data integration initiatives and solutions.

Key advantages of Omni-Gen™ Personal Edition include the ability to:

  • Quickly import, understand, and cleanse personal data sets.
  • Create cleansing and validation rules without the need for developers.
  • Experiment with ways to improve data quality.
  • Add multiple data sources from CSV, Excel®, and database tables.
  • Export individual data sources after cleansing.
  • Export cleansing rules to Omni-Gen™ Data Quality Edition and Omni-Gen™ Master Data Management Edition.

Importing Your Data

Importing your data to begin analyzing and improving your data quality has never been easier with Omni-Gen™ Personal Edition, which allows you to quickly import:

  • Comma-separated value (CSV) files
  • Microsoft Excel spreadsheets
  • Remote database tables through JDBC

Select data source

For the purposes of this article, we will be importing an Excel spreadsheet containing customer-related data. Select Excel file and then click Continue. The Add Data Source dialog opens, as shown in the following image.

Add Data Source

After browsing through your file system and selecting an Excel spreadsheet you want to import, click Load File. Note, in this example, the spreadsheet that is being imported includes a header, so the Has Header Row check box is selected.

Once your data is processed and loaded successfully, the Omni-Gen™ Personal Edition workbench opens as a new tab in your browser with your data already populated, as shown in the following image.

Your data displays

That’s it! You are now ready to profile and cleanse your imported data.

Navigating the User Interface

With a focus on the user experience in mind, Omni-Gen™ Personal Edition was developed to provide a user-friendly workbench that is browser-based and simple to navigate and use. Omni-Gen™ Personal Edition provides two distinct views of your data that you can quickly toggle directly within the existing interface.

Two options to view your data

Clicking the Profile or Data Grid icon in the left pane changes how your imported data is viewed and presented in the Omni-Gen™ Personal Edition workbench.

The Profile view mode provides a data profile of your imported data that is currently selected.

Profile view mode

This view includes metrics that show the current state of the data (for example, subject completeness and completeness by attribute).

The Data Grid view mode provides an actual view of your imported data that is currently selected (as records on a page).

Data Grid view

Using this view, you can quickly see the relationship of one attribute in your data to the other attributes in the grid. The Data Grid view enables you to create Data Quality Rules and run experiments on the data.

Shown in the left pane, the Data Source view provides a summary of the imported data that is currently selected in Omni-Gen™ Personal Edition, as shown in the following image.

Data Source view

Useful metadata below the header (for example, CUSTOMER) quickly identifies the name of the Excel spreadsheet that has been imported, the number of rows it contains, and the date it was added and last modified. The list of individual values (for example, customer_sid, active_status, legal_name, and so on) represents the attributes in this imported data. You can select each attribute to view information (for example, a data profile) about that attribute on a more granular level.

Profiling Your Data

Profiling is the process of analyzing your data to look for quality issues, such as missing values or incorrect formats. When you open Omni-Gen™ Personal Edition, a high-level profile of the loaded data currently selected is automatically displayed in the workbench, as shown in the following image.

Data displayed in the workbench

This data profile shows a summary of data completeness. From this summary you can quickly learn that there are 1,935 customer records in the loaded data source, and none are completely blank. Each column (also referred to as an attribute) in this data source is analyzed for completeness. Analyzing this profile further, there is one attribute that is completely blank in every record (legal_name).

You can drill into each attribute to see more detailed statistics simply by clicking on its name in the left pane. In the following example, a data profile for the soc_sec_num attribute (Social Security Number) is shown.

Data profile for Social Security Number

In addition to the high-level data profile, the most and least frequent values of the selected attribute are displayed. Another useful statistic for identifying data quality issues is the frequency of patterns, which can be observed by scrolling down further on this page. Specifically, the sequence of Digits (D), Letters (L), and punctuation characters. For example, a Social Security Number (SSN) is expected to have a pattern that matches DDD-DD-DDDD, which is nine digits separated by hyphens.

Most frequent SSN patterns

As you can see, many of the records in this data source match the expected format. The second most frequent pattern includes nine digits, without the hyphens. These instances should probably be transformed into a standardized SSN format. However, other records also have data quality issues, since they contain letters or are incomplete.

Filtering Your Data

You can filter the data that is displayed in Omni-Gen™ Personal Edition based on values in one or more attributes. Simply right-click any attribute in the data grid and select Add to Filter, as shown in the following image.

filter data based on attribute values

Select an operation (for example, Equals) and then enter the filter value. In the following example, only records for active customers will be displayed (active_status = Y).

Select an operation

You can repeat this process for multiple attributes. In the following example, only records for active customers that are located in the West with a credit rating of A-1+ will be displayed.  Filters can also be combined using logical AND/OR operations.

Only records for active customers that are located in the West with a credit rating of A-1+ display

Cleansing Your Data

One of the basic goals of data cleansing is to get data into a standardized format. Sometimes data can originate from different source systems or free-form data entry. This can result in various representations of data that mean the same thing, for example, “Male” versus “M” and “Female” versus “F”.

Let’s take a closer look at the credit_rating_val attribute.

20 most frequent values

It would seem that two different source systems were used, a letter-based system (A, B, C) and a more descriptive system (Prime, Speculative). Once you determine how these values correlate, you can easily replace values as required. For example, Prime may mean the same as A-1+.

The following image shows a cleansing rule being added to the credit_rating_val attribute to replace a value.

A a cleansing rule being added to the credit_rating_val attribute to replace a value

In this example, the defined rule will replace all instances of the credit rating UpperMedium with B+ in the data source.

the defined rule will replace all instances of the credit rating UpperMedium with B+ in the data source

After executing the rule, you can see all of the instances where the replacement was made, as shown in the following image.

Instances where the replacement was made.

Validating Special Content Types

Omni-Gen™ Personal Edition also provides predefined operations that are dedicated for special content types. These go beyond simply checking for basic data types, such as a numeric or string. For example, Social Security Numbers (SSNs) have several restrictions and you can apply predefined SSN operations to your data field.

First, the predefined operation allows you to standardize the SSN format. This is useful since the source data in this example includes a variety of formats. Some records include dashes that separate the digits in the SSN (DDD-DD-DDDD) and some do not. Let’s choose to use dashes as a standard format in our SSNs, as shown in the following image.

The predefined operation allows you to standardize the SSN format.

It would also be a good idea to validate the SSN values and flag any invalid SSNs. For example, since a valid SSN must contain nine digits, it would be helpful to identify any instances that do not contain nine digits with an appropriate tag. To do this, simply click Add tag, as shown in the following image.

You can validate the SSN values and flag any invalid SSNs.

The first tagging operation (SSN has invalid length) is selected by default, which actually coincides with the validation check you require. There are several other restrictions regarding SSNs and corresponding operations for those are also provided.

After your rule is executed, you can see that the current version of the data has been standardized with the desired SSN format (DDD-DD-DDDD), as shown in the following image.

The current version of the data has been standardized.

In addition, notice that 10 records have been tagged for having an incorrect number of digits in the SSN, as shown in the following image.

10 records have been tagged for having an incorrect number of digits in the SSN.

You can drill down further into this list to examine specific records simply by clicking on the SSN_NO_9_DIGIT tag.

Prepackaged Operations

Omni-Gen™ Personal Edition allows you to rapidly cleanse and validate your data using prepackaged operations for common data types, such as Social Security Numbers, U.S. phone numbers, U.S. states, first and last names, e-mail addresses, and countries.

Rapidly cleanse and validate your data using prepackaged operations for common data types.

For case-sensitive data, you can quickly apply capitalization, lowercase, or uppercase styling as required.

Quickly apply capitalization, lowercase, or uppercase styling as required.

You can also remove or replace values in your data source as needed for data cleansing purposes.

Remove or replace values in your data source.

These are just a few examples of many prepackaged operations available in Omni-Gen™ Personal Edition, which allow you to save time as you govern your data. No coding or development is required. As you can see, you can simply select these prepackaged operations from a drop-down list and add them to your rules.

Conclusion

Omni-Gen™ Personal Edition helps your organization profile, standardize, and cleanse your data, without requiring developers, so you can immediately get the high-quality data you require. Information Builders believes that Omni-Gen™ Personal Edition will encourage more organizations to explore data quality initiatives by providing a larger number of their business users and data stewards with a tool they can begin using right out of the box.