Data Profiling: The First Step in Creating a Data Warehouse

No matter what form of data conversion project your organization is involved in, the methods can be time-consuming and a little risky. But one thing is for certain: The critical first step in data migration is data profiling.

As the need to consolidate, move, improve, convert or find new uses for existing data increases, companies and organizations with large information systems are experiencing a rise in the number of projects that involve migrating or accessing data from older information systems.

AMR Research estimates that ERP conversion projects alone will generate $52 billion by 2002, and GartnerGroup/Dataquest forecasts that $2.6 billion will be spent on data warehousing software in 1999, growing to $6.9 billion by 2002.

According to recent studies conducted by The Standish Group, a research advisory firm based in Dennis, Mass., 15,000 ERP conversion projects with budgets of $3 million or greater will begin in 1999 at a total cost of $95 billion. Standish says an incredible 88 percent of these projects are doomed to experience cost and timeline overruns, or fail entirely, in large part due to a fundamental inability to understand source data.

These projects share certain characteristics, including the fact that they use data collected and stored in an older system, they require a new database structure be defined, and they require that data from the older system be moved or copied into the new system.

Projects with these characteristics include migrations to ERP applications, data warehouses and data marts; data consolidation projects associated with mergers and acquisitions; and data improvement projects. Also included are Enterprise Application Integration (EAI) projects where old data needs to be accessed via relational or object views through gateways.

All of these projects have very different objectives, but all share a common approach to successful completion – the source data must be thoroughly understood through a process called data profiling. Despite being a critical first step in data migration projects, conventional methods for data profiling have proven to be time-consuming, labor-intensive and error-prone. Recently, the concept of using interactive data profiling software to organize the process and automate the tasks has been introduced.

The success of a data migration project relies on the degree of completeness and accuracy of the profiled information collected from the data sources. If the project works from anything less, it may experience failures in the execution of any of the subsequent data movement tasks or may result in target data populated with incorrect data. It may also result in an incorrect or inefficient target database design.

Data profiling is a rigorous process that leads a project designer through steps that will significantly improve the execution of a project and remove the risk of project failure due to an incomplete understanding of the data. As such, data profiling constitutes the essential first step in any successful data migration project.

What Is Data Profiling and Mapping?

In broad terms, data profiling is the process of collecting all pertinent information on the content and structure of an existing set of data sources and examining that data to provide a correct and complete model for a target database. A complete data profile contains detailed information on data values, attributes, dependencies and relationships.

The data profile forms the basis for developing accurate source-to-target transformation maps during the subsequent data mapping steps. Data mapping is the process of matching the source data profile to either an existing target database design or using it to create a target database design. Data mapping needs to be performed at the entity, attribute and (at times) data value level. This information, in turn, is used in conjunction with third-party data migration tools to extract, scrub, transform and load the data from the old system.

Although metadata, source code, old project documentation, old data dictionaries, etc., can shed some light on the content of existing systems, they cannot be relied upon for accurate or complete information. Digging out the essential characteristics of older systems can only be accomplished in an accurate way through analysis of the actual data, and much of that digging can only be completed through automated processes. This is the strength of data profiling and mapping software.

The up-front data profiling and mapping phases are the most difficult parts of the data migration process. Approximately 60 to 80 percent of the project time is spent on these phases (understanding the legacy data sources and creating an accurate data model and set of mapping specifications). A successful outcome of a data migration project (in terms of project completion time, cost and quality of results) depends on doing this part correctly.

Data profiling and mapping software provides utility to organizations faced with analyzing existing data sources as the basis for migration, integration, re-engineering or consolidation. Due to the significant investment that organizations already have in legacy data, it is likely that data profiling and mapping software will prove indispensable in organizations who have large amounts of non-relational data (in quantity and/or diversity) that needs to be integrated with or migrated to a relational model.

The Data Profiling Process

Before data profiling is done, the user needs to collect information on the data sources and understand the objectives of the project. Actual source data should be made available to the profiling process. Also, existing metadata for the data sources should be gathered to help identify the probable data elements that will be used in the project.

Each attribute in the data sources needs to be analyzed to determine the following: cardinality of data, data type and form, length of data, range of values found and frequency of each value found.

Examination of this information will determine if any information is missing (presence of NULLs, blanks or zeros). It will also determine if the values are UNIQUE over the data source. The data values themselves need to be examined to see if they are as expected. Infor-mation on cardinality and frequency distribution can be used to determine if the attribute is a viable candidate for an aggregating variable in the target.

Data quality problems can easily be spotted at this level of examination. Incorrect data values, lopsided frequencies, values outside acceptable ranges, missing information, lack of uniqueness and other quality problems are noted.

The next level of analysis in developing a profile is to determine the dependencies between attributes that exist within each data source. The attributes making up a source file or segment need to be processed to find all functional dependencies. Older systems are commonly found to be de-normalized, or to have expected dependencies violated in the data. The process of finding all functional dependencies is impractical without the sophisticated and automated support of data profiling software.

The collection of identified dependencies will reveal all attribute sets that can be normalized into separate target tables, all candidate primary keys and all derived attributes. Data quality problems can often mask an expected dependency. Knowing where quality problems exist that will affect structural relationships can head off migration disasters.

The next level of data profiling is to examine attributes in one source data entity with attributes in other source data entities. The process consists primarily of finding data value overlaps between attributes in order to locate potential attributes with the same domain. Data profiling software can automate this process.

Finding same-domain attributes is essential for determining integration points between source entities. These will commonly result in foreign-key relationships in the target design. The analysis will not only identify the integration pairs but will reveal attributes that are truly redundant and can be eliminated during migration.

Addressing Quality Issues

Once profiling is complete across all of the levels described above, the issue of data quality must be addressed. The various phases of profiling will clearly separate out those quality issues that will interfere with the transformation and movement of the data and from those that will affect its intended use at the target system. Quality issues typically fall into one or more of the following categories: attribute value errors, missing data errors, dependency errors and connection errors.

This information is critical in deciding whether the data can support the goals of the project. Missing data or data quality problems may make project cancellation the best choice. If not, the profile will lead to designing the data scrubbing mechanisms needed and assessing the time and cost of scrubbing.

The target database design should be based on the information gathered during profiling. This gives the project team confidence that the new structure is compatible with the data that will be brought to it, or the team will understand the areas of dissonance and be able to plan for them. The profile will reveal the correct table structures to ensure third normal form and referential integrity where needed. It will also indicate where redundancies can be eliminated. In the case where the target is already set, as in moving to an ERP-packaged application, the profile will be indispensable in performing the matching process between the data available and the data requirements of the target.

Each level of profiling contributes information essential to the design and implementation of the subsequent data movement processes, including data extraction, scrubbing, transformation and loading. By profiling first, one can precisely determine the requirements for each step of the process and account for them in the implementation. Building a data movement process set without a thorough profile of the source data will inevitably result in process errors when the real data flows through. Building a process derived from the real data will significantly increase the likelihood that it will work correctly the first time.

About the Author:

Jack Olson is Vice President of Technology at Evoke Software. He can be reached at (512) 372-9372, or via e-mail at jolson@evokesoft.com.

Must Read Articles