In-Depth
It’s Time for Your Data to Talk: A Look at Data Integration
Successful data integration can mean the difference between a company that realizes its full potential and one that just keeps missing the mark. We present an overview of what’s involved to successfully cleanse and integrate data on disparate systems in multiple locations.
Your company has gone global with offices in the United States, France, England and Spain. The expansion has been a big success, except for one problem: you not only have valuable business data in four different locations but it’s also in four different formats. Now imagine if you had a tool that could collect all of the data from these various sources, drop it into one location, and actually make it compatible? That’s the idea behind data integration.
What Is Data Integration?
Data integration is the process of accumulating and combining data sets from disparate sources at various locations. This data can then be used for business intelligence, CRM, data mining, or other applications that involve the analysis of data in order to make key decisions. Data integration incorporates a series of processes—the sequence of applications that extract data from various sources, bring them to a data staging area, and then programmatically prepare the data for migration into the data warehouse and the actual loading of the data into the data warehouse and data marts.
There are numerous software products to choose from in the marketplace to help get this done. But because of the wide range of choices, selecting the right tools can be a difficult process. It’s important to consider all the capabilities you’ll need now and as your projects expand, including data conversion, cleansing, formatting, and aggregation. Usually, after the data is extracted, a number of transformations may be applied in preparation for data consolidation and loads into data warehouses, data marts, or dimensional data structures used for decision support systems or business intelligence systems.
The Benefits of Data Integration
Why would you want to embark on a data integration project when it sounds like a complicated and time-consuming process?
Companies integrate their data for a variety of reasons including cost savings and improved analysis capabilities. Among the key benefits a company may experience with integrated data:
- Availability of data. Once the data has passed through the data integration process, it is now in a format that can be used by departments within a company. More decision makers have access to the data in a form they can use. For example, the data can now be passed on to data marts, and managers in each department can access and analyze the data and make critical decisions.
- Enhanced data quality. After the data has gone through the data cleansing process, it will supply a company with clear, accurate information. For instance, one financial services company now uses the data to better understand customer behavior and satisfy customer needs as efficiently as possible. The company mines data from its day-to-day operations with a view to improving customer relations.
- Better manageability. The data can be seamlessly used in a wide variety of applications, providing users with greater accessibility and manageability. For example, a company can use the same data for data mining, customer relationship management programs, and business intelligence applications.
- Improved decision making. The data is now in a clearly defined format, reducing confusion as to how it can be used and the type of information it can provide. For instance, one retailer integrates data from its 2800 stores and 40 distributors throughout the United States. This allows them to determine which stores have what inventory; they can improve their negotiations with vendors because they can find out the exact number of parts they’ve ordered in the past. The data warehouse also provides up-to-date sales snapshots, allowing quick determination of how much sales a particular store has had in the past three months.
- Higher return on investment. By transforming the data for use in multiple applications, a company can achieve a higher ROI. The amount of data is reduced, allowing you to leverage your existing hardware investment. Also, by selecting the right solutions for your data integration project, you’ll be able to maximize the ROI through reduced elapsed times and a reduction in the time to administer, monitor, and manage data warehousing applications.
Beginning the Process
Data integration begins with the definition of the data requirements of your company. This includes deciding what data analysis applications you’ll need as well as examining the type of data that is available.
You’ll also have to identify where each kind of data comes from, how often it is updated, how it is currently being used, and where it can be stored within the company. Then you’ll have to decide how you are to going to use it and what must be done to clean the data and transform it.
Once this is completed, the next step is to preprocess the data prior to loading it into the data warehouse and database. Preprocessing allows you to reformat the data for seamless integration and also results in faster and more efficient database and data warehouse loads.
Preprocessing Techniques
Five major preprocessing techniques will help you improve the success rate of your data integration project.
- Selection: Many sites begin their database and data warehouse loads with a mountain of data gathered from heterogeneous systems and/or multiple processing locations. The important thing is to not only select the data you need, but to also make the selection as quickly and efficiently as possible. You could create a custom application to do this, but the time spent on writing the code would probably be too time consuming. Instead, it’s faster to add software that is designed to specifically extract just the parts of the data you specify. Selection should usually be done first, so that you are working with the least amount of data possible as you perform your preprocessing work.
- Reformatting: Chances are that records from several different systems will arrive with radically different (and possibly incompatible) formats. Reformatting allows you to rearrange the fields in these records so that all records have the same load record image. As an added bonus, reformatting lets you eliminate any fields that you don’t need to store in your database or data warehouse tables.
- Aggregation: Aggregation is used to eliminate duplicate records. At the same time, you can aggregate and “sum” a set of records by adding important numeric fields together, then only load one record in a set that contains the total. This kind of “summing” can be a critical element in optimizing query processing with aggregate tables.
- Grouping: Grouping lets you split your data into separate partitioned table ranges after it has been selected, reformatted, and summarized. Splitting data into several files during one operation is far more efficient than running multiple select applications to create the same number of files.
- Sorting; The data should be sorted into the order in which it will be stored in the tables or by which it will be indexed. This can significantly improve the performance of your data analysis application.
The Data Quality Challenge
While preprocessing is a key step, the success of a data integration project also depends on the quality of the data. Here are some of the consequences that companies face when their data is not of the highest quality:
- Due to invalid data, a manufacturer mistakenly sold off what it thought was excess stock. However, the company was actually short of stock, and the result was thousands of unfilled orders, unhappy customers, and lost revenue.
- Inaccurate data caused an insurance company to raise its risk exposure too high and suffer expensive losses on many of its policies.
- Employment data that was not updated on time caused a government agency to request 25 percent fewer staff members than it actually needed. The result was a substantial layoff of staffers and excessive overtime and poor morale among the remaining staff members.
All of these problems could have been avoided if more attention had been paid to data quality.
Variant Data Invalidates Intelligence
The quality dilemma is easy to see in a simple example of variant data. Here are five variations that might appear in the key field for a data warehouse query:
- Jon Smith
- Jonathan Smith
- J. Smith
- Jon R. Smith
- Jonathan R. Smith
Although it is unlikely that these variants will occur all at once, even two or three can severely skew results. Variants can cause overestimates if duplicate records have been created, or underestimates if a substantial number of records with variants such as misspellings or abbreviations are missed during aggregation.
Three-Step Data Quality Plan
High-quality data is defined as data that is complete, valid, consistent, timely, and accurate. Ensuring your data is of the highest quality is a critical part of any data integration project.
To improve the quality of your data, consider this three-step procedure:
- Research
- Remediate
- Enhance
Although the third step is optional, enhancement is important because it can improve query results in a variety of ways, especially for marketing and sales applications.
Research: This step can be divided into two phases: identifying inconsistent data and determining how that data is entering the system.
A variety of tools can be used to identify bad data, some of which are dedicated tools and often quite slow while others are fast and flexible. It is important to know how bad data enters the system, especially if data cleaning procedures were followed rigorously. Among the possible sources of bad data: poor training of data entry personnel and inadequate or missing cleaning programs. For example, keys that are sometimes in mixed upper and lower case and at other times all in upper case could be the problem if a cleaning program is not case sensitive but the query program is.
Remediate: Once the variant data has been identified, a procedure to remediate that data and to prevent future corruption must be created and implemented. Again, a dedicated tool can be used for this step.
Enhance: High-quality data is prepared for loading by either being captured in a standard, error-proof way or by being cleaned in preparation for loading, as has been discussed above. Data can also be enhanced for further analysis. For example, demographic and/or lifestyle information can be added to customer records before the actual load. Enhancing customer data usually entails combining multiple sources of data; this data is often held in multiple databases on disparate platforms. Using a data-manipulation tool makes coordinating all these different sources much easier.
Get Ready, Get Set, Load
At this point, you’ve preprocessed and cleansed your data. Now it’s ready to be loaded into the data warehouse prior to being used by your various applications. By completing the sorting and aggregation previously discussed, you’ll be able to achieve dramatic performance gains when loading. You can use a specialized load program to complete the process.
Conclusion
Successful data integration can mean the difference between a company that realizes its full potential and one that just keeps missing the mark. After all, data integration offers the user a higher degree of accuracy, increased availability, improved manageability, and better decision-making capabilities, because you’ve now taken all the data from a variety of different sources, cleansed and reformatted it, and transferred it into a data warehouse where multiple users can access it. The quality of the data is high, giving these users a degree of accuracy that they didn’t have previously.