In-Depth
Talend Makes Splash with Open Source Data Quality Suite
Open source DQ is just the beginning. RDBMS vendors are cooking up homegrown data quality, too. Is DQ about to become a commodity?
Open source data quality (DQ) tools are here and more are on the horizon.
For example, consider open source data integration (DI) specialist Talend. It made a splash at last week's TDWI World Conference in San Diego, announcing Talend Data Quality, a DQ complement to its flagship ETL software that the company claims is comparable to alternatives from Business Objects (an SAP company), DataFlux (a SAS Institute Inc. company), IBM Corp., and Informatica Corp.
DQ has been on Talend's roadmap for some time. Just two months ago, the company unveiled what it billed as the industry's first open source data profiling tool, too -- Talend Open Profiler.
There's a clear sense in which DQ is an established concern in the open source software (OSS) community. In fact, Talend Data Quality isn't strictly the first OSS data quality tool. Already available are the Open Source Data Quality and Profiling project at Sourceforge, along with DataCleaner and Mural, two newer OSS data quality efforts. There's also the obliquely-OSS OpenDQ 2.0 from InfoSolve Technologies, though it does not include built-in service or support, and its source code is available only to customers, an arrangement that some folks claim violates the spirit -- if not the letter -- of the GPL.
Even so, Talend Data Quality is the first OSS tool to deliver both commercial support and integration as part of a larger suite
Data quality is a surprisingly involved proposition, given both the complexity of its underlying technology (best-of-breed data quality assumes the use of speedy deduplication, matching, and other specialty algorithms) and the extreme breadth of its reach (best-of-breed data quality requires the development and maintenance of multi-lingual dictionaries, for example). Talend may be up to the challenge. It recently demonstrated its innovative open source DI platform: an ETL engine that produces extracted, cleansed, and transformed data in the form of an executable binary (http://www.tdwi.org/News/display.aspx?ID=8967).
"What we're offering is a complete data quality suite that … is composed of two different modules: a data profiler, which is a [more advanced] version of Talend Data Profiler, and a cleanser. This is a complete solution. With it, we are addressing most of the requirements [typically addressed by] standalone data quality tools," says Yves de Montcheuil, vice-president of marketing with Talend.
Talend Data Quality fills in quite a few DQ checkboxes, boasting -- among other features -- data identification (i.e., the ability to determine if data is reliable or unreliable on a record-by-record basis); data cleansing (i.e., the ability to clean incorrect, incomplete, or inconsistent data, either by using built-in routines or by cross-referencing it against masters databases or reference data); and data enrichment (i.e., the ability to flesh out data with "nice-to-have" information -- including latitude and longitude information, census data, or credit scores) capabilities.
In the last scenario, organizations are actually consuming data from external -- or supplementary -- sources. It's a fast-emerging requirement, according to Montcheuil.
"[Customers] are starting to include mapping information [in their data integration feeds]. When you do that, you need to have accurate geographical coordinates if you're going to include this [mapping] information. Some [customers] want to use [i.e., incorporate] credit scores, or information from the U.S. Census Bureau," he says. "Most of those feeds are going to go through Web services, although it will vary, based on the industry you're looking at. In some cases, it will require a subscription with the delivery of files. Because we have an architecture based on a SOA stack, it's very, very easy for us to incorporate third-party data through integrated components."
Although it boasts capabilities that are "comparable" to commercial best-of-breed offerings, Talend Data Quality is a work in progress, Montcheuil concedes.
"We have a very aggressive release cycle, and [customers] need to understand that because of [the nature of] our platform, there is a lot less risk of serious problems than if we were launching new products entirely from scratch," he comments. "We'll continue to improve the platform. With ETL, we have a release every four months; [Talend] Data Quality will be on a similar cycle. One major direction we're working on now, for future release … is the SOA stack."
Industry veteran Mark Madsen is cautiously optimistic about the OSS community's first suite-centric DQ offering. He's intrigued by the Talend product, which (because it won't formally debut until next month) he hasn't yet had a chance to put through its paces. Madsen isn't alone: according to Montcheuil, Talend isn't releasing any reference or beta customer information.
Madsen also thinks that Talend Data Quality, considered alongside other open source DQ tools -- and in the context of several nascent relational database DQ projects -- could augur coming commoditization in the DQ segment.
"[Talend has] not released, so I can't say how much there is in the tools yet, but it will improve over time," he asserts. "DQ is the latest product to become a commodity in the integration space. Everyone has ETL, profiling and [data] quality now. At my vendor session at TDWI, all three [relational database] vendors [i.e., IBM Corp., Microsoft Corp., and Oracle Corp.] demoed data quality in the open slot where they had ten minutes to show something of interest."
As for the OSS data quality tools, Madsen thinks Talend -- with its services, support, community base, and built-in development environment -- could be the pick of the litter.
"Today there are only rudimentary open source offerings, but there are three others focused on this. They [the other OSS tools] just don't have a good set of features or come married to DI tools," he concludes.