Silver Creek Accelerates Product Data Integration

Existing DI and DQ tools can't easily be adapted to address product data integration. What's needed, proponents say, is a better, dedicated tool.

Product data integration (PDI), some industry watchers say it's a highly specialized project category, with highly specialized requirements. Existing DI tools -- which have been designed for high-volume, high-complexity integration scenarios -- and data quality (DQ) tools -- which have focused on CDI -- can't easily be adapted to address product data integration scenarios, according to Philip Russom, a senior manager with The Data Warehousing Institute (TDWI).

"Almost all DQ tools were designed for customer data -- which is admittedly the lion's share of the DQ market -- whether the vendor admits it or not. The problem with product data is that it's nowhere as predictable as customer data, so DI/DQ tools tend to choke on it," Russom points out. "The second problem is that product data is loaded with unstructured data -- [i.e.,] text describing the product and its numerous specifications and performance characteristics, usually in indecipherable acronyms and product-specific lingo."

Enter Silver Creek Systems, a DI and DQ player that claims to have tackled PDI on its own terms.

"We have focused on product data pretty much from the beginning. Our technology is capable of handling a lot more than product data, of course, but when we formulated our data management strategy as a company, we looked around and we saw that the product data segment was far and away the least well-served portion of what you would call the data integration or data quality market place," says Martin Boyd, vice-president of marketing with Silver Creek.

"There are a lot of vendors that focus on name and address cleanup, and those are the ones that call themselves data quality [vendors]. They do a great job. I don't want to undermine the job that they do. However, when you go around [customer data] into less structured data, data that comes in an infinite number of shapes and forms, those vendors do much less well."

This is in spite of the fact, Boyd argues, that Business Objects SA (an SAP Company), IBM Corp., Informatica Corp. (proprietor of the former Similarity Systems), and other players have recently started to make noise about product data quality. The salient point, Boyd claims, is that a customer-data-centric approach cannot easily be brought to bear against product data.

"[Data quality vendors have] discovered that product data is harder than they thought. The technologies that their [PDI offerings are] being built on are not well-suited for handling product data. The reason is that they have all started with a pattern-based recognition of the world, and when you're dealing with names and addresses, that works really well. When you're dealing with less structured data -- when you're dealing with product data -- it works very poorly."

In contradistinction, Boyd touts Silver Creek's DataLens, a product that develops a "semantic-based understanding" of product data. It uses natural language processing (NLP) to automatically categorize products and their attendant attributes, mapping incoming product data into an internal product catalog, Boyd says.

"The interface that we present to users is very simple, very streamlined; [and] under the surface. [However,] it's using a semantic-based technique. We're targeting not the patterns in the data but the meaning of the data. Because of that, the system can deal with variations in word order, punctuation, [and] spelling. Even where there's no white space, we can do character-level parsing."

DataLens also learns on the fly. "[Learning is] a construct that none of these other systems has, because they're pattern-based and they've been dealing with name-and-address type problems, so they've never had to deal with this [issue of understanding data and its context]," Boyd explains.

TDWI's Russom describes Silver Creek as part of a select group of "vendors [that are] willing to address the arcane and complicated world of product data quality." It's a grouping that's thus far been closed to the big DQ players, Russom maintains -- thanks largely to the emphasis it places on esoteric technology, such as (for example) Silver Creek's natural language processing capability.

"DataLens' secret sauce is natural language processing, which can read the text, understand what it means, and map between data sources and targets," Russom explains, adding that Silver Creek recently added the ability to map incoming product data directly into an internal product catalog. "For exceptions that need remediation, there's a fast click-and-drag tool."

For some PDI scenarios, Russom suggests, Silver Creek's Data Lens can be nothing short of a godsend. "For companies with massive product catalogs [such as eCommerce, manufacturing, retail, CPG, and the like] this automation is a godsend that enables users to process thousands of records of product data with accuracy and minimal human intervention," he concludes.

Silver Creek, not surprisingly, touts a typically rapid time-to-implementation: Boyd cites the case of one anonymous customer that went live with more than 1,000 cleansed and catalogued product categories in 10 weeks.

In many cases, he says, Silver Creek and DataLens are brought in to enhance an existing DQ or DI toolset; in some cases, he claims, customers bring in DataLens to do clean-up in the aftermath of a failed (or kludgey) DQ-driven PDI effort. Thanks to Auto Learn, implementation is typically speedy, he claims.

"You start by assembling metadata from the organization for particular product categories. There are always standards in the company for how the data is supposed to be represented and what attributes it's supposed to have. Customers always have standards like that -- they just haven't been able to put them into practice. Usually, with just a few questions, we can do that for them," Boyd explains.

A customer starts by feeding raw data into DataLens. From there, Silver Creek's NLP technology works to ferret out and recognize product attributes, prompting a user with questions, inferences, or suggestions based on an analysis of the data. "Very quickly, it builds up a very rich understanding including all of the variations of how the data might be represented. That's how someone can go live with 1,000 categories in 10 weeks."