In-Depth

Sypherlink Tackles the Messiest of Data Integration Dilemmas

Sypherlink can help automate the identification and mapping of data -- and it has the patents to prove it.

Sometimes the essential simply gets lost. Consider, for example, the rush to construct new and more sophisticated business intelligence (BI), performance management (PM), and data integration (DI) edifices. Isn't there's a sense in which many vendors are putting the cart before the horse?

After all, how can you hope to connect to data and integrate it into your BI or PM processes if you don't even know where -- or what -- it is?

That's the kind of fundamental problem that Sypherlink Inc. proposes to address. The company, whose products are used by the Federal Bureau of Investigation and more than 500 other agencies, touts a technology prescription -- a cocktail of heuristics-matching and probability-based matching algorithms -- that it claims helps automate the identification and mapping of data.

Sypherlink's Harvester determines where field-level relationships exist between multiple data sources; Sypherlink's Exploratory Warehouse facilitates connectivity and helps accelerate the configuration and prototyping process; while Sypherlink's Integrity ensures the quality and consistency of source data.

Sypherlink, then, does a lot of things. Its identification and mapping technology -- i.e., Harvester -- is, however, the technology on which the company was first built.

"Our objective when we started the company [back in 2001] wasn't to build a competing ETL tool or query application," Sypherlink CEO James Paat told us. "Our objective was to build the process required to enable information sharing or data integration to take place and to look for ways to accelerate that process."

The upshot, Paat says, has been a long, strange trip of sorts. "The first three and a half years we spent doing research and development. The outcome was that we developed an application that we patented.

"The patent … revolves around two key areas: the first was around the ability to leverage heuristics and artificial intelligence to map the fields between database management systems," Paat continued.

"The second part [of the patent] is around virtualization: the ability to take the multiple DBs and then once we have [generated] the mapping relationships, [to] present them to a user as a single system. So [in this way], we're able to generate a prototype" of what the finished product -- i.e., a data warehouse -- might look like.

Turnkey appliance vendors might say otherwise, but data management groups can't just roll out a data warehouse overnight, Paat explained. Prototyping and validation are important, and oft-neglected -- or jury-rigged -- parts of the process.

"We automate the discovery and mapping, we then prototype it and validate it. We support both the physical data warehouse model and a production-ready federated model," he noted. "For a physical data warehouse approach, we take all of our metadata, all of our rules that we discover and we … drive all of our mappings directly into [ETL] tools. We're an enabler. For the federated system, the exploratory warehouse demonstrates non-production modes so [data management groups] can [validate] that view, too."

From there, Paat said, Sypherlink is typically ready to hand off to production systems. "Once they validate it, we have adapters that can take it and configure it for production-ready systems. Whether it's Composite or IBM, … all of those federation engines require someone to configure the sources, the relationships -- we do all of that," he indicated.

"Even for the federated views, the business has to know what views to federate, and then where those views federate in the system. All of that is work that Sypherlink does: prototyping, discovering, mapping."

In the absence of a tool such as Sypherlink, Paat argued, much of this work falls, perforce, into the laps of domain experts. The problem, he said, is that such experts are, by definition, few and far between. Moreover, he pointed out, human domain expertise is considerably more expensive than its machine-based equivalent, such as that facilitated by Sypherlink. Then there's that most intractable -- and widespread -- of problems: poor documentation.

"Those domain experts rely on how well the systems have been documented over the years. The first challenge is that there's not a high availability of domain experts. The second is that most of the systems have been poorly documented," he indicated.

"There are also human cognitive limitations. Users are relied on to determine, for example, that these fields maps to this field, so the quality of the mapping is never consistent." There are also cases -- typically those which involve an enormous mix of heterogeneous data sources -- in which human expertise simply breaks down, Paat suggested.

"[Y]ou can only mentally process mapping one field to one field. When you look at efforts like national security, where there are thousands of databases trying to map thousands of fields, with our technologies, we overcome all of those limitations. Now a computer application can look across all of these sources without documentation, without understanding the systems."

This doesn't take human domain expertise out of the loop, Paat stressed -- it just lets organizations make more effective use of the domain expertise that they already have. Given the sparseness of domain expertise in any given organization, domain experts will likely be very busy even with the use of an automated tool. "We don't remove the requirement to have a domain expert. You still need a domain expert, but now their job is to review the mappings that Sypherlink found and to determine exceptions."

Sypherlink's customers include a who's who of processors of large data volumes: the Transportation Security Administration (TSA), the FBI (which is using Sypherlink to help construct its National Data Warehouse), and the state of Florida. On the corporate side, Sypherlink is found at companies such as TIAA-CREF and The Mayo Clinic.

Its technology is also OEMed by a number of vendor partners, according to Paat. "We have a very strong partnering mentality, so we partner with everything from the application vendors or the traditional database vendors -- so whether it's the Teradatas or the Oracles or the IBMs on the physical data warehouse side, or the Composites on the federated side, we partner with them."

Must Read Articles