In-Depth
Greenplum Touts Speedy Parallel Data Warehouse Bulk Loading
Greenplum says its new Scatter/Gather Streaming capability can meaningfully accelerate loading times
In a highly competitive data warehousing (DW) segment, differentiation matters.
Differentiation, of course, is in the eye of the beholder, but Greenplum Inc. this week touted what it claims is best-in-class differentiation in the form of a new high-speed bulk-loading option for its DW software.
Bulk loading is a big issue for some customers, officials maintain, and -- for those shops that need it -- Greenplum says its new parallel loading feature can meaningfully accelerate loading times.
Ben Werther, Greenplum's director of product management, touts the example of one customer -- he declined to disclose its name -- that uses Scatter/Gather Streaming to achieve load times of up to 4 TB per hour. Werther touts Scatter/Gather as a "drastic" improvement over the bulk-loading approach that's still used in many DW environments.
"The classical bulk-loading approach is where you have a command line program. It pushes the file … to one server on the database, or in some cases [it] can open a few [different] sessions and send it to a few sessions on that database, but you're usually limited to a single file on the source side. No one else that we're aware of can be pulling from multiple servers simultaneously and coordinate [those transfers] as one load and doing transactions at the same time," he argues.
Greenplum is one of several vendors that market massively parallel processing (MPP) DW technology. Scatter/Gather can be seen as a highly-parallelized take on bulk loading.
"We take the source, whatever it is, and we're essentially breaking it up into chunks that are sent over to the database in streams. Those scattered pieces are going across all of those different streams concurrently. [They're] flowing in parallel fashion to the database, so this [data] may be coming form one source or many sources," Werther explains.
"In the gathering phase, we don't simply land it to disc and call it a day. Because a lot of customers want to support ELT or ETLT … we want to be able to do processing on the data on the way in, so we support arbitrary processing through SQL in parallel as the data flows in to those data servers. … The data is automatically partitioned and … flows across the network and lands where it needs to land."
Parallel loading isn't a new idea. Oracle Corp., for example, has long supported parallel loading in its flagship DBMS. Ditto for high-end data warehousing kingpin Teradata, which -- like Greenplum -- touts the ability to "perform continuous loading of transactional data" via its Parallel Transporter offering.
Parallel loading of the kind Greenplum touts with Scatter/Gather could prove to be a disruptive feature in the broader DW segment, particularly among vendors that market DW products based on open source DBMSes such as MySQL or PostgreSQL.
Jos van Dongen is a data warehousing consultant who works with open source software (OSS) analytic databases -- of which Greenplum (which runs on an MPP-ized version of PostgreSQL) is one example. He says loading can be a time-consuming process in many OSS environments. Van Dongen and other like-minded specialists are currently experimenting with different approaches to accelerate data loading (using sample sets of between 10 and 100 GB) in OSS data warehouses. It's proving to be a tough nut to crack.
"Our problem with this is two-fold: [a lack of] serious hardware resources -- let alone an MPP cluster -- and [a] lack of parallel capabilities in the software," van Dongen explained in an e-mail. He noted that few, if any, OSS data warehouse platforms -- e.g., Infobright's Community Edition, LucidDB, or MonetDB -- support parallel loading. For example, he says, "[T]wo weeks ago I got the results from a TPC-H SF100 benchmark on PostgreSQL using a single quad-core machine with 8GB of RAM and [two] 500GB RAID 0 disks. [It] took them over 10 hours to load the data, and about half the queries didn't respond within 10 hours or just crashed." In most cases, van Dongen indicates, OSS DW platforms can achieve load speeds of up to 20-30 GB an hour -- a far cry from the load performance of an Oracle or a Teradata.
At the same time, OSS platforms are considerably cheaper than Oracle, Teradata, or other proprietary vendors. "What would be interesting … is [to] add another item to the equation: cost. How do load times look like in [dollars] per TB [per] hour?" van Dongen continues.
"[I]n my case, … Infobright loaded 100 GB in a little less than 4.5 hours on a $2,500 machine. All software is free. This results in 22.2 GB-per-hour, or … $112,612 [per TB/hour]. If Greenplum can do 4.5 TB an hour, their system would have to cost less than $506,756 to be relatively faster than mine. Is it?"
A Greenplum representative told BI This Week: "Feedback from customers is that TPC-H numbers aren't meaningful for the kinds of workloads they are running. For this reason, none of the data warehouse appliance leaders -- Greenplum, Teradata, or Netezza -- publish[es] TPC-H benchmarks. Instead of artificial benchmarks, Greenplum works with customers to show them the performance it can bring to their unique workload. Cost-wise, Greenplum offers pricing as low as $20k per TB of usable data. This is significantly less than Oracle's pricing -- particularly when you include their RAC and parallel query options, and the need to purchase either SAN or Exadata servers."
The upshot, according to van Dongen, is that speedy load times come at a cost. He cites published TPC-H benchmarks that demonstrate that Oracle can load 1 TB of data in 1 hour and 35 minutes; the cost of doing so, however, works out to about $4 million per TB an hour. Other MPP OSS-based data warehouse players can do so at less cost, van Dongen concludes -- Exasol can load 1 TB in 1 hour and 3 minutes (at roughly $1 million per TB/hour), and ParAccel can load a single terabyte in 41 minutes (at a cost of about $861,000 per TB/hour) -- but, relative to his own example, they're still extremely expensive.
Of course, some customers are willing to pay a premium -- in some cases a significant premium -- to get speedy loading capabilities. "[T]his discussion is purely academic, though insightful: if you need to load large amounts of data in a very short time, there are no cheap alternatives," he says.
Randy Lea, vice-president of product and services marketing with Teradata, concurs -- to a degree.
Some of Teradata's customers load "as much as 50 TB per day in real production environments," according to Lea. For these customers, extreme loading is an important issue, but isn't the most important issue, he argues. "Depending on customer requirements, sometimes the load is [a fraction of a percentage] of your system resources; sometimes it's 10 percent; sometimes it can be 50-60 percent if you have batch windows. What is the cost of that? You never really buy a system for a load."
On the other hand, Lea points out, extreme loading separates the enterprise data warehousing (EDW) haves from the EDW have-nots. "Make no mistake, you do have to have a parallel loading environment. If you don't have a parallel loading environment, it is tough to play in this space"". It's just that the number of customers who actually have those extreme [loading] requirements is very small."
Greenplum officials say that Scatter/Gather streaming isn't just a one-trick pony. In addition to its bulk-loading benefits, its implementation is both DBA- and developer-friendly, Werther maintains.
"[Another neat thing] is the way this is exposed to the DBAs and the developers. Traditionally, you'd have a command-line program that you'd run and pipe your file into it. We have that and we support it [via] a wrapper around this type of technology with something that we call gpload," he explains. "When you query a table, you're causing this whole Scatter/Gather Streaming technology to fire up and you pull from that source into the database and it's streamed into the database and appears to you like a table in the database. You can join against that … and pull it in in parallel, no matter where that data is -- on the subnet or on the Internet, half way around the world."
The point, Werther concludes, is that "it looks to the database like it's doing a parallel query. It looks like a scan that's an internal scan. Under the covers it's actually pulling data from all of these systems in parallel. It's not just about big bulk loading."