RainStor Delivers Big Data Retention on Cloudera’s Distribution including Apache Hadoop

Extreme compression, shared infrastructure lowers TCO for retaining massive data sets.

Note: ESJ’s editors carefully choose vendor-issued press releases about new or upgraded products and services. We have edited and/or condensed this release to highlight key features but make no claims as to the accuracy of the vendor's statements.

RainStor, an infrastructure software company specializing in Online Data Retention (OLDR), has announced that RainStor 4.5 can be deployed using Cloudera’s Distribution including Apache Hadoop. The result is a pragmatic and scalable approach to Big Data that performs fast analytics while retaining data at a lower overall total cost of ownership (TCO).

RainStor can be used to retain and access massive data sets on the Hadoop Distributed File System (HDFS) at a physical footprint at least 97 percent smaller. The result combines Hadoop’s Big Data processing, management, and analytics with RainStor for compliant data retention on existing, low-cost servers and storage.

RainStor on HDFS, using locally attached commodity storage, offers a low initial capital investment and ongoing total cost of ownership for retaining petabytes of data. RainStor’s specialized repository compresses the data using a patented value and pattern de-duplication technique and stores it in immutable form on HDFS. RainStor has built-in security, audit trails, and granular retention and expiry policies for managing the life cycle of stored data. Data within RainStor can be accessed through standard structured query language (SQL), specialized RDBMS native SQL, and standard BI tools via ODBC/JDBC.

Depending on the Hadoop replication factor, the size of stored data can be a significant multiple of the raw data loaded. To counteract this, most Hadoop deployments rely on the use of binary compression (such as LZO), which typically yields on average 5-to-1 compression and comes with a re-inflation performance penalty upon access. In contrast, RainStor achieves compression rates of 40-to-1 or greater and allows data access without re-inflation.

Example: With 2 petabytes (Pb) of raw data to be stored for a 6-month period, the difference in disk savings could look like this:

  • Data in HDFS: 2 Pb X 3 (for replication) = 6Pb + results of analysis
  • Data in HDFS with RainStor: 0.05Pb (original source data compressed 40 to 1) X 3 (for replication) = 0.15Pb + results of analysis; a physical storage savings of 5.85Pb

Even using low-cost commodity disk, as data volumes reach multi-petabytes and beyond, the initial capital expenditure can be significant. More importantly, the overall operating cost of a large number of storage drives continues to be a significant contributing expense that can reach millions of dollars over multiple years. RainStor’s compression, life-cycle management, and compliant retention features, combined with HDFS’ low-cost commodity disk and scale-out benefits, provides significant value and cost savings for Big Data analysis and retention.

RainStor 4.5 for Hadoop is available immediately. For more information, visit www.rainstor.com.

Must Read Articles