Amazon Unveils Public Data Sets

Service provides access to a centralized repository of publicly available scientific, demographic, and medical data

Amazon Web Services (AWS) grabbed headlines last month for extending its Elastic Compute Cloud, better known as EC2, to Europe. The suite of Web services designed to provide resizable compute capacity in the cloud is now available to overseas developers. Amazon claims about 440,000 Linux and Unix developers in the United States now use EC2.

Perhaps more important for developers was AWS' earlier (and quieter) launch of "Public Data Sets on AWS," a service that provides access to a centralized repository of publicly available scientific, demographic, and medical data that can be integrated into AWS cloud-based applications. The repository hosts nonconfidential data sets from such massive DBs as the U.S. Census databases, 3-D chemical structures provided by Indiana University, and an annotated form of the Human Genome from Ensemble, among others. More data is coming, Amazon said, including a "wide range of economic statistics from the Bureau of Economic Analysis and additional scientific data sets."

AWS is hosting these data sets at no charge, but users must have an EC2 account and pay for the compute and storage their applications consume when they access the data.

According to Adam Selipsky, vice president of product management and developer relations for AWS, select public data sets are hosted on EC2 as Amazon "Elastic Block Store" (EBS) snapshots. EC2 users can access this data by creating their own EBS volumes, he said, using the public data set snapshots as a starting point. They are then able to access, modify, and perform computations on these volumes directly via their EC2 instances.

"We delivered the first Web service in July of 2002," Selipsky said. "And we've been working to lower the barriers to entry ever since. We are making it possible for our customers to succeed based on their ideas, not their resources."

It appears that at least one of the goals behind the Public Data Sets on AWS is to lure the scientific community into its cloud. At least two from that community already like the idea: Dr. Peter Tonellato from the Harvard Medical School and Dr. Glenn Proctor, Ensembl Software Coordinator at the European Bioinformatics Institute.

"Public Data Sets on AWS will enable me and many of my colleagues to collaborate with each other by sharing our commonly used data sets, research environments and tools," Tonellato said in a statement. "We can set up a controlled environment in minutes, run our computational analysis for a couple of hours, and shut down the environment. Our results are completely repeatable."

"Bioinformatics is a hugely exciting area which is providing much insight into our understanding of biology and, particularly, the genetic basis of many human diseases like cancer and diabetes," Proctor said in a statement. "The genome is a complex thing, however; it presents us with a potential source of invaluable information but also with great challenges in how to store, analyze and annotate it, and how to make both the raw genomic information and our annotations available to as many people as possible. ... Amazon EC2 allows us to go even further and make all our data available in a robust, scalable and flexible form that anyone with an AWS account can use."

About the Author

John K. Waters is the editor in chief of a number of sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS.  He can be reached at

Must Read Articles