Imagine combining the power and scalability of cloud computing and storage with access to thousands of datasets hosted in a reliable and feature-rich data repository platform? Cloud Dataverse does exactly that—brings a mature and widely used data repository platform, Dataverse, together with the OpenStack cloud platform.
This is a necessary next step. At the time of Big Data, when large and streaming datasets are becoming more common, it is necessary for repository and cloud platforms to converge so that data do not need to be moved constantly when they are processed, shared, and archived.
In the last decade, the use of data repository and cloud platforms have grown significantly but mostly separately, not taking advantage of one another. According to the re3data website, there are now more than 1,800 public research data repositories used in academia, government, and business, supported by either open-source repository platforms (e.g., Dataverse, DSpace, CKAN, and Fedora), proprietary software (e.g., Figshare), or one-off databases (e.g., Protein Data Bank). Equally, there has been an increase in the use of open source software for creating private and open clouds (e.g., OpenStack and OpenNebula), as well as in the use of commercial public clouds (e.g., Amazon Web Services, Microsoft Azure, Google Cloud), by academia, government, and businesses alike. Now it is time to bring these together.
To understand the value of converging data repositories and cloud computing, consider the popular AWS public dataset service. Amazon hosts a variety of public datasets that range from census data, to an inventory of Google Books, to Human Genome information. Rather than having to spend hours downloading these datasets, AWS users can browse the AWS repository, locate a dataset, and then analyze it in-situ using Amazon Elastic Map Reduce. However, this is not a satisfying solution for data repositories stakeholders. While AWS's public datasets service demonstrates the value of integrating access to datasets with cloud computing, it lacks the power of today's fully functional research data repositories, which follow best practices on data sharing. In our experience, many data owners, while willing to widely share datasets, are uncomfortable with a-priori making them fully public. For example, some agencies require dataset users to sign a term of use so that the agency is not liable for fully anonymizing the dataset. Having a mechanism where users can locate datasets and apply for access is critical to making these datasets available. Also, the effort to curate and make a dataset available can be enormous, and it is becoming critical to provide a mechanism to give dataset authors credit for this work. Finally, access to metadata associated with the dataset is key to find and reuse the data.