SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long-Term Data Preservation in Sustainability Science
1. SEAD Virtual Archive:
Building a Federation of Institutional
Repositories for Long-Term Data Preservation
in Sustainability Science
Beth Plale, Indiana University, Bloomington, Indiana, USA
Robert H. McDonald, Indiana University, Bloomington, Indiana, USA
Kavitha Chandrasekar, Indiana University, Bloomington, Indiana, USA
Inna Kouper, Indiana University, Bloomington, Indiana, USA
Stacy Konkiel, Indiana University, Bloomington, Indiana, USA
Margaret L. Hedstrom, University of Michigan, Ann Arbor, Michigan, USA
Jim Myers, Rensselaer Polytechnic Institute, Troy, New York, USA
Praveen Kumar, University of Illinois, Urbana, Illinois, USA
Cooperative agreement
#OCI0940824
IDCC 2013 – Amsterdam – Jan. 16, 2013 1
2. SEAD TEAMS
Margaret Hedstrom-PI, Marietta Van Buhler, Karen Woollams,
Michigan George Alter (ICPSR), Bryan Beecher (ICPSR)
Beth Plale-Co-PI, Katy Börner, Robert H. McDonald, Robert Light,
Kavitha Chandrasekar, Stacy Kowalczyk, Inna Kouper, Stacy Konkiel,
Indiana Robert Ping, Ryan Cobine
James Myers-Co-PI, Ram Prasanna Govind Krishnan, Lindsay Todd
Rensselaear
Praveen Kumar-Co-PI, Terry McLaren (NCSA), Rob Kooper (NCSA),
Illinois Luigi Marini (NCSA)
IDCC 2013 – Amsterdam – Jan. 16, 2013 2
3. Challenge: The Data Deluge
1. Scientific data ingestion must be quick and minimally intrusive on a scientist’s time.
2. Ingesting must be flexible enough to handle the varied kinds of data.
sizes // formats // composition
3. Tools for advertising and serving data from an institutional repository need to be
consistent with tools and processes of the scientific community.
IDCC 2013 – Amsterdam – Jan. 16, 2013 3
4. Challenge: Long Tail Scientific Research
• Many research niches
– customized methods
& toolsets
– localized storage
• Less consideration for long-term availability
and data reuse
IDCC 2013 – Amsterdam – Jan. 16, 2013 4
5. Requirements of Virtual Archive for
Sustainability Science
• Must connect multiple IRs
• Must be minimally intrusive on a scientist’s time
• Must handle varied data:
– multi-GB collection,
– vastly heterogeneous collection of files,
– small complex database of a thousand variables, or
– set of files in formats that are unique to the
subdiscipline
• Must be consistent with tools and processes of
the community
IDCC 2013 – Amsterdam – Jan. 16, 2013 5
6. SEAD
discover
ingest
publish associate
SEAD Virtual Archive (SVA)
-- manage sustainability science
window to multiple IRs
--OAIS model
IU Scholarworks UIUC IDEALS UMich Deep
IR IR Blue IR
IDCC 2013 – Amsterdam – Jan. 16, 2013 6
7. SEAD Virtual Archive (SVA)
Design
Policy
Decisions
Progress to
SEAD Virtual Archive (SVA)
-- manage sustainability science Date
window to multiple IRs
--OAIS model
[Single view into data] [Easy deposit]
IDCC 2013 – Amsterdam – Jan. 16, 2013 7
8. Accept
SEAD Virtual Archive Workflow
Repository
Agreement
Upload Run File Mint Deposit Update
Preview Index
Data to Virus Charact- to IR (& DOI
Data DOI Metadata
VA Checking erization cloud) target
IR Large Index
Index
Version Scientific
Scientific
Match- Dataset
Data Metadata
maker Decision Metadata
Ongoing work
IDCC 2013 – Amsterdam – Jan. 16, 2013 8
9. Architecture: SEAD VA Matchmaker
IR
Match-
maker
Query for
data contributor metadata IR Matchmaker
Client
VIVO Return data contributor’s
affiliation information
Query Get
Match Match
Return all
Query
IRs’ details
VA load
VA Load Monitor IR Matchmaker
Repository Agent
Agent Service
Query
Return
for IRs’
VA load
details
constraints
IDCC 2013 – Amsterdam – Jan. 16, 2013 9
10. Policy: Licensing Agreements
• Right to store and re-format files
(preservation)
• Allow editing to protect human
subjects, sensitive data (protection) Repository
• Make metadata public
(discoverability)
rights
• Ensure sponsor compliance
(liability)
IDCC 2013 – Amsterdam – Jan. 16, 2013 10
11. Policy: Licensing Agreements
• Retain copyright/moral
rights
Depositor • Deposits will not be
rights changed from original
intent
• Embargoes will be honored
IDCC 2013 – Amsterdam – Jan. 16, 2013 11
12. Policy: Licensing Agreements
Single-license Matchmaking
solution solution
Connect
requirements of:
Satisfy all repository
requirements • End users
• Repositories
• SEAD Virtual Archive
Mitigate rights on
behalf of depositor
IDCC 2013 – Amsterdam – Jan. 16, 2013 12
14. Policy: Author IDs
• Global system
• Buy-in from and
• Used primarily at integration with
domain/institution ORCID major publishers
al level and institutions
• Supports many
researcher ID
systems, ResearcherID
including Scopus
VIVO ID
ORCID Author ID
Pivot ID
IDCC 2013 – Amsterdam – Jan. 16, 2013 14
16. Progress to Date
• Ingested all NCED data
– Small-sized collection (overall < 150 Mb)
– File organization for heterogeneous collection of
related files with flat or hierarchical structure
• Tested deposit between the VA, UIUC IDEALS,
and IUScholarWorks
IDCC 2013 – Amsterdam – Jan. 16, 2013 16
17. Future Work
• Address other use cases
– Large size collections (overall > 1 Gb)
– Relational database / interconnected variables
– Unique formats (to project, discipline, community)
• Interoperability with other DataNets
• Support for API access
• Determine how prototype fits researcher
workflows
IDCC 2013 – Amsterdam – Jan. 16, 2013 17
Good morning, my name is Stacy Konkiel and I’m an E-Science Librarian at Indiana University in the United States. SEAD—which stands for Sustainable Environment, Actionable Data--is an NSF DataNet project that is aiming to create a federation of institutional repositories for long-term data preservation in sustainability science.
Myself and my colleague Robert McDonald who is also here today are part of a larger team of scientists and research data specialists, led by PIs Margaret Hedstrom, Beth Plale, Jim Myers, and Praveen Kumar, that are working to build the SEAD Virtual Archive.Before I go into too much detail about the project, I’m going to set up our research problem, which should help explain why this federation is both unique and essential.
It would come as no surprise to many of you that many universities are currently grappling with their response to the deluge of scientific data emerging from their faculty’s research. Many are looking to their libraries and institutional repositories as a solution. Research libraries have considerable expertise with preservation of the scholarly record--in cooperation with the university’s IT organization, a research library can play a significant role in preservation of scientific data, and in connecting the it to the published record of scholarship. However, the general consensus is that institutional repositories are difficult to use. Often, it takes a long time and manual intervention to deposit data. If university IRs are to compete with commercial and specialized scientific repositories, they need to provide services that are easy, fast, reliable, and community friendly. Geyser photo source: http://www.flickr.com/photos/worldtrek/5782815999/sizes/m/Usability photo source: http://www.localwin.com/julie/system/files/lu10/Usability_Testing.jpg
Another challenge is how to deal with Long tail scientific research.Long tail science, such as sustainability science, often has many research niches, which rely on customized methods and toolsets and on localized storage The data are often designed to satisfy the immediate needs of researchers, with less consideration for long-term availability and data reuse. Photo credit: http://www.pdx.edu/sustainability/sites/www.pdx.edu.sustainability/files/styles/pdx_collage_medium/public/kj8-07PSU%20DJ00425x7.jpg
How can institutions—using their IRs—address these problems?One approach is to create a virtual archive over IR interactions.Any virtual archive for sustainability science would require the following features:Must mediate over multiple Institutional Repositories – no single IR will hold enough sustainability science data to be meaningfulPolicies and tools for ingest of scientific data must be minimally intrusive on a scientist’s time.Processes for ingest of data must handle varied data: multi-GB collection, vastly heterogeneous collection of files, small complex database of a thousand variables, or set of files in formats that are unique to the subdiscipline.Tools for serving data from an institutional repository must be consistent with tools and processes of the sustainability science community
The SEAD Virtual Archive (VA) supports not only the preservation of sustainability science data in the institutional repository today, but also aims to facilitate rich access and use into the future. In our specific prototype, we worked with the NSF National Center National Center for Earth-Surface Dynamics (NCED), one of the original NSF Science and Technology Centers, to see how data would work in such an archive.The community needs described in the previous slide have heavily shaped the creation of the SEAD project as a whole. There are three major components to SEAD, which I’ll describe by way of explaining how data moves within the framework.A user “ingests” his or her her data to ACR where metadata are harvested and add’l annotation by the user and community can take place. Data sets that are considered “fixed” are then “published” to SVA, a long-term preservation layer which manages a sustainability science window to multiple IR’s and prepares data sets for deposit. The data sets published to the SVA are registered with SEAD VIVO where associations are made between researchers, publications, and data sets. This relationship information is fed back to ACR as discovery information. Users of ACR (not just the ingester) can then use the discovery information to work with the data to improve it further.
The Virtual Archive offers users a single view into data, with easy deposit.The software extends the open source software code developed by the Data Conservancy. Currently, the SEAD VA provides mechanisms to automatically deposit data into the Indiana University repository, IUScholarWorks1, and the University of Illinois repository, IDEALS2, with growth to other repositories. I’m going to first discuss the SEAD Virtual Archive’s design, which is influenced heavily by policies related to both the IRs that support the federation, and the specific needs of our user group (discussed previously). I’ll then talk a bit about our policy decisions such as deposit and permanent IDs, then finish out with a brief update of our progress to date on the project.
Now, let’s drill down into what the Virtual Archive workflow looks like. When a data set arrives in the Virtual Archive, it is presumed to be “ready to publish.” At this point, data and metadata are considered ready to be versioned/fixed .We also assume that the metadata also contains identified terms of access and use (i.e., datasets that were marked and selected for preservation, repository licensing agreements are accepted). Once data is accepted into the virtual archive, there is… an automated check of dataset integrity, A DOI is minted, A matchmaking service automates decisions on to which repository to deposit the data (more on that in a moment)Data is deposited to the IR, and thenQuery(ies) are then used to assemble metadata that’s been either entered manually, or harvested automatically by earlier services such as the ACR. The metadata object also contains provenance about the file(s). SEAD VA then transforms the RDF package into the SIP (Submission Information Package), an OAIS-compliant xml-encoded store for metadata.
Next I’d like to discuss an interesting step in the data publishing process that helps to provide a single, seamless IR deposit interface for researchers: the matchmaking processMatchmaking is a technical solution for automated deposit that reconciles the needs of institutional repositories with the needs of end users and the SEAD VA. For instance, a repository may only be able to take small files of less than 150 MB, or only want data from its affiliated researchers. These policy statements are encoded in the matchmaker, which is a seamless query process that evaluates the rules when deciding where to deposit a package. For obvious reasons, the matchmaker is (for the most part) hidden to the researcher—all researchers encounter is a single deposit interface, which might include some questions that the matchmaker might use to create rulesAs an easy example, the question “To which university are you affiliated?” is a means of matching authors from Indiana, Illinois, or Michigan with the appropriate repository. For authors unaffiliated with any of those universities, a second question could be posed, “To which research centers are you affilliated?” If they answer something like NCED, the data could be deposited without a problem to an NCED-affiliated repository.Of course, in practice, affiliation information would be gleaned from VIVO—as you can see here, the matchmaker client queries VIVO and the repositories themselves to gather enough information to create appropriate rules.Because the SEAD VA interacts with several repositories that all have different scopes, missions, service orientations and depositor requirements, we need to make sure that such reconciliation satisfies all partners. Once the matchmaking service has made a decision on where to deposit, many processes, policies, and tools need to be in place in order to make the ingest of scientific data into an institutional repository easy and smooth. The SEAD Virtual Archive has implemented a SWORD API client for automated data deposit into the repositories. Our plan is to extend the SWORD API deposit to non-DSpace repositories (for example, repositories based on Fedora software). It’s also worth noting that we are exploring mapping current IR author data to VIVO, using BibApp
Many of the rules for the matchmaking service is built upon uselogic created by the various DEPOSIT licensing agreements. When depositing to a single repository, you will often be required to accept a university-sanctioned deposit licensing agreement—a sort of Terms of Service for the repository that clearly explains the rights of the repository and also the rights the depositor retains.These DEPOSIT Licensing agreements always set terms that allow repositories certain rights, such as basic rights to store and re-format files for preservation purposes, orChanging data to protect human subjects, or That ensure the depositor has complied with research sponsors, to limit liabilityDetermining the rights that authors are willing to grant repositories is one step in reconciling researchers’ needs with those of the IRs, and also those of SEAD.
Licensing agreements also often ensure rights for depositors, such asdeposits will not be changed in any way that differs from the depositor’s original intent; data under embargo will not be distributed until embargo has ended; the data creator retains copyright and/or moral rights to any depositsThese depositor rights are pretty similar across the repositories. We are continuing to track them (and attempting to reconcile them, where necessary) because we feel that a more elegant solution than the matchmaking service might be to work with each of the institutions to create a single site license for SEAD.
Ultimately, a single-license solution that would satisfy all repository (and institutional) legal requirements is our goal. Such a license would also need to mitigate rights on behalf of the depositorUntil then, it’s important that the project Aims to develop clear instructions displayed during the deposit process that:explain the rights and responsibilities associated with the use of datasetsdirect users to relevant information regarding licensing It’s also important that our programmatic matchmaking service can identify when there is a match between the independently stated requirements of specific end users, repositories, and SEAD Virtual Archive.
The second major policy decision that shaped the architecture of our project is that of which permanent identifier schema to use for assigning IDs to datasets, and also for identifying authors.The SEAD project applies persistent identifiers (PIDs) to datasets and authors to track usage and citations. For author identification, we currently use internal VIVO identifiers, which are unique links to each researcher profile. For dataset identification the SEAD project uses DOIs, created by Datacite, that are being resolved to the institutional repositories where data is stored.
Currently, the SEAD project uses the built-in VIVO unique ID for all researcher profile information stored in our Active/Social Content Repository. This VIVO ID can be effectively used primarily at the domain or institutional level4. However, the VIVO ontology currently supports many different researcher PID system IDs, including ORCID5, and systems such as Thomson Reuter’s ResearcherID.When the SEAD project began, ORCID was still under development as an author ID system.Now that the ORCID technical infrastructure has been established, SEAD hopes to take advantage of that service by enabling ORCID ID registration as the main unique identifier system. This would enable researcher identification at a more global level.We know that more is coming from ORCID, and we look forward to helping implement resolution services between ORCID and VIVO systems.
For data, we decided fairly early on to adopt DOIs as our permanent id scheme. However, DSpace primarily uses Handles as a permanent ID, so a challenge has been reconciling these two identifier schemes within our system.The Handle system is an established standard for PID resolution supported by the DSpace institutional repository software, and often managed locally. Many repository support staff are already familiar with Handles and there are over 1300 installations world-wide. Handle IDs do not store associated metadata, while DOIs do—an important concern for a project developing rich metadata and an additional preservation measure. Further, while Handles are generated primarily for use in DSpace institutional repositories, DOIs are supported by a wide range of publishers and vendors as the citation standard and are used widely for data identification. We wanted to implement a service that had the right international scope, and we felt that DOIs were that service. Using DOIs therefore allows SEAD VA to go beyond DSpace-based repositories, to federate across a larger number and variety of repositories, and to integrate with other data preservation and citation technologies. One of the limitations for the DOI standard is that even though it allows for multiple destination resolution3, most implementations do not support it. Ideally, the DOI would resolve to the data set’s locations in both the VA and ACR components of SEAD, so end users could access both the active and the fixed data version. Currently, DataCite does not allow this action, so we overcome this limitation by maintaining a DOI and a separate link to the ACR in both the SEAD VA and SEAD-VIVO. In any event, given that two permanent IDs exist for any given dataset by virtue of being deposited to a DSpace repository, we need to find a good way to cross-reference Handles and DOIs. It’s a challenge we’re working to address.
Finally, I’d like to briefly let you know of some of our progress to date with the Virtual Archive.As I pointed out, sustainability science is broad, and the data are diverse. As our starting point, we have ingested all NCED data, in order to test the use case of how SEAD would handle a small-sized collection. We’ve also addressed another use case: organization for heterogeneous collection of related files. Our file organization is based on how we received the files from NCED; another file organization format be more optimal—we’re looking into finding the best way to structure organization.Tested deposit between the VA, UIUC IDEALS, and IUScholarWorks successfully. We’ve been able to demonstrate the end-to-end prototype between tri-part functionality of SEAD, including the social network, virtual archive, and repositories.
Work does remain on the prototype. We have to address the other use cases for data. including the heterogenous file structure I mentioned, and also how to deal with relational databases and unique formats.Our other major task is to determine how the prototype fits researcher workflows.
That is a very basic overview of the SEAD virtual archive and some of the policy work we’ve done to date. I encourage you to check out the paper, which should be included on the conference USB drives, for more information. I have a few minutes left for questions.