SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long-Term Data Preservation in Sustainability Science

SEAD Virtual Archive:
Building a Federation of Institutional
Repositories for Long-Term Data Preservation
in Sustainability Science
Beth Plale, Indiana University, Bloomington, Indiana, USA
Robert H. McDonald, Indiana University, Bloomington, Indiana, USA
Kavitha Chandrasekar, Indiana University, Bloomington, Indiana, USA
Inna Kouper, Indiana University, Bloomington, Indiana, USA
Stacy Konkiel, Indiana University, Bloomington, Indiana, USA
Margaret L. Hedstrom, University of Michigan, Ann Arbor, Michigan, USA
Jim Myers, Rensselaer Polytechnic Institute, Troy, New York, USA
Praveen Kumar, University of Illinois, Urbana, Illinois, USA

Cooperative agreement
#OCI0940824
IDCC 2013 – Amsterdam – Jan. 16, 2013 1

SEAD TEAMS
Margaret Hedstrom-PI, Marietta Van Buhler, Karen Woollams,
Michigan George Alter (ICPSR), Bryan Beecher (ICPSR)

Beth Plale-Co-PI, Katy Börner, Robert H. McDonald, Robert Light,
Kavitha Chandrasekar, Stacy Kowalczyk, Inna Kouper, Stacy Konkiel,
Indiana Robert Ping, Ryan Cobine

James Myers-Co-PI, Ram Prasanna Govind Krishnan, Lindsay Todd
Rensselaear

Praveen Kumar-Co-PI, Terry McLaren (NCSA), Rob Kooper (NCSA),
Illinois Luigi Marini (NCSA)


Challenge: The Data Deluge

1. Scientific data ingestion must be quick and minimally intrusive on a scientist’s time.
2. Ingesting must be flexible enough to handle the varied kinds of data.
sizes // formats // composition
3. Tools for advertising and serving data from an institutional repository need to be
consistent with tools and processes of the scientific community.


Challenge: Long Tail Scientific Research
• Many research niches
– customized methods
& toolsets
– localized storage

• Less consideration for long-term availability
and data reuse


Requirements of Virtual Archive for
Sustainability Science
• Must connect multiple IRs
• Must be minimally intrusive on a scientist’s time
• Must handle varied data:
– multi-GB collection,
– vastly heterogeneous collection of files,
– small complex database of a thousand variables, or
– set of files in formats that are unique to the
subdiscipline
• Must be consistent with tools and processes of
the community

SEAD
discover
ingest

publish associate

SEAD Virtual Archive (SVA)
-- manage sustainability science
window to multiple IRs
--OAIS model

IU Scholarworks UIUC IDEALS UMich Deep
IR IR Blue IR


Design

Policy
Decisions

Progress to
-- manage sustainability science Date
window to multiple IRs
--OAIS model

[Single view into data] [Easy deposit]

Accept
SEAD Virtual Archive Workflow
Repository
Agreement

Upload Run File Mint Deposit Update
Preview Index
Data to Virus Charact- to IR (& DOI
Data DOI Metadata
VA Checking erization cloud) target

IR Large Index
Index
Version Scientific
Scientific
Match- Dataset
Data Metadata
maker Decision Metadata

Ongoing work


Architecture: SEAD VA Matchmaker

IR
Match-
maker

Query for
data contributor metadata IR Matchmaker
Client

VIVO Return data contributor’s
affiliation information
Query Get
Match Match
Return all
Query
IRs’ details
VA load
VA Load Monitor IR Matchmaker
Repository Agent
Agent Service
Query
Return
for IRs’
VA load
details
constraints

Policy: Licensing Agreements

• Right to store and re-format files
(preservation)
• Allow editing to protect human
subjects, sensitive data (protection) Repository
• Make metadata public
(discoverability)
rights
• Ensure sponsor compliance
(liability)



• Retain copyright/moral
rights
Depositor • Deposits will not be
rights changed from original
intent
• Embargoes will be honored


Single-license Matchmaking
solution solution

Connect
requirements of:
Satisfy all repository
requirements • End users
• Repositories
• SEAD Virtual Archive

Mitigate rights on
behalf of depositor


Policy: Permanent Identifiers

Author IDs Dataset IDs

•VIVO •Digital
identifiers Object
Identifiers
(DOIs)

Policy: Author IDs
• Global system
• Buy-in from and
• Used primarily at integration with
domain/institution ORCID major publishers
al level and institutions
• Supports many
researcher ID
systems, ResearcherID
including Scopus
VIVO ID
ORCID Author ID
Pivot ID


Policy: Dataset IDs

Handles DOIs


Progress to Date
• Ingested all NCED data
– Small-sized collection (overall < 150 Mb)
– File organization for heterogeneous collection of
related files with flat or hierarchical structure
• Tested deposit between the VA, UIUC IDEALS,
and IUScholarWorks


Future Work
• Address other use cases
– Large size collections (overall > 1 Gb)
– Relational database / interconnected variables
– Unique formats (to project, discipline, community)
• Interoperability with other DataNets
• Support for API access
• Determine how prototype fits researcher
workflows

Thank you

http://www.sead-data.net
@SEADdatanet

Cooperative agreement
#OCI0940824

SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long-Term Data Preservation in Sustainability Science

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long-Term Data Preservation in Sustainability Science

Similar to SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long-Term Data Preservation in Sustainability Science (20)

SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long-Term Data Preservation in Sustainability Science

Editor's Notes