Leveraging publication metadata to help overcome the data ingest bottleneck

Leveraging publication metadata to help overcome the data ingest bottleneck Todd J. Vision National Evolutionary Synthesis Center Department of Biology University of North Carolina at Chapel Hill ORCID Participant Meeting, Harvard, May 2011

The End To make data archiving integral to scientific publishing. The scope Data underlying findings in the peer-reviewed biological literature. The Means Integrated submission of data with the manuscript Low barrier to submission (at the datafile level) Free reuse of data (free as in both speech & beer) Journals share responsibility for governance and sustainability

The long tail of orphan data in “small science” after B. Heidorn Specialized repositories (e.g. GenBank, PDB) Volume Orphan data Rank frequency of datatype

The long tail of orphan data in “small science” after B. Heidorn Specialized repositories (e.g. GenBank, PDB) Volume Bumpus HC (1898) The Elimination of the Unfit as Illustrated by the Introduced Sparrow, Passer domesticus. A Fourth Contribution to the Study of Variation.pp. 209-226 in Biological Lectures from the Marine Biological Laboratory, Woods Hole, Mass. Orphan data Rank frequency of datatype

A publication package 1 1. Integrated manuscript and data submission

A publication package 2 1 1. Integrated manuscript and data submission 2. Handshaking with specialized repositories

Integrated Submit manuscript Manuscript metadata

Integrated Submit manuscript Submit data Manuscript metadata

Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review

Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Acceptance notification Curation Data DOI Production

Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Acceptance notification Curation Data DOI Production Article metadata Curation

Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Acceptance notification Curation Data DOI Production Article metadata Curation Article Publication Data publication

Non-integrated Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Submit data Acceptance notification Curation Data DOI Production Article metadata Curation Article Publication Data publication

Non-integrated Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Submit data Acceptance notification Curation Data DOI Production Author adds DOI Data DOI Article metadata Curation Article publication Article Publication Article metadata harvested Data publication

Article Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC, Eisen JA (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011 Dryad data package Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC, Eisen JA (2011) Data from: Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. Dryad Digital Repository. doi:10.5061/dryad.8384

Integrated submission Currently integrated or in process: 20 All journals with Dryad content: >70 A minority require data prior to review Journals published by a variety of organizations Traditional (incl. Oxford University Press, Wiley-Blackwell) Open Access (incl. BMC, BMJ Open) Society publishers (e.g. with Allen Press, or independent)

Dryad vs. Supplementary Online Materials

Member nodes ,[object Object],Coordinating nodes Investigator toolkit

Why Dryad yearns for ORCIDs Replace name strings with identities Disambiguation of like names Clustering of synonymous names Confidently recognizing different data packages that share an author Enabling Accurate author searches Internal and external author hyperlinks Aggregation of author contributions Inclusion of data records in the profiles of coauthors Propagation of ORCIDs with Dryad metadata Manual curation of names not feasible Only ~20% of Dryad authors in Library of Congress name auth. file Manual control would explode curation costs

How to get ORCIDs into Dryad Ideally sent to Dryad by integrated journals Pre-review/Pre-production: allows coauthors to edit data packages Post-production: works for all other uses Non-integrated journals Lookup API based on article or affiliation data To be avoided Authors required to enter ORCIDs during submission Authors required to register during submission

What do we know about authors? Names Often abbreviated except for corresponding or submitting author At least one article they have written Title, journal, volume, pages, DOI, abstract Other identifiable information An email for submitting authors Sometimes: institutional affiliation and contact information for corresponding authors

Some requirements Recognizing ORCIDs for authenticated users Mapping to InCommon Silver profiles ORCIDs for organizations (e.g. consortia) Dspacesupport Curator interface for ORCID lookup/verification Lookup/registration option from submission interface Allowing metadata relationships (e.g. of an ORCID with a name) Mechanisms for curator to Flag duplicates and errors Register provisional ORCIDs Map to other profiles (e.g. InCommon)

Business model issues Dryad is (will be) supported by subscriptions and deposit charges, primarily from journals. With a not-for-profit budget Feasibility requires wide adoption by publishers And manuscript-submission system developers! Favored model Pay for use of automated lookup services, with costs scaled by usage level Credit for curator contributions

"Cherish old knowledge that you may acquire new" The Analects of Confucius Special thanks to Elena Feinstein Jane Greenberg Ryan Scherle For more information: http://datadryad.org http://blog.datadryad.org http://datadryad.org/wiki http://code.google.com/p/dryad dryad-users@nescent.org Facebook: Dryad Twitter: @datadryad

Dryad Metadata Profile (v3.0) Article Data Package ,[object Object]

Leveraging publication metadata to help overcome the data ingest bottleneck

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Leveraging publication metadata to help overcome the data ingest bottleneck

Similar to Leveraging publication metadata to help overcome the data ingest bottleneck (20)

Recently uploaded

Recently uploaded (20)

Leveraging publication metadata to help overcome the data ingest bottleneck

Editor's Notes