SlideShare a Scribd company logo
1 of 29
Leveraging publication metadata to help overcome the data ingest bottleneck  Todd J. Vision National Evolutionary Synthesis Center Department of Biology  University of North Carolina at Chapel Hill ORCID Participant Meeting, Harvard, May 2011
The End To make data archiving integral to scientific publishing.   The scope Data underlying findings in the peer-reviewed biological literature. The Means Integrated submission of data with the manuscript Low barrier to submission (at the datafile level) Free reuse of data (free as in both speech & beer) Journals share responsibility for governance and sustainability
The long tail of orphan data in “small science” after B. Heidorn Specialized repositories (e.g. GenBank, PDB) Volume Orphan data Rank frequency of datatype
The long tail of orphan data in “small science” after B. Heidorn Specialized repositories (e.g. GenBank, PDB) Volume Bumpus HC (1898) The Elimination of the Unfit as Illustrated by the Introduced Sparrow, Passer domesticus. A Fourth Contribution to the Study of Variation.pp. 209-226 in Biological Lectures from the Marine Biological Laboratory, Woods Hole, Mass.  Orphan data Rank frequency of datatype
A publication package
A publication package 1 1. Integrated manuscript and data submission
A publication package 2 1 1. Integrated manuscript and data submission 2. Handshaking with specialized repositories
Integrated Submit manuscript
Integrated Submit manuscript Manuscript metadata
Integrated Submit manuscript Submit data Manuscript metadata
Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review
Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Acceptance notification Curation Data DOI Production
Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Acceptance notification Curation Data DOI Production Article metadata Curation
Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Acceptance notification Curation Data DOI Production Article metadata Curation Article Publication Data publication
Non-integrated Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Submit data Acceptance notification Curation Data DOI Production Article metadata Curation Article Publication Data publication
Non-integrated Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Submit data Acceptance notification Curation Data DOI Production Author adds DOI Data DOI Article metadata Curation Article publication Article Publication Article metadata harvested Data publication
Article Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC, Eisen JA (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011 Dryad data package Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC, Eisen JA (2011) Data from: Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. Dryad Digital Repository. doi:10.5061/dryad.8384
Integrated submission Currently integrated or in process: 20 All journals with Dryad content: >70  A minority require data prior to review Journals published by a variety of organizations Traditional (incl. Oxford University Press, Wiley-Blackwell) Open Access (incl. BMC, BMJ Open) Society publishers (e.g. with Allen Press, or independent)
Dryad vs. Supplementary Online Materials
612 downloads
Member nodes ,[object Object],Coordinating nodes Investigator toolkit
Why Dryad yearns for ORCIDs Replace name strings with identities Disambiguation of like names Clustering of synonymous names Confidently recognizing different data packages that share an author Enabling Accurate author searches Internal and external author hyperlinks Aggregation of author contributions Inclusion of data records in the profiles of coauthors Propagation of ORCIDs with Dryad metadata Manual curation of names not feasible Only ~20% of Dryad authors in Library of Congress name auth. file Manual control would explode curation costs
How to get ORCIDs into Dryad Ideally sent to Dryad by integrated journals Pre-review/Pre-production: allows coauthors to edit data packages Post-production: works for all other uses Non-integrated journals Lookup API based on article or affiliation data To be avoided Authors required to enter ORCIDs during submission Authors required to register during submission
What do we know about authors? Names Often abbreviated except for corresponding or submitting author At least one article they have written Title, journal, volume, pages, DOI, abstract Other identifiable information An email for submitting authors Sometimes: institutional affiliation and contact information for corresponding authors
Some requirements Recognizing ORCIDs for authenticated users Mapping to InCommon Silver profiles ORCIDs for organizations (e.g. consortia) Dspacesupport Curator interface for ORCID lookup/verification Lookup/registration option from submission interface Allowing metadata relationships (e.g. of an ORCID with a name) Mechanisms for curator to  Flag duplicates and errors Register provisional ORCIDs Map to other profiles (e.g. InCommon)
Business model issues Dryad is (will be) supported by subscriptions and deposit charges, primarily from journals. With a not-for-profit budget Feasibility requires wide adoption by publishers And manuscript-submission system developers! Favored model Pay for use of automated lookup services, with costs scaled by usage level Credit for curator contributions
"Cherish old knowledge that you may acquire new"  	The Analects of Confucius Special thanks to Elena Feinstein Jane Greenberg Ryan Scherle For more information: http://datadryad.org http://blog.datadryad.org http://datadryad.org/wiki http://code.google.com/p/dryad dryad-users@nescent.org Facebook: Dryad Twitter: @datadryad
Dryad Metadata Profile (v3.0) Article Data Package ,[object Object]

More Related Content

What's hot

Laurie Goodman: Overcoming Hurdles to Data Publication
Laurie Goodman: Overcoming Hurdles to Data PublicationLaurie Goodman: Overcoming Hurdles to Data Publication
Laurie Goodman: Overcoming Hurdles to Data PublicationGigaScience, BGI Hong Kong
 
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Merce Crosas
 
The DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with ConfidenceThe DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with ConfidenceMerce Crosas
 
dkNET Poster Experimental Biology 2019
dkNET Poster Experimental Biology 2019dkNET Poster Experimental Biology 2019
dkNET Poster Experimental Biology 2019dkNET
 
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Merce Crosas
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge DiscoveryMichel Dumontier
 
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...GigaScience, BGI Hong Kong
 
Why study Data Sharing? (+ why share your data)
Why study Data Sharing?  (+ why share your data)Why study Data Sharing?  (+ why share your data)
Why study Data Sharing? (+ why share your data)Heather Piwowar
 
W3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description GuidelinesW3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description GuidelinesMichel Dumontier
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
 
dkNET Poster ENDO 2019
dkNET Poster ENDO 2019dkNET Poster ENDO 2019
dkNET Poster ENDO 2019dkNET
 
Building an NIH Data Catalog: Bit by Bit
Building an NIH Data Catalog: Bit by BitBuilding an NIH Data Catalog: Bit by Bit
Building an NIH Data Catalog: Bit by Bitreadkev
 
Reproducible research: First steps.
Reproducible research: First steps. Reproducible research: First steps.
Reproducible research: First steps. Richard Layton
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Michel Dumontier
 
Scott Edmunds ISMB talk on Big Data Publishing
Scott Edmunds ISMB talk on Big Data PublishingScott Edmunds ISMB talk on Big Data Publishing
Scott Edmunds ISMB talk on Big Data PublishingGigaScience, BGI Hong Kong
 

What's hot (20)

Laurie Goodman: Overcoming Hurdles to Data Publication
Laurie Goodman: Overcoming Hurdles to Data PublicationLaurie Goodman: Overcoming Hurdles to Data Publication
Laurie Goodman: Overcoming Hurdles to Data Publication
 
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
Open Source Tools Facilitating Sharing/Protecting Privacy: Dataverse and Data...
 
The DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with ConfidenceThe DataTags System: Sharing Sensitive Data with Confidence
The DataTags System: Sharing Sensitive Data with Confidence
 
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
 
dkNET Poster Experimental Biology 2019
dkNET Poster Experimental Biology 2019dkNET Poster Experimental Biology 2019
dkNET Poster Experimental Biology 2019
 
FAIR data and the Etsin service
FAIR data and the Etsin serviceFAIR data and the Etsin service
FAIR data and the Etsin service
 
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
 
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
Addressing the New Challenges in Data Sharing: Large-Scale Data and Sensitive...
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
 
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...
 
Why study Data Sharing? (+ why share your data)
Why study Data Sharing?  (+ why share your data)Why study Data Sharing?  (+ why share your data)
Why study Data Sharing? (+ why share your data)
 
W3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description GuidelinesW3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description Guidelines
 
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
An Open Repository Model for Acquiring Knowledge About Scientific ExperimentsAn Open Repository Model for Acquiring Knowledge About Scientific Experiments
An Open Repository Model for Acquiring Knowledge About Scientific Experiments
 
dkNET Poster ENDO 2019
dkNET Poster ENDO 2019dkNET Poster ENDO 2019
dkNET Poster ENDO 2019
 
Building an NIH Data Catalog: Bit by Bit
Building an NIH Data Catalog: Bit by BitBuilding an NIH Data Catalog: Bit by Bit
Building an NIH Data Catalog: Bit by Bit
 
Reproducible research: First steps.
Reproducible research: First steps. Reproducible research: First steps.
Reproducible research: First steps.
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
 
Scott Edmunds ISMB talk on Big Data Publishing
Scott Edmunds ISMB talk on Big Data PublishingScott Edmunds ISMB talk on Big Data Publishing
Scott Edmunds ISMB talk on Big Data Publishing
 

Similar to Leveraging publication metadata to help overcome the data ingest bottleneck

RO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital ObjectsRO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital ObjectsCarole Goble
 
Data sharing & the nih data catalog
Data sharing & the nih data catalogData sharing & the nih data catalog
Data sharing & the nih data catalogreadkev
 
David Shotton - Research Integrity: Integrity of the published record
David Shotton - Research Integrity: Integrity of the published recordDavid Shotton - Research Integrity: Integrity of the published record
David Shotton - Research Integrity: Integrity of the published recordJisc
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic WebPeter Mika
 
Archives Hub - Data in :: Data out
Archives Hub - Data in :: Data outArchives Hub - Data in :: Data out
Archives Hub - Data in :: Data outJane Stevenson
 
BibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 PresentationBibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 PresentationReynold Xin
 
Linking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual ArchivesLinking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual ArchivesMicah Altman
 
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...Open Science Fair
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the partsCarole Goble
 
Wrangling metadata from hathi trust and pubmed to provide full text linking t...
Wrangling metadata from hathi trust and pubmed to provide full text linking t...Wrangling metadata from hathi trust and pubmed to provide full text linking t...
Wrangling metadata from hathi trust and pubmed to provide full text linking t...NASIG
 
Ag Data Commons: Agricultural research metadata and data
Ag Data Commons: Agricultural research metadata and dataAg Data Commons: Agricultural research metadata and data
Ag Data Commons: Agricultural research metadata and dataCyndy Parr
 
Asis&t webinar people directories access innovations
Asis&t webinar people directories access innovationsAsis&t webinar people directories access innovations
Asis&t webinar people directories access innovationsBert Carelli
 
Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesStefan Dietze
 
Crediting informatics and data folks in life science teams
Crediting informatics and data folks in life science teamsCrediting informatics and data folks in life science teams
Crediting informatics and data folks in life science teamsCarole Goble
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyPeter Mika
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin YahooPeter Mika
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Laurent Alquier
 
Using Dataverse Virtual Archive Technology for Research Data Management
Using Dataverse Virtual Archive Technology for Research Data ManagementUsing Dataverse Virtual Archive Technology for Research Data Management
Using Dataverse Virtual Archive Technology for Research Data ManagementGary Wilhelm
 

Similar to Leveraging publication metadata to help overcome the data ingest bottleneck (20)

RO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital ObjectsRO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital Objects
 
Data sharing & the nih data catalog
Data sharing & the nih data catalogData sharing & the nih data catalog
Data sharing & the nih data catalog
 
David Shotton - Research Integrity: Integrity of the published record
David Shotton - Research Integrity: Integrity of the published recordDavid Shotton - Research Integrity: Integrity of the published record
David Shotton - Research Integrity: Integrity of the published record
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic Web
 
Archives Hub - Data in :: Data out
Archives Hub - Data in :: Data outArchives Hub - Data in :: Data out
Archives Hub - Data in :: Data out
 
BibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 PresentationBibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 Presentation
 
FAIRer Research
FAIRer ResearchFAIRer Research
FAIRer Research
 
Linking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual ArchivesLinking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual Archives
 
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...
OSFair2017 Workshop | How FAIR friendly is the FAIRDOM Hub? Exposing metadata...
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
 
Wrangling metadata from hathi trust and pubmed to provide full text linking t...
Wrangling metadata from hathi trust and pubmed to provide full text linking t...Wrangling metadata from hathi trust and pubmed to provide full text linking t...
Wrangling metadata from hathi trust and pubmed to provide full text linking t...
 
Ag Data Commons: Agricultural research metadata and data
Ag Data Commons: Agricultural research metadata and dataAg Data Commons: Agricultural research metadata and data
Ag Data Commons: Agricultural research metadata and data
 
Asis&t webinar people directories access innovations
Asis&t webinar people directories access innovationsAsis&t webinar people directories access innovations
Asis&t webinar people directories access innovations
 
Introduction of Linked Data for Science
Introduction of Linked Data for ScienceIntroduction of Linked Data for Science
Introduction of Linked Data for Science
 
Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital Libraries
 
Crediting informatics and data folks in life science teams
Crediting informatics and data folks in life science teamsCrediting informatics and data folks in life science teams
Crediting informatics and data folks in life science teams
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkey
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
 
Using Dataverse Virtual Archive Technology for Research Data Management
Using Dataverse Virtual Archive Technology for Research Data ManagementUsing Dataverse Virtual Archive Technology for Research Data Management
Using Dataverse Virtual Archive Technology for Research Data Management
 

Recently uploaded

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 

Recently uploaded (20)

Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 

Leveraging publication metadata to help overcome the data ingest bottleneck

  • 1. Leveraging publication metadata to help overcome the data ingest bottleneck Todd J. Vision National Evolutionary Synthesis Center Department of Biology University of North Carolina at Chapel Hill ORCID Participant Meeting, Harvard, May 2011
  • 2. The End To make data archiving integral to scientific publishing. The scope Data underlying findings in the peer-reviewed biological literature. The Means Integrated submission of data with the manuscript Low barrier to submission (at the datafile level) Free reuse of data (free as in both speech & beer) Journals share responsibility for governance and sustainability
  • 3. The long tail of orphan data in “small science” after B. Heidorn Specialized repositories (e.g. GenBank, PDB) Volume Orphan data Rank frequency of datatype
  • 4. The long tail of orphan data in “small science” after B. Heidorn Specialized repositories (e.g. GenBank, PDB) Volume Bumpus HC (1898) The Elimination of the Unfit as Illustrated by the Introduced Sparrow, Passer domesticus. A Fourth Contribution to the Study of Variation.pp. 209-226 in Biological Lectures from the Marine Biological Laboratory, Woods Hole, Mass. Orphan data Rank frequency of datatype
  • 6. A publication package 1 1. Integrated manuscript and data submission
  • 7. A publication package 2 1 1. Integrated manuscript and data submission 2. Handshaking with specialized repositories
  • 9. Integrated Submit manuscript Manuscript metadata
  • 10. Integrated Submit manuscript Submit data Manuscript metadata
  • 11. Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review
  • 12. Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Acceptance notification Curation Data DOI Production
  • 13. Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Acceptance notification Curation Data DOI Production Article metadata Curation
  • 14. Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Acceptance notification Curation Data DOI Production Article metadata Curation Article Publication Data publication
  • 15.
  • 16. Non-integrated Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Submit data Acceptance notification Curation Data DOI Production Article metadata Curation Article Publication Data publication
  • 17. Non-integrated Integrated Submit manuscript Submit data Manuscript metadata Review passcode Peer review Submit data Acceptance notification Curation Data DOI Production Author adds DOI Data DOI Article metadata Curation Article publication Article Publication Article metadata harvested Data publication
  • 18. Article Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC, Eisen JA (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011 Dryad data package Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC, Eisen JA (2011) Data from: Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. Dryad Digital Repository. doi:10.5061/dryad.8384
  • 19. Integrated submission Currently integrated or in process: 20 All journals with Dryad content: >70 A minority require data prior to review Journals published by a variety of organizations Traditional (incl. Oxford University Press, Wiley-Blackwell) Open Access (incl. BMC, BMJ Open) Society publishers (e.g. with Allen Press, or independent)
  • 20. Dryad vs. Supplementary Online Materials
  • 22.
  • 23. Why Dryad yearns for ORCIDs Replace name strings with identities Disambiguation of like names Clustering of synonymous names Confidently recognizing different data packages that share an author Enabling Accurate author searches Internal and external author hyperlinks Aggregation of author contributions Inclusion of data records in the profiles of coauthors Propagation of ORCIDs with Dryad metadata Manual curation of names not feasible Only ~20% of Dryad authors in Library of Congress name auth. file Manual control would explode curation costs
  • 24. How to get ORCIDs into Dryad Ideally sent to Dryad by integrated journals Pre-review/Pre-production: allows coauthors to edit data packages Post-production: works for all other uses Non-integrated journals Lookup API based on article or affiliation data To be avoided Authors required to enter ORCIDs during submission Authors required to register during submission
  • 25. What do we know about authors? Names Often abbreviated except for corresponding or submitting author At least one article they have written Title, journal, volume, pages, DOI, abstract Other identifiable information An email for submitting authors Sometimes: institutional affiliation and contact information for corresponding authors
  • 26. Some requirements Recognizing ORCIDs for authenticated users Mapping to InCommon Silver profiles ORCIDs for organizations (e.g. consortia) Dspacesupport Curator interface for ORCID lookup/verification Lookup/registration option from submission interface Allowing metadata relationships (e.g. of an ORCID with a name) Mechanisms for curator to Flag duplicates and errors Register provisional ORCIDs Map to other profiles (e.g. InCommon)
  • 27. Business model issues Dryad is (will be) supported by subscriptions and deposit charges, primarily from journals. With a not-for-profit budget Feasibility requires wide adoption by publishers And manuscript-submission system developers! Favored model Pay for use of automated lookup services, with costs scaled by usage level Credit for curator contributions
  • 28. "Cherish old knowledge that you may acquire new" The Analects of Confucius Special thanks to Elena Feinstein Jane Greenberg Ryan Scherle For more information: http://datadryad.org http://blog.datadryad.org http://datadryad.org/wiki http://code.google.com/p/dryad dryad-users@nescent.org Facebook: Dryad Twitter: @datadryad
  • 29.
  • 30. bibo.status = article publication status
  • 31. dc.creator = authors of article
  • 32. dc.issued = article publication date
  • 33. dc.title = title of article
  • 34. bibo.journal = journal title
  • 35. bibo.issn and bibo.eissn
  • 38. bibo.pageStart and bibo.pageEnd
  • 39. dc.abstract = article abstract
  • 40. dc.isReferencedBy = data package doi
  • 41. dc.identifier = doi of data package
  • 42. dc.relation.hasPart = dois of data files
  • 43. dc.references = handle of article description record
  • 44. dc.title = title of data package
  • 45. dc.description (not article abstract, optional)
  • 46. dc.creator = authors of data package
  • 47. dc.date (with refinements – dates associated with submission to Dryad and archiving in the repository)
  • 48. dryad.external = GenBank accession number, TreeBASE identifier
  • 49. dc.relation = URL of related resource
  • 50. dc.subject = general keywords
  • 52. dc.spatial = geographic keywords
  • 53. dc.temporal = timespan keywords
  • 54.
  • 55. dc.relation.isPartOf = doi of data package
  • 56. file-specific description: keywords, authors, format, size, checksum, etc.
  • 57. embargo information (type, end date)

Editor's Notes

  1. Demand from the user community (i.e. biologists) has led to a distributed network of sometimes inadequate, often unsustainable, generally non-interoperable, solutions, from personal websites to publisher hosted supplementary materials. Dryad, by contrast, is designed to be a self-sustainingservice that rationalizes the space and is responsive to multiple stakeholders:journals, publishers, societies, funders, authors, and data users
  2. There is widely used infrastructure for certain well-defined “easy” biological datatypes like DNA sequences and protein structures. But these repositories are not adequate to capture all those many datasets that requires more context to be reusable. Dryad is designed specifically to enable archiving and reuse of this long tail of orphan data. Our civilization is not wealthy to ever support the variety specialized repositories that would be needed, and the curation that would be needed to standardize these data.
  3. A classic example of orphan data is Bumpus’ (1898) sparrows " ... on February 1 of the present year, when, after an uncommonly severe storm of snow, rain, and sleet, a number of English [house] sparrows were brought to the Anatomical Laboratory of Brown University Seventy-two of these birds revived; sixty-four perished; ... “ Pages of his data on the measurements of birds that died versus those that revived, and the data has been used ever since to test statistical methods of measuring natural selection on multivariate traits, and for teaching evolutionary biology. This is not high-throughput biology. This is a single clever opportunistic, low-tech and idiosyncratic study by an individual investigator, who, by virtue of having published his data, enhanced the value of his science, and is still being read to this day. These data are nearly meaningless out of context. But even without elaborate machine-readable metadata, these data have been in 100s of papers, by many1000s of students, and is still being reused over a century later. The other lesson to take away from here is that publication related data is only the tip of the iceberg of the long tail, but it is the low-hanging fruit. It is the most consistently valuable, the most consistently reusable, and the easiest to archive in a systematic way, because there is a vast socio-technical-economic infrastructure of publication that can be leveraged.
  4. Two aspects of the way Dryad works that merit more detailed explanation, both of which contribute to lowering the burden on data submission.
  5. Integrated manuscript submission described in more detail, since this is more relevant to how Dryad data records get populated with ORCIDs
  6. Canonical handshaking workflow: data files exchanged in a BagIttarball (files + manifest), completed submissions harvested via OAI-PMH updates.Future plans to use OAI-ORE for the manifest. Mechanism could be extended to allow deposit from research software (e.g. R, digital lab notebooks) or institutional repositories (e.g. withSWORD)
  7. An example of an email sent to Dryad from an integrated journal (Molecular Ecology) with author names highlighted.
  8. One can see these benefits action with a A 2009 data package compiled wood anatomy data from 8412 plant species. It has already been downloaded over 600 times! While some of these downloads may lead to citations, there is probably a good deal of data reuse for educational purposes, and exploration of analytical methods on this unique dataset.The inset from the corresponding Ecology Letters article shows the geographical distribution of wood density in North and South America. Each data point is the mean wood density value of all unique species occurrences in that cell. Wood density clearly varies in a very predictable way with temperature, precipitation, and seasonality. Dryad contains the data underlying this figure, but without Dryad, researchers would be unable to reconstruct the original data from this image for testing new hypotheses.
  9. Dryad is one of many member nodes in the DataONE network, an NSF funded DataNet that includes federal labs, research stations, earth observatory networks, citizen science data archives, etc, which includes both earth science and life science data. Data is replicated across members nodes, while metadata and a layer of services are provided by a smaller number of redundant coordinating nodes. Dryad is the only member node that focuses on publication data, although many of the others are use in published literature. One of the technical goals is to support distributed authentication, so CRUD rights can be propagated from node to node for individuals, groups and organizations. DataONE has adopted InCommon, a single-sign on technology for US research and education, which is based on SAML-based authentication and authorization (e.g. Shibboleth). Specifically, DataONE uses InCommon Silver, which uses verified profiles allowing a high degree of trust.
  10. Dryad Metadata Application Profile 3.0