SlideShare a Scribd company logo
1 of 26
Download to read offline
Analyzing the Persistence of Referenced
    Web Resources with Memento


                                         Robert Sanderson
                                           Mark Phillips
                                       Herbert Van de Sompel


                                      http://mementoweb.org/


                                            Memento is funded by
                                           The Library of Congress

         Persistence of Referenced Web Resources
        Open Repositories 2011, Austin TX, June 6-11
Overview



•  Motivating Horror Story

•  Memento

•  Experiment

•  Results

•  Conclusions/Future Work




                 Persistence of Referenced Web Resources
                                                             2
                Open Repositories 2011, Austin TX, June 6-11
A Motivating Academic Horror Story




    Persistence of Referenced Web Resources
                                                3
   Open Repositories 2011, Austin TX, June 6-11
A Motivating Academic Horror Story




    Persistence of Referenced Web Resources
                                                4
   Open Repositories 2011, Austin TX, June 6-11
A Motivating Academic Horror Story




    Persistence of Referenced Web Resources
                                                5
   Open Repositories 2011, Austin TX, June 6-11
A Motivating Academic Horror Story




    Persistence of Referenced Web Resources
                                                6
   Open Repositories 2011, Austin TX, June 6-11
Another Motivating Academic Horror Story




        Persistence of Referenced Web Resources
                                                    7
       Open Repositories 2011, Austin TX, June 6-11
Another Motivating Academic Horror Story




        Persistence of Referenced Web Resources
                                                    8
       Open Repositories 2011, Austin TX, June 6-11
Question 1
To what extent are web resources that are referenced from
 works in repositories still available at their original URL?



                                           Significant prior art!

                                      But very small scale other
                                     than Lawrence's early work
                                             on Citeseer



                                           (See paper for references)

                 Persistence of Referenced Web Resources
                                                             9
                Open Repositories 2011, Austin TX, June 6-11
Our Hero Enters the Scene!




 Persistence of Referenced Web Resources
                                             10
Open Repositories 2011, Austin TX, June 6-11
Question 1(redux)

To what extent are web resources that are referenced from
 works in repositories still available at their original URL …

            or from archives of web resources?




  Prior art sketchy at best, as lacks automated method to
        enable discovery of archived web resources.




                 Persistence of Referenced Web Resources
                                                             11
                Open Repositories 2011, Austin TX, June 6-11
Memento Framework


Memento wants to make it easy to navigate the web of the past



                             •  Global version indicator: Time

                             •  Based on the primitives of the Web:
                                resource, representation, content
                                negotiation, link

                             •  Functionality: Given a URI and a
                                Datetime, resolve the closest
                                archived copy




               Persistence of Referenced Web Resources
                                                           12
              Open Repositories 2011, Austin TX, June 6-11
Original Resources and Mementos




    Persistence of Referenced Web Resources
                                                13
   Open Repositories 2011, Austin TX, June 6-11
Memento: Bridge from Present to Past




     Persistence of Referenced Web Resources
                                                 14
    Open Repositories 2011, Austin TX, June 6-11
Memento: Bridge from Present to Past




     Persistence of Referenced Web Resources
                                                 15
    Open Repositories 2011, Austin TX, June 6-11
Multiple Archives




 Persistence of Referenced Web Resources
                                             16
Open Repositories 2011, Austin TX, June 6-11
Original Resource’s Server Gone




   Persistence of Referenced Web Resources
                                               17
  Open Repositories 2011, Austin TX, June 6-11
Question 2

How long is the period between the publication of a paper
  and the archiving of a resource cited by that paper?




       Memento allows us to answer this question.




                Persistence of Referenced Web Resources
                                                            18
               Open Repositories 2011, Austin TX, June 6-11
Experiment

Using Memento, check all of the links extracted from papers in
repositories to discover:
    •  Are they still resolvable at their Original URI?
    •  Are Mementos available in archives?
    •  What is the Memento-Datetime of the closest copy?

Data Set:
   •  University of North Texas Institutional Repository
        •  3595 works, 17965 unique URLs
        •  May 1999 to August 2010
   •  arXiv
        •  400144 works, 144087 unique URLs
        •  December 1993 to December 2009
   •  Total:
        •  162052 URLs, generating 306452 (URL, Paper) tuples


                    Persistence of Referenced Web Resources
                                                                19
                   Open Repositories 2011, Austin TX, June 6-11
Experimental Process

 Extract       Extract
  Links       Metadata


  Filter *                                             * We filter broken and
  Links                                                intra/inter-repository
                                                       links.

Normalize     Normalize
  Links       Metadata

                                                                    Results:
                                                                   (URL,Time,
 (URL, Paper, Time, Subject)                                        Memento-
                                                                   Time, Paper,
                                                                     Subject)



                  Persistence of Referenced Web Resources
                                                              20
                 Open Repositories 2011, Austin TX, June 6-11
Results: Archiving Extent per Repository


            UNT             •  72% in archives and/or still exist

                            •  High proportion of archived
                            URLs, possibly due to academic
                            level and general disciplines



           arXiv            •  78% in archives and/or still exist

                            •  45% still exist, but not archived!
                            Possibly due to high value, but
                            very discipline specific references




       Persistence of Referenced Web Resources
                                                   21
      Open Repositories 2011, Austin TX, June 6-11
Results: Days between Publication and Archive




    Typical long tail, but inexplicably similar curves at
    different scales for repositories.
    arXiv: 45% within a month, 80% within a year
    UNT: 48% within a month, 80% within a year

           Persistence of Referenced Web Resources
                                                       22
          Open Repositories 2011, Austin TX, June 6-11
Results: Archiving Extent Per Discipline


                              UNT          •  Most disciplines exhibit
                                           similar behavior, except
                                           History, Journalism and
                                           English with lower
                                           percentage archived



                              arXiv        •  Most disciplines exhibit
                                           similar behavior with very
                                           low percentage archived
                                           within one month, and
                                           very high percentage still
                                           dereferencable



       Persistence of Referenced Web Resources
                                                   23
      Open Repositories 2011, Austin TX, June 6-11
Conclusions

Biggest Issues:

   •  Need access to the URIs extracted from repository resources
   •  Need a web archive of scholarly communication's context
   •  WebCite is good, but requires proactive archiving request

Proposal:

   •  Repositories should expose the links extracted from the full
   text of their resources
        •  In metadata for the resource
        •  In an Atom feed …

   •  To act as seed URL list for a (Memento compliant) web archive




                   Persistence of Referenced Web Resources
                                                               24
                  Open Repositories 2011, Austin TX, June 6-11
Future Work


•  Repeat with much larger dataset
     •  JSTOR
     •  CiteSeer
     •  Astrophysics Data System
     •  RePeC
     •  PubMed
     •  arXiv
     •  10+ ETD Repositories
     •  SSRN (discussion ongoing)
     •  Your repository?

•  Investigate 45/80 similarity

•  Community support for automated scholarly web archive project



                     Persistence of Referenced Web Resources
                                                                 25
                    Open Repositories 2011, Austin TX, June 6-11
Thank You!

                •  Rob Sanderson
                     •  Twitter: @azaroth42
                     •  Email: azaroth42@gmail.com
                          or rsanderson@lanl.gov

                •  Paper: http://arxiv.org/abs/1105.3459

                •  Slides: http://slidesha.re/

                •  Memento:
                     •  http://www.mementoweb.org/
                     •  http://groups.google.com/group
                              /memento-dev



 Persistence of Referenced Web Resources
                                             26
Open Repositories 2011, Austin TX, June 6-11

More Related Content

Viewers also liked

Evaluating SharedCanvas in CATCHPlus
Evaluating SharedCanvas in CATCHPlusEvaluating SharedCanvas in CATCHPlus
Evaluating SharedCanvas in CATCHPlusRobert Sanderson
 
RDF: Resource Description Failures?
RDF: Resource Description Failures?RDF: Resource Description Failures?
RDF: Resource Description Failures?Robert Sanderson
 
SharedCanvas: Dealing with Uncertainty in Digital Facsimiles
SharedCanvas: Dealing with Uncertainty in Digital FacsimilesSharedCanvas: Dealing with Uncertainty in Digital Facsimiles
SharedCanvas: Dealing with Uncertainty in Digital FacsimilesRobert Sanderson
 
TimeMaps: Metadata for Memento
TimeMaps: Metadata for MementoTimeMaps: Metadata for Memento
TimeMaps: Metadata for MementoRobert Sanderson
 
iAnnotate 2013 Introduction
iAnnotate 2013 IntroductioniAnnotate 2013 Introduction
iAnnotate 2013 IntroductionRobert Sanderson
 
Annotating Scholarly Resources
Annotating Scholarly ResourcesAnnotating Scholarly Resources
Annotating Scholarly ResourcesRobert Sanderson
 
Big Data: Indexing ~50Tb of URIs
Big Data: Indexing ~50Tb of URIsBig Data: Indexing ~50Tb of URIs
Big Data: Indexing ~50Tb of URIsRobert Sanderson
 
Linked Data: Building Standards and Communities
Linked Data: Building Standards and CommunitiesLinked Data: Building Standards and Communities
Linked Data: Building Standards and CommunitiesRobert Sanderson
 
Transcending Silos: Shared Canvas Data Model for Digital Facsimiles
Transcending Silos: Shared Canvas Data Model for Digital FacsimilesTranscending Silos: Shared Canvas Data Model for Digital Facsimiles
Transcending Silos: Shared Canvas Data Model for Digital FacsimilesRobert Sanderson
 
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...Robert Sanderson
 
Erika Pricyla Cerino HernáNdez
Erika Pricyla Cerino HernáNdezErika Pricyla Cerino HernáNdez
Erika Pricyla Cerino HernáNdezguest1cc234
 
NLLC 2011: Memento, Open Annotation, SharedCanvas
NLLC 2011: Memento, Open Annotation, SharedCanvasNLLC 2011: Memento, Open Annotation, SharedCanvas
NLLC 2011: Memento, Open Annotation, SharedCanvasRobert Sanderson
 
Dit Heb Je Nog Nooit Gezien
Dit Heb Je Nog Nooit GezienDit Heb Je Nog Nooit Gezien
Dit Heb Je Nog Nooit Gezienguest6964ce
 
W3C Open Annotation: Status and Use Cases
W3C Open Annotation: Status and Use CasesW3C Open Annotation: Status and Use Cases
W3C Open Annotation: Status and Use CasesRobert Sanderson
 
NISO Annotation Meeting (San Francisco)
NISO Annotation Meeting (San Francisco)NISO Annotation Meeting (San Francisco)
NISO Annotation Meeting (San Francisco)Robert Sanderson
 
Making Web Annotations Persistent over Time
Making Web Annotations Persistent over TimeMaking Web Annotations Persistent over Time
Making Web Annotations Persistent over TimeRobert Sanderson
 
W3C Web Annotation WG Update (I Annotate 2016)
W3C Web Annotation WG Update (I Annotate 2016)W3C Web Annotation WG Update (I Annotate 2016)
W3C Web Annotation WG Update (I Annotate 2016)Robert Sanderson
 

Viewers also liked (20)

Evaluating SharedCanvas in CATCHPlus
Evaluating SharedCanvas in CATCHPlusEvaluating SharedCanvas in CATCHPlus
Evaluating SharedCanvas in CATCHPlus
 
RDF: Resource Description Failures?
RDF: Resource Description Failures?RDF: Resource Description Failures?
RDF: Resource Description Failures?
 
SharedCanvas: Dealing with Uncertainty in Digital Facsimiles
SharedCanvas: Dealing with Uncertainty in Digital FacsimilesSharedCanvas: Dealing with Uncertainty in Digital Facsimiles
SharedCanvas: Dealing with Uncertainty in Digital Facsimiles
 
TimeMaps: Metadata for Memento
TimeMaps: Metadata for MementoTimeMaps: Metadata for Memento
TimeMaps: Metadata for Memento
 
iAnnotate 2013 Introduction
iAnnotate 2013 IntroductioniAnnotate 2013 Introduction
iAnnotate 2013 Introduction
 
Annotating Scholarly Resources
Annotating Scholarly ResourcesAnnotating Scholarly Resources
Annotating Scholarly Resources
 
Big Data: Indexing ~50Tb of URIs
Big Data: Indexing ~50Tb of URIsBig Data: Indexing ~50Tb of URIs
Big Data: Indexing ~50Tb of URIs
 
Linked Data: Building Standards and Communities
Linked Data: Building Standards and CommunitiesLinked Data: Building Standards and Communities
Linked Data: Building Standards and Communities
 
Transcending Silos: Shared Canvas Data Model for Digital Facsimiles
Transcending Silos: Shared Canvas Data Model for Digital FacsimilesTranscending Silos: Shared Canvas Data Model for Digital Facsimiles
Transcending Silos: Shared Canvas Data Model for Digital Facsimiles
 
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...
 
Erika Pricyla Cerino HernáNdez
Erika Pricyla Cerino HernáNdezErika Pricyla Cerino HernáNdez
Erika Pricyla Cerino HernáNdez
 
Niso Annotation Webinar
Niso Annotation WebinarNiso Annotation Webinar
Niso Annotation Webinar
 
NLLC 2011: Memento, Open Annotation, SharedCanvas
NLLC 2011: Memento, Open Annotation, SharedCanvasNLLC 2011: Memento, Open Annotation, SharedCanvas
NLLC 2011: Memento, Open Annotation, SharedCanvas
 
Dit Heb Je Nog Nooit Gezien
Dit Heb Je Nog Nooit GezienDit Heb Je Nog Nooit Gezien
Dit Heb Je Nog Nooit Gezien
 
Python Web Interaction
Python Web InteractionPython Web Interaction
Python Web Interaction
 
W3C Open Annotation: Status and Use Cases
W3C Open Annotation: Status and Use CasesW3C Open Annotation: Status and Use Cases
W3C Open Annotation: Status and Use Cases
 
NISO Annotation Meeting (San Francisco)
NISO Annotation Meeting (San Francisco)NISO Annotation Meeting (San Francisco)
NISO Annotation Meeting (San Francisco)
 
Making Web Annotations Persistent over Time
Making Web Annotations Persistent over TimeMaking Web Annotations Persistent over Time
Making Web Annotations Persistent over Time
 
W3C Web Annotation WG Update (I Annotate 2016)
W3C Web Annotation WG Update (I Annotate 2016)W3C Web Annotation WG Update (I Annotate 2016)
W3C Web Annotation WG Update (I Annotate 2016)
 
IIIF Presentation API
IIIF Presentation API IIIF Presentation API
IIIF Presentation API
 

Similar to Analyzing the Persistence of Referenced Web Resources with Memento

Web Today, Good Tomorrow? Transactional archiving of web content
Web Today, Good Tomorrow? Transactional archiving of web contentWeb Today, Good Tomorrow? Transactional archiving of web content
Web Today, Good Tomorrow? Transactional archiving of web contentPeter Burnhill
 
SAA 2014 session 703
SAA 2014 session 703SAA 2014 session 703
SAA 2014 session 703rosalielack
 
Annotations Supporting Scholarly Editing
Annotations Supporting Scholarly EditingAnnotations Supporting Scholarly Editing
Annotations Supporting Scholarly EditingAnna Gerber
 
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyHIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyPRELIDA Project
 
Lessons in Cross-Repository Interoperability learned from the aDORe effort
Lessons in Cross-Repository Interoperability learned from the aDORe effortLessons in Cross-Repository Interoperability learned from the aDORe effort
Lessons in Cross-Repository Interoperability learned from the aDORe effortHerbert Van de Sompel
 
Linked Data in Scholarly Communication
Linked Data in Scholarly CommunicationLinked Data in Scholarly Communication
Linked Data in Scholarly CommunicationBernhard Haslhofer
 
Towards OpenURL Quality Metrics: Initial Findings
Towards OpenURL Quality Metrics: Initial FindingsTowards OpenURL Quality Metrics: Initial Findings
Towards OpenURL Quality Metrics: Initial Findingsalc28
 
Data Driven Learning Culture Fall09
Data Driven Learning Culture Fall09Data Driven Learning Culture Fall09
Data Driven Learning Culture Fall09annbee
 
Get On The Reference Bus! Wyoming
Get On The Reference Bus! WyomingGet On The Reference Bus! Wyoming
Get On The Reference Bus! WyomingKatie Lynn
 
Hiberlink: Prototypes of pro-active approaches to support the archiving of we...
Hiberlink: Prototypes of pro-active approaches to support the archiving of we...Hiberlink: Prototypes of pro-active approaches to support the archiving of we...
Hiberlink: Prototypes of pro-active approaches to support the archiving of we...EDINA, University of Edinburgh
 
Thinking of Linking: A random series of ideas, concepts, Platonic ideals, a y...
Thinking of Linking: A random series of ideas, concepts, Platonic ideals, a y...Thinking of Linking: A random series of ideas, concepts, Platonic ideals, a y...
Thinking of Linking: A random series of ideas, concepts, Platonic ideals, a y...Martin Kalfatovic
 
Libraries, OA research and OER: towards symbiosis?
Libraries, OA research and OER: towards symbiosis?Libraries, OA research and OER: towards symbiosis?
Libraries, OA research and OER: towards symbiosis?Nick Sheppard
 
Prototypes of pro-active approaches to support the archiving of web reference...
Prototypes of pro-active approaches to support the archiving of web reference...Prototypes of pro-active approaches to support the archiving of web reference...
Prototypes of pro-active approaches to support the archiving of web reference...EDINA, University of Edinburgh
 
(5) Cataloging Records using Dublin Core Element SetAfter reading .docx
(5) Cataloging Records using Dublin Core Element SetAfter reading .docx(5) Cataloging Records using Dublin Core Element SetAfter reading .docx
(5) Cataloging Records using Dublin Core Element SetAfter reading .docxtienmixon
 
Tales from the Keepers Registry: Dr Who and the Scholarly Record
Tales from the Keepers Registry: Dr Who and the Scholarly RecordTales from the Keepers Registry: Dr Who and the Scholarly Record
Tales from the Keepers Registry: Dr Who and the Scholarly RecordEDINA, University of Edinburgh
 
WorldCatLocal Discovery to Delivery
WorldCatLocal Discovery to DeliveryWorldCatLocal Discovery to Delivery
WorldCatLocal Discovery to Deliveryltls
 

Similar to Analyzing the Persistence of Referenced Web Resources with Memento (20)

Web Today, Good Tomorrow? Transactional archiving of web content
Web Today, Good Tomorrow? Transactional archiving of web contentWeb Today, Good Tomorrow? Transactional archiving of web content
Web Today, Good Tomorrow? Transactional archiving of web content
 
SAA 2014 session 703
SAA 2014 session 703SAA 2014 session 703
SAA 2014 session 703
 
Reference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and RemedyReference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and Remedy
 
Annotations Supporting Scholarly Editing
Annotations Supporting Scholarly EditingAnnotations Supporting Scholarly Editing
Annotations Supporting Scholarly Editing
 
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyHIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
 
Lessons in Cross-Repository Interoperability learned from the aDORe effort
Lessons in Cross-Repository Interoperability learned from the aDORe effortLessons in Cross-Repository Interoperability learned from the aDORe effort
Lessons in Cross-Repository Interoperability learned from the aDORe effort
 
Linked Data in Scholarly Communication
Linked Data in Scholarly CommunicationLinked Data in Scholarly Communication
Linked Data in Scholarly Communication
 
"In the Early Days of a Better Nation": Enhancing the power of metadata today...
"In the Early Days of a Better Nation": Enhancing the power of metadata today..."In the Early Days of a Better Nation": Enhancing the power of metadata today...
"In the Early Days of a Better Nation": Enhancing the power of metadata today...
 
Towards OpenURL Quality Metrics: Initial Findings
Towards OpenURL Quality Metrics: Initial FindingsTowards OpenURL Quality Metrics: Initial Findings
Towards OpenURL Quality Metrics: Initial Findings
 
Data Publishing in Archaeozoology
Data Publishing in ArchaeozoologyData Publishing in Archaeozoology
Data Publishing in Archaeozoology
 
Data Driven Learning Culture Fall09
Data Driven Learning Culture Fall09Data Driven Learning Culture Fall09
Data Driven Learning Culture Fall09
 
Get On The Reference Bus! Wyoming
Get On The Reference Bus! WyomingGet On The Reference Bus! Wyoming
Get On The Reference Bus! Wyoming
 
Hiberlink: Prototypes of pro-active approaches to support the archiving of we...
Hiberlink: Prototypes of pro-active approaches to support the archiving of we...Hiberlink: Prototypes of pro-active approaches to support the archiving of we...
Hiberlink: Prototypes of pro-active approaches to support the archiving of we...
 
Thinking of Linking: A random series of ideas, concepts, Platonic ideals, a y...
Thinking of Linking: A random series of ideas, concepts, Platonic ideals, a y...Thinking of Linking: A random series of ideas, concepts, Platonic ideals, a y...
Thinking of Linking: A random series of ideas, concepts, Platonic ideals, a y...
 
Libraries, OA research and OER: towards symbiosis?
Libraries, OA research and OER: towards symbiosis?Libraries, OA research and OER: towards symbiosis?
Libraries, OA research and OER: towards symbiosis?
 
The opac and the web
The opac and the webThe opac and the web
The opac and the web
 
Prototypes of pro-active approaches to support the archiving of web reference...
Prototypes of pro-active approaches to support the archiving of web reference...Prototypes of pro-active approaches to support the archiving of web reference...
Prototypes of pro-active approaches to support the archiving of web reference...
 
(5) Cataloging Records using Dublin Core Element SetAfter reading .docx
(5) Cataloging Records using Dublin Core Element SetAfter reading .docx(5) Cataloging Records using Dublin Core Element SetAfter reading .docx
(5) Cataloging Records using Dublin Core Element SetAfter reading .docx
 
Tales from the Keepers Registry: Dr Who and the Scholarly Record
Tales from the Keepers Registry: Dr Who and the Scholarly RecordTales from the Keepers Registry: Dr Who and the Scholarly Record
Tales from the Keepers Registry: Dr Who and the Scholarly Record
 
WorldCatLocal Discovery to Delivery
WorldCatLocal Discovery to DeliveryWorldCatLocal Discovery to Delivery
WorldCatLocal Discovery to Delivery
 

More from Robert Sanderson

LUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at YaleLUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at YaleRobert Sanderson
 
Zoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable DataZoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable DataRobert Sanderson
 
Provenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked ArtProvenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked ArtRobert Sanderson
 
Data is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD SustainabilityData is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD SustainabilityRobert Sanderson
 
A Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and UsabilityA Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and UsabilityRobert Sanderson
 
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable DataLinked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable DataRobert Sanderson
 
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataIllusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataRobert Sanderson
 
Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Robert Sanderson
 
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data EcosystemSanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data EcosystemRobert Sanderson
 
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingTiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingRobert Sanderson
 
The Importance of being LOUD
The Importance of being LOUDThe Importance of being LOUD
The Importance of being LOUDRobert Sanderson
 
Introduction to Linked Art Model
Introduction to Linked Art ModelIntroduction to Linked Art Model
Introduction to Linked Art ModelRobert Sanderson
 
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...Robert Sanderson
 
Strong Opinions, Weakly Held
Strong Opinions, Weakly HeldStrong Opinions, Weakly Held
Strong Opinions, Weakly HeldRobert Sanderson
 
IIIF Discovery Walkthrough
IIIF Discovery WalkthroughIIIF Discovery Walkthrough
IIIF Discovery WalkthroughRobert Sanderson
 
Linked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRMLinked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRMRobert Sanderson
 
Euromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over CommitteeEuromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over CommitteeRobert Sanderson
 
Linked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data ModelLinked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data ModelRobert Sanderson
 
EuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUDEuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUDRobert Sanderson
 

More from Robert Sanderson (20)

Understanding Linked Art
Understanding Linked ArtUnderstanding Linked Art
Understanding Linked Art
 
LUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at YaleLUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at Yale
 
Zoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable DataZoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable Data
 
Provenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked ArtProvenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked Art
 
Data is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD SustainabilityData is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD Sustainability
 
A Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and UsabilityA Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and Usability
 
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable DataLinked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
 
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataIllusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
 
Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)
 
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data EcosystemSanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
 
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingTiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
 
The Importance of being LOUD
The Importance of being LOUDThe Importance of being LOUD
The Importance of being LOUD
 
Introduction to Linked Art Model
Introduction to Linked Art ModelIntroduction to Linked Art Model
Introduction to Linked Art Model
 
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
 
Strong Opinions, Weakly Held
Strong Opinions, Weakly HeldStrong Opinions, Weakly Held
Strong Opinions, Weakly Held
 
IIIF Discovery Walkthrough
IIIF Discovery WalkthroughIIIF Discovery Walkthrough
IIIF Discovery Walkthrough
 
Linked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRMLinked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRM
 
Euromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over CommitteeEuromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over Committee
 
Linked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data ModelLinked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data Model
 
EuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUDEuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUD
 

Analyzing the Persistence of Referenced Web Resources with Memento

  • 1. Analyzing the Persistence of Referenced Web Resources with Memento Robert Sanderson Mark Phillips Herbert Van de Sompel http://mementoweb.org/ Memento is funded by The Library of Congress Persistence of Referenced Web Resources Open Repositories 2011, Austin TX, June 6-11
  • 2. Overview •  Motivating Horror Story •  Memento •  Experiment •  Results •  Conclusions/Future Work Persistence of Referenced Web Resources 2 Open Repositories 2011, Austin TX, June 6-11
  • 3. A Motivating Academic Horror Story Persistence of Referenced Web Resources 3 Open Repositories 2011, Austin TX, June 6-11
  • 4. A Motivating Academic Horror Story Persistence of Referenced Web Resources 4 Open Repositories 2011, Austin TX, June 6-11
  • 5. A Motivating Academic Horror Story Persistence of Referenced Web Resources 5 Open Repositories 2011, Austin TX, June 6-11
  • 6. A Motivating Academic Horror Story Persistence of Referenced Web Resources 6 Open Repositories 2011, Austin TX, June 6-11
  • 7. Another Motivating Academic Horror Story Persistence of Referenced Web Resources 7 Open Repositories 2011, Austin TX, June 6-11
  • 8. Another Motivating Academic Horror Story Persistence of Referenced Web Resources 8 Open Repositories 2011, Austin TX, June 6-11
  • 9. Question 1 To what extent are web resources that are referenced from works in repositories still available at their original URL? Significant prior art! But very small scale other than Lawrence's early work on Citeseer (See paper for references) Persistence of Referenced Web Resources 9 Open Repositories 2011, Austin TX, June 6-11
  • 10. Our Hero Enters the Scene! Persistence of Referenced Web Resources 10 Open Repositories 2011, Austin TX, June 6-11
  • 11. Question 1(redux) To what extent are web resources that are referenced from works in repositories still available at their original URL … or from archives of web resources? Prior art sketchy at best, as lacks automated method to enable discovery of archived web resources. Persistence of Referenced Web Resources 11 Open Repositories 2011, Austin TX, June 6-11
  • 12. Memento Framework Memento wants to make it easy to navigate the web of the past •  Global version indicator: Time •  Based on the primitives of the Web: resource, representation, content negotiation, link •  Functionality: Given a URI and a Datetime, resolve the closest archived copy Persistence of Referenced Web Resources 12 Open Repositories 2011, Austin TX, June 6-11
  • 13. Original Resources and Mementos Persistence of Referenced Web Resources 13 Open Repositories 2011, Austin TX, June 6-11
  • 14. Memento: Bridge from Present to Past Persistence of Referenced Web Resources 14 Open Repositories 2011, Austin TX, June 6-11
  • 15. Memento: Bridge from Present to Past Persistence of Referenced Web Resources 15 Open Repositories 2011, Austin TX, June 6-11
  • 16. Multiple Archives Persistence of Referenced Web Resources 16 Open Repositories 2011, Austin TX, June 6-11
  • 17. Original Resource’s Server Gone Persistence of Referenced Web Resources 17 Open Repositories 2011, Austin TX, June 6-11
  • 18. Question 2 How long is the period between the publication of a paper and the archiving of a resource cited by that paper? Memento allows us to answer this question. Persistence of Referenced Web Resources 18 Open Repositories 2011, Austin TX, June 6-11
  • 19. Experiment Using Memento, check all of the links extracted from papers in repositories to discover: •  Are they still resolvable at their Original URI? •  Are Mementos available in archives? •  What is the Memento-Datetime of the closest copy? Data Set: •  University of North Texas Institutional Repository •  3595 works, 17965 unique URLs •  May 1999 to August 2010 •  arXiv •  400144 works, 144087 unique URLs •  December 1993 to December 2009 •  Total: •  162052 URLs, generating 306452 (URL, Paper) tuples Persistence of Referenced Web Resources 19 Open Repositories 2011, Austin TX, June 6-11
  • 20. Experimental Process Extract Extract Links Metadata Filter * * We filter broken and Links intra/inter-repository links. Normalize Normalize Links Metadata Results: (URL,Time, (URL, Paper, Time, Subject) Memento- Time, Paper, Subject) Persistence of Referenced Web Resources 20 Open Repositories 2011, Austin TX, June 6-11
  • 21. Results: Archiving Extent per Repository UNT •  72% in archives and/or still exist •  High proportion of archived URLs, possibly due to academic level and general disciplines arXiv •  78% in archives and/or still exist •  45% still exist, but not archived! Possibly due to high value, but very discipline specific references Persistence of Referenced Web Resources 21 Open Repositories 2011, Austin TX, June 6-11
  • 22. Results: Days between Publication and Archive Typical long tail, but inexplicably similar curves at different scales for repositories. arXiv: 45% within a month, 80% within a year UNT: 48% within a month, 80% within a year Persistence of Referenced Web Resources 22 Open Repositories 2011, Austin TX, June 6-11
  • 23. Results: Archiving Extent Per Discipline UNT •  Most disciplines exhibit similar behavior, except History, Journalism and English with lower percentage archived arXiv •  Most disciplines exhibit similar behavior with very low percentage archived within one month, and very high percentage still dereferencable Persistence of Referenced Web Resources 23 Open Repositories 2011, Austin TX, June 6-11
  • 24. Conclusions Biggest Issues: •  Need access to the URIs extracted from repository resources •  Need a web archive of scholarly communication's context •  WebCite is good, but requires proactive archiving request Proposal: •  Repositories should expose the links extracted from the full text of their resources •  In metadata for the resource •  In an Atom feed … •  To act as seed URL list for a (Memento compliant) web archive Persistence of Referenced Web Resources 24 Open Repositories 2011, Austin TX, June 6-11
  • 25. Future Work •  Repeat with much larger dataset •  JSTOR •  CiteSeer •  Astrophysics Data System •  RePeC •  PubMed •  arXiv •  10+ ETD Repositories •  SSRN (discussion ongoing) •  Your repository? •  Investigate 45/80 similarity •  Community support for automated scholarly web archive project Persistence of Referenced Web Resources 25 Open Repositories 2011, Austin TX, June 6-11
  • 26. Thank You! •  Rob Sanderson •  Twitter: @azaroth42 •  Email: azaroth42@gmail.com or rsanderson@lanl.gov •  Paper: http://arxiv.org/abs/1105.3459 •  Slides: http://slidesha.re/ •  Memento: •  http://www.mementoweb.org/ •  http://groups.google.com/group /memento-dev Persistence of Referenced Web Resources 26 Open Repositories 2011, Austin TX, June 6-11