Analyzing the Persistence of Referenced Web Resources with Memento
1. Analyzing the Persistence of Referenced
Web Resources with Memento
Robert Sanderson
Mark Phillips
Herbert Van de Sompel
http://mementoweb.org/
Memento is funded by
The Library of Congress
Persistence of Referenced Web Resources
Open Repositories 2011, Austin TX, June 6-11
2. Overview
• Motivating Horror Story
• Memento
• Experiment
• Results
• Conclusions/Future Work
Persistence of Referenced Web Resources
2
Open Repositories 2011, Austin TX, June 6-11
3. A Motivating Academic Horror Story
Persistence of Referenced Web Resources
3
Open Repositories 2011, Austin TX, June 6-11
4. A Motivating Academic Horror Story
Persistence of Referenced Web Resources
4
Open Repositories 2011, Austin TX, June 6-11
5. A Motivating Academic Horror Story
Persistence of Referenced Web Resources
5
Open Repositories 2011, Austin TX, June 6-11
6. A Motivating Academic Horror Story
Persistence of Referenced Web Resources
6
Open Repositories 2011, Austin TX, June 6-11
7. Another Motivating Academic Horror Story
Persistence of Referenced Web Resources
7
Open Repositories 2011, Austin TX, June 6-11
8. Another Motivating Academic Horror Story
Persistence of Referenced Web Resources
8
Open Repositories 2011, Austin TX, June 6-11
9. Question 1
To what extent are web resources that are referenced from
works in repositories still available at their original URL?
Significant prior art!
But very small scale other
than Lawrence's early work
on Citeseer
(See paper for references)
Persistence of Referenced Web Resources
9
Open Repositories 2011, Austin TX, June 6-11
10. Our Hero Enters the Scene!
Persistence of Referenced Web Resources
10
Open Repositories 2011, Austin TX, June 6-11
11. Question 1(redux)
To what extent are web resources that are referenced from
works in repositories still available at their original URL …
or from archives of web resources?
Prior art sketchy at best, as lacks automated method to
enable discovery of archived web resources.
Persistence of Referenced Web Resources
11
Open Repositories 2011, Austin TX, June 6-11
12. Memento Framework
Memento wants to make it easy to navigate the web of the past
• Global version indicator: Time
• Based on the primitives of the Web:
resource, representation, content
negotiation, link
• Functionality: Given a URI and a
Datetime, resolve the closest
archived copy
Persistence of Referenced Web Resources
12
Open Repositories 2011, Austin TX, June 6-11
13. Original Resources and Mementos
Persistence of Referenced Web Resources
13
Open Repositories 2011, Austin TX, June 6-11
14. Memento: Bridge from Present to Past
Persistence of Referenced Web Resources
14
Open Repositories 2011, Austin TX, June 6-11
15. Memento: Bridge from Present to Past
Persistence of Referenced Web Resources
15
Open Repositories 2011, Austin TX, June 6-11
17. Original Resource’s Server Gone
Persistence of Referenced Web Resources
17
Open Repositories 2011, Austin TX, June 6-11
18. Question 2
How long is the period between the publication of a paper
and the archiving of a resource cited by that paper?
Memento allows us to answer this question.
Persistence of Referenced Web Resources
18
Open Repositories 2011, Austin TX, June 6-11
19. Experiment
Using Memento, check all of the links extracted from papers in
repositories to discover:
• Are they still resolvable at their Original URI?
• Are Mementos available in archives?
• What is the Memento-Datetime of the closest copy?
Data Set:
• University of North Texas Institutional Repository
• 3595 works, 17965 unique URLs
• May 1999 to August 2010
• arXiv
• 400144 works, 144087 unique URLs
• December 1993 to December 2009
• Total:
• 162052 URLs, generating 306452 (URL, Paper) tuples
Persistence of Referenced Web Resources
19
Open Repositories 2011, Austin TX, June 6-11
20. Experimental Process
Extract Extract
Links Metadata
Filter * * We filter broken and
Links intra/inter-repository
links.
Normalize Normalize
Links Metadata
Results:
(URL,Time,
(URL, Paper, Time, Subject) Memento-
Time, Paper,
Subject)
Persistence of Referenced Web Resources
20
Open Repositories 2011, Austin TX, June 6-11
21. Results: Archiving Extent per Repository
UNT • 72% in archives and/or still exist
• High proportion of archived
URLs, possibly due to academic
level and general disciplines
arXiv • 78% in archives and/or still exist
• 45% still exist, but not archived!
Possibly due to high value, but
very discipline specific references
Persistence of Referenced Web Resources
21
Open Repositories 2011, Austin TX, June 6-11
22. Results: Days between Publication and Archive
Typical long tail, but inexplicably similar curves at
different scales for repositories.
arXiv: 45% within a month, 80% within a year
UNT: 48% within a month, 80% within a year
Persistence of Referenced Web Resources
22
Open Repositories 2011, Austin TX, June 6-11
23. Results: Archiving Extent Per Discipline
UNT • Most disciplines exhibit
similar behavior, except
History, Journalism and
English with lower
percentage archived
arXiv • Most disciplines exhibit
similar behavior with very
low percentage archived
within one month, and
very high percentage still
dereferencable
Persistence of Referenced Web Resources
23
Open Repositories 2011, Austin TX, June 6-11
24. Conclusions
Biggest Issues:
• Need access to the URIs extracted from repository resources
• Need a web archive of scholarly communication's context
• WebCite is good, but requires proactive archiving request
Proposal:
• Repositories should expose the links extracted from the full
text of their resources
• In metadata for the resource
• In an Atom feed …
• To act as seed URL list for a (Memento compliant) web archive
Persistence of Referenced Web Resources
24
Open Repositories 2011, Austin TX, June 6-11
25. Future Work
• Repeat with much larger dataset
• JSTOR
• CiteSeer
• Astrophysics Data System
• RePeC
• PubMed
• arXiv
• 10+ ETD Repositories
• SSRN (discussion ongoing)
• Your repository?
• Investigate 45/80 similarity
• Community support for automated scholarly web archive project
Persistence of Referenced Web Resources
25
Open Repositories 2011, Austin TX, June 6-11
26. Thank You!
• Rob Sanderson
• Twitter: @azaroth42
• Email: azaroth42@gmail.com
or rsanderson@lanl.gov
• Paper: http://arxiv.org/abs/1105.3459
• Slides: http://slidesha.re/
• Memento:
• http://www.mementoweb.org/
• http://groups.google.com/group
/memento-dev
Persistence of Referenced Web Resources
26
Open Repositories 2011, Austin TX, June 6-11