This SAA 2014 (session 703) http://sched.co/1hIEcE2 lightning talk highlights challenges and solutions to promoting access and discovery of web archives. Speakers discussed descriptive strategies towards integrating web archives with EAD finding aids, MARC records in library catalogs, and other discovery methods and tools.
Improving Access to Web Archives: Models, Metadata, and Integration
1. From Crawling to Walking:
Improving Access to Web Archives
SAA 2014
Session 703
2. From Crawling to Walking:
Improving Access to Web Archives
1. Jane Zhang
2. Michael Paulmeno
3. Meg Tuomala
4. Benn Joseph
5. Polina Ilieva
6. Jennifer Wright
7. John Bence
8. Olga Virakhovskaya
9. Anna Perricci
10. Rick Fitzgerald
11. Rosalie Lack
4. Web Records,
Web Archived Files, and
Web Archives Access Models
Jane Zhang, Catholic University of America
Session 703 - From Crawling to Walking:
Improving Access to Web Archives
SAA 2014, Washington DC
Saturday, August 16
5. Introduction
Web as records
The Web ARChive files as
recordkeeping formats
Web archives access models
6. Web Archiving Initiatives
• A survey on web archiving initiatives
–Daniel Gomes et al., Foundation for
National Scientific Computing, Portuguese
Web Archive Team
–International Conference on Theory and
Practice of Digital Libraries 2011, 25-29
September 2011
• Wikipedia: List of Web archiving
initiatives
7. Web Archiving Initiatives
A survey on web archiving initiatives
(2011)
42 web archiving initiatives worldwide
9 initiatives from the United States
List of Web Archiving Initiatives
(July 2014)
70 web archiving initiatives worldwide
17 initiatives from the United States
8. Web File Formats
2011 Worldwide Survey
The ARC and WARC formats are
dominant, being used by 54% of the
initiatives.
2014 List – USA
10 out of 17 initiatives identified as
using the ARC and/or WARC formats
58% of the US Web archiving initiatives
9. Web Archives Access Models
2011 Worldwide Survey
89% support access to URL history
79% enable searching metadata
67% provide full-text search over archived
content
2014 List – USA
URL history: 12 out of 17 – 70%
Metadata: 13 out of 17 – 76%
Full-text: 12 out of 17 – 70%
10. Metadata:
Theme-based Collections
Collection overview, name, title,
subject, abstract, language, year
captured
Site title, subject, place, language
Collection description, keyword, filter
by site title, and/or file type, topic
group
Catalog records (collection or website)
11. Metadata:
Provenance-based Collections
Site owner, business activity, topic, sub-
topic, region, country, language, year
created, date archived
Collection/series description, site title
Keyword search, browse by agency
Collection description, title keyword,
browse by agency name, government
branch, or agency expiration date
Browse by region, then site owner
15. Overview
• Many challenges to making web archives
accessible
• Archival description not fully compatible with
library catalogs
• Problem not unique to web archives
• Differing metadata and content standards lead
to separation between libraries and archives
(i.e. silos)
• Researchers who access archives through
library systems tend to use them longer
1
1 Noah Huffman, “More than Just Linking: Integrating MARC and EAD in a Single Discovery Interface at Duke, UNC-Chapel
Hill, and NCSU”, 14
16. The Current State of Affairs
• Collections accessed through access multiple
points
• Subject headings2
• Many organizations create two descriptions
and link via MARC 856 field; this can cause
confusion3
• Yet significant discovery occurs through search
engines4
2 Michelle Mascaro, “Controlled Access Headings in EAD Finding Aids: Current Practices in Number of and Types of Headings
Assigned,” 223.
3 Noah Huffman, “More than Just Linking: Integrating MARC and EAD in a Single Discovery Interface at Duke, UNC-Chapel
Hill, and NCSU,” 3 –5.
17. Challenges to Integration
• MARC records lack detail5 6
• Archivists uncertain about readiness to adopt
new standards 7
• Many different systems (ArchivesSpace, Ebsco
Discovery, Blacklight, various Integrated Library
Systems) and metadata standards
• Other challenges specific to web archives
• Ex. How to represent a continuously
accessioned resource?
5 Caprini and Kelcy Shepherd, “The MARC Standard and Encoded Archival Description,” 19.
6 Karen F. Gracy and Frank Lambert, “Who’s Ready to Surf the Next Wave? A Study of Perceived Challenges to Implementing
New and Revised Standards for Archival Description,” 102.
7 Ibid, 117
18. Towards the Future
• Increasing efforts to integrated archival
description and library catalogs
– University of Denver Penrose Library
8
– Triangle Research Libraries Network
9
– Library of Congress
– UNC Chapel Hill
• Adaptability key to future collaboration
• What affects archives, affects web archives
as well
8 Gregory C. Colati, Katherine M. Crowe, and Elizabeth S. Meagher, “Better, Faster, Stronger: Integrating Archives Processing
and Technical Services.”
9 Noah Huffman, “More than Just Linking: Integrating MARC and EAD in a Single Discovery Interface at Duke, UNC-Chapel
Hill, and NCSU.”
19. Works Cited
• Caprini, Peter, and Kelcy Shepherd. “The MARC Standard and Encoded Archival Description.” Library
Hi-Tech 22, no. 1 (2004): 18 –27. doi:10.1108/07378830410524468.
• Gregory C. Colati, Katherine M. Crowe, and Elizabeth S. Meagher. “Better, Faster, Stronger:
Integrating Archives Processing and Technical Services.” Library Resources and Technical Services 53,
no. 4 (October 2009): 261 – 270.
• Karen F. Gracy, and Frank Lambert. “Who’s Ready to Surf the Next Wave? A Study of Perceived
Challenges to Implementing New and Revised Standards for Archival Description.” The American
Archivist 77, no. 1 (Spring/Summer 2014): 96–132.
• Michelle Mascaro. “Controlled Access Headings in EAD Finding Aids: Current Practices in Number of
and Types of Headings Assigned.” Journal of Archival Organization 9, no. 3–4 (January 2011): 208 –
225. doi:10.1080/15332748.2011.643690.
• Noah Huffman. “More than Just Linking: Integrating MARC and EAD in a Single Discovery Interface
at Duke, UNC-Chapel Hill, and NCSU.” Journal for the Society of North Carolina Archivists 8, no. 2
(April 2011): 2 – 17.
21. Different strokes for
different folks / Meeting the
descriptive & access needs of multiple web
archive collections / With minimal workflow and
process change
Meg Tuomala
Assistant archivist, Gates Archive
Formerly e-records archivist at UNC-Chapel Hill
22. Web archiving at UNC: context
● Started in 2013; using Archive-it
● 6 web archive collections
● Extension of / supplement to existing
collections
● Special collections at UNC consolidated;
archival & biblio tech services are one
dept
23. Different folks: the collections
Biblio
● North Carolina Collection
● Rare Book Collection
● Digital Artists’ File
27. WASsup?: Describing Web
Archives Using Archon
SAA Washington, D.C.
August 16, 2014
Benn Joseph
Manuscript Librarian
Northwestern University Library
b-joseph@northwestern.edu
30. NU version of Archon:
• Only used for collection
management
• Separate blacklight/solr public
interface that searches and displays
the finding aids
• Finding aids all live in a fedora
repository
• “Ingest EAD” button added to
Archon, puts xml into fedora to then
be served via finding aids portal
38. August 16, 2014
Polina Ilieva, UCSF Archives & Special Collections
Science Online:
Evaluating usage, impact and
appraisal
39. Since it’s so easily
accessible, lab websites
are used as reference
tools by lab members
Sharing datasets
Channels for scholarly
communications
After funding ends
website can be the only
place where the data is
preserved and available
Why collect?
40. Not just preserved for
future use, scientists
need instant access
Websites become
integral part of scientific
scholarly output
Impact
41. Curation and Appraisal
How to select from hundreds
of labs?
Web Archive pilot project in
collaboration with the library’s
Research Informationist:
Research @UCSF collection
Will use UCSF Profiles:
Research Networking and
Expertise Mining Tool
Collect and analyze info about
faculty and researchers who
lead labs: the length of
service/title, # of scholarly
publications, availability of
websites, grants and awards.
42. Protocols
Data
Images
Lectures (a/v)
Publications
List of lab members
What to collect?
47. Square Peg in a Round Hole:
Integrating Web Archives into
Existing Descriptive Practices
Jennifer Wright
Archives and Information Management Team
Leader
SAA 2014
Session 703
wrightjm@si.edu
siarchives.si.edu
48. Accession-based Collections
Management
• Each transfer is separate accession
• Each accession cataloged separately in CMS
• Each accession has own finding aid
Solution for websites:
Crawls with similar dates and the same creator are
combined into one accession
49. Description and Cataloging
• Describes each
website/blog in
accession
• Notes technical and
other issues
• Includes crawl date(s)
• Indexes subjects,
website/blog/
exhibition titles, and
other creators
50. EAD Finding Aid
• Includes descriptive
data from CMS
• Lists each
website/blog
included in
accession
• Uses DAO tag to
link to crawl on
Archive-It
Search on “Website Records” at
http://siarchives.si.edu/search/sia_search_findingaids
51. Archive-It
• Browse URLs
• Search across all
Smithsonian
crawls
• Search by
keyword or
limiting options
• Plan to take
better advantage
of metadata
Smithsonian on Archive-It:
https://archive-it.org/organizations/660
58. 58
• Next steps
• UX testing on finding aids integration vs. local
search page
• Gather (read: develop) additional use analytics
• For more go to:
• http://marbl.library.emory.edu/collections/archives/web.h
tml
• http://findingaids.library.emory.edu/
Google analytics for
search interface from
Feb 2013 to June
2014. Page went live
in June 2013.
• #1 referral:
Redirected URL
of single web
archive
• #2 referral:
MARBL website
search interface
• #3 referral:
finding aids
database
Thanks!
60. Describing <archived> web content
from single sites to web archives
Olga Virakhovskaya
volga@umich.edu
http://bentley.umich.edu/
61. Local subject heading (MARC fields 690)
LC subject headings (MARC fields 6xx)
MARC field 260/264
MARC fields 1xx/7xx
MARC fields 520 &
545 / History & Scope
and Content notes
MARC field 245
62. – Think BIG
– Automate
– Follow standards
– Be consistently clear
– Communicate
e hU a
…because machines don’t know everything
64. MARC records for the Contemporary
Composers Web Archive
Anna Perricci
Columbia University Libraries
SAA Lightning Talk (August 16, 2014)
65. Web Archiving at Columbia
We’ve only got 5 minutes!
• Columbia University
Libraries web archiving
program precedents
• Current Mellon grant
• Collaborative web archiving
66. Contemporary Composers Web Archive
Selectors
• Borrow Direct Music Librarians Group: music librarians at Brown,
Columbia, Cornell, Dartmouth, Harvard, Johns Hopkins, Princeton,
and Yale universities, MIT, and the universities of Chicago and
Pennsylvania
Cataloging expertise
• Russell Merritt (cataloger specializing in music resources)
• Kate Harcourt (Director of Original and Special Materials Cataloging)
• Alex Thurman (Web Resources Collection Coordinator)
68. Creating MARC records for web archives
• Creating MARC records for
archived websites is
standard practice at CUL
– MARC records make web
archives discoverable in
CLIO (Columbia Libraries
Information Online)
• Collection level and seed
level records
• Will use Archive-It interface
to make Dublin Core records
71. Anticipating wider use of MARC records
• Records have been released
to WorldCat
• Collaborators on cataloging
were attentive to which
fields will ordinarily be
stripped out when a MARC
record is imported to
another institution’s OPAC
72. Conclusions
• So far sample of 10 records
has taught us…
• Positive feedback from
music librarians
• Next we will add another 44
records for the archived
sites in CCWA soon
76. Migration effort
• Began in 2013, ongoing
• Move web archives from stand-alone web
application at http://loc.gov/lcwa to library-
wide discovery system at
http://loc.gov/websites/
• Metadata and content migration
• Cross-functional team effort
78. New Possibilities
• Web archives discoverable alongside other LC
collections for first time
• Web archives searchable from LC main page
for first time – greater visibility
• Consistent navigation, look and feel mirrors LC
website
80. New Challenges
• Thousands of MODS records already created
for access, how to repurpose?
• Different interfaces, different needs
• Enable new ideas (combined records)
• Keeping useful elements, old and new
83. From Crawling to Walking:
Improving Access to Web Archives
SAA 2014
Rosalie Lack
rosalie.lack@ucop.edu
84. SAA Web Archiving Roundtable
Follow the blog!
• http://webarchivingrt.wordpress.com/
Learn more!
• http://www2.archivists.org/groups/web-
archiving-roundtable
86. What We’re Doing
• Creating finding aids for each web archive
• Adding links to existing finding aids for the
relevant archived sites
• Providing a web archive collection search page
• Uploading records into library catalogs
• Sending records to OCLC
• Building collaborative collections and providing
unified access
• Integrating access with other formats in our
discovery systems
88. Image credits
Title: The razing of silos on the former Roy Ranch, San Geronimo,
California, May, 1964 [photograph]
Creator/Contributor: unknown
Date: May, 1964
Contributing Institution: Marin County Free Library
http://content.cdlib.org/ark:/13030/kt3489r96r/?order=1
http://content.cdlib.org/ark:/13030/kt067nf0kk/?order=1
http://content.cdlib.org/ark:/13030/kt467nf1dq/?order=1