Digital History seminar
4 November 2014
Live Stream: http://ihrdighist.blogs.sas.ac.uk/2014/10/28/tuesday-4-november-interrogating-the-archived-uk-web-historians-and-social-scientists-research-experiences/
1. Interrogating the
archived UK web
“RNIB”
Gareth Millward – gareth.millward@lshtm.ac.uk – Centre for History in Public Health
Improving health worldwide
http:://history.lshtm.ac.uk
2. “The best-laid schemes
o’ mice an’ men…
• Original plan to investigate
the presence of information
for disabled people on the
UK web
• Also to look at the
accessibility of that info
through Web Accessibility
Standard 1.0 (1998)
• Search for major
organisations and key
disability words
• Run sample through
validation tools
Pieter Bruegel the Elder - The Tower of Babel (Vienna) - Google Art
Project – edited : from Wikipedia
3. … Gang aft
agley.”
• Far too much stuff!
• Search terms such as “RADAR”,
“SCOPE” and “MIND”
obviously… problematic…
• No discernible pattern from
code validation
• “Experience” of using screen
readers impossible (for now)*
• Defining “information” or
“reach” not a simple task
• Still major problems with
assessing “importance” and
“relevance”
* - At least within design scope of this project… !
Macintosh Performa 5200, a mid-90s Apple
computer. From Wikipedia.
8. The trouble
begins - links
Links to Instances
-> rnib.org.uk 259,421
-> w3.org 71,798
-> mla.gov.uk 34,435
-> openharmonise.org 32,071
-> facebook.com 31,098
• Disaggregated statistics are
basically meaningless
• Second most common link is
to W3.org – had virtually
nothing to do with the actual
activities of RNIB
• openharmonise.org – the CMS
for mla.gov.uk. Reflects
references on MLA site, not
the activity of RNIB
10. Commensurability goes
out the window..
• Once you start filtering out the
areas that aren’t “really” part
of your search, it becomes
impossible to compare one
search term with another.
• You will lose “useful”
information and keep
“useless” stuff
• Can begin to build a “human
readable” corpus – but what
the heck do I actually have,
here? Certainly not what I
originally intended to look at…
xkcd:Thesis Defence
11. Whittling down
• REMOVED LINKS TO W3.org (usually just a mention of WAI)
• REMOVED RNIB.org.uk (I can browse the main site – more interested
in external material)
• REMOVED 2009 & 2010 (made the sample smaller, and these use
different crawling system)
• REMOVED RNIB.co.uk
• REMOVED big-print.co.uk
• REMOVED MLA.gov.uk (mentions RNIB a lot, but becomes noise)
• The result of all this? The corpus is down to 71,112
• (Actually, by reducing the date range further and adding a couple of
extra tweaks, now down to 39,270)
12. What did we learn
today?
• Visible effects of the impact of
RNIB on UK web standards
• Sheer presence suggests RNIB
was better than its peers at
establishing itself on the
internet
• Google has made us me lazy
• An archive without an archivist
or a catalogue is highly
problematic for researchers The British Library – from Wikicommons