This document provides an overview of an introductory course on Web Science. It discusses key topics including:
1. What is Web Science and why it matters as an area of scientific study.
2. Key aspects of web architecture like URIs, URLs, HTTP, and file formats.
3. Methods of measuring the web through network analysis and studying structures like the blogosphere and social networks.
4. The Web Science Method which takes an iterative, mixed methods approach of engineering, measuring, and analyzing the web.
5. The social aspects of the web and challenges of incorporating human behavior.
6. Issues of web governance, security, and standards setting.
Decarbonising Buildings: Making a net-zero built environment a reality
Intro to Web Science (Fall 2013)
1. Intro to Web Science
September 19, 2013
ITWS 1100
John Erickson, Kristine Gloria, Qingpeng Zhang
Tetherless World Constellation
Rensselaer Polytechnic Institute
2. Agenda
1. A Science of The Web and why it matters
2. Web Architecture/Engineering the Web
3. Measuring the Web
4. The Web Science Method
5. Social Aspects of the Web
a. Evolution of methodology
b. Hurdles of incorporating the “social”
c. Why humans aren’t just "nodes" in a network
6. Web and other Governance
3. What is Web Science?
● Positions the World Wide Web as an object
of scientific study unto itself
● Recognizes the Web as a transformational,
disruptive technology
● Its practitioners focus on understanding the
Web...
○ ...its components, facets and characteristics
● The Web Science Method: “the process of
designing things in a very large space..."
4. What does Web Science ask?
● What processes have driven the Web’s
growth, and will they persist?
● How does large-scale structure emerge from
a simple set of protocols?
● How does the Web function as a socio-
technical system?
● What drives the viral uptake of certain Web
phenomena?
Bottom line: What might fragment the Web?
5. What is the Web?
● "The Web is not a thing..."
● Continuously changing due to coordinated
and conflicting processes
● An evolving large-scale structure
dependant on static and emerging protocols
● A socio-technical system that reflects and
obfuscates social and technical structures
● Always goes where we allow it to go...but
seldom where we want or expect it to go!
Les Carr, et.al. http://slidesha.re/142MFrV
7. Web Architecture
It's quite simple, really! ;)
● A standard system for identifying resources
● Standard formats for representing
resources
● A standard protocol for exchanging
resources
Relevant core standards:
● URIs (URLs): Universal Resource Identifiers
● HTML: Hypertext Markup Language
● HTTP: Hypertext Transfer Protocol
11. Identifying Resources (1)
● A global identification system is essential
○ to share information about resources
○ to reason about resources
○ to modify or exchange resources
● "Resources" are anything that can be linked
to or spoken of
○ Documents, cat videos, people, ideas...
● Not all resources are "on" the Web
○ They might be referenced from the Web...
○ ...while not being retrievable from it
○ These are (so called) "information resources"
Les Carr, et.al. http://slidesha.re/142MFrV
12. Identifying Resources (2)
● A global standard is required; the URI is it
● Others systems are possible...
○ ...but added value of a single global system of
identifiers is high
○ Enables linking, bookmarking and other functions
across heterogeneous applications
● How are URI used?
○ All resources have URIs associated with them
○ Each URI identifies a single resource in a context-
independent manner
○ URIs act as names and (usually) addresses
○ In general URIs are "opaque"Uniform Resource Identifier (URI): Generic Syntax (RFC 3986) http://www.ietf.org/rfc/rfc3986.txt
13. Identifying Resources (4)
● "URIs identify and URLs locate..."
○ ...and identify
● URLs are URIs aligned with protocols
○ URLs include the "access mechanism" or "network
location", e.g. http:// or ftp://
○ How to "dereference" the URI and retrieve the thing
● URL examples
○ ftp://ftp.is.co.za/rfc/rfc1808.txt
○ http://www.ietf.org/rfc/rfc2396.txt
○ mailto:John.Doe@example.com
○ telnet://192.0.2.16:80/
Uniform Resource Identifier (URI): Generic Syntax (RFC 3986) http://www.ietf.org/rfc/rfc3986.txt
14. Representing Resources (1)
● Resources are manifest as digital files
● The Web recognizes a (growing) set of file
formats
○ The original and workhorse is HTML...
○ ...but there are many others
● Retrievable resources on the web serve
multiple purposes
○ Resources encode information and data
○ Resources aggregate links to other resources
● This is what makes The Web(tm) a "web..."
16. Retrieving Resources (1)
● Review: URIs that reference retrievable
resources -- URLs -- must specify a protocol
for retrieval
● The original and most common Web protocol
is HTTP
● Specialized protocols are possible but
resources may appear "off the grid..."
18. Principles for creating a healthy Web
Tim Berners-Lee http://www.w3.org/DesignIssues/LinkedData.html
● Use URIs as names for things
● Use HTTP URIs so people can "look up"
those names
● When someone "looks up" a URI, return
useful information
○ use the standards to do it
● Include links to other URIs, so the
Consumer can discover more things
○ People or applications
Why is linking important???
19. Implications of a well-connected
Web: Google PageRank
● Links to other nodes as a "vote" of quality
and/or relevance
PageRank https://en.wikipedia.org/wiki/PageRank
21. Measuring the Web
● The rich variety of networks on the Web
○ Router network
○ Web page network (linking via hyperlinks)
○ Document network (citation network on DBLP*, etc.)
○ Social networks
■ Facebook: friendship, comment-reply, tag, and all
kinds of social relationship on Facebook
■ Twitter: follower, retweet, mention, reply, etc.
■ Blogosphere: friendship, visiting, comment, etc.
■ LinkedIn: colleague, classmate, etc.
■ Crowdsourcing: collaboration, co-worker, etc.
■ Other social media...
*The DBLP Computer Science Bibliography http://www.informatik.uni-trier.de/~ley/db/
22. Measuring the Web - Blogosphere
Political Blogosphere
2004 US Presidential Election
Bloggers:
Blue -Democrat
Red - Republican
Pink - Neutral
L. Adamic, N. Glance, The political blogosphere and the 2004 U.S. election:
divided they blog, LinkKDD’05
27. Analyzing networks on the Web
Measure...
■ # of nodes
■ # of edges
■ Diameter and radius
■ Network density
■ Degree distribution
■ Clustering coefficient
■ Average shortest path length
■ Strongly/weakly connected components
■ Betweenness/Closeness centrality
■ Bow-tie structure
■ Community discovery
■ Key nodes discovery
■ etc...
28. Measuring the Web
“Bow-tie” structure
Overall view of the structure
of the Web
SCC
IN
OUT
Tendrils
Tubes
Disconnected
29. Measuring the Web
● It's a “Small World” after all...
○ Most pairs of pages separated by small # of links
○ Almost always by fewer than 20 links
○ "Diameter" of central core is 28, very small
compared to the size of the Web
○ Analysis suggests diameter will grow logarithmically
with the size of the Web (ie slowly)
○ Diameter of social networks decreases over time
● Conclusion: The Web is “smaller” than we thought!
● “Six degrees of separation” verified in Social Web
R. Albert, H. Jeong and A.-L. Barabasi, Diameter of the World Wide Web,
Nature 401 (1999) 130–131. http://bit.ly/18atsYA
J. Leskovec, etc. Graphs over Time: Densification Laws, Shrinking Diameters
and Possible Explanations, KDD (2005)
30. Measuring the Web
“Scale-free” property
Highly
Connected
Hubs
“Rich
get
richer”
A live model: http://ccl.northwestern.edu/netlogo/models/PreferentialAttachment
31. The Web Science Method
Berners-Lee, T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(10)
32. The Web Science Method
Berners-Lee, T. (2007). W3C. http://www.w3.org/2007/Talks/0509-www-keynote-tbl/#(10)
"Science"
"Engineering"
38. Social Aspects of the Web
"Visual complexity produces opacity.
Massive individualizing data produces
beautiful, playful hairballs which show us
nothing."
- Bruno Latour @ CHI2013
For discussion see, "What baboon notebooks, monads, state surveillance and network diagrams have
in common: Bruno Latour at CHI2013" http://bit.ly/14Y3d3u
39. Multiple disciplines, multiple
methods
● Given it’s multiple disciplines, the argument is for a mixed-
methods approach to measuring the web. This means both
quantitative AND qualitative methods should be employed
by researchers.
○ Pros: More robust, comprehensive understanding of
“human social behavior”
○ Cons: Diametrically opposed philosophies in data
gathering and analysis
● Unanswered questions:
○ Replicability
○ Bias
○ Objectivity and Accuracy
● Ethics
49. Web Science meets (Web) Governance?
Policy Design
Policy
Implementation
Analysis and
understanding of
policy
implications
Policy
Conception
50. Review...
1. A Science of The Web
2. Web Architecture
3. Measuring the Web
4. The Web Science Method
5. Social Aspects of the Web
6. Web and other Governance
51. Assignment:
1. Preferential Attachment Simulator: http://bit.ly/18bd0p2
○ Try the THINGS TO TRY!
2. Excel-based Network Analysis Tutorial
○ Following instructions at: http://bit.ly/1a3mtzW
○ Install NodeXL from: http://bit.ly/1a3mnbx
○ Use Senate 2007 data from: http://bit.ly/1a3mhAI
○ Play with other data at: http://bit.ly/1a3mJ1X
3. Social Network Exploration
○ Twitalizer: http://twitalyzer.com
○ TweetArchivist: http://tweetarchivist.com
○ MentionMap: http://mentionmapp.com
4. Create a Web Science Scenario:
○ Identify a (social) problem
○ Proposed an engineered solution
○ Identify how to measure, analyze, evaluate, iterate