Microsoft PowerPoint - NCRM EPrints Repository
Download
Report
Transcript Microsoft PowerPoint - NCRM EPrints Repository
The Ethics of Large-Scale Web
Data Analysis (Webmetrics)
Mike Thelwall, Statistical Cybermetrics Research Group,
University of Wolverhampton, UK
Rob Ackland, Australian Demographic and Social Research
Institute, Australian National University
Virtual Knowledge Studio (VKS)
Information Studies
Contents
What is webmetrics?
Context: Online access to personal
information
Researchers’ use of personal information
Confidentiality and anonymity
Resource issues
What ethical considerations apply to collecting
and analysing web data on a large scale from
unaware web “publishers” ?
1. What is webmetrics?
Large-scale analysis if web-based data
Collecting and quantitatively analysing
online information
Objective is not to find information
about individuals but identify trends
Data gathered with VOSON, SocSciBot,
Issue Crawler, LexiURL,…
Example
VOSON Hyperlink
network of
political parties
from 6 countries
(Ackland and
Gibson, 2006).
Node size prop.
to outdegree.
76 nodes.
Example:
Links between
EU universities
Austria
Geopolitical
connected
Switzerland
Belgium
Germany
France
Spain
NL
UK
Norway
Italy
Poland
Finland
Normalised linking, smallest countries removed
Sweden
AltaVista link searches
Link associations between social network sites
Example: Blog searching
2. Context: Online access to
personal information
Blogs, social network sites, personal
web sites contain information that is:
Private and protected (invisible to
researchers)
Intentionally public
Publicly private1 (intended for friends but
allowed to be public)
Unintentionally public (public but believed
by owner to be private)
1. Lang (2007)
Accessing “public” information
Commercial search engines
Web crawlers
Internet Archive (includes deleted info)
Who is using Dataveillance?
Dataveillance1: Downloading or otherwise
gathering data on internet users in order to
influence their behaviour
Google – can use email, searching, blogging,
social network activities to target advertising
(& may report to US government)
Amazon – can use past activities to target
adverts or improve web site
1. Zimmer (2008)
3. Researchers’ use of
personal information
Key issue: for large scale research, data
from/about the unaware is used without
their approval, and possibly for
purposes that they might disagree with
Which ethical safeguards should be
taken for this kind of research?
Issue 1: People vs. Documents
Traditionally, documents can be
researched without approval, but
people can’t
Even harsh criticism is fair practice (e.g.,
book review/analysis)
Since web pages are documents,
researching them without permission is
normally OK
Issue 2: Invasion of privacy?
Natural vs. normative
A situation is naturally private1 if a reasonable
person would expect privacy
A situation is normatively private1 if a
reasonable person would expect others to
protect their privacy
Non-secure web pages/data are typically
naturally private
Accessing is not normally invading privacy, even if
undesired by page owners and with negative
consequences
1. Moor (2004)
4. Confidentiality and
anonymity
When should anonymity be granted to
research “subjects” (page owners)?
When a possibly undesired label attached (e.g.,
hate group, terrorist)
When undesired groups might benefit? (e.g.,
league table of hate groups)
When publicly private individuals singled out (e.g.,
detailed analysis of “average” blogger)
Should data be anonymised – as for Census
data used for research?
5. Resource issues
Accessing a web page uses the owner’s
server time/bandwidth
Crawling a web site can use a lot of the
owner’s server time/bandwidth
May incur charges or loss of service quality
Robots.txt protocol
This file lists pages/folders in a web site
may not be crawled
It does not restrict crawling speed
It should be obeyed in research
Most individual users are probably
unaware of this and so don’t use its
protection
Crawling speed
Web crawlers should not run too fast
that they cause service issues
Full speed is probably OK on a UK
university web site but not on a Burkina
Faso library web site
Use judgement to decide how quickly to
crawl – length of pauses in crawling
How many pages to crawl?
Crawling too many pages puts
unnecessary strain on the server
crawled
Use judgement to decide the minimum
number of pages/crawl depth that is
enough
Use search engine queries as a
substitute, if possible
Automatic search engine
searches
Research can piggyback off the crawling of
commercial search engines
No resource implications for site owners
Uses search engine “Applications
Programming Interfaces”
Search engines specify the maximum number
of searches per day
Results limited to the imperfect web
crawling/coverage of search engine crawlers
Summary
Researchers need to be aware of
potential issues when doing large scale
data analysis research
Judgement is called for in all issues
Research does not normally need
participant permission
Be sensitive to impact of findings and
any need for anonymity
References
Lange, P. G. (2007). Publicly private and privately public: Social
networking on YouTube. Journal of Computer-Mediated
Communication, 13(1), Retrieved May 8, 2008 from:
http://jcmc.indiana.edu/vol2013/issue2001/lange.html
Zimmer, M. (2008). The gaze of the perfect search engine: Google
as an infrastructure of dataveillance. In A. Spink & M. Zimmer
(Eds.), Web search: Multidisciplinary perspectives (pp. 77-99).
Berlin: Springer.
Moor, J. H. (2004). Towards a theory of privacy for the information
age. In R. A. Spinello & H. T. Tavani (Eds.), Readings in
CyberEthics (2nd ed., pp. 407-417). Sudbury, MA: Jones and
Bartlett.