Transcript Document
Methods for Exploiting Academic
Hyperlinks
Mike Thelwall
Statistical Cybermetrics Research Group
University of Wolverhampton, UK
The Problem
To map patterns of communication between
researchers in a country based upon university
web sites
Patterns of communication are also mapped based
upon journal citations or journal title words
Provides useful information about the structure and
evolution of research fields
Can identify previously unknown field connections
Web analysis could illustrate wider and more
current patterns
Data collection
Web crawler
AltaVista advanced queries
host:wlv.ac.uk AND link:gla.ac.uk
AllTheWeb advanced queries
Google
Does not support same level of Boolean
querying
Types of link count
Direct link counts
Co-inlink counts
Inter-site links only
E
B and C are co-inlinked
Co-outlink counts
D
A
B
D and E are co-outlinked
C
F
Alternative Document Models
Domain ADM
Count links between domains (ignoring
multiple links) instead of pages
P1
P2
P3
www.scit.wlv.ac.uk
P4
P5
P6
www.dcs.gla.ac.uk
Alternative Document Models
Directory ADM
University ADM
Counts links between directories
Estimated using URL slashes
Counts links between entire university Web sites
Too extreme for most purposes
ADMs reduce the impact of replicated links
E.g. a subsite of 1000 pages linking to another
university home page in its navigation bar
Some Inter-University
Hyperlink Patterns
For the UK and Europe
Citation-Style Hyperlink
Analysis
Citation counts are known to be reasonable
indicators of research quality but is the same true
for inlink counts?
Counts of links to universities within a country can
correlate significantly with measures of research
productivity
The significance of this result is in giving
‘permission’ to investigate the use of inter-university
links for researching scholarly communication
Most links are only loosely
related to research
90% of links between UK university sites have
some connection with scholarly activity,
including teaching and research
But less than 1% are equivalent to citations
So link counts do not measure research
dissemination but are more a natural by-product
of scholarly activity
Cannot use link counts to assess research
Can use link counts to track an aspect of
communication
Links to UK universities against
their research productivity
The reason for the
strong correlation is
the quantity of Web
publication, not its
quality
This is different to
citation analysis
Universities tend to link to
neighbours
Universities
cluster
geographically
Language is a factor in
international interlinking
English the dominant language for Web sites in
the Western EU
In a typical country, 50% of pages are in the
national language(s) and 50% in English
Non-English speaking extensively interlink in
English
{Research with Rong Tang & Liz Price}
Can map patterns of international
communication
Counts of links
between EU
universities in
Swedish are
represented by
arrow thickness.
Counts of
links between
EU
universities in
French are
represented
by arrow
thickness.
Which
language???
Which
language???
Linking patterns vary enormously
by discipline
No evidence of a significant geographic trend
Disciplinary differences in the extent of
interlinking: e.g., history Web use is very low,
Chemistry is very high
Individual research projects can have an
enormous impact upon individual departments
E.g. Arts web sites are often for specific exhibitions
or for digital media projects
Links not frequent enough to reliably reveal
patterns of interdiscipliniarity
Clustering using links
Background: Power laws in
Academic Webs
Academic Webs have a topology dominated by
power laws, including
Counts of links to pages (inlink counts)
Counts of links to pages (outlink counts)
Groups of interconnected pages
Directed component sizes
Undirected component sizes
Power laws mean that clustering connected
components will not yield useful results
Page Outlinks
Topological component sizes
Community Identification
Algorithm
Can apply to page, directory and domain models
Gives complimentary results: a “layered
approach”
100000
10000
1000
Frequency
Frequency
10000
1000
100
100
10
10
1
1
1
10
100
1000
10000
Community size: page model, k = 32
100000
1
10
100
1000
Community size: Directory model, k = 32
10000
Stretching links further: coinlinks, co-outlinks
For the UK academic Web, about 42% of
domains connected by links alone host similar
disciplines, and about 43% connected by links,
co-inlinks and co-outlinks
But over 100 times more domains are colinked or
coupled than are directly linked
Links in any form are less than 50% reliable as
indicators of subject similarity
Summary
Studies of the relatively restricted
subdomain of university web sites
Produce directly useful results
For Web IR, they also
Help refine methodologies
Help build intuition