Steps Towards Mapping e-Research and Measuring Impact

Download Report

Transcript Steps Towards Mapping e-Research and Measuring Impact

Steps Towards Mapping
e-Research and Measuring Impact
Alex Voss, Rob Procter, Peter
Halfpenny, Meik Poschen,
Marzieh Asgari-Targhi
AHM’08: Workshop on Profiling e-Research:
Mapping Communities and Measuring Impacts
Edinburgh, 10th September 2008
Aims
 To compile a comprehensive* database of
e-Social Science activities in the UK and
elsewhere
 To analyse the data in order to capture snapshot
of e-Social Science
 To provide a monitoring tool that flags up new
content
 To provide an infrastructure for further research
Problem
 What I would call e-Social Science is not
always labeled e-Social Science
 Simply googling for the term will provide
only a partial view
 Need to establish a network of relevant
nodes with context information on the web
and expand search from there
Approach
 Using lists of conference and workshop
attendees
 Search for relevant URLs
 Review resulting data
 Harvest web pages connected to these
 Extract key terms
 Visualise results
 Further steps…
Seed List
 Data about attendees of events
(Intl. Conference and Agenda Setting)
 226 individuals
 Removal of duplicates and erroneous
entries
 Import into SQL database
Search
 Using Yahoo Search API, generating list of
URLs matching name, surname and
affiliation
 Restricted to .ac.uk, .edu and .nhs.uk and
.gov.uk
 Results in 30k hits for 226 people
 Extraction of hostnames from URL
Removing False Positives
 Clustering of hostnames by frequency
showed some systematic false positives
through long lists of names on some sites
 e.g., lists of alumni, sports teams etc.
 Manually removing these for the top 80
hostnames reduced number of URLs by
10k to 20k
Review
 Clustering of hostnames by frequency (after cleaning):
select count(host) as size, host from url group by host order by size desc;
+------+-------------------------------------+
| size | host
|
+------+-------------------------------------+
| 211 | www.geog.leeds.ac.uk
|
| 204 | www.nottingham.ac.uk
|
| 140 | www.shef.ac.uk
|
| 126 | www.ncess.ac.uk
|
| 109 | www.manchester.ac.uk
|
|
97 | www.lancs.ac.uk
|
|
95 | www.psychology.nottingham.ac.uk
|
|
93 | redress.lancs.ac.uk
|
|
92 | www.cs.bris.ac.uk
|
|
91 | www.comlab.ox.ac.uk
|
Review (II)
 Clustering of URLs by number of persons mentioned
(after cleaning):











+---------------------------------------------------------------------+
| size | url
|
+---------------------------------------------------------------------+
|
24 | http://ess.si.umich.edu/papers.htm
|
17 | http://www.ncess.ac.uk/events/ASW/visualisation/
|
17 | http://www.ncess.ac.uk/events/conference/2006/papers/
|
12 | http://ess.si.umich.edu/committee.htm
|
12 | http://redress.lancs.ac.uk/resources/
|
10 | http://www.kato.mvc.mcc.ac.uk/rss-wiki/VizNET
|
10 | http://www.informatics.manchester.ac.uk/aboutus/staff/|
|
8 | http://www.ncess.ac.uk/about_us/people/?centre=
|
7 | http://www.geog.leeds.ac.uk/people/a.turner/personal/blog/
Checking Completeness



select id from url where url = 'http://ess.si.umich.edu/committee.htm';
> 59765
select surname, name from delegate join delegate_url
on id = delegate_id where url_id = 59765;
 This returns a list of 12 people but actual list of
conference PC is much longer
 Missing people who are in the database but also people
missing in the database
 Potential to expand list of people involved in e-Social
Science
Harvesting Content





Harvesting 20k web pages takes time
Using multithreaded code to mask latency
Using 40 harvesters still takes about 4h
All but 230 pages harvested
1.3GB of data
Amending Seed Data
 Extracting email addresses
 Finding mailto: links actually works quite well
 Not much need to deal with obfuscation (such as alex.voss-atncess.ac.uk)
 But doing this may improve results
 How to deal with multiple valid emails
 Extracting affiliations
 Again, surprising how effective this was but ho
 Again, how to deal with multiple affiliations
 Affiliation does not map 1:1 to research area
Key Term Extraction
 Using NaCTeM’s Termine (using website
at the moment, web service soon)

Rank
5
10
11
12
13
14
15
18
19
22
27
35
40
46
48
Term
e-social science
national centre
rob procter
social science
marina jirotka
international conference
social sciences
mark rouncefield
computer science
research centre
science studies unit
lancaster university
computer supported cooperative work
text mining
paul luff
Key Term Extraction (II)
 Next steps:
Change code to use web services API
Repeat key term extraction for 226 individuals
Create unified key term list
Review and create stop-list
Factor this into tailored Termine service
Named entity recognition to extend seed list
Social Map
Co-occurrence of names on web pages
Further Next Steps
 Add weights to social map – how strongly are
people connected?
 Drawing social network graphs for interactive
analysis using information about link structure
 Repeating Yahoo searches to flag up new data
appearing
 RSS feed on what’s new in e-Social Science
 Doing Yahoo searches on the top key terms
emerging
Next Steps?
 FOAF – type semantic data on e-Social Science
projects
 What incentives could we leverage to get people
to provide the information we are interested in?
 Combining with bibliometric work
 New kinds of entities:
 Publications
 Projects, Organisations