View/Open - NDLTD Document Archive

Download Report

Transcript View/Open - NDLTD Document Archive

ETD 2005
International Accesses
to a Digital Library of
ETDs
ETD 2005
Ana Pavani
Departamento de Engenharia Elétrica
Pontifícia Universidade Católica do Rio de Janeiro
[email protected]
http://www.maxwell.lambda.ele.puc-rio.br/
Presentation outline
 Profile of the digital library
 Generation of data
 Combination and anaysis of data –
interesting results
 Next steps
Profile of the digital library
 Beginning of the collection – 2nd semester
of 1995
 Items to start the collection – courseware
(texts, exercises, technical manuals, tests, etc.)
 The digital library is part of a system that:
 Is a LMS (Learning Management System)
 Has administrative functions that allow data
exchange with the university’s administrative
system
 Is linked (2 directions) to CNPq’s Lattes Platform
(curricula database with more than 595 K CV)
 Allows the control of series collections
 Is multilingual and has interfaces in 3 languages
 Evolution of the collection:
 Administrative documents
 Preprints, published papers & online articles
 Interactive courseware
 ETDs (2000)
 Online journals (2003)
 Senior projects (2003)
 Online bulletins – distributed through mailing
lists, archived and published automatically (2004)
 Books (Oct. 2005)
 Numbers of titles in the collection:
 Courseware (many types) – 2,700+
 Administrative documents – 33
 Technical documents – 94
 ETDs – 1873 (PUC-Rio) + 31 (UNICAP)
 Preprints, published papers & online articles –
280
 Senior projects – 305
 Online journals – 3 (+ 1 in Oct. 2005 + 1 in Dec.
2005)
 Online bulletins – 2
 Books – 1 (to be published in Oct. 2005)
 Total number of digital objects (DOs):
16,400+
 Technological characteristics:
 Machine – IBM RS/6000
 Operating system – IBM AIX
 Web server – Apache
 DBMS – IBM DB2
 Apache log contains info on accesses to ALL
digital contents on the system, besides all
transaction that users perform (clicking
buttons, reading posts, reading help pages, etc.) –
data on transactions with contents must be
extracted from the server log to generate
the numbers to be analyzed
Generation of data
 Data have 2 different natures: production
and accesses
 Production data come from functions of the
system that are not related to the Apache
server but only to the DB
example
ETD time series - PUC-Rio & UNICAP
2000
1800
1600
1400
numbers
1200
1000
800
600
400
200
0
semesters
jun.00
dec.04
dec.00
jun.05
jun.01
sep.05
dec.01
jun.02
dec.02
jun.03
dec.03
jun.04
(*) PUC-Rio started
requiring ETDs in Aug.
2002;
(*) UNICAP does not
require ETDs.
 Access data are obtained from both the
Apache Server log and the DB:
 Logs are mined (according to the following definitions)
and the results are stored on the DB
 Mined data are combined with production data
(metadata) already in the database (types of
contents, authors, programs, areas of knowledge, dates,
countries, etc.) to yield results
Definitions for mining the log
 When access statistics came into
discussion, it was necessary to define how
data should be mined from the log and how
it should be combined afterwards
 The definitions follow – (M) mining
definitions and (C) combining definitions
(M) Visits and complete visits
An ETD can have one or many digital
objects. The number of visits is the sum of
all accesses to all digital objects in a given
month. A complete visit is a set of visits to
all digital objects from a country in a given
month.
(M) Country x IP address
The decision to use the country and not
the IP address to establish a visit was
based on the fact that the visits to an ETD
can be made at different times (and
reconnecting may assign a new IP address) and
from different locations (with fixed IP
addresses).
(M) Counting visits from the same IP
address
Visits from the same IP are counted individually
due to the fact that networks with many machines
can be identified by the IP address of a firewall.
(M) Counting visits to restricted
digital objects
Some ETDs are totally or partially restricted –
approximately 30% have some type of permanent
or temporary restriction. Metadata, abstracts
included, are publicly available for all of them. It
was decided that attempts followed by denials of
access would be counted as accesses.
!! This is informed in the help pages
of the system; it is suggested that
authors should consider allowing their
contents to become public if many
attempts occur.
(C) Lines to mine
Since the interest was on access to digital objects,
the decision was to get the lines with extensions
.dcr, .doc, .htm, .pdf, etc. All possible extensions
on the database are considered, as long as the
corresponding item is cataloged on the digital
library, so that an eventual static html system
page is not counted.
Observations
(1) Statistics were planned on a monthly basis.
The model treats data as sequences of points
with discrete-time intervals of a month. Past
months data are unchanged and current
month is updated according to the Update
definition.
(2) IPs are resolved using a plug-in called GeoIP
Free that is available with AWStats.
(C) Information to get from a log line
The month and the year are extracted
along with identification of the digital
object and the country of the IP address
that accessed the digital object.
(C) Update of the DB
The lines are read every hour at the full
hours (00:00, 01:00, etc.); incremental
lines are mined. Accesses are summed for
each month-year-DO-country, so the table
is not very big – in the first 6 months of
2005 the average number of lines per
month was 10,000.
(C) When to start computing
The log of the Apache Server started being
saved on Jun 01, 2004. So, either this date
was used or a later one, for example Jan
01, 2005. The decision was to use all
available monthly logs.
When the process started, some days
of offline processing were required.
Afterwards update became automatic
according to the Update definition.
Observations
(1) Maybe these were not the best definitions –
we are willing to discuss alternatives!!
(2) The (original) logs are stored and saved offline
in case some change in the minig strategy is
decided (we have not sunk the ships!!).
Definitions for computing statistics
 By author
 Visited ETDs by year, month and country
 Visited ETDs by country, month and year
 25 most visited ETDs (on the system = PUC-Rio
+ UNICAP)
 20 most visited ETDs by institution
 10 most visited ETDs by graduate program
 Visited ETDs by institution, program, year
and month
Initial Results
Access to ETDs is increasing
(Sep 28, 2005)
Visits
1,600
1,400
1,200
Brazil
United States
pt speaking countries
1,000
es speaking countries
others
800
600
400
200
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Jun 04 - Sep 05
# ETDs may/sep –  13%
# accesses may/sep –  54.6%
Number of total visits is increasing
(Sep 28, 2005)
Total Visits
3,500
3,000
2,500
2,000
1,500
1,000
500
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Jun 04 - Sep 05
# ETDs may/sep –  13%
# accesses may/sep –  54.6%
Accumulated average total
visits is increasing (Sep 28, 2005)
Accumulated Average Total Visits
900
800
700
600
500
400
300
200
100
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Jun 04 - Sep 05
# ETDs may/sep –  13%
# accesses may/sep –  54.6%
Brazil accounts for 55% of the
accesses since Jun 01, 2004
(Sep 28, 2005)
12.66
13.53
54.87
7.00
Brazil
United States
pt speaking countries
es speaking countries
others
11.94
But…
Brazil + pt speaking + es speaking = 75%
Brazil + US + pt speaking + es speaking = 87%
On Jun 15, 2007 the numbers of ETDs in
Iberian languages on the NDLTD DB were
Institution
Country
Language(s)
Number
National Library
Portugal
Portuguese
185
IBICT (includes PUC-Rio)
Brazil
Portuguese
UAB
Spain (Catalunya)
Catalan or English or Spanish
UIB
Spain (Catalunya)
Catalan or English or Spanish
22
UJI
Spain (Catalunya)
Catalan or English or Spanish
42
UOC
Spain (Catalunya)
Catalan
UPC
Spain (Catalunya)
Catalan or English or Spanish
415
UPF
Spain (Catalunya)
Catalan or English or Spanish
67
URL
Spain (Catalunya)
Spanish
URV
Spain (Catalunya)
Catalan or English or Spanish
106
UdG
Spain (Catalunya)
Catalan or English or Spanish
131
UdL
Spain (Catalunya)
Catalan or English or Spanish
70
UV
Spain (Catalunya)
Catalan or English or Spanish
200
11,118
Brazilian ETDs were 83% of all ETDs in Iberian languages
(total number 13,369)
1,011
1
1
Percentage of visits from Brazil
is decreasing (Sep 28, 2005)
Percentage of Visits from Brazil
80
70
60
50
40
30
20
10
0
1
2
3
4
5
6
7
8
9
10
Jun 04 - Sep 05
11
12
13
14
15
16
Accumulated percentage averages
of visits from Brazil (Sep 28, 2005)
Accumulated Average Percentage of Visits from
Brazil
68
66
64
62
60
58
56
54
52
50
1
2
3
4
5
6
7
8
9
10
Jun 04 - Sep 05
11
12
13
14
15
16
Total accesses top 10
countries (Sep 28, 2005)
Country
Brazil
Visits
12,845
USA
2,795
Portugal
1,489
Spain
679
Peru
652
Mexico
432
Chile
364
France
245
Colombia
225
Argentina
224
# identified countries 122
+ unindentified countries
+ satellite access host
Some interesting results
 Some ETDs are permanent ‘best sellers’
 They are on specific subjects (examples: a specific
phylosopher and history of modern architecture in Brazil)
 They are linked from sites on the subjects
(examples: the first from the US & Brazil and the second
from Germany)
 They are accessed from different countries
 Some topics are permanent ‘best sellers’
(example: energy)
 Some ETDs are temporary ‘best sellers’ –
this seems to happen when they are
displayed at the ‘last published ETDs’
functions (system and graduate program)
 Some graduate programs are permanent
‘best sellers’
 They research topics that are very specific of the
country (examples: education and history of culture)
 They are indexed in other sites and/or digital
libraries (examples: Universia in Spain for social sciences
and humanities)
 They are accessed from different countries
The 25 most visited ETDs
have a large number of visits
Average Number of Visits of the 25 Most Visited
ETDs
400
350
300
250
200
150
100
50
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Jan 04 - Sep 05
No average is lower than 100
visits per month
Next steps
 Find out how readers got to ETDs (BDTD,
NDLTD, SCIRUS, etc.) – an online survey is
planned
 Interview faculty to check if some ETDs are
recommended reading in courses
 Gather more data and analyze in a ‘more
scientific’ manner (must find a student!!)
 Develop additional functions comparing
accesses with production
 Extend to other digital contents (at the
moment only ETDs and online journals have access
statistics)
Thank you!
Muito obrigada!