20140725ETD2014FoxImprovingETDlandscapex

Download Report

Transcript 20140725ETD2014FoxImprovingETDlandscapex

Improving the ETD
Landscape
ETD 2014: 17th Int’l Symposium on ETDs
Leicester, England
Edward A. Fox
Executive Director, NDLTD, www.ndltd.org
[email protected]
http://fox.cs.vt.edu/talks/2014
Virginia Tech, Blacksburg, VA 24061 USA
1
Outline
•
•
•
•
•
Acknowledgments
Why, what, who, how
Improving, quality
Related technical contributions
DLs and DL curriculum
Acknowledgments
• Family, mentors, teachers, students
• Dissertations: Sung Hee Park, Venkat
Srinivasan, Seungwon Yang
• NSF: IIS-0535057, 0916733, 1319578
• All those working with ETDs
• NDLTD, including its Members, Board,
Committees, and Working Groups
Why, What, Who?
• Why?
– enhance graduate education
– expand global research collaboration
• What?
– help students communicate more effectively
– get ETDs for all TDs: next goal 5 million
– help make ETDs open, accessible, preserved
• Who?
– levels: students, faculty, staff, (grad) administrators
– professions: CS, IT, LIS, librarians, archivists
How?
•
•
•
•
•
•
Authoring systems, tools, methods
Data and auxiliary information management aids
Metadata creation software and techniques
Submission, approval, refinement workflows
Local access and information management
Sharing, disseminating, discovering
– OAI, data providers, harvesting
– Regional/national, global institutions
• Services: access, preservation, adding value
• Add back files
Improving – 1 of 2
• Context: Quality frameworks, references on quality
• Guidelines and documentation for all of this
• Works
– XML + PDF + raw/original representations
– Multimedia, software, simulations, websites, dynamic
content
• Data, auxiliary information, references/bibliographies
– Reproducibility
• Metadata
– Completeness: subject classification, faculty by role
– Authority info
Improving – 2 of 2
• Local services
– Training, assistance
– IR, archives, archival consortia
• Global services
– Browse, faceted search, full-text search
– Recommend, CLIR, CBIR, summaries, topics
– Linked data, hyperlinks, citation linking
– Alerts, notifications, RSS feeds, filtering
Information Life Cycle (adapted)
Authoring
Modifying
Classifying
Using
Tagging
Recommending
Citing
Retention
Indexing
/ Mining Downloading
Storing
Discovering
Retrieving
Filtering
Distributing
Networking
Borgman et al. 1996
http://is.gseis.ucla.edu/research/dig_libraries/
8
Quality and the Information Life Cycle
Active
Accurac
y
Comple
teness
Conform
ance
Timeliness
Similarity
Preservability
Describing
Organizing
Indexing
Authoring
Modifying
Semi-Active
Pertinence
Retention
Significance
Mining
Creation
Accessibility
Storing
Accessing
Timeliness
Filtering
Utilization
Archiving
Distribution
Seeking
Discard
Inactive
Searching
Browsing
Recommending
Relevance
Ac
ce
s si
b
Networking Pr
ese ility
rva
bil
ity
Quality Dimensions
DL Concept
Digital object
Metadata specification
Collection
Catalog
Repository
Services
Dimensions of Quality
Accessibility
Pertinence
Preservability
Relevance
Similarity
Significance
Timeliness
Accuracy
Completeness
Conformance
Completeness
Impact Factor
Completeness
Consistency
Completeness
Consistency
Composability
Efficiency
Effectiveness
Extensibility
Reusability
Reliability
Digital Library Service Taxonomy
Infrastructure Services
Repository-Building
Creational
Preservational
Acquiring
Cataloging
Crawling (focused)
Describing
Digitizing
Federating
Harvesting
Purchasing
Submitting
Conserving
Converting
Copying/Replicating
Emulating
Renewing
Translating (format)
Add
Value
Annotating
Classifying
Clustering
Evaluating
Extracting
Indexing
Measuring
Publicizing
Rating
Reviewing (peer)
Surveying
Translating
(language)
Information
Satisfaction
Services
Browsing
Collaborating
Customizing
Filtering
Providing access
Recommending
Requesting
Searching
Visualizing
11
Improve related movements
• Make related efforts work for graduate
researchers, ETDs, and university ETD activities:
• Open access, institutional repositories
• Sharing references and citations: Zotero, …
• Sharing data, datasets, workflows; reproducible
science: reproducibleresearch.net, …
• Building author profiles: ORCID, ISNI, …
• Digital libraries and DL education (DL2014)
Related technical contributions
• Broadly: new/better systems, user/usage studies,
added services, improved practices
• Automatically assign topics or categories to ETDs
or to portions (e.g., chapters) to aid browsing and
(faceted) searching
• Build a union reference collection: by aiding
authors (e.g., Hiberlink) and/or by automatic ETD
text mining
• Enhanced information retrieval: cross language
IR, content based IR (image/video/music) …
Topic determination
• Given a document, extract or generate
generalized description of its topics
• Statistical approaches, e.g., LDA
• Knowledge based approaches, e.g., Xpantrac
– Take a webpage or document
– Use portions of it to build queries to a knowledge
source (Web, Wikipedia, and ETD collection)
– Combine, analyze, and summarize the results
– Seungwon Yang, "Automatic Identification of Topic
Tags from Texts Based on Expansion-Extraction
Approach", Jan. 2014, Ph.D. dissertation,
http://hdl.handle.net/10919/25111
ETD Classification: Venkat Srinivasan
• Enhance metadata by adding subject categories
• Hierarchical classification of ETDs (and chapters
thereof) using Library of Congress categories
• Training data
– OCLC’s WorldCat: records from 1M books have good
labels but little metadata; labels on ETDs not usable
– Results coming from queries each designed to
describe a category
– Need to balance negative and positive examples
throughout the LoC taxonomy
ETD Classification:
Category
Algorithm Pipeline
ETDs categorized into
a node of the
category tree (after
classification)
ETD
Collection
Tree
Category label for
each node used as
query
ETD metadata used
for categorization
Categorized
ETDs
Google
Naïve Bayes
Classifiers
Top 50 webpages (for
each node in the
tree)
Level-wise
categorization
Browsing
Training
Web
Document
Sets
Training
Cleanup
(stemming,
stopword removal,
etc.)
Sets
Interface
Reference Extraction and Databasing
1. How can we implement metadata schema for
bibliographic information?
2. What machine learning methods are effective
to extract reference sections including
footnotes and chapter references?
Sung Hee Park, "Discipline-Independent Text
Information Extraction from Heterogeneous
Styled References Using Knowledge from the
Web", June 2013, VT CS Ph.D. dissertation
Dataflow of Reference Section Extraction
Training data
Feature
Extraction
Learning
Pdf2 txt
ETD in PDF
Feature
Extraction
Reference
Section
Extraction
Tagged data
ETD References: System Architecture
Users
Extracting
Reference
Sections
ETD
Repository
Searching,
Browsing,
Manipulating
Metadata
with
References
Web App (e.g., ETD-db)
https://github.com/VTUL/etddb2
Union ETD
References ?
Discovery, Search Engines, Info. Retrieval
(to be extended for images, etc.)
Q
Search
Results
Ranking
D
Documents
Best matches
(Q with D)
selected
Quality of many systems is low, with recall and
precision at only around .5, as opposed to 1 at 1.
Search Module Detail
(features can be about text, images, …)
Query Q
Document D1
Feature vector
Q
Similarity
Function
S = Sim(Q,D1)
Feature vectors
D1
• In CBIR (Content Based Image Retrieval),
• search is based on visual content of images
– Color
– Shape
– Texture …
DL Definitions: Informal 5S
DLs are complex systems that
•
•
•
•
•
help satisfy info needs of users (societies)
provide info services (scenarios)
organize info in usable ways (structures)
present info in usable ways (spaces)
communicate info with users (streams)
• Use this as: checklist, design guidelines, basis for
formal description, specification for software
implementation; e.g., Spaces help re GIS, VR
22
Digital Library Books
• Edward A. Fox and Jonathan P. Leidig, eds. Digital Library
Applications: CBIR, Education, Social Networks,
eScience/Simulation, and GIS. Morgan & Claypool Publishers, 2014,
175 p., http://dx.doi.org/10.2200/S00565ED1V01Y201401ICR032
• Edward A. Fox and Ricardo da Silva Torres, eds. Digital Library
Technologies: Complex Objects, Annotation, Ontologies,
Classification, Extraction, and Security. Morgan & Claypool, 2014,
205 p., http://dx.doi.org/10.2200/S00566ED1V01Y201401ICR033
• Rao Shen, Marcos Andre Goncalves, and Edward A. Fox. Key Issues
Regarding Digital Libraries: Evaluation and Integration. Morgan &
Claypool, 2013, 110 p.,
http://dx.doi.org/10.2200/S00474ED1V01Y201301ICR026
• Edward A. Fox, Marcos Andre Goncalves, and Rao Shen. Theoretical
Foundations for Digital Libraries: The 5S (Societies, Scenarios,
Spaces, Structures, Streams) Approach. Morgan & Claypool, 2012,
180 p., http://dx.doi.org/10.2200/S00434ED1V01Y201207ICR022,
supplementary website
https://sites.google.com/a/morganclaypool.com/dlibrary/
DL Curriculum Project
• NSF awards to VT and UNC-CH: CS and LIS
• Project server: http://curric.dlib.vt.edu/
• Wikiversity:
http://en.wikiversity.org/wiki/Curriculum_on_Dig
ital_Libraries
• Table 1: Core DL Curriculum
• Table 2: Information Retrieval Packages
• Table 3: LucidWorks Big Data Software
• Table 4: Multimedia Software
24
DL Curriculum Module Template
1. Module name
2. Scope
3. Learning objectives
4. 5S characteristics of the module
(streams, structures, spaces, scenarios,
society)
5. Level of effort required (in-class and
out-of-class time required for students)
6. Relationships with other modules
(flow between modules)
7. Prerequisite knowledge/skills
required (what the students need to
know prior to beginning the module;
completion optional; complete only if
prerequisite knowledge/skills are not
included in other modules)
8. Introductory remedial instruction
(the body of knowledge to be taught
for the prerequisite knowledge/skills
required; completion optional)
9. Body of knowledge (theory +
practice; an outline that could be used
as the basis for class lectures)
10. Resources (required readings for
students; additional suggested readings
for instructor and students)
11. Exercises / Learning activities
12. Evaluation of learning objective
achievement (graded exercises or
assignments)
13. Glossary
14. Additional useful links
15. Contributors (authors of module,
reviewers of module)
25
RELATED
TOPICS
CORE DL
TOPICS
COURSE
STRUCTURE
DL Curriculum Framework
Semester 1:
DL collections:
development/creation
Digitization
Storage
Interchange
Metadata
Cataloging
Author
submission
Digital objects
Composites
Packages
Semester 2:
DL services and
sustainability
Architectures
(agents, buses,
wrappers/mediators)
Interoperability
Spaces
(conceptual,
geographic,
2/3D, VR)
Documents
E-publishing
Markup
Multimedia
streams/structures
Capture/representation
Compression/coding
Bibliographic
information
Bibliometrics
Citations
Content-based
analysis
Multimedia
indexing
Naming
Repositories
Archives
Services
(searching,
linking,
browsing, etc.)
Archiving and
preservation
Integrity
Architectures
(agents, buses,
wrappers/mediators)
Interoperability
Thesauri
Ontologies
Classification
Categorization
Multimedia
presentation,
rendering
Info. Needs
Relevance
Evaluation
Effectiveness
Intellectual property
rights mgmt.
Privacy
Protection (watermarking)
Routing
Filtering
Community
filtering
Search & search strategy
Info seeking behavior
User modeling
Feedback
Info
summarization
Visualization
26
DL Curriculum Modules - examples
• Module 1-b: History of digital libraries and
library automation
• Module 2-c: File Formats, Transformation,
and Migration
• Module 3-b: Digitization
• Module 4-b: Metadata
• Module 5-a: Architecture overviews
• …
27
Summary Scene
Conclusion: Improving together
•
•
•
•
Who will help?
What can we do?
What knowledge and education is needed?
What connections, integrations, collaborations
can help with ETDs?
• Please comment and share! – Ed Fox
([email protected], http://fox.cs.vt.edu/talks/2014)