English=UK/US - School of Computing

Download Report

Transcript English=UK/US - School of Computing

COMP3740 CR32:
Knowledge Management
and Adaptive Systems
Overview and example KM exam questions
By Eric Atwell, School of Computing,
University of Leeds
S1: Eric Atwell
Office: 6.06a
[email protected]
S2: Vania Dimitrova
Office: 9.10p
[email protected]
http://www.comp.leeds.ac.uk/eric http://www.comp.leeds.ac.uk/vania
http://www.comp.leeds.ac.uk/nlp http://www.comp.leeds.ac.uk/agc/krgroup.html
Semester 1 Topics in KM
• Knowledge in Knowledge Management
– the nature of knowledge, definitions and different types
– Knowledge used in Knowledge Based Systems, KM systems
• Knowledge and Information Retrieval / Extraction
–
–
–
–
–
Analysis of WWW data: Google tools, SketchEngine, BootCat
IR: finding documents which match keywords / concepts
IE: extracting key terms, facts (DB-fields) from documents
Matching user requirements, advanced/intelligent matching
Mining WWW as source of data and knowledge
• Knowledge Discovery
–
–
–
–
Collating data in data warehouse; transforming and cleaning
Cross-industry standard process for data mining (CRISP-DM)
OLAP, knowledge visualisation, machine learning in WEKA
Analysis of WWW-sourced data
Past Exam Papers?
• One way to see what you need to learn is to look at
past exam papers – this gives a “bird’s eye view”
• COMP3740 CR32 is a new module …
• BUT developed from
– COMP3410 Technologies for Knowledge Management
– COMP3640 Personalisation and User-Adaptive Systems
• For example, past COMP3410 exam paper covers
some topics in CR32
Q1a: KM for bibliographic search
Serge Sharoff is a lecturer at Leeds University who has
published many research papers relating to
technologies for knowledge management, for
example: …
(i)Imagine you are asked to assess the impact of Dr
Sharoff’s research, by finding a list of papers by
other researchers which cite these publications.
Suggest three Information Retrieval tools you could
use for this task. State an advantage and a
disadvantage of each of these three IR tools for this
search task, in comparison to the other tools.
A1a: KM for bibliographic search
(i)Name 3 appropriate tools e.g. Google Scholar, CiteSeer, ISI
Web of Knowledge, Google Books
An appropriate pro and con of each, eg:
Google Scholar:
Pro: wider coverage, all publications on open WWW;
Con: does not give full references, just URL and some details
Citeseer:
Pro: stores papers in several formats plus BibTeX references;
Con: not as good coverage, esp interdisciplinary
ISI Web of Knowledge or Web of Science:
Pro: good coverage of top journals including “paid-for”
Con: most papers in this field are not in top journals
Q1a (ii): KM doesn’t always work
Q: Suggest three reasons why citations for some papers
might not be found by any of your suggested IR tools
A: - Two of these papers are in Russian, citations may
also be; these tools focus on English-language papers;
- Papers in this field are mainly in conference/workshop
proceedings, not journals, hence less likely to be
indexed by IR tools (esp Web of Science)
- older papers may not be online, so less likely to be
found and cited by others
Q1b: Info Retrieval v Info Extraction
What is the difference between Information Retrieval
and Information Extraction? A Knowledge
Management consultancy aims to build a database of
all Data Mining tools available for download via the
WWW, including name, cost, implementation
language, input/output format(s), and Machine
Learning algorithm(s) included; should they use IR
or IE for this task, and why?
A1b: Info Retrieval v Info Extraction
IR: finding whole documents which match query
IE: extracting data/info from a given text to populate
fields in data-base or knowledge-base records
Both IR and IE are appropriate:
this task requires IR to find DM tool description
webpages from whole WWW, but then finding the
specific details in each webpage is “identifying fields
in records for DB population” task
Q1c: using relevance feedback to adapt a query
IR query finds “matching” documents.
The user may say some are not relevant.
Relevance feedback can guide the system to adapt the
initial query – new query finds “more of the same”
This may look complicated but it’s just putting the
numbers into the equation…
Relevance feedback example
• [4 marks: 1 for correct q vector, 1 for realising  sums a
single d vector, 1for 3 weighted vectors, 1 for answer]
• q' = q + di / | HR | -  di / | HNR|
• = 0.5 q + 0.5 d1  0.5 d4
• = 0.5  (1.0, 0.6, 0.0, 0.0, 0.0)
•
+ 0.5  (0.8, 0.8, 0.0, 0.0, 0.4)
•
 0.5  (0.6, 0.8, 0.4, 0.6, 0.0)
• =
(0.5, 0.3, 0.0, 0.0, 0.0)
•
+ (0.4, 0.4, 0.0, 0.0, 0.2)
•
 (0.3, 0.4, 0.2, 0.3, 0.0)
• = (0.6, 0.3,  0.2,  0.3, 0.2)
Q2: Knowledge processes
• “In 2008, Leeds University adopted the Blackboard Virtual Learning
Environment (VLE) to be used in undergraduate taught modules in all
schools and departments. In future, lectures and tutorials may become
redundant at Leeds University: if we assume that student learning fits
Coleman’s model of Knowledge Management processes, then the
Virtual Learning Environment provides technologies to deal with all
stages in this model. All relevant explicit, implicit, tacit and cultural
knowledge can be captured and stored in our Virtual Learning
Environment, for students to access using Information Retrieval
technologies.”
• Is this claim plausible? In your answer, explain what is meant by
Coleman’s model of Knowledge Management processes, citing examples
relating to learning and teaching at Leeds University. Define and give
relevant examples of the four type of knowledge; and state whether they
could be captured and stored in our VLE, and searched for via an
Information Retrieval system.
[20 marks]
even an “essay” has a marking scheme
• Key points:
• - Coleman process of knowledge gathering/acquisition: big problem would be data
capture and preparation
• - Coleman process of knowledge storage/organisation: KM/IR could be of great
benefit
• - Coleman process of knowledge refining/adding value: lectures aim at more than
“rote learning”
• - Coleman process of knowledge transfer/dissemination: students prefer human
factors of lectures?
• - Explicit Knowledge has been articulated
• - example: e.g. lecture notes, course handbooks
• - already captured, and already accessible via IR search
• - Implicit Knowledge hasn’t been articulated (but could be)
• - example, e.g. extra material known to lecturer but not on the handouts
• - could potentially be captured, accessible if text form eg transcripts
• - Tacit Knowledge can’t be articulated but is done “without thinking”
• - example, e.g. how to design and implement elegant programs
• - tacit knowledge cannot be captured, hence cannot be searched for via IR
• - Cultural Knowledge is shared norms/beliefs to enable concerted action
• - example, e.g. students cooperate in groupwork
• - written guidelines can be captured and retrieved, but not “group spirit”
Q3: Data Mining with WEKA
• Association rules link arbitrary features;
• e.g. (center = 0) => (color = 0) (100% - perfect
predictor);
• Classification rules predict final feature (class)
english=UK/US;
• e.g. (color < 3) => (english = UK) (100% - perfect
predictor)
Simple decision tree
(colorpercent < = 40)
/
\
Yes
No
/
\
UK
US
How to choose the root?
• aim to balance the decision tree: best attribute is one
which naturally splits instances into homogeneous
subtrees with least errors. E.g. (colorpercent <= 40)
splits into perfectly-predictive subsets with the
training set.
Confusion matrix
depends on decision-point given in (b); eg:
for (colorpercent <= 40) we get 2 wrong
classifications:
=== Confusion Matrix ===
a b <-- classified as
1 2 | a = UK
0 0 | b = US
Supervised v unsupervised ML
• Supervised learning involves learning from example
instances with desired "answer" or classification, eg building
decision tree to predict the last attribute, English=UK/US,
given the arff instances;
• Unsupervised learning involves learning from example
instances but not being shown desired "answer" for each, eg
clustering instances into groups of similar documents on the
basis of discriminative feature-values, not including English
as the target class; this may yield another division of
documents.
Reminder: bird’s eye overview of KM
• Knowledge in Knowledge Management
• Knowledge and Information Retrieval /
Extraction
• Knowledge Discovery
January mock exam: Knowledge Management