Transcript PPT
1
Open Mind Initiative
David G. Stork
Ricoh Silicon Valley
[email protected]
2
Outline
One-sentence description
Background
Open Mind Initiative
Sample projects
Relation to Open Source and to Data mining
Related efforts elsewhere
What do we do next Monday?
3
Open Mind Initiative
A collaborative framework (based on
Open Source methodology) for
developing “intelligent” software,
where...
» domain experts provide algorithms,
» tool developers provide software
infrastructure and tools, and
» non-expert ‘e-citizens’ provide raw data.
4
Background: Market need
Speech recognition
OCR
Web searching
......
Some software (e.g., common sense)
too costly for a single company to build
Background:
E-community & Open Source Waves
GNU
SendMail
Linux
» 10M lines; 10M seats; dbl. time ª 6 mo., 105 contributors
Apache
» Half of all web servers
Beowulf
» Supercomputer power from networked PCs
Newhoo! dmoz.org
» Open web directory (527,991 sites, 10,943 editors, 82,003 categories)
Infomedia
» Open source encyclopedia
5
6
Growth of new software methods
1990 105 programmers 1995 Linux
1995 106 web authors 1999 Newhoo!
1999 109 e-citizens
2003 Open Mind
New communication allows communities and
collaboration, and thus new software
methods
Opportunities expand to less-skilled users
Background: Pattern
recognition/intelligent systems
Recognizer = Theory + Model + Data
Theory excellent
Models depend on problem
Never enough data
» “the group with the most data wins”
» e.g., OCR
» ...
7
8
Background: Tools
Tools for customization/experimentation
» CSLU (Open Source)
» Nuance
» HTK
» S+
» ...
Non-experts can use these!
9
Background: Infrastructure
Collaborative software
Animals (Shapiro 75, Lo & Stork 99)
Answer Garden (Ackerman 90)
BBN UNIPEN data collection software
(Schwartz 97)
Infrastructure:
Relevance rating
DirectHit, Inc.
» improved web indexing by monitoring
users’ selections
FireFly
» target advertisements based on user
profile
Amazon.com
» book recommendations
10
11
Open Mind Initiative
Three main functions provided by
» Domain Experts
– fundamental algorithms, process control,
education/proselytizing, ...
» Tool developers
– software infrastructure, tools, ...
» e-citizens
– raw data, low-level bug reports, ...
12
Domain Experts
Provide algorithms (e.g., OCR, ...)
Provide general algorithms (e.g., Bayes nets, ...)
Process control, algorithm development and truthing
»
»
»
»
»
»
»
detect outliers for review/rejection
data “voting”
catch trials
signal dection theory (d’)
method of limits
two-alternative forced-choice hidden staircase
bias avoidance
Trend to publish data and algorithms on the web
More university work will be done with Linux
Tool/infrastructure
developers
Get maximum information for minimum
e-citizen effort (e.g., informative patterns)
Make it easy (fast) for contributors
Web infrastructure
Collaborative software (version control)
Reward contributors
13
14
e-citizens
Incentives
»
»
»
»
»
»
»
»
»
benefits in used system
fun (games: Marathon, MUDD, ...)
recognition (post names by amount of info. accepted)
general interest (note progress: data and performance)
altruism/philanthropy (cf. OED, SETI, ...)
education (linguistics in schools, ...)
lottery
money
frequent flyer miles
1.5M inmates, 1M in nursing homes, ...
Sample Projects (1)
Handwritten isolated character OCR
Recognizer: simple neural net, decision tree,
nearest-neighbor, ...
Patterns presented on contributors’ browsers,
cached, ...
Synthetic data (rotate, skew, line thicken/thin)
Learning with queries (ask informative patterns);
each pattern more valuable than a sampled one
Cooperative improvement (submit characters over
internet, download improved OCR the next day)
Improved
OCR
15
16
OCR example
Open Mind host
4
9
4 9
4 9
4 9
4 9
e-citizens
...
9
4 9
4 9
Sample Projects (2)
Handwritten word recognition
Recognizer: “off the shelf”
Words scanned from handwritten docs
Three alternatives shown, best selected
by naive contributor (as in commercial
speech recognizers)
Improved handwritten OCR
17
Sample Projects (3)
Open Mind chatbot game
MUDD-like game
Goal: find the route through the castle
to the “human”
» choose the “most natural” paragraph
Linguistic information learned in
background
More natural interfaces
18
Sample Projects (4)
Common sense about computers
Facts
» programs compiled, interpreted, run, ...
» a mouse is a peripheral
» early versions of code are generally buggy
» COBOL is a programming language
More
natural text interfaces
19
Sample Projects (5)
Open Mind chess/go
Chess/go = fast search + board scoring
Allow contributors to score positions
» weighted by FIDE chess rating/go dan
» weighted by score on on-line test
» weighted by “confidence”
port to multiple PCs (Beowulf) for speed
Improved beam search via improved
scoring (more humanlike style?)
20
Sample Projects (6)
Open Mind Animals
(Lam & Stork 99)
challenges:
truthing
2 legs?
Y
N
can fly?
elephant
N
dog
human
N
bat
Y
bug reporting
forwarding errors to domain experts
crediting contributors
ordered by amount contributed
avoid ID clashes; allow anonimity
query simplification
reduce average number of queries/new animal
tree simplification
better taxonomy
arbitrary branching factor
mane?
horse
parrot
Y
N
feathers?
Y
can swim?
tree reflects the structure of domain
N
generalizable to other domains
other forms of queries
human-machine interface
dog
Y
insure valid animals
name/synonym check
insure data quality (“voting,” “accept if used”)
natural, show current query set (selectable)
display progress
number of animals, contributors, show tree
21
22
Sample Projects (7)
Open Mind Investment Assistant (Lo
TOY
DOL
AMD
BTFD
K
XLNX
MAT
ALTR
ATT
BRDCY
GM
AAPL
F
DELL
MSFT
IBM
& Stork 99)
23
Problems in Machine learning
Relative value of learning with queries vs. iid samples
Data truthing/outlier detection
Optimal learning strategies given...
» Bayes error
» probability of hostile data
» probability of data error
Learn reliability of e-citizens, individually and as a
group
24
Relation to Open Source
Open Source
• no e-citizens
• expert knowledge (C++filt,gdbm)
• machine learning irrelevant
• web infrastructure useful
• most work is directly
on the final software
• hacker culture (ª105)
Open Mind
• e-citizens crucial
• informal knowledge (read, hear)
• machine learning essential
• web infrastructure essential
• most work is on the
infrastructure
• e-citizen and business culture (ª109)
25
Relation to Data Mining
Data Mining
• type of data may not be available
for the project desired (e.g., OCR)
• no interactive queries
slower learning
ambiguities not resolved
• relatively fixed amount of data
• little or no e-citizen support
Open Mind
• data tailored to the project desired
(e.g., OCR)
• interactive queries
faster learning
ambiguities resolved
• new data encouraged
• e-citizen support
26
Open Mind project Taxonomy
Benefit
World
OpenMind
OCR
chess/go
common
sense
H
H
L
H
comp c-s
M
H
M
dialog
H
M
M
H
M
H
speech
grammar
Animals
H
H
M
ease/
simplicity
M
L
M
L
Use of e-citizens
M
H
H
L
H
M
M
M
M
H
H
H
27
Related efforts elsewhere
Speech
» Macrophone
» Human phoneme project
» Linguistic Data Consortium
» VoiceControl (Open Source speech for Linux)
» CSLU (Center for Spoken Language Understanding) Open
Source speech tools
OCR
» NIST, CEDAR, ARPA, UNIPEN
GNU dictionary
Newhoo!
28
It is inevitable
Need is here
Web is here
Theory/Machine learning is here
Intelligent
systems
Open Mind
e-citizens’
knowledge
This collaboration is going to happen!
» Less radical than Richard Stallman or Linus Torvald...
29
Possible value to corporations
Most companies could never develop
most of this software, nor preserve a
competitive advantage through
proprietary software
Expand functionality/niches for all
Low-cost, possibly high-payoff research
Leverage university work
30
Technical Specifications
Language: Java
» Portable
Operating System: Linux
» Open Source, portable, multiprocessor version (Beowulf)
Data representation: Resource Description
Framework (RDF)
» Source: www.w3.org/RDF/
» Code: lxr.mozilla.org/mozilla/source/rdf/base/
» Docs: www.mozilla.org/rdf/doc/
31
Licenses
No license choice will satisfy everyone
» GNU: any linked code must include source and follow FSF
copyright -- “copyleft”
» FreeBSD: do whatever you like (can charge)
But... you cannot link GNU & FreeBSD!
Practical (not moral) decision
Open Source will benefit from competitive
commercialization
» BSD license best for Open Mind
32
What do we do next Monday?
Put up OpenMind.org
Demonstration projects: Open Mind Animals
Limited seeding (proselytizing)
Solicit projects; introduce domain experts with tool
developers
Get corporate donations (e.g., books, CDs, ...)
33
Summary
Open Mind
» Collaborative framework for developing
“intelligent systems”
» Experts, tool developers, e-citizens
Projects
Vision of the future
34
Questions/Comments...
Contact: [email protected]