Transcript PPT

1
Open Mind Initiative
David G. Stork
Ricoh Silicon Valley
[email protected]
2
Outline







One-sentence description
Background
Open Mind Initiative
Sample projects
Relation to Open Source and to Data mining
Related efforts elsewhere
What do we do next Monday?
3
Open Mind Initiative
A collaborative framework (based on
Open Source methodology) for
developing “intelligent” software,
where...
» domain experts provide algorithms,
» tool developers provide software
infrastructure and tools, and
» non-expert ‘e-citizens’ provide raw data.
4
Background: Market need
Speech recognition
 OCR
 Web searching
 ......
 Some software (e.g., common sense)
too costly for a single company to build

Background:
E-community & Open Source Waves



GNU
SendMail
Linux
» 10M lines; 10M seats; dbl. time ª 6 mo., 105 contributors

Apache
» Half of all web servers

Beowulf
» Supercomputer power from networked PCs

Newhoo! dmoz.org

» Open web directory (527,991 sites, 10,943 editors, 82,003 categories)
Infomedia
» Open source encyclopedia
5
6
Growth of new software methods
1990 105 programmers  1995 Linux
1995 106 web authors  1999 Newhoo!
1999 109 e-citizens
 2003 Open Mind


New communication allows communities and
collaboration, and thus new software
methods
Opportunities expand to less-skilled users
Background: Pattern
recognition/intelligent systems
Recognizer = Theory + Model + Data
 Theory excellent
 Models depend on problem
 Never enough data

» “the group with the most data wins”
» e.g., OCR
» ...
7
8
Background: Tools

Tools for customization/experimentation
» CSLU (Open Source)
» Nuance
» HTK
» S+
» ...

Non-experts can use these!
9
Background: Infrastructure
Collaborative software
 Animals (Shapiro 75, Lo & Stork 99)
 Answer Garden (Ackerman 90)
 BBN UNIPEN data collection software

(Schwartz 97)
Infrastructure:
Relevance rating

DirectHit, Inc.
» improved web indexing by monitoring
users’ selections

FireFly
» target advertisements based on user
profile

Amazon.com
» book recommendations
10
11
Open Mind Initiative

Three main functions provided by
» Domain Experts
– fundamental algorithms, process control,
education/proselytizing, ...
» Tool developers
– software infrastructure, tools, ...
» e-citizens
– raw data, low-level bug reports, ...
12
Domain Experts



Provide algorithms (e.g., OCR, ...)
Provide general algorithms (e.g., Bayes nets, ...)
Process control, algorithm development and truthing
»
»
»
»
»
»
»


detect outliers for review/rejection
data “voting”
catch trials
signal dection theory (d’)
method of limits
two-alternative forced-choice hidden staircase
bias avoidance
Trend to publish data and algorithms on the web
More university work will be done with Linux
Tool/infrastructure
developers





Get maximum information for minimum
e-citizen effort (e.g., informative patterns)
Make it easy (fast) for contributors
Web infrastructure
Collaborative software (version control)
Reward contributors
13
14
e-citizens

Incentives
»
»
»
»
»
»
»
»
»

benefits in used system
fun (games: Marathon, MUDD, ...)
recognition (post names by amount of info. accepted)
general interest (note progress: data and performance)
altruism/philanthropy (cf. OED, SETI, ...)
education (linguistics in schools, ...)
lottery
money
frequent flyer miles
1.5M inmates, 1M in nursing homes, ...
Sample Projects (1)
Handwritten isolated character OCR





Recognizer: simple neural net, decision tree,
nearest-neighbor, ...
Patterns presented on contributors’ browsers,
cached, ...
Synthetic data (rotate, skew, line thicken/thin)
Learning with queries (ask informative patterns);
each pattern more valuable than a sampled one
Cooperative improvement (submit characters over
internet, download improved OCR the next day)
 Improved
OCR
15
16
OCR example
Open Mind host
4
9
4 9
4 9
4 9
4 9
e-citizens
...
9
4 9
4 9
Sample Projects (2)
Handwritten word recognition
Recognizer: “off the shelf”
 Words scanned from handwritten docs
 Three alternatives shown, best selected
by naive contributor (as in commercial
speech recognizers)
 Improved handwritten OCR

17
Sample Projects (3)
Open Mind chatbot game
MUDD-like game
 Goal: find the route through the castle
to the “human”

» choose the “most natural” paragraph
Linguistic information learned in
background
 More natural interfaces

18
Sample Projects (4)
Common sense about computers

Facts
» programs compiled, interpreted, run, ...
» a mouse is a peripheral
» early versions of code are generally buggy
» COBOL is a programming language
 More
natural text interfaces
19
Sample Projects (5)
Open Mind chess/go
Chess/go = fast search + board scoring
 Allow contributors to score positions

» weighted by FIDE chess rating/go dan
» weighted by score on on-line test
» weighted by “confidence”
port to multiple PCs (Beowulf) for speed
 Improved beam search via improved
scoring (more humanlike style?)

20
Sample Projects (6)
Open Mind Animals
(Lam & Stork 99)
challenges:
truthing
2 legs?
Y
N
can fly?
elephant
N
dog
human
N
bat
Y
bug reporting
forwarding errors to domain experts
crediting contributors
ordered by amount contributed
avoid ID clashes; allow anonimity
query simplification
reduce average number of queries/new animal
tree simplification
better taxonomy
arbitrary branching factor
mane?
horse
parrot
Y
N
feathers?
Y
can swim?
tree reflects the structure of domain
N
generalizable to other domains
other forms of queries
human-machine interface
dog
Y
insure valid animals
name/synonym check
insure data quality (“voting,” “accept if used”)
natural, show current query set (selectable)
display progress
number of animals, contributors, show tree
21
22
Sample Projects (7)
Open Mind Investment Assistant (Lo
TOY
DOL
AMD
BTFD
K
XLNX
MAT
ALTR
ATT
BRDCY
GM
AAPL
F
DELL
MSFT
IBM
& Stork 99)
23
Problems in Machine learning



Relative value of learning with queries vs. iid samples
Data truthing/outlier detection
Optimal learning strategies given...
» Bayes error
» probability of hostile data
» probability of data error

Learn reliability of e-citizens, individually and as a
group
24
Relation to Open Source
Open Source
• no e-citizens
• expert knowledge (C++filt,gdbm)
• machine learning irrelevant
• web infrastructure useful
• most work is directly
on the final software
• hacker culture (ª105)
Open Mind
• e-citizens crucial
• informal knowledge (read, hear)
• machine learning essential
• web infrastructure essential
• most work is on the
infrastructure
• e-citizen and business culture (ª109)
25
Relation to Data Mining
Data Mining
• type of data may not be available
for the project desired (e.g., OCR)
• no interactive queries
slower learning
ambiguities not resolved
• relatively fixed amount of data
• little or no e-citizen support
Open Mind
• data tailored to the project desired
(e.g., OCR)
• interactive queries
faster learning
ambiguities resolved
• new data encouraged
• e-citizen support
26
Open Mind project Taxonomy
Benefit
World
OpenMind
OCR
chess/go
common
sense
H
H
L
H
comp c-s
M
H
M
dialog
H
M
M
H
M
H
speech
grammar
Animals
H
H
M
ease/
simplicity
M
L
M
L
Use of e-citizens
M
H
H
L
H
M
M
M
M
H
H
H
27
Related efforts elsewhere

Speech
» Macrophone
» Human phoneme project
» Linguistic Data Consortium
» VoiceControl (Open Source speech for Linux)
» CSLU (Center for Spoken Language Understanding) Open
Source speech tools

OCR

» NIST, CEDAR, ARPA, UNIPEN
GNU dictionary
Newhoo!

28
It is inevitable
Need is here
Web is here
Theory/Machine learning is here
Intelligent
systems
Open Mind
e-citizens’
knowledge
This collaboration is going to happen!
» Less radical than Richard Stallman or Linus Torvald...
29
Possible value to corporations
Most companies could never develop
most of this software, nor preserve a
competitive advantage through
proprietary software
 Expand functionality/niches for all
 Low-cost, possibly high-payoff research
 Leverage university work

30
Technical Specifications

Language: Java
» Portable

Operating System: Linux
» Open Source, portable, multiprocessor version (Beowulf)

Data representation: Resource Description
Framework (RDF)
» Source: www.w3.org/RDF/
» Code: lxr.mozilla.org/mozilla/source/rdf/base/
» Docs: www.mozilla.org/rdf/doc/
31
Licenses

No license choice will satisfy everyone
» GNU: any linked code must include source and follow FSF
copyright -- “copyleft”
» FreeBSD: do whatever you like (can charge)



But... you cannot link GNU & FreeBSD!
Practical (not moral) decision
Open Source will benefit from competitive
commercialization
» BSD license best for Open Mind
32
What do we do next Monday?

Put up OpenMind.org

Demonstration projects: Open Mind Animals
Limited seeding (proselytizing)
Solicit projects; introduce domain experts with tool
developers
Get corporate donations (e.g., books, CDs, ...)



33
Summary

Open Mind
» Collaborative framework for developing
“intelligent systems”
» Experts, tool developers, e-citizens
Projects
 Vision of the future

34
Questions/Comments...
Contact: [email protected]