A new tool to improve the filtering options in advanced
Download
Report
Transcript A new tool to improve the filtering options in advanced
A Linguistic Solution to Perfecting
Search Technology
A new tool to improve the filtering options in advanced searching
Fernando Moreno-Torres
General Manager
MTC Soft
www.mtcsoft.es
The Problem (I): different meanings
European Commission concrete
Results: 967.000
European Commission - Research: Industrial technologies - Robot ...
Concrete repair The permanence of our built environment can easily be taken for granted. Materials such as concrete,
that are used to construct our roads, ...
Concrete tasks or activities contained in White Paper Financial ...
European Commission End 2006. 46) Concrete actions in mortgage credit (follow-up White Paper) European
Commission 2007and ...
UNCTAD XI: European Commission calls for concrete action to assist ...
The European Commission will call for concrete global measures to support commodity dependent developing
countries and their producers at the 11 ...
European Commission – Information roadshow on the European ...
European citizenship… not just words, concrete rights! - Information roadshow organised by the European Union.
European Commission. ...
Madrid – April, 2009
www.mtcsoft.es
Page 2
The Problem (II): no relations between words
Tom Cruise bought house
Results: 169.000
Doctors not fans of Tom Cruise's baby gift - Women's health- msnbc.com
6 Dec 2005 ... Tom Cruise bought a sonogram machine so that he and girlfriend Katie Holmes can see their unborn ....
Video: From 'House' to White House ...
He shows them the money - Los Angeles Times
6 May 2007 ... Despite speculation that Tom Cruise and Katie Holmes are having marital problems, he just bought a
Beverly Hills home for his family that cost ... The one-story, 13000-square-foot house was built on speculation this
year …
Tom Cruise Buys $35 Million Beverly Hills Mansion - Katie Holmes ...
7 May 2007 ... ft. structure in 1937, the house was expanded four years ago and ... have bought a multi-million-dollar
house only two minutes away. They plan to move in July. Tom Cruise's $35 million Beverly Hills mansion Photo by:
Tom Cruise Bought Katie Holmes An Engagement Ring Right After ...
Tom Cruise bought wife Katie Holmes an engagement ring after their first date. .... Charlotte Church Kicks Her
Boyfriend Out Of House After Heated Argument ...
Madrid – April, 2009
www.mtcsoft.es
Page 3
The problem
Several-word searches
Common words
Too many results, most of then not related to the query
¿Why?
Search engines don’t discriminate different meanings of words
Words are not related in the texts
Madrid – April, 2009
www.mtcsoft.es
Page 4
The solution
Parsing and tagging the texts
- Discovering the real meaning of every word in every sentence
- Discovering the relations between words
The search engines can use this information:
to filter the results
to get accurate results
With these “enriched” texts we can give users new
possibilities to filter the search using additional options.
Madrid – April, 2009
www.mtcsoft.es
Page 5
The advanced search query
Who?
Tom Cruise
The sentence subject
The actor
What does he
do?
To buy
The sentence verb
The action
Concerning
what?
House
The sentence direct
object
The object
Where?
In Hollywood
(optional)
The place
When?
Last Month
(optional)
The time
Madrid – April, 2009
www.mtcsoft.es
Page 6
The new query and the new results
Tom Cruise is the subject of a sentence in which buy is the verb
and house is the direct object
This query would find:
- Tom Cruise has bought a house…
- Tom Cruise and Katie Holmes will buy a new house ….
- Yesterday Tom Cruise bought two dogs and a red house…
This query would not find:
- Tom Cruise bought a dog and visited a house in …
- Tom Cruise is very famous. Madonna has bought an expensive house…
- Tom Cruise doesn’t house any friends, because he has bought a new car
and …
Madrid – April, 2009
www.mtcsoft.es
Page 7
How did we make it?
A 10-year project
Based on the linguistic analysis from a philologist and professional
translator, working for the European Commission for more than 20 years
It’s not Statistical Natural Language Processing
It’s not Artificial Intelligence software
It’s not the work of software engineers
It’s the core of our new Automatic Translator (to be released next May 20th)
It’s a design of a philologist, constructed by software developers, MTC
Soft, over the last three years
We perform Word Sense Disambiguation using linguistic knowledge
Madrid – April, 2009
www.mtcsoft.es
Page 8
Where it comes from
PHASE
PHASE
1
PHASE
2
3
Grammatical
Analysis
4
Syntactic
Analysis
VICTOR Translator
Translation
SPANISH
ENGLISH
Symbol
identification
PHASE
Parsalyser
The translator has 4 main phases or modules.
The first 3, in English, analyze and tag every word of the text.
This is the “Parsalyser.”
Madrid – April, 2009
www.mtcsoft.es
Page 9
PHASE
PHASE
1
PHASE
2
Symbol
identification
3
Grammatical
Analysis
How does it work?
Syntactic
Analysis
Original text
Germany is supporting the
development of infrastructure
(especially energy and water
supplies), promotion of the economy
and employment, advising the PISG
on the restructuring of administrative
structures in the education and
vocational training area and the
improvement of the general
economic climate.
In just a few
seconds
Tagged text
Germany is supporting the
development of infrastructure
(especially energy and water
supplies), promotion of the
economy and employment,
advising the PISG on the
restructuring of administrative
structures in the education and
vocational training area and the
improvement of the general
economic climate.
Madrid – April, 2009
+ 140 Processes
+ 500 linguistic rules
The text is analyzed in every process
Every process does its work
- Word identification
- Capital letter identification
- “How” gets special treatment
- “Could” gets special treatment
-…
- Disambiguation:
Verb – Noun
- Adjective – Pronoun
- Verb – Preposition
- …
-
+ 140 Processes
www.mtcsoft.es
Page 10
PHASE
PHASE
1
2
Symbol
identification
Phase 1 – Symbol identification
PHASE
3
Grammatical
Analysis
Syntactic
Analysis
Identification of: words, symbols and expressions
Identification of root words (using our dictionary)
Assignation of possible grammar functions
Original text:
Tom Cruise has just bought a house that costs $35 million.
Words identified:
Madrid – April, 2009
www.mtcsoft.es
Page 11
PHASE
PHASE
1
2
Symbol
identification
Grammatical
Analysis
Phase 2 – Grammatical Analysis
PHASE
3
Syntactic
Analysis
Disambiguation:
Performing an exhaustive linguistic analysis
Original text:
Tom Cruise has just bought a house that costs $35 million.
Text after this phase:
Madrid – April, 2009
www.mtcsoft.es
Page 12
PHASE
PHASE
1
2
Symbol
identification
Grammatical
Analysis
Phase 3 – Syntactic Analysis
PHASE
3
Syntactic
Analysis
Identification of the syntagms:
Both simple and complex
Both autonomous and subordinate
Identification of syntactic and semantic relations between them
Original text:
Tom Cruise has just bought a house that costs $35 million.
Syntactic functions:
Subject
Madrid – April, 2009
Verb
Object
www.mtcsoft.es
Page 13
PHASE
1
Symbol
identification
PHASE
2
Grammatical
Analysis
PHASE
3
An example
Syntactic
Analysis
Original text:
On the basis of an annual report from the Commission, it shall also examine
the effects of the special arrangements with regard to drugs, including the
progress in the fight against drugs made by countries listed in the second
annex and, if progress is insufficient, the Commission will consider taking any
measures to suspend in whole or in part the application of Article 7, in
accordance with the established procedure in Article 32 and after consulting
the country concerned.
Madrid – April, 2009
www.mtcsoft.es
Page 14
PHASE
1
PHASE
2
Symbol
identification
Grammatical
Analysis
PHASE
3
An example
Syntactic
Analysis
After Phase 1: Words identified:
Madrid – April, 2009
www.mtcsoft.es
Page 15
PHASE
1
PHASE
2
Symbol
identification
Grammatical
Analysis
PHASE
3
An example
Syntactic
Analysis
After Phase 2: Words disambiguated:
Madrid – April, 2009
www.mtcsoft.es
Page 16
PHASE
1
PHASE
2
Symbol
identification
Grammatical
Analysis
PHASE
3
An example
Syntactic
Analysis
After Phase 3: Syntactic functions
Madrid – April, 2009
www.mtcsoft.es
Page 17
The Dictionary
In collaboration with: Facultad de Traducción e Interpretación
Granada University
Dictionary Main Statistics:
7 professors
24 students
- 106.634 English words
4 months
- 31.417 Fixed expressions
9.587 Open expressions
- 20.268 Idiomatic verbs
- Grammar functions:
12.558 Adjectives
4.388 Adverbs
47.150 Substantives
8.558 Verbs (2.590 phrasal verbs)
2.136 acronyms
2.217 incidental phrases
Madrid – April, 2009
www.mtcsoft.es
Page 18Págin
The Dictionary
10 Dictionaries
Every one
has its type
of record
Every word
has its data
Madrid – April, 2009
www.mtcsoft.es
Page 19
State of the “Beta”
1
1
2
3
90%
Parser + Tagger = Parsalyser
2
Key words Indexes
15%
3
Advanced Search Interface
Madrid – April, 2009
www.mtcsoft.es
15%
Page 20
2
Indexes
Key words Indexes
Word
Syntactic
Function
Function
Tom Cruise
1
1
1
123
Tom Cruise
1
1
3
234
Tom Cruise
1
1
12
345
Tom Cruise
1
3
2
12
Tom Cruise
1
3
23
134
Tom Cruise
1
3
2
567
Tom Cruise
1
3
12
1.234
Tom Cruise
1
3
33
2.345
buy
1
1
3
1.233
buy
1
1
124
3.432
buy
1
2
44
4.333
buy
3
2
3
234
buy
3
2
23
134
buy
3
2
2
345
buy
3
3
34
4.322
house
1
2
3
234
house
1
2
21
1.456
house
1
3
3
234
house
3
3
12
4.444
Madrid – April, 2009
Sentence
(Simplified example. In the real table,
Words are codified using the dictionary)
Grammar
File
File:
234
Sentence: 3
Grammar function
Syntactic function
www.mtcsoft.es
1
2
3
Substantive
Adjective
Verb
1
2
3
Subject
Verb
Direct Object
Page 21
3
Advanced Search Interface
User Interface
There is a lot of work done in this area.
A
approach:
first
We has
not developed the interface.
Just a simple window to test the system.
Even so, it is Tom
very Cruise
powerful
bought house
X Only texts with the words related
Only texts with the words related
Just a simple “click” to filter most of the results
Madrid – April, 2009
www.mtcsoft.es
Page 22
3
Advanced Search Interface
User Interface
Advanced searches:
For “advanced” users, a more powerful interface
Analyses the sentence
Proposes grammar functions
Proposes syntactic functions
Allows declined forms for every word
It works!
Madrid – April, 2009
www.mtcsoft.es
Page 23
Other uses
Plain texts
Germany is supporting the
development of infrastructure
(especially energy and water
supplies), promotion of the economy
and employment, advising the PISG
on the restructuring of administrative
structures in the education and
vocational training area and the
improvement of the general
economic climate.
Search
Engines
Semantic
Web
Data
Mining
Statistical
Analysis
…
Madrid – April, 2009
www.mtcsoft.es
Page 24
Other uses
Plain texts
Tagged text
Germany is supporting the
development of infrastructure
(especially energy and water
supplies), promotion of the economy
and employment, advising the PISG
on the restructuring of administrative
structures in the education and
vocational training area and the
improvement of the general
economic climate.
Germany is supporting the
development of infrastructure
(especially energy and water
supplies), promotion of the
economy and employment,
advising the PISG on the
restructuring of administrative
structures in the education and
vocational training area and the
improvement of the general
economic climate.
The parser + tagger can be used for a lot of Internet related tasks
Search
Engines
Semantic
Web
Data
Mining
Statistical
Analysis
…
Madrid – April, 2009
www.mtcsoft.es
Page 25
A new Standard?
Tagged text
Germany is supporting the
development of infrastructure
(especially energy and water
supplies), promotion of the
economy and employment,
advising the PISG on the
restructuring of administrative
structures in the education and
vocational training area and the
improvement of the general
economic climate.
The question is:
Why use “plain texts” to start the work with, If you can use “enriched texts”?
Change the “raw material” at the beginning.
These “enriched texts” are:
created automatically (no human work)
and have an “universal” format (no standard to discuss)
Madrid – April, 2009
www.mtcsoft.es
Page 26
Thank you for attending
Fernando Moreno-Torres
[email protected]
+34.609.575.000
+34.958.215.280
C/ Concepción, 47
18009 – Granada (Spain)
www.mtcsoft.es