Transcript PPT1

Information Retrieval –
and projects we have done.
Group Members:
Aditya Tiwari (08005036)
Harshit Mittal (08005032)
Rohit Kumar Saraf (08005040)
Vinay Surana (08005031)
Guided by Prof. Pushpak Bhattacharyya
Motivation


Web, documents and encyclopedia all have
tremendous amount of data and information in
them. The information thus available serves only the
intent of the creator or collector of data.
However, there can be other uses of that
data/information as well. The need is to mine the
right information from the data and use it
appropriately.
Information Retrieval
Applications
Web search – Google,Yahoo
 Querying/QA system like Watson
(developed by IBM).
 Spam filtering
 Automatic Summarization
 Cross-lingual retrieval

en.wikipedia.org/wiki/Information_retrieval_applications
Information Retrieval

IR is the study of concerned with
searching for documents, and for
metadata about documents, as well as that
of searching relational databases and the
WWW.

The data objects that are collected can be images,
documents, videos, mind maps, music
en.wikipedia.org/wiki/Information_Retrieval
Wiki Mind Mapping
Harshit Mittal (IIT-B)
[email protected]
Aditya Tiwari (IIT-B)
[email protected]
Akhil Bhiwal (VIT University)
[email protected]
6
Project Idea

Represent the textual information in
graphical form which is easier to
understand and more intuitive to read.
The visual representation should be able
to summarize the text.
7
Research Goal

Use of phrases to represent semantic
information.

Hierarchical representation of
information of a given text
8
Mind maps
A mind map is a diagram used to
represent words, ideas, tasks, or other items
linked to and arranged around a central key
word or idea.
 Example Mind map in the next slide.

http://en.wikipedia.org/wiki/Mind_maps
9
Mind map
http://www.spicynodes.org/blog/2010/05/21/stuff-we-like-climate-change-mind-maps/
10
What’s the difficult part?

We can’t represent information from any
article in mind-map as it is. That would
make it incoherent and clumsy.

Phrase extraction

General rules of grammar don’t apply
here.
11
Possible Solution

Develop new linguistic rules
representation of text in visual form.
for

Use existing summarization tools to
generate summary and try to represent
that in mind-map.
12
How we did it.

Pulling out the article section wise from the
Wikipedia page.

Parsing each section sentence wise using the
Stanford parser.

Extracting “relevant” phrases using Tregex
(another Stanford tool).

Putting these phrases into a mind map,
section wise.
http://nlp.stanford.edu/software/tregex.shtml
13
Extraction of relevant information

Identifying subtrees from the parse tree of a
sentence that are important.

This was done using a few heuristics like:
◦ Presence of a superlative adjective in a noun phrase
http://nlp.stanford.edu/software/tregex.shtml
14
Extraction of relevant information

Presence of a cardinal number in a noun
phrase
http://nlp.stanford.edu/software/tregex.shtml
15
Extraction of relevant information

Matching of a particular verb to the bag of verbs
that were considered relevant for a particular
article. For example : for the history section, verbs
like find , discover, settle, decline were considered
“more useful”, as compared to words like derive,
deduce etc. which were considered useful for some
other section.
16
Extraction of relevant information
Ex : The name India is derived from Indus.
http://nlp.stanford.edu/software/tregex.shtml
17
Code Generated Mind Map
18
Evaluation

http://en.wikipedia.org/wiki/Precision_and_recall
19
Evaluation

Survey based:

Asking a person to generate 10 questions
from given article.

Asking another person to answer those
question with the help of mind-map.

Repeating the same exercise in reverse
manner for another article.
20
Observations

Pros:
◦ Extraction of right information with high
accuracy.
◦ Concept of phrase extraction works well.
◦ High precision value were obtained (between
0.5-0.75).
21
Observations

Cons
◦ Information presented in mindmap of low depth
is clumsy.
◦ Low recall value (0.2 – 0.4)
◦ Linking of node phrases with their apt
description.
◦ Heuristics defining “important phrases” need to
be refined.
22
Limitations

Bag of words and Tregex expressions is
hand-coded instead of machine learned.

Garbage phrases are being generated.

Level of hierarchy is limited to 3.
23
Future work

Using machine learning to determine the
important keywords for a given sentence.

We want to explore the possibility of
finding patterns in subtree expressions
using machine learned approach.

Refinement of generated phrases.
24
References
http://en.wikipedia.org/wiki/Mind_maps
 http://en.wikipedia.org/wiki/Precision_and_recall
 Tool : Stanford Parser and Stanford Tregex Match
http://nlp.stanford.edu/software/tregex.shtml

25
Vision Based Attribute
Segmentation from lists in Web
Pages
-by
Rohit Kumar Saraf
26