HomePage_final2

Download Report

Transcript HomePage_final2

Search for personal information using
Yahoo BOSS
by
Evgeny Dosychev
Dmitry Kichin
Supervisor: Eddie Bortnikov
HomePage Project


Finding personal information in the web is not
an easy task.
We want to create an automatic tool that will
find and present personal information for the
requested person.
Technical Issues

We need an effective way to find information in
the web.
We will use Yahoo BOSS.

Personal information on the web is not in a
standart format.
We focus on working with IEEE pdf documents.
Technical Issues
 How will we parse the info and identify the
differnt details?
PDF to Text - using special Java package.
Using the standrt structure of the IEEE
documents.

How will we avoid confusion between different
people with the same name (name ambiguity)?
Divide the info to clusters.
Let the user make the choise between the
clusters*.
Technologies
•
Java
Will be used to build the Windows desktop application.
•
Yahoo! BOSS
Provides free access to Yahoo search index.
•
PDFbox
Java library. Used for extracting text from PDF
documents
BOSS
•
Yahoo! Search BOSS (Build your Own Search
Service) is a Yahoo! initiative that gives the
developers free access to the Yahoo! Search index.
•
The results can be supplied into the developer's
application so that they can manipulate the
resources according to their needs.
•
Up to 500 results can be retrieved.
Based on Wikipedia
HomePage functionality

Desktop Java application.
• Gets from the user the search target.
• Searches the web using Yahoo! BOSS.
• Downloads and parses PDF documents and
Images and produces HTML page with the
information which was found.
(Currently it is: email, publication titles, publication
short summary, images, and links to the full
document)
HomePage functionality
Devides the information to clusters (based on
the key=email)


Gets the user choise to decide which info fits.

Produces HTML page with all the details.
Sceenshots
Clustering algorithm
It is very hard to the computer to solve name ambiguity.
We leave this task to the user.
Each group of information items (cluster) will be defined
by its key (email) and the user make the choise.
The result page will be produced from the chosen
clusters
Workflow
Class Diagram
Flow Diagram
Challenges
• PDFbox appeared to be not reliable and problematic. It
is not the best solution to PDF parsing.
• Perhaps the main challange was the semantic parsing
(finding information in the text). We discovered that the
sematic parsing by itself very problematic task, that
requires time and resourses beyond the project scope.
Conclusions
•
We learned the principle of the BOSS project, and
used the power that it provides
•
We prepared a well-designed object oriented
infrastructure for the task.
•
HomePage can be a good infrastructure for adding
additional algorithms that find additional information
in the texts.
•
In order to extract and identify information from the
text, we need to use specific algorithms and
methods.