Design a search engine for a specific website

Download Report

Transcript Design a search engine for a specific website

Design a full-text search engine for a
website based on Lucene
Presented by: Lijia Li, Yingyu Wu, Xiao Zhu
Outline
•
•
•
•
•
Introduction
Our goal
System architecture
Conclusion and future work
Show demo
Introduction
• With the development of the network, the amount
of information on the Internet showed explosive
growth, increased the difficulty of finding the
target information, the search engine has brought
great convenience to people looking for
information, internet has become an indispensable
tool.
Our goal
• In this project, our goal is to implement a fulltext retrieval engine based on Lucene.
Full-text retrieval engine
• The full-text search engine based on the entire
text retrieval technology for indexing and
searching.
• Features:
(1) The unstructured index file database
(2) Flexible retrieval methods
(3) Support nature language retrieval
(4) Retrieval efficiency
System Architecture
• Search Engine is used to provide searching
service to users. Our search engine has two
main parts: online and offline.
Users
Enter keyword
Online
User Interface
analyzer
Result sorting
Search module
Search
Index File
Index module
website
Request
webpage
Website
database
crawler
offline
Lucene
Why
• The index file format independent of the application platform
• Inverted index
• Object-oriented system architecture
• Chinese parser (SmartchineseAnalyzer, IKAnalyzer)
• Implement a set of powerful Query engine(RangeQuery,
FuzzyQuery……)
• Open Source
Web Crawler
Analysis
robots.txt
Collection
of start
URL
URL
Analysis
URL
Get
robots.txt
Unprocessed
URL queue
Extract
Links
Architecture of web crawler
Page
database
Page fetch
module
Page
analysis
module
Inter
net
Work flow of web crawler
1.
2.
3.
4.
5.
6.
Extract the initial URL into unprocessed URL queue
Get a URL address from the head of the queue
Download pages according to their URL
Extract hyperlink from the download page
Extracted hyperlinks added to unprocessed URL queue
Check whether the unprocessed URL queue is null
if yes the program will be terminated
otherwise step 2 will be executed.
7. Loop
Index
Aset of
documents to
be index
Read and
Analysis
document
no
yes
Date of index ealier
than the creation
data
Whether
Indexed?
no
yes
Determine the type
of document
yes
Call the corresponding
document parser to parse
document
Whether exist
same type
no
Parse document
Build index file
Work flow
Document indexing steps
1. Creating a IndexWriter instance
IndexWriter writer = new IndexWriter(indexPath, analyzer, boolean,
maxFieldLength)
2. Creating a recode of Document
Document doc = new Document()
3. Add Field Object in recode of Document
doc.add(new Filed(string, tokenstream))
4. Write recode of Document in Index
writer.addDocument(doc);
5. Close Index Writer Object, end indexing
writer.close()
Flow chart of searching
start
Accept search string
from user
QueryParser analyze search
string, output Query object
Set up Searcher
IndexSearcher object search
related document in Index File
Output related
document
end
Example:
User input:
“ 大连理工 计算机”,
“america ohio”
After QueryParser:
“大连理工” AND“计算
机”,
“america” AND “ohio”
Highlight search key word
1. Get position value of
search key word
2. Get fragment of search
key word, according
position value of search
key word
3. Use HTML and CSS
attributes to highlight
search key word
Conclusion and future work
• What we learn through this project is how to
use web crawler and Lucene to implement a
full-text search engine.
• Working on hadoop
• Thank you!