What is a Search Engine?

Transcript What is a Search Engine?

NeuroSearch
A Specialised
Search Engine for
Neuroscience WebPages
Fatma Y. ELDRESI (MPhil )
Systems Analysis / Programming Specialist, AGOCO
Part time lecturer in University of Garyounis,
[email protected]
Contents
Introduction
Components in a NeuroSearch & its Architecture
Implementation
Software lifecycle :
(1)webCrawler Engine, (2) Indexer Engine, (3)
Query Engine, (4) Re-Crawler Engine (Specialised Crawler)
Challenges
Testing
Conclusions
2
Introduction
What is a
Search
Engine?
A server or a collection of servers dedicated to indexing internet
web pages, storing the results and returning lists of pages which
match particular queries.
Convenient search engines generate indexes :
•Google using Spider
•Yahoo using Directory
“NeuroSearch” Using Spider & the Advance Knowledge
3
Introduction cont..
why is a specialised search engine needed?
Defining
the
problem



Web has got non centralised organisation, with huge mixed
collection of Information
Updated continuously, without standard format,
Pages are extensively linked
Therefore, establishing standard measures for relevance is a very challenging task
In addition,
(1)- users have many challenges in choosing the relevant keywords;
(2)- professionals sometimes fail in their search and get disappointed result,
because
A. the retrieved pages sometimes not related or
B. different from what the they’re looking for.
The
Objective
Creating a specialised search engine (i.e, Advance knowledge) to read web documents
Index and update all the content in the local server
Answer the queries from the local database
Update the system over a constant period
4
Components of “NeuroSearch”
It has two components:
1-Search/Crawler Engine
2- Query engines
5
Components explained
Query Engine
Retriever (Query engine)
Crawler Engine
Re-crawler
Crawler Engine
Indexer
Crawler Engine
Spider
6
“NeuroSearch” Architecture Model
WWW
Search
Engine
Interface
Query Engine
Users
Index
Indexer
Re-Crawler
WebCrawler
World Wide Web
7
Implementation and Case Study
•Creating the database using Access DB.
•Implementing all parts of “NueroSearch” using
Java Language and SQL.
8
NeuroSearch Database
Advance
Knowledge data
WebCrawler
data
The
Advance
Knowledge
TEXT
Re-crawler
data
Query
TEXT
Data
TEXT
Indexer
data
9
The advance knowledge
Case study- Neuroscience (Vision)
Phase 3
This knowledge is stored in the
database and categorised by numbers,
and related knowledge is categorised
too and stored in data network form in
the database.
Phase 2
Then, as a domain knowledge of
Vision, do data mining
to construct keywords and the
relation between them.
Phase 1
NeuroSearch uses advance
knowledge about
Neuroscience (vision) as a
case study.
10
Software lifecycle
Crawler Engine
Consists of
1. WebCrawler/Spider Engine
2. Indexer Engine
3. Re-Crawler (specialised)
11
WebCrawler (Spider)
1)-This web crawler is general one
which can download any kind of
WebPages.
It performs this using :
2)-Fetch URL, retrieves all
its WebPages and saves them
in the local drive
Spider
3)-In addition, WebCrawler
has to access the proxy
firewall (i.e. in Newcastle
University LAN), before
downloaded any web sites.
4)-The crawler performs a breadth-first
search, which means it collects a list of all
the links that are on the current page before
it follows any of the links to a new page.
12
WebCrawler - real challenge
.
Challenge 1:
Challenge 2:
connect to www and
accessing private websites.
Solution 1:
Crawler has to allow its socket to connect
first with the Proxy server.
connect this socket further to the WWW
Solution 2:
Get method : the straight
forward socket uses is just to get
the file name.
However, in this case
Get command has to take the full
URL.
13
Indexer Engine
1)-Firstly, it search the
webpage using it’s
advance knowledge.
Then, Webpage will be
deleted if it is not related
to the case study subject.
Indexer
Engine
3)-All keywords it contains,
how many times they are
repeated, title, contents Then,
save them in the database for
later display in the query result
and do other calculation.
2)- if it is related to the case
study subject (neuroscience) so
the indexer will collect the
following information from
the document:
4)-The Ranking Method
14
Query Engine
It has an interface
to accept keywords
from the user
Query
Engine
gives the user 2 choices
for either display only
the most relevant result,
or the whole result
which include the related
results.
It searches for query
keywords in the index
database and retrieved the
result in html format.
15
Query Result:
This is indeed an edge compared to other convenient search engines
16
Re-Crawling
2-its interface allow the special
users decide to continue
crawling the website or
Recancel it.
Crawling
1-WebCrawler is specialised of any
subject created in the advance
knowledge in the database, which
will achieve this purpose by reading
the URL from the index database
using SQL
3-This Part of software aimed to
update the index
found new link.
This is will make search and crawl
any “advance knowledge” subject related websites easier
17
Testing phase
Test phase requires:
checking the first 10 ranking queries results
of the “NeuroSearch” with
the same 10 queries results of another
search engine such as Google.
abbreviation
& combined
keywords
general
keywords
specific
keywords
20 tests for each
category
Abbreviation
keywords
combined
keywords
Total of 1000 tests
18
Testing cont..
Ranking query test results in General Keywords:
Search
Engine
First
10
results
Google
Rank
Keyword
NeuroSearch Search Engine
Repeated
Rank
Keyword
repeated
Relatedkeyword
repeated
Quality/perce
ntage
1
0
0
0
10
1
3
53
3
37%
2
10
1
3
10
1
3
51
3
27%
3
0
0
0
10
1
3
37
3
36%
4
0
0
0
10
1
3
37
3
33.6%
5
0
0
0
10
1
3
34
3
36.7%
6
0
0
0
10
1
3
29
3
38.4%
7
0
0
0
10
1
3
28
3
38.1%
8
0
0
0
10
1
3
28
3
38%
9
0
0
0
10
1
3
28
3
24.9%
10
0
0
0
10
1
3
28
3
13.8%
Average
%
10%
10%
100%
100%
Table 1: (Query 1) Ranking query test result in General Keywords: (Eye)
19
Testing cont..
The Average Rankinf performance Engine Query test results
(Category based)
Error bar = +/- 1 standard deviation
Ranking performance
100
90
80
70
60
50
80.96
40
30
48.99
20
10
Chart 1 Average of
Keywords
performance for
Category Based test
results of the
(Google)
36.66
6.33
1.99
0
1
2
3
4
5
Google
The Average Keyword Performance Engine Query test results
(Category based)
Error bar = +/- 1 standard deviation
Ranking performance
100
90
80
70
60
50
92.33
40
88.49
92.99
2
3
98.16
79.49
30
20
10
Chart 2 Average
of Keywords
performance for
Category Based
test results of the
(NeuroSearch)
0
1
NeuroSearch
4
5
20
Analysing the search engines ranking results
Depends on the Categories
Independent Samples T-Test
Google Search Engine * NeuroSearch Search Engine
Ne uroSe arch Se arch Engine
Google
Se arch
Engine
General Keywords
T-value
Sig. (2-tailed)
df (degree of
freedom
Specific keywords
T-value
Sig. (2-tailed)
df (degree of
freedom
abbreviations
keywords
T-value
Sig. (2-tailed)
df (degree of
freedom
combined
keywords
T-value
Sig. (2-tailed)
df (degree of
freedom
abbreviations,
combined and
specific keywords
T-value
Sig. (2-tailed)
df (degree of
freedom
General
keywords
-16.920
.000
9
Statis tically
s ignificant
Specific
Keywords
abbreviations
keywords
combined
keywords
abbreviations,
combined and
specific
keywords
-4.394
.000
19
Statis tically
s ignificant
-63.50
.000
19
Statis tically
s ignificant
-3.387
.003
19
Statis tically
s ignificant
Table 4. The Average Ranking Engines Performance Query test results Category
-2.904
.009
19
Statis tically
s ignificant
based
21
Analysing the Average Ranking Engines
Performance Query test results Category based
t test
Result analysis
p value < .05).
That indicates,
NeuroSearch have a
is used to
compare two statistically
significantly higher
groups' scores mean score in all
on the same
categories ranking
variable
results (100) than
Google (52.35)
Result analysis ..
the negative values
of t-test show the
(inverse) relation
between them when
NeuroSearch results
increase the Google
results decrease.
22
Visual representation
Average Ranking Engines performance queries based
Average Keywords Engines performance queries based
Google NeuroSearch
100
1
52.35
0 10 20 30 40 50 60 70 80 90 100
Ranking Performance
Chart 3 Average of Categories Based
Engines ranking performance
100
90
80
70
Average of 60
50
Keywords 40
30
20
10
0
90.29
34.98
1
Google NeuroSearch
Chart 4 Average of the keyword Based in
the documents in Query test results for
(Category based Query) engines performance
23
Conclusion
Although “NeuroSearch”
search engine Used
a simple algorithm to judge the page
quality compared by
other convenient search engines,
“NeuroSearch” proves to be very
Particularly, if its
advance
knowledge
built/created by
specialist (domain
knowledge),
e.g. Oil, Medical,
arts, etc
powerful in obtaining relevant results,
24
Reference (example..)
 : Wandell, Brain A. Foundations of Vision. Sunderland,
Massachusetts, USA, 1995.
 Brin, S. and L. Page. The Anatomy of a Large-Scale
Hypertextual Web Search Engine. The Seventh Annual
International WWW Conference and computing science of
Stanford University, Stanford, CA 94305.USA, 1998.
25
Ready for Questions!!!
26