Transcript ppt
By Morris Wright, Ryan Caplet, Bryan Chapman
Overview
Crawler-Based Search Engine (A
script/bot that searches the web in a
methodical, automated manner)
(wikipedia, ”web crawler”)
Limited to a subset of Uconn’s School of
Engineering Websites
Resources: Web server and MySQL
servers provided by ECS
Languages Used: HTML, PHP, SQL,
Perl
Task Breakdown
Bryan
Design Crawler
Analyze files and fill database with Urls to search
Morris
Search functionality
Database/Account Management
Ryan
UI Development
Ranking Algorithm and Keyword extraction
done by group
Crawler Summary
The crawler creates a ”mirror” of our intended
scope of websites on local hard drive
Using a script, the title is then extracted from
the relavent files and placed into a DB table
Another script then visits each url and extracts
keywords to populate the second DB table
When a user types in a word in the search
engine, the word will be queried in the
keyword database, and from that word another
query will be sent to display all the urls/titles
matching that specific keyword
Crawler - Wget
The linux command, wget is used in our script
along with the base domain of
www.engr.uconn.edu/ to limit our crawler to
sites within the school of engineering
“Wget can follow links in HTML pages and
create local versions of remote web sites, fully
recreating the directory structure of the original
site” (linux.about.com)
Our “Mirror”
A script is then used to run a recursive call that
removes all the <title> tags from the files,
preparing them for storage into the database
Crawler – Stem Words
A script is used to remove all arbitrary
”stem” words and combine like words such
as:
the
if
however
-ion, -ing, -ier… etc
“Running” is the same as “Run”
Helps with space in the database
Crawler Functionality
Once this is accomplished our first database is
populated with indexing information and has a
layout as seen below.
Site Index Table
ID
URL
TITLE
Used as a primary key
Stores site's url address
Stores extracted title
Crawler Functionality
PHP is then used to loop through all the url
listings in our indexing database to create
keywords
Unwanted HTML syntax is removed and
PHP's built-in function
array_count_values is used to create a
list of keywords and frequency
For the time being, these keyword
frequencies will be used to determine page
rank and ordering on the search page
MySQL Database
Crawler Functionality
Once this list is created for a given website, we
then populate our keyword database by either
creating a new table for the keyword, or simply
adding a new entry into an existing table
'Keyword' Table
ID
Used as a primary key
URL
Stores site's url address
Freq
Stores keyword frequency
Sample Keyword Results
Consider the following results
URL: http://www.uconn.edu/resnet
Title: For all your Technology Needs
Keyword: technology 4
Keyword: information 10
URL: http://www.uconn.edu/sports
Title: For all your Sports Information
Keyword: football 10
Keyword: information 12
Crawler Functionality
Once the databases have been populated, it
just needs to be integrated with the search
function of the page and the UI to be fully
functional
The current UI is good for displaying a few
results, but we will need something more
efficient and better looking when there are
hundreds of results
Search Function
When a word is entered into the search
bar, a query of that word is entered into the
database
If the word is in the database, the query will
pull up all the URLs and their associated
titles and display them on the page
The pages should be ordered by their page
rank – the higher the frequency of the
keyword, the higher the rank
The search function code is written in PHP
and the queries are written in SQL
Search Function Test
Search Function Example
Search Function - Mail
Search Function - Uconn
Search Function – N/a
Changes needed for Integration
Need to setup the test database fields to
match up the criteria of the crawler
database
Test Database only uses 1 database
whereas the crawler database uses 2 –
one for the URL/Titles, one for the
Keywords
Need to work on security measures such
as input validation and Hackbar
Hackbar is a tool used for testing SQL
injections, XSS holes and site security.
Questions?