Transcript ppt

By Morris Wright, Ryan Caplet, Bryan Chapman
Crawler-Based Search Engine (A
script/bot that searches the web in a
methodical, automated manner)
(wikipedia, ”web crawler”)
 Limited to a subset of Uconn’s School of
Engineering Websites
 Resources: Web server and MySQL
servers provided by ECS
 Languages Used: HTML, PHP, SQL,
Task Breakdown
 Design Crawler
 Analyze files and fill database with Urls to search
 Search functionality
 Database/Account Management
 UI Development
Ranking Algorithm and Keyword extraction
done by group
Crawler Summary
The crawler creates a ”mirror” of our intended
scope of websites on local hard drive
Using a script, the title is then extracted from
the relavent files and placed into a DB table
Another script then visits each url and extracts
keywords to populate the second DB table
When a user types in a word in the search
engine, the word will be queried in the
keyword database, and from that word another
query will be sent to display all the urls/titles
matching that specific keyword
Crawler - Wget
The linux command, wget is used in our script
along with the base domain of to limit our crawler to
sites within the school of engineering
“Wget can follow links in HTML pages and
create local versions of remote web sites, fully
recreating the directory structure of the original
site” (
Our “Mirror”
A script is then used to run a recursive call that
removes all the <title> tags from the files,
preparing them for storage into the database
Crawler – Stem Words
A script is used to remove all arbitrary
”stem” words and combine like words such
 the
 if
 however
 -ion, -ing, -ier… etc
 “Running” is the same as “Run”
Helps with space in the database
Crawler Functionality
Once this is accomplished our first database is
populated with indexing information and has a
layout as seen below.
Site Index Table
Used as a primary key
Stores site's url address
Stores extracted title
Crawler Functionality
PHP is then used to loop through all the url
listings in our indexing database to create
 Unwanted HTML syntax is removed and
PHP's built-in function
array_count_values is used to create a
list of keywords and frequency
 For the time being, these keyword
frequencies will be used to determine page
rank and ordering on the search page
MySQL Database
Crawler Functionality
Once this list is created for a given website, we
then populate our keyword database by either
creating a new table for the keyword, or simply
adding a new entry into an existing table
'Keyword' Table
Used as a primary key
Stores site's url address
Stores keyword frequency
Sample Keyword Results
Consider the following results
 URL:
 Title: For all your Technology Needs
 Keyword: technology 4
 Keyword: information 10
 Title: For all your Sports Information
 Keyword: football 10
 Keyword: information 12
Crawler Functionality
Once the databases have been populated, it
just needs to be integrated with the search
function of the page and the UI to be fully
 The current UI is good for displaying a few
results, but we will need something more
efficient and better looking when there are
hundreds of results
Search Function
When a word is entered into the search
bar, a query of that word is entered into the
 If the word is in the database, the query will
pull up all the URLs and their associated
titles and display them on the page
 The pages should be ordered by their page
rank – the higher the frequency of the
keyword, the higher the rank
 The search function code is written in PHP
and the queries are written in SQL
Search Function Test
Search Function Example
Search Function - Mail
Search Function - Uconn
Search Function – N/a
Changes needed for Integration
Need to setup the test database fields to
match up the criteria of the crawler
 Test Database only uses 1 database
whereas the crawler database uses 2 –
one for the URL/Titles, one for the
 Need to work on security measures such
as input validation and Hackbar
 Hackbar is a tool used for testing SQL
injections, XSS holes and site security.