Transcript ppt

By Morris Wright, Ryan Caplet, Bryan Chapman
Overview
Crawler-Based Search Engine (A
script/bot that searches the web in a
methodical, automated manner)
(wikipedia, ”web crawler”)
 Limited to a subset of Uconn’s School of
Engineering Websites
 Resources: Web server and MySQL
servers provided by ECS
 Languages Used: HTML, PHP, SQL,
Perl

Task Breakdown

Bryan
 Design Crawler
 Analyze files and fill database with Urls to search

Morris
 Search functionality
 Database/Account Management

Ryan
 UI Development

Ranking Algorithm and Keyword extraction
done by group
Crawler Summary




The crawler creates a ”mirror” of our intended
scope of websites on local hard drive
Using a script, the title is then extracted from
the relavent files and placed into a DB table
Another script then visits each url and extracts
keywords to populate the second DB table
When a user types in a word in the search
engine, the word will be queried in the
keyword database, and from that word another
query will be sent to display all the urls/titles
matching that specific keyword
Crawler - Wget




The linux command, wget is used in our script
along with the base domain of
www.engr.uconn.edu/ to limit our crawler to
sites within the school of engineering
“Wget can follow links in HTML pages and
create local versions of remote web sites, fully
recreating the directory structure of the original
site” (linux.about.com)
Our “Mirror”
A script is then used to run a recursive call that
removes all the <title> tags from the files,
preparing them for storage into the database
Crawler – Stem Words

A script is used to remove all arbitrary
”stem” words and combine like words such
as:
 the
 if
 however
 -ion, -ing, -ier… etc
 “Running” is the same as “Run”

Helps with space in the database
Crawler Functionality

Once this is accomplished our first database is
populated with indexing information and has a
layout as seen below.
Site Index Table
ID
URL
TITLE
Used as a primary key
Stores site's url address
Stores extracted title
Crawler Functionality
PHP is then used to loop through all the url
listings in our indexing database to create
keywords
 Unwanted HTML syntax is removed and
PHP's built-in function
array_count_values is used to create a
list of keywords and frequency
 For the time being, these keyword
frequencies will be used to determine page
rank and ordering on the search page

MySQL Database
Crawler Functionality

Once this list is created for a given website, we
then populate our keyword database by either
creating a new table for the keyword, or simply
adding a new entry into an existing table
'Keyword' Table
ID
Used as a primary key
URL
Stores site's url address
Freq
Stores keyword frequency
Sample Keyword Results
Consider the following results
 URL: http://www.uconn.edu/resnet

 Title: For all your Technology Needs
 Keyword: technology 4
 Keyword: information 10

URL: http://www.uconn.edu/sports
 Title: For all your Sports Information
 Keyword: football 10
 Keyword: information 12
Crawler Functionality
Once the databases have been populated, it
just needs to be integrated with the search
function of the page and the UI to be fully
functional
 The current UI is good for displaying a few
results, but we will need something more
efficient and better looking when there are
hundreds of results

Search Function
When a word is entered into the search
bar, a query of that word is entered into the
database
 If the word is in the database, the query will
pull up all the URLs and their associated
titles and display them on the page
 The pages should be ordered by their page
rank – the higher the frequency of the
keyword, the higher the rank
 The search function code is written in PHP
and the queries are written in SQL

Search Function Test
Search Function Example
Search Function - Mail
Search Function - Uconn
Search Function – N/a
Changes needed for Integration
Need to setup the test database fields to
match up the criteria of the crawler
database
 Test Database only uses 1 database
whereas the crawler database uses 2 –
one for the URL/Titles, one for the
Keywords
 Need to work on security measures such
as input validation and Hackbar
 Hackbar is a tool used for testing SQL
injections, XSS holes and site security.

Questions?