Schmidt_Michael_Presentationx

Download Report

Transcript Schmidt_Michael_Presentationx

CPSC 8985
Fall 2015
P10
Web Crawler
Mike Schmidt
Overview
• A web crawler is a script or piece of code that will go out
onto the Internet and pull information or data
• A web crawler can be trained to only look for certain
information. This data can then be saved into a database
which is known as web scraping
• Analysis can be performed on the stored data to show
trends or similarities between sets
Architecture
• Java – language in which the
business objects and data
access objects are written
• Jsoup – Java library the
application uses to pull html
elements from web pages
• MongoDB – a NoSQL
database that the applications
uses to store information
collected from the web
MongoDB
• The name mongo comes from the word humongous, as
MongoDB provides a solution to store massive amounts
of data
• MongoDB is a NoSQL type database and stores
information in a JSON like way, using document objects
• Mongo databases can be spread over multiple servers
which makes them a perfect solution to large amounts of
data that need to be accessed in a timely manner
JSoup
• The jsoup Java library is used to parse webpages into
elements using HTML tags and attributes
• Jsoup tears down website pages by using CSS and jquery
like methods
• Scraped jsoup elements can be easily added to a
document object which is then sent to the MongoDB
server
Scraped Data
• This application scrapes data live from the Internet
(weather, sports scores, and movie listings)
• Data that is collected is stored into a Mongo database
where analysis can be performed
• Scraping data allows users to pull in information from
multiple sources and aggregate it into one central
location
Live Demo of
Application