Data Collection and Web Crawling

Download Report

Transcript Data Collection and Web Crawling

Data Collection and Web
Crawling
Overview
• Data intensive applications are likely to
powered by some databases.
• How do you get the data in your database?
– Your private secret data source
– Public data from Internet
• In this tutorial, we will introduce how to
collect data from Internet.
– Use APIs
– Web Crawlers
Collecting data from Internet:
Use APIs
• The easiest way to get data from the Internet.
• Steps:
– 1. Make sure the data source provide APIs for data
collection.
– 2. Obtain API key or other forms of authorization.
– 3. Read documentation
– 4. Coding
Collecting data from Internet:
Use APIs
• Example: Twitter Search API
• 1. Make sure the data source provide APIs for data
collection.
– “Search API is focused on relevance and not completeness”
– “Requests to the Search API, hosted on search.twitter.com,
do not count towards the REST API limit. However, all
requests coming from an IP address are applied to a Search
Rate Limit. The Search Rate Limit isn't made public to
discourage unnecessary search usage and abuse, but it is
higher than the REST Rate Limit. We feel the Search Rate
Limit is both liberal and sufficient for most applications
and know that many application vendors have found it
suitable for their needs.”
Collecting data from Internet:
Use APIs
• 2. Obtain API key or other forms of
authorization.
– Read through
https://dev.twitter.com/docs/auth/tokensdevtwittercom and get them
• 3. Read documentation
• Found a Java implementation of Twitter API and read some
documentation files and sample codes at
http://twitter4j.org/en/index.html
Collecting data from Internet:
Use APIs
• 4. Coding
• Code based on the documentation and code samples.
• Refer to our sample code
(DataCollection/TweetsCollector.java)
Collecting data from Internet:
Web Crawlers
• However, other providers hosting the data you
are interested in may not provide API for you.
– Example case: You want all movies’ information from
IMDB, but IMDB doesn’t provide API for programmers.
– e.g. You want all the movie information found at a
starting page
http://www.imdb.com/features/video/browse/
• You need to develop your own crawler.
• Prerequisite: HTTP Client and Regular Expression
Collecting data from Internet:
Web Crawlers
• After browsing the website, you find out that
each movie’s information can be found at
http://www.imdb.com/title/tt******/ where
*****=movie id
• Pseudo Code:
extract the movie ids from the starting page
http://www.imdb.com/features/video/browse/
for each id in {ids}
access http://www.imdb.com/title/tt-movieid/, store page content in d
obtain movie’s title t, year y, storyline s
store (id, t,y,s) in database
Collecting data from Internet:
Web Crawlers
• Selected Useful Java methods:
• Read html files:
URLConnection conn = new URL(String url).openConnection().getInputStream();
//Returns an InputStream object that contains the source html content for url.
• Regex that finds specific patterns in a text:
Matcher m=Pattern.compile(Stirng regex).matcher(String source_text);
while (m.find()){String result=m.group(i)};
//From in source_text, find string(s) that matches the pattern specified by regex;
//Then store the ith parenthesis group in regex.
• Wait for several seconds to reduce the risks of being detected and
banned
Thread.sleep((long) (1000*Math.random()*k));
//wait for 0~k seconds.
Regular Expression
• Regex - An advanced search.
– “Normal search” only deals with finding fixed
character sequences.
– Regex can handle various patterns.
• An interactive tutorial:
– http://regexone.com/
• A place to quickly test a written regex against a
source text:
– http://regexpal.com/
Regular Expression
The most useful ones for web crawlers:
<tag>(.*?)</tag>
match everything surrounded by
<tag><tags>
Example
html content:
<div class="txt-block" itemprop="actors" itemscope
itemtype="http://schema.org/Person">
<h4 class="inline">Stars:</h4>
<name size=3>Ben Ziegler</name>,
<name size=5>Glenna Hill</name>,
<name size=4>Jason Woolfolk</name>
<span class="ghost">|</span>
<span class="see-more inline nobr">
<a href="fullcredits?ref_=tt_ov_st_sm" itemprop='url'> See full cast and crew</a>
&raquo;
</span>
</div>
Example
• Match the three names surrounded by <name> tags
– <name size=\d>(.*?)</name>
Example
• Convert this regex into Java expression:
Matcher m=Pattern.compile("(?mis)<name size=\\d>(.*?)</name>").matcher(html_content);
while (m.find()){System.out.println(“name: ”+m.group(1));
– we use \\d instead of \d in order to escape the escape
character “\”.
– () controls the group to be extracted.
Matcher m=Pattern.compile("(?mis)<name size=(\\d)>(.*?)</name>").matcher(html_content)
while (m.find()){System.out.println(“name: ”+m.group(2));
– Feel the difference:
• What if we use (.*) instead of (.*?) ?
Collecting data from Internet:
Web Crawlers
• A complete sample code is provided in
• DataCollection/MovieSpider.java
Summary
Pros
Cons
Third party APIs
Convenient, easy to use
safe, won’t be blocked
Fast
Need to manage API keys
Inflexible
Limit on access
Your own web crawlers
Very flexible.
Theoretically, you can
collect anything you find.
A lot of coding
May be blocked