Data-Collecion

Download Report

Transcript Data-Collecion

Web Crawler
Data Collection Module
Web Crawler
• A web crawler (also known as a web spider or web robot) is a program or automated script which browses the
World Wide Web in a methodical, automated manner.
• Search engines such as Google, Bing etc. uses web crawlers to index the newly created data on Internet.
16BIT
IITR
Web Crawler
Data Collection Module
News Crawler
• News Crawlers are focused on retrieving newly published News Data.
• News Crawlers monitors a set of defined News sources and captures the news as soon as it publishes.
Predefined
Set of News
Sources
News Article
Downloader
News URL
Downloader
Crawl every
30 Min
New URLs
New
URLs
News
Articles
News
Database
Architecture of News Crawler at IITR
16BIT
IITR
Web Crawler
Data Collection Module
Web Crawler
A Simple Java Program for
Downloading a Web Page
16BIT
IITR
Web Crawler
Data Collection Module
Parsing a Web Page
•
Given a Web Page, we can retrieve different components by Parsing it.
•
Many HTML Parsers are available such as Jsoup, Xerces, NekoHTML
•
Following Java program uses Jsoup parser to extract Hyperlinks from a web page.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.io.File;
public class ExtractLinks {
public static void main(String[] args) throws IOException {
File input = new File("data.html");
Document doc = Jsoup.parse(input, "UTF-8", “ ");
Elements links = doc.select("a[href]");
System.out.println("Total Number of Links:"+links.size());
for (Element link : links) {
System.out.println(link.attr("abs:href"));
}
}
}
16BIT
IITR
Web Crawler
Data Collection Module
Retrieving Article Text
• There are many API available for extracting the main content from web pages, such as Boilerplate API
• Following Java program demonstrates the use of Boilerplate API to extract the article text from a news article
import java.io.PrintWriter;
import java.net.URL;
import de.l3s.boilerpipe.BoilerpipeExtractor;
import de.l3s.boilerpipe.extractors.CommonExtractors;
import de.l3s.boilerpipe.sax.HTMLHighlighter;
public class BoilerplateDemo {
public static void main(String[] args) throws Exception {
URL url = new URL("http://www.thehindu.com/news/national/land-acquisition-ordinance-bill-gets-a-burial/article7597517.ece");
final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
// choose the operation mode (i.e., highlighting or extraction)
//final HTMLHighlighter hh = HTMLHighlighter.newHighlightingInstance();
final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
PrintWriter out = new PrintWriter("highlighted.html", "UTF-8");
out.println(hh.process(url, extractor));
out.close();
System.out.println("Now open file highlighted.html in your web browser");
}
}
16BIT
IITR
Article Extractor
Data Collection Module
Article Extraction
• Objective: To extract Article Content from Given News URL
• News URL: http://www.hindustantimes.com/world-t20/amitabh-bachchan-to-sing-national-anthem-before-indiapakistan-match/story-QXxnQAvmJsisvIYtSFv33L.html
Bollywood superstar Amitabh Bachchan will sing the National
Anthem before the start of the marquee India-Pakistan World
Twenty20 cricket match at the Eden Gardens on March 19.
Bachchan has confirmed the development by retweeting a post in
his official Twitter handle while sources in the Cricket Association
of Bengal today said this was an effort by its president Sourav
Ganguly.
“The president was involved and the plan was on for a long time,”
CAB sources said.
While the ‘Big B’ will sing the National Anthem in his signature
baritone, Pakistan will also make their presence felt with classical
singer Shafaqat Amanat Ali who is slated to sing the Pakistani
National Anthem.
16BIT
IITR
Article Extractor
Data Collection Module
Add-ons: Noise
http://timesofindia.indiatimes.com/india/India-became-3rd-largest-economy-in-2011-from-10th-in-2005/articleshow/34416429.cms
16BIT
IITR
Article Extractor
Data Collection Module
Article Extraction
16BIT
IITR
Article Extractor
Data Collection Module
Article Extraction
String url = “input_url.html”;
String name = “CLASS or ID name”;
Document doc = Jsoup.connect(url).timeout(100*1000).userAgent("Mozilla").get();
article = doc.getElementsByClass(name).text();
Or
article = doc.getElementById(name).text();
String url = “http://www.dnaindia.com/world/report-pakistan-blast-in-peshawar-buskills-at-least-15-govt-employees-over-25-injured-2189902”;
String name = “body-text”;
Document doc = Jsoup.connect(url).timeout(100*1000).userAgent("Mozilla").get();
article = doc.getElementsByClass(name).text();
Example
16BIT
IITR
Extract Meta-Key Phrase
Data Collection Module
Metadata of News Webpages
• Metadata refers to data about data. It is always in the form of key-value pairs.
Key : name = “author”
Value : content = “TCA Sharad Raghavan”
16BIT
IITR
Extract Meta-Key Phrase
Data Collection Module
Metadata of News Webpages
• Metadata content of a typical news webpage:
• Title,
• Description,
• News keywords,
• Author name,
• Last modified date,
• Publishing date,
• etc.
• News websites use various types of protocols to insert metadata. OGP (Open Graph Protocol) is one of them.
• Some of the well know OGP tags are :
• og:title - The title of your object as it should appear within the graph, e.g., "The Rock".
• og:type - The type of your object, e.g., "video.movie". Depending on the type you specify, other properties
may also be required.
• og:image - An image URL which should represent your object within the graph.
• og:url - The canonical URL of your object that will be used as its permanent ID in the graph, e.g.,
"http://www.imdb.com/title/tt0117500/".
16BIT
IITR
Extract Meta-Key Phrase
Data Collection Module
Open Graph Protocol
• Open Graph Protocol (OGP) provided by Facebook, allows the embedding of web content as Facebook social
graph objects.
• It defines tags which can be used by web content generators for converting web objects into corresponding graph
object.
Facebook Graph Object
<meta property="og:image"
content="http://www.thehindu.com/multimedia/dynamic/02
459/LPG_2459323c.jpg">
<meta property="og:title" content="Centre changes tack on
LPG subsidy campaign">
<meta property="og:description" content="The government
seems to have given up on the Give It Up….">
16BIT
IITR
Extract Meta-Key Phrase
Data Collection Module
“news_keyword” Tag
• Keywords which are most relevant to the article.
<meta name="news_keywords" content="LPG subsidy, LPG
subsidy campaign, Give It Up Campaign ,economy, business and
finance, energy and resource">
16BIT
IITR
Twitter Crawler
Data Collection Module
Twitter
• Online social networking and microblogging service.
• Enables its registered user to read and send messages of 140 characters known as tweets.
• Twitter contains data in following forms:
• Tweet: Message to send with 140 characters or less.
• Follower: A person who has chosen to read your tweets on an ongoing basis.
• Reply or @ : The @ symbol means you are talking to or about the person.
• Retweet or RT: The act of repeating what some one else has tweeted so that your followers can see it.
• HashTag or # : HashTag provide a theme for the tweet that allow all similar tweets to be searched.
16BIT
IITR
Twitter Crawler
Data Collection Module
Twitter
To Follow
Tweet
Persons Retweeted
Reply
HashTag
Retweets
16BIT
IITR
Twitter Crawler
Data Collection Module
Data Extraction from Twitter
• Data from twitter can be extracted using either Twitter APIs or R packages.
1. Twitter APIs:
• REST API
• Streaming API
2. R packages:
• twitteR
• RTwitterAPI
16BIT
IITR
Twitter Crawler
Data Collection Module
Data Extraction from Twitter using a REST API: Twitter4J
1. Login Twitter account.
2. Open link https://apps.twitter.com/app/new and create an application.
3. Generate Access token.
4. Create a New Java Project and include the Twitter4j Library from
https://dl.dropboxusercontent.com/u/1737239/twitter4j-core-2.2.5.jar
16BIT
IITR
Twitter Crawler
Data Collection Module
Java Code to Extract Tweets related to Query “World Cup”
16BIT
IITR
Twitter Crawler
Data Collection Module
Java Code to Extract Trends from Twitter
16BIT
IITR