here - Sapienza

Download Report

Transcript here - Sapienza

Projects
(2012-2013)
Ida Mele
Rules
• Students have to work in teams (max 2 people).
• The project has to be delivered by the deadline that will be
published on my web site.
– Usually the project deadline is the same day of the written
exam. Students, who pass the exam during the first session,
can deliver the projects by the second session.
• The project score is from 0 to 10.
– The professor decides the final mark, considering also the
score of the written exam.
• A project can be assigned to max 2 groups.
Ida Mele
Projects (2012-2013)
1
Project Request
• Students have to send me an email with object: WebIR project request and the following information:
– Name and last name of each student in the group.
– Title of the project.
– Short description of what the students intend to do (up to
250 words).
Important: all the members of the group should be cc-ed in
the email.
• If everything is OK, you will receive a confirmation email.
• There is no deadline for the request of the project.
Ida Mele
Projects (2012-2013)
2
Project Delivery
• The presentation of the project takes 15-minutes. The
presentation should contain the description of the
problem, the design decisions, the most important issue
related to the implementation, and the results achieved.
Students use slides for their presentations and if they
want they can realize a demo as well.
• Students have to deliver the source code and the slides.
More instructions about the project delivery will be
published on my web site.
Ida Mele
Projects (2012-2013)
3
Project list
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Analyze the link structure of the web graph of Sapienza University.
Analyze the link structure of Twitter social network.
Find communities in Facebook.
Find communities in IMDB.
Find communities in DBLP.
Hadoop implementation of PageRank.
Hadoop implementation of HITS.
Realize a reverse web graph with Hadoop.
Realize an inverted index with Hadoop.
Personalized ranking of news.
Enrich News using Tweets.
Enrich News using Wikipedia.
Ida Mele
Projects (2012-2013)
4
Projects
1) Analyze the link structure of the web graph of Sapienza
University.
– Crawl the portion of the Web related to the domain
uniroma1.it, create the corresponding web graph. Analyze
its link structure, and identify the authoritative web sites.
– Tip: the students can use node features such as: degree,
in-degree, out-degree, PageRank, etc. They can plot the
distribution of the aforementioned measures. The
students can enrich their analysis by studying the edge
reciprocity, and the graph assortativity.
Ida Mele
Projects (2012-2013)
5
Projects
2) Analyze the link structure of Twitter social network.
– Use Twitter API and create the who-follow-whom network.
Analyze the distribution of followers, following, and
identify most popular users. Study the edge reciprocity,
and determine if the network is assortative.
– Tip: the students can use PageRank and/or other node
features to identify the most popular users.
– Tip: the network is assortative when nodes tend to be
connected with similar nodes, for example nodes with high
degree have edges to nodes with high degree.
Ida Mele
Projects (2012-2013)
6
Projects
3) Find communities in Facebook.
– Use Facebook API to download data of your friends and of
friends of friends. Create the corresponding friendship
graph and find communities of users. Check if
communities correspond to groups of users who live in the
same city, work for the same organization, or attend the
same school, university, etc.
– Tip: the students can identify clusters of users by using a
graph-partitioning tool.
Ida Mele
Projects (2012-2013)
7
Projects
4 and 5) Find communities in a network of collaborations.
Project n.4: use IMDB: http://www.imdb.com/interfaces
Project n.5: use DBLP: http://dblp.uni-trier.de/xml/
– Create a graph where nodes are people and a link between two
people represents the fact that they have worked together. Use
this graph to find communities of people. People come from the
same country, they are famous (for project n.4), they belong to
the same university (for project n.5).
– Tip: the information about the number of collaborations is
important, students can use weighted edges to represent it.
– Tip: the students can use a tool for graph partitioning in order to
find out clusters of users.
Ida Mele
Projects (2012-2013)
8
Projects
6 and 7) Hadoop implementation of a ranking algorithm.
Project n.6: implementation of PageRank.
Project n.7: implementation of HITS.
– Given a web graph, where nodes represent web pages and
the edge between two nodes u and v represents the link
from the source page u to the target page v, implement a
ranking algorithm to computes the scores of the nodes.
Plot and analyze the distribution of the obtained scores.
Ida Mele
Projects (2012-2013)
9
Projects
8) Realize a reverse web graph with Hadoop.
– Given a web graph, the algorithm creates the graph with
reversed edges. For example if the input graph has the
edge (u,v), the output graph will have the edge (v,u).
Represent the input and output graphs (or portions of
them) using a graph tool.
– Tip: for each link the map creates <target, source> pairs.
The reducer create the concatenation of the sources, and
emits <target, list of sources> pairs.
Ida Mele
Projects (2012-2013)
10
Projects
9) Realize an inverted index with Hadoop.
– Given a large collection of documents, the algorithm
creates the inverted index, where the dictionary contains
the indexed terms, and for each term is stored the list of
postings.
– Tip (for the dictionary): the students can decide to use
stemming or to remove stop-words.
– Tip (for the postings): the students can realize an inverted
index where each posting has the ID of the document
containing the term and other information, such as the
frequency of the term in the document and the position of
the occurrences of the term in the document.
Ida Mele
Projects (2012-2013)
11
Projects
10) Personalized ranking of news.
– Create a system which re-ranks news articles according to
the user interests. Users can specify their interests by
selecting them from a list of keywords (ex. gossip, sport,
politics, …). The system uses an algorithm that ranks the
news articles according to the user preferences.
– Tip: the students can use different sources for collecting
the news articles.
Ida Mele
Projects (2012-2013)
12
Projects
11) Enrich News using Tweets.
– Enrich a news site with the information published by the
users of Twitter. Given a news article, the system can
gather all the user tweets about that and show the news
article along with the tweets.
– Tip: students can use news about concerts of famous
singers, or about strikes, riots…
– Tip: students can decide to use a timeline of tweets on the
top of the page, or to rank them and show the top-n
tweets on the left of the page.
Ida Mele
Projects (2012-2013)
13
Projects
12) Enrich News using Wikipedia.
– Enrich the facts reported in news pages with information
extracted from Wikipedia. Given a news article identify the
name of people mentioned in the article and for each of
them report the wikipedia information about their life.
– Tip: the students can use Stanford Name Entity Recognizer
(http://nlp.stanford.edu/software/CRF-NER.shtml) for the
entity-extraction task. It allows to easily find the name of
famous people.
– Tip: the students can use the whole wikipedia page or
paragraphs extracted from it.
Ida Mele
Projects (2012-2013)
14
Other important information
• Graph datasets: for those students who want work on graphs,
but they cannot crawl a portion of the Web, they can find
some large graphs here: http://law.di.unimi.it/datasets.php.
• News datasets: for those students who want to work on news
articles, but they cannot collect the pages from the Web, send
me an email.
• Some famous graph tools:
– Gephi (https://gephi.org/),
– METIS (http://glaros.dtc.umn.edu/gkhome/views/metis) for graphpartitioning.
• For questions send me an email, I will reply ASAP.
Ida Mele
Projects (2012-2013)
15