Transcript here

Web Information Retrieval
Projects
Ida Mele
Rules
• Students can work in teams (max 3 people)
• The project must be delivered by the deadline that will
be published on my web site. Usually the project
discussion is the same day of the written exam. Students
who register for the first exam call can present the
software project in the first or in the second exam call
• The project score is from 0 to 10. The professor decides
the final mark
• The same project can be assigned to max 2 groups
• For any question/doubt/problem, send me an email
Ida Mele
Projects
1
Project Request
• Students have to send me an email with object: WebIR project request specifying:
• Name and last name of each student in the group
• Title of the project and dataset the students intend to use
• Short description of what the students intend to do (up to
250 words)
Important: all the members of the group should be cc-ed in
the email
• If everything is OK, you will receive a confirmation email
• There is no deadline for the request of the project
Ida Mele
Projects
2
Project Delivery
• The presentation of the project takes 15 minutes
• The presentation should contain:
• the description of the problem and of the dataset
• the most important issues related to the implementation, and
how they have been addressed
• the results achieved
• Students can use slides for their presentations and if they
want they can realize a demo as well
• Deadline and more instructions about the project
delivery will be published on my web site
Ida Mele
Projects
3
List of Projects
1)
2)
3)
4)
5)
6)
7)
Analyze the link structure of a large graph from the Web
Find circles in a social network through link analysis
Find communities in a network of users
Classification of online reviews
Topic classification of tweets
Personalized ranking of query results
Hadoop implementation of a link-based ranking
algorithm
8) Hadoop implementation of an inverted index
Ida Mele
Projects
4
Projects
1) Analyze the link structure of a large graph from the Web
• Create the web graph and analyze its link structure by
computing degree, in-degree, out-degree, PageRank,
TruncatedPageRank, edge reciprocity, graph assortativity,
number of triangles, etc. Plot the distributions of the features
• List of datasets you can use:
• http://law.di.unimi.it/datasets.php  use one of the graphs
available in Section Larger crawls
• http://snap.stanford.edu/data/index.html  use graphs in Section
Web graphs (e.g., web-Google, web-Stanford, web-NotreDame)
• http://webdatacommons.org/hyperlinkgraph/  use the graph
representing subdomains
Ida Mele
Projects
5
Projects
2) Find circles in a social network through link analysis
• Create the graph of the users of a popular social network (e.g.,
Twitter, Facebook, or Google+). Analyze the network and apply
link-based features to identify circles. Check if the circles you get
match the ones obtained from the analysis of common features
• List of datasets you can use:
• http://snap.stanford.edu/data/index.html  use one of the ego
graphs available in Section Social networks: ego-Facebook, egoGplus, or ego-Twitter. Each dataset is made of the ego network,
the set of circles for the ego node, and the connections among ego
networks. You can use the file with the set of circles as a groundtruth
Ida Mele
Projects
6
Projects
3) Find communities in a network of users
• Create a graph where nodes are people and a link between two
people represents the fact that they have something in common.
For example, they are collaborators (DBLP co-authorship network)
or they have bought the same product (Amazon product copurchasing network), etc. Use this graph to find communities of
people and check the results with the ground-truth provided in the
dataset
• List of datasets you can use:
• http://snap.stanford.edu/data/index.html  use one of the
graphs available in Section Networks with ground-truth
communities (e.g., com-DBLP, com-Amazon, com-YouTube, comFriendster)
Ida Mele
Projects
7
Projects
4) Classification of online reviews
• Given a set of user reviews about products (food, wine,
etc.), analyze the text and other features for creating a
classification of reviews. Some possible classifications are
dividing reviews for kind/brand of product, for judgment
(positive/neutral/negative), for helpfulness, etc.
• List of datasets you can use:
• http://snap.stanford.edu/data/index.html  use data
available in in Section Online Reviews (e.g., CellarTracker,
Amazon reviews, Fine Foods, Movies)
Ida Mele
Projects
8
Projects
5) Topic classification of tweets
• Given a set of english tweets, implement a topicclassification algorithm which divides tweets into
categories. Possible categories are personal updates, news,
politics, economics, sports, music, gossip, etc. You can also
use ODP categories (http://www.dmoz.org/) for creating
the list of possible topics
• List of datasets you can use:
• Send me an email, and I will give you the link to the dataset
you can download
Ida Mele
Projects
9
Projects
6) Personalized ranking of query results
• Create a system for query-result personalization. The users
of the system can specify their interests by selecting them
from a list of keywords (e.g., gossip, sport, politics, …). You
can use a HTML form for the registration to the system.
• Crawl a portion of the web (e.g., news websites) and
create the corresponding webgraph. Use a personalized
ranking algorithm, for example, Topic-Specific PageRank,
for ranking the pages according to user interests and
compare the personalized ranking against the notpersonalized one.
Ida Mele
Projects
10
Projects
7) Hadoop implementation of a link-based ranking algorithm
• Given a web graph, where nodes represent web pages and the
edge between two nodes u and v represents the link from the
source page u to the target page v, implement in Hadoop a
ranking algorithm (PageRank or HITS) to computes the scores of
the nodes. Plot and analyze the distribution of the obtained
scores
• List of datasets you can use:
• http://law.di.unimi.it/datasets.php  use one of the graphs
available in Section Larger crawls
• http://snap.stanford.edu/data/index.html  use graphs in Section
Web graphs (e.g., web-Google, web-Stanford, web-NotreDame)
Ida Mele
Projects
11
Projects
8) Hadoop implementation of an inverted index
• Given a large collection of documents, create the inverted index,
which is made of a dictionary and the posting lists. The
dictionary contains indexed terms (remove stop-words and use
stemming for preprocessing). For each term in the dictionary,
the posting list contains information about documents where
the term appears. Each posting has the ID of the document, the
frequency of the term in the document, and the positions of the
occurrences of the term in the document
• List of datasets you can use:
• Gutenberg project (http://www.gutenberg.org/) offers free ebooks
that can be used for creating the document collection
Ida Mele
Projects
12
Important Information
• Students can choose one of the projects in the list, or they can
propose a different project
• There are no constraints on the datasets to use: The students
can use the datasets suggested in the list of projects or
different datasets available on the Web, or they can even
create a new dataset for their project
• Links to other dataset sources:
• http://vlado.fmf.uni-lj.si/pub/networks/data/default.htm
• http://www.trustlet.org/wiki/Repositories_of_datasets
• http://www-personal.umich.edu/~mejn/netdata/
Ida Mele
Projects
13
Important Information
• There are no constraints on programming languages, libraries,
and tools to use
• Links to some tools/libraries for working with graphs:
• Graph visualization: Gephi (http://gephi.org/), Graphviz
(http://www.graphviz.org/)
• Large-graph partitioning: METIS
(http://glaros.dtc.umn.edu/gkhome/metis/metis/overview)
• Java Library: WebGraph (http://webgraph.di.unimi.it/), JUNG
(http://jung.sourceforge.net/)
• Python library: NetworkX (http://networkx.github.io/)
Ida Mele
Projects
14