?Miscellaneous Group¿
Download
Report
Transcript ?Miscellaneous Group¿
?Miscellaneous
Group¿
What we learned
in Compiegne
Sven Abels, David Parry,
Katarzyna Wegrzyn-Wolska,
Wai Gen Yee
Contents
• Abels: Splitting compounds
• Parry: Attribution
• Wegrzyn-Wolska: Web page
lifetimes.
• Yee: P2P information retrieval.
jWordSplitter
• Usage of "Bloom Filters"?
– to test whether or not an element is a member of a set
strong space advantage compared to hash tables
• Connect words instead of splitting them?
– Needed e.g. in China (Google vs. Baidu)
• Reduction of dictionary to atomic words?
– To further improve size of dictionary and checking time
• Consideration of further language specific rules
– For words that might need a grammatical change after
the decomposition
• Next: Evaluation of improvement
– By using two projects as introduced in the presentation
Attribution
• K-distance has some similarity to n-grams,
but the compression algorithms give more
flexibility.
• Location of centroids for clustering can be
simplified—this may make clustering via
this approach more practical.
• The work in computational linguistics for
author identification is related.
• Use of the compression dictionary directly,
may allow comparison between
dictionaries rather than the “black box”
approach.
Lifetimes of Web Pages
• The lifespan, accessibility and
archiving of dynamic documents
• Why is this problematic?
• Interesting comments and questions:
– measuring the lifespan of dynamic documents and
its interest for the Search Engines.
– definition of the lifespan, where the page can be
consider as a new one.
Peer-to-Peer IR
• Reputations of peers.
– Identify spoofers, spam.
• Expand the model:
– Development of P2P Googles.