AND_DatasetsWorkgroup

Transcript AND_DatasetsWorkgroup

Group-1
 Task : Data sets, benchmarks, evaluation techniques for analysis of
noisy texts.
 Participant: Maarten de Rijke, Amaresh Pandey, Donna Harman, Venu
Govindaraju, Aixin Sun and Venkat Subramaniam
AND
Second Workshop on Analytics for
Noisy Unstructured Text Data
AND Working Groups
July 24, 2008  Slide ‹#›
Datasets
 Important to list out datasets that are out there.
 A List of datasets that are publicly available can be added to the
proceedings along with descriptions as well as comments.
 Create a Table: dataset name and source; application; usability;
tools for creating and analyzing the data sets.
 Take a references from AND 07.
 List out Missing things about data sets.
 Data sets can be for speech, text, OCR, etc.
 LDC and ELDC can be a source for speech data
 NIST can be a source for OCR data
 List out tools and sources which gives data for academic/Industry
research work.
AND
Second Workshop on Analytics for
Noisy Unstructured Text Data
AND Working Groups
July 24, 2008  Slide ‹#›
Benchmarks
 Identify Popular tasks, organize competitions to create
 List of past evaluations and benchmarks say from TREC and list what
can be done
 Blogs, speech, OCR in TREC 5, legal, spam, cross language text,
historical texts and etc.
 Create a table: popular tasks; what benchmarks exist; new benchmarks
 Give emphasis of certain type of data sets like, Blogs and OCR.
AND
Second Workshop on Analytics for
Noisy Unstructured Text Data
AND Working Groups
July 24, 2008  Slide ‹#›
Evaluation
 Cascaded evaluation should be done: noise, effect of noise, effects of
different stages of processing.
 Evaluation requires truth data. Creating labeled truth is costly. So
create a common task on a given dataset, that way truth data gets
generated.
 List evaluation techniques and metrics for common tasks
 Create a table which contains: Task, evaluations technique, source and
references.
AND
Second Workshop on Analytics for
Noisy Unstructured Text Data
AND Working Groups
July 24, 2008  Slide ‹#›
Datasets, Benchmarks, Evaluation
Techniques
 What Data sets, benchmarks, evaluation techniques are
needed for the analysis of noisy texts?
 Datasets today comprise mostly newswire data. Blogs, sms, email,
voice, and other spontaneous communcation datasets are needed.
• TREC Tracks have recently started including such
datasets
 Are benchmarks/evaluation dependent on the task
• QA over blogs….. blogs are not factual
• Business Intelligence over customer calls and emails
• Opinion and sentiment mining from emails and blogs
• On such datasets agreement between humans is also
very low
AND
Second Workshop on Analytics for
Noisy Unstructured Text Data
AND Working Groups
July 24, 2008  Slide ‹#›

AND_DatasetsWorkgroup

Transcript AND_DatasetsWorkgroup

Directory