AND_DatasetsWorkgroup
Download
Report
Transcript AND_DatasetsWorkgroup
Group-1
Task : Data sets, benchmarks, evaluation techniques for analysis of
noisy texts.
Participant: Maarten de Rijke, Amaresh Pandey, Donna Harman, Venu
Govindaraju, Aixin Sun and Venkat Subramaniam
AND
Second Workshop on Analytics for
Noisy Unstructured Text Data
AND Working Groups
July 24, 2008 Slide ‹#›
Datasets
Important to list out datasets that are out there.
A List of datasets that are publicly available can be added to the
proceedings along with descriptions as well as comments.
Create a Table: dataset name and source; application; usability;
tools for creating and analyzing the data sets.
Take a references from AND 07.
List out Missing things about data sets.
Data sets can be for speech, text, OCR, etc.
LDC and ELDC can be a source for speech data
NIST can be a source for OCR data
List out tools and sources which gives data for academic/Industry
research work.
AND
Second Workshop on Analytics for
Noisy Unstructured Text Data
AND Working Groups
July 24, 2008 Slide ‹#›
Benchmarks
Identify Popular tasks, organize competitions to create
List of past evaluations and benchmarks say from TREC and list what
can be done
Blogs, speech, OCR in TREC 5, legal, spam, cross language text,
historical texts and etc.
Create a table: popular tasks; what benchmarks exist; new benchmarks
Give emphasis of certain type of data sets like, Blogs and OCR.
AND
Second Workshop on Analytics for
Noisy Unstructured Text Data
AND Working Groups
July 24, 2008 Slide ‹#›
Evaluation
Cascaded evaluation should be done: noise, effect of noise, effects of
different stages of processing.
Evaluation requires truth data. Creating labeled truth is costly. So
create a common task on a given dataset, that way truth data gets
generated.
List evaluation techniques and metrics for common tasks
Create a table which contains: Task, evaluations technique, source and
references.
AND
Second Workshop on Analytics for
Noisy Unstructured Text Data
AND Working Groups
July 24, 2008 Slide ‹#›
Datasets, Benchmarks, Evaluation
Techniques
What Data sets, benchmarks, evaluation techniques are
needed for the analysis of noisy texts?
Datasets today comprise mostly newswire data. Blogs, sms, email,
voice, and other spontaneous communcation datasets are needed.
• TREC Tracks have recently started including such
datasets
Are benchmarks/evaluation dependent on the task
• QA over blogs….. blogs are not factual
• Business Intelligence over customer calls and emails
• Opinion and sentiment mining from emails and blogs
• On such datasets agreement between humans is also
very low
AND
Second Workshop on Analytics for
Noisy Unstructured Text Data
AND Working Groups
July 24, 2008 Slide ‹#›