Generating Semantic Annotations For Research Datasets
Download
Report
Transcript Generating Semantic Annotations For Research Datasets
GENERATING AUTOMATIC SEMANTIC
ANNOTATIONS FOR RESEARCH DATASETS
AYUSH SINGHAL AND JAIDEEP SRIVASTAVA
CS DEPT. ,
UNIVERSITY OF MINNESOTA, MN, USA
CONTENTS
Motivation
Problem statement
Proposed approach
Data type labelling
Application concept
Experiments and results
Experiments and results
Similar dataset identification
Experiments and results
Conclusions and future work
MOTIVATION
Annotation is act of adding a note by way of comment or explanation.
Apart from documents, images, videos are searchable only when they have tags or
annotations (i.e. content)
Recently, genomic databases, archeological databases are annotated for indexing.
ANNOTATING RESEARCH DATASETS
No context- hard to be searchable by popular search engines.
Make the dataset visible and informative.
EXAMPLE OF STRUCTURED ANNOTATION
PROBLEM STATEMENT
Given a data name “D” as a string of English characters, the research task is to
generate semantic annotations for the dataset denoted by “D” in the following
categories:
Characteristic data type
Application domain
List of similar datasets
PROPOSED APPROACH
Research challenges
No universal schema for describing content
of a dataset.
Common attribute, dataset name.
No well known structure for semantic
annotation of research datasets.
Proposed structure should positively impact
user’s search for datasets.
CONTEXT GENERATION
Used the top-50 results to build context for the dataset “Global context”
Critical step: how to generate
useful context for a dataset.
• Usage of the dataset in research.
• Research articles and journals .
• Get a proxy using web knowledge:
Google scholar search engine.
IDENTIFYING DATA TYPE LABELS
For a dataset ‘D’:
Given: global context of ‘D’, a list of data types
Required: data type of ‘D’
Approach: Supervised Multi-label classification
Feature construction:
0. Preprocessing of global context-stop word removal etc.
1. BOW and TFIDF representation of Global context of ‘D’.
2. Dimensionality reduction by PCA- 98% of variance coverage
EXPERIMENTS AND RESULTS
Dataset
Instances
Label count
Label density
Label cardinality
SNAP
42
5
0.34
1.69
UCI
110
4
0.275
1.1
Ground truth: author provided data type labels.
Baseline: ZeroR classifier.
Evaluation metrics: typical multi-label classification metrics ( Tsoumakas et al 2010)
Measure
ZeroR
AdaBoostMH
(tfidf)
Measure
ZeroR
AdaBoostMH
(BOW)
Fmeasure ↑
0.025
0.172
Fmeasure ↑
0.854
0.873
Average Precision ↑
0.657
0.663
Average Precision ↑
0.908
0.924
0.555
Macro AUC↑
0.5
0.54
0.5
Macro AUC↑
SNAP dataset
UCI dataset
CONCEPT GENERATION
Given a dataset ‘D’, find k-descriptors (n-gram words) for the application of dataset.
Approach: Concept extraction from world knowledge (wikipedia, dbpedia)
Input feature: Global context of ‘D’.
Preprocessing of global context
Used text analytic tools (AlchemyAPI) for concept generation.
Pruning of input query terms
EXPERIMENTS AND RESULTS
Baseline: Context generated from the short description provided by the owner. Text
pre-processing was done.
Evaluation metrics: user rating.
Comparison of average user rating on UCI and SNAP dataset.
UCI dataset
SNAP dataset
IDENTIFYING SIMILAR DATASETS
Given a dataset ‘D’, find k-most similar datasets from a list of datasets.
Approach: cosine similarity between TFIDF vectors of global-context of ‘D’ and
global-context of d_i in list of datasets.
Top-k selection from list ranked in descending order.
EXPERIMENTS AND RESULTS
Ground truth: dataset categorization provided by the dataset repository owners.
Different categorization for SNAP and UCI.
Baseline: Context generated from owner’s description.
Evaluation metrics: precision@k
SNAP dataset
UCI dataset
USE CASE: SYNTHETIC QUERYING
Synthetic querying on the annotated database of research datasets.
50 queries on SNAP database and 50 queries on UCI database.
Query structure: find a <data type> dataset used for <concept> like <similar to>
<fields> are random generated from their respective lists.
Evaluation metric: overlap between context of retrieved results and the input query.
Baseline: querying on Google database and extracting dataset names from the
retrieved results.
QUANTITATIVE AND QUALITATIVE EVALUATION
Comparison of Google results with annotated DB for a few samples
CONCLUSIONS AND FUTURE WORK
Real world datasets play an important role- testing and validation purposes.
General purpose search engines cannot find datasets due to lack of annotation.
A novel concept of structured semantic annotation of dataset- data type labels,
application concepts, similar datasets.
Annotation generated using global context from the web corpus.
Data type labels identification using multi-label classifier- using web context helps to
improve accuracy both for SNAP and UCI test datasets.
CONCLUSIONS AND FUTURE WORK
Concept generation using web context performs better than baseline based on user
ratings.
Web context is not significantly helpful in identifying similar datasets for UCI and
SNAP datasets.
18% improvement in accuracy over normal datasets search using Google ( for
synthetic queries).
Future work: finding an overall encompassing structure of annotation ; extending
analysis across different domains.
THANK YOU