Final - QUADS

Download Report

Transcript Final - QUADS

SMART QUALITATIVE DATA:
METHODS AND COMMUNITY TOOLS FOR DATA MARK-UP
UK DATA ARCHIVE-NLP
COLLABORATION
ESDS Qualidata is using semi-automated mark-up of
some components of its data collections using
natural language processing (NLP) and information
extraction:
• new partnerships created – new methods, tools
and jargon to learn
• new area of application for NLP to social science
data
• growing interest in UK in applying NLP and text
mining to social science texts – data and research
outputs such as publications’ abstracts
• UK Data Archive, University of Essex (lead partner)
• Language Technology Group, Human
Communication Research Centre, School of
Informatics, University of Edinburgh
18 months duration
1 March 2005 – 31 October 2006
METADATA STANDARDS
The XML schema will specify a ‘reduced’ set of Text
Encoding Initiative (TEI) elements:
core tag set for transcription
names, numbers, dates <persname>
links and cross references <ref>
notes and annotations <note>
text structure <body>
unique to spoken texts <kinesic>
linking, segmentation and alignment <link>
advanced pointing - XPointer framework
text and AV synchronisation
contextual information (participants, setting, text)
interview text with XML
tags embedded
<u who="#interviewer" xml:id="u1">There's just one or two
factual things first of all do you mind my asking how old you
are?</u>
<u who="#subject" xml:id="u2">49.</u>
<u who="#interviewer" xml:id="u3">And what schools did you
go to?</u>
-<u who="#subject" xml:id="u4">
<orgName>King Street</orgName> ,
<orgName>Woodside</orgName> and
<orgName>Hilton</orgName> .
</u>
XML: enabling a
standardised format for
interview transcripts
Information about interviewee
<u who="#interviewer" xml:id="u5">Uh-huh .. and how oldDate of birth: 1930
were you when you left the school?</u>
Gender: female
<u who="#subject" xml:id="u6">14.</u>
Marital status: married
<u who="#interviewer" xml:id="u7">And you work at the
Occupation: pharmacy assistant
moment? What sort of work do you do?</u>
Geographic region: Scotland
- <u who="#subject" xml:id="u8">
LP:There's just one or two factual things first of all do you mind my
how old you are?
Well I've gone back to get shorter hours, I've went back asking
to
domestic, which I dinna really care for. But then I used toG24:49.
be
LP:And what schools did you go to?
in the pharmacy department at
G24:King Street, Woodside and Hilton.
<orgName>ARI</orgName>
LP:Uh-huh .. and how old were you when you left the school?
G24:14.
... just
LP:And you work at the moment? What sort of work do you do?
<seg type="occupation">pharmacy assistant</seg>
G24:Well I've gone back to get shorter hours, I've went back to
domestic, which I dinna really care for. But then I used to be in the
pharmacy department at ARI ... just pharmacy assistant. At least it was
better than cleanin'! But then they've nae part-time workers there so..
LP:And did you work in the pharmacy long?
XML: enabling webenabled display,
search and browse
Main aim: to explore methodological and technical solutions for ‘exposing’ digital
qualitative data to make them fully shareable and exploitable.
Main objectives
• specify, test and propose an eXtended Markup Language (XML) schema for
storing and marking up qualitative data
• investigate requirements for contextualising qualitative data and developing
standards for data documentation
• develop semi-automated using natural language processing tools for preparing
marked up qualitative data for sharing
• research tools for publishing and interrogating data via the web – Qualitative
Data Mark-Up Tools (QDMT)
WHAT FEATURES DO WE NEED TO MARK-UP AND WHY?
Collaboration between:
•
•
•
•
.•
•
•
•
•
•
WHAT IS SQUAD?
Spoken interview texts provide the clearest and most common example of the types
of encoding features needed. There are three basic groups of structural features:
• utterance, specific turn taker, defining idiosyncrasies in transcription
• links to analytic annotation and other data types (e.g. thematic codes,
• concepts,audio or video links, researcher annotations)
• identifying information such as real names, company names, place names,
• occupations, temporal information
Identify atomic elements of information in text:
• personal names
• company/organisation names
• locations
• dates
• times
• percentages
• occupations
• monetary amounts
Example:
Italy's business world was rocked by
the announcement last Thursday that
Mr. Verdi would leave his job as vicepresident of Music Masters of Milan,
Inc to become operations director of
Arthur Anderson.
USING NLP TOOLS
Information Extraction (IE) is a sub-field of NLP
which aims to identify key pieces of information in
texts using 'shallow' analysis techniques.
A typical IE system will perform Named Entity
Recognition where particular kinds of proper names
and terms are identified, classified and marked up.
This is a means of annotating documents with
semantic metadata – enabling highly resource
discovery and data exploration. The Java interface
tool developed in SQUAD is called CME.
ANNOTATION TOOL - ANONYMISE
This tool imports marked up data from the CME
NLP system. Named entities are highlighted and
co-reference chains – e.g numerous references to a
single person - are identified.
CAPTURING AND DEFINING DATA CONTEXT
Rich context enables informed re-use of data. But defining how to provide context
for raw data to make it more ‘usable’ is complex. ESDS Qualidata has spent ten
years working in the area of sharing qualitative data, and has done much to
establish informal ways of documenting raw data.
Both micro and macro level features should be considered including: how the
research question was framed, the research application process, project progress,
fieldwork situations, analyses processes. Fieldwork observations are useful as are
timelines and political chronologies. Equally when undertaking a replication or
restudy, detailed information on sampling procedures, field work approaches and
question guides will be essential.
SQUAD has identified a minimal generic set of elements that
represent a baseline for contextualising data. QUADS has
produced an edited collection on this issue as a special edition of
the Journal in Methodological Innovations Online.
sirius.soc.plymouth.ac.uk/~andyp/.
AUDIOVISUAL ARCHIVING
Archiving and exposure of qualitative data in a way that faithfully represents its
origins and context is important. Linking qualitative data to other distributed data
sources such as audio-visual or geo-coded data sources, such as maps can afford
creative and exciting ways of visualising data.
The formalised and systematic
archiving and sharing of digital
audio-visual data from
qualitative research is fairly
new.
Names can be anonymised with chosen
pseudonyms. The references of names to
pseudonyms is saved.
Annotations are explored in an XML format in the
NITE NXT model. NXT uses ‘stand off’ annotation –
where annotation is linked to or referenced by
words.
DATA EXCHANGE STANDARDS
A uniform format for richly encoding qualitative
research is necessary as it:
• enables preservation and re-use of metadata,
data and annotation
• ensures consistency of presentation and
description of data
• supports the development of common web-based
publishing and search tools
• facilitates data interchange (e.g. CADAS
packages) and comparison among datasets
Progress:
• limited formal definition of a common XML
vocabulary and Document Type Definition (DTD)
based on the Text Encoding Initiative (TEI)
• testing of a new Qualitative Data Interchange
Format (QDIF)
SQUAD is helping to explore
XML representation and display
of audio-visual data.
CONTACT
TOOLS PROGRESS
• defined header metadata for a standardised transcript
• defined and tested generic XML models for qualitative data
• tested and refined NLP tools for qualitative data
• built front end to NLP named entity tools
• chosen software to enable annotation of data
• explored data export formats for longer-term archiving
• investigated powerful XML based indexing tools for searching and
retrieving data
• investigated web display of multimedia data and pointers to other
resources using XML - extending the functionality of ESDS Qualidata
From Autumn 2006:
• formalising data exchange standard
• key word extraction systems to help
conceptually index qualitative data – text
mining collaboration
• exploring grid-enabling data: e-social science
collaboration
quads.esds.ac.uk/squad
Louise Corti and Claire Grover
UK Data Archive
University of Essex
Colchester, Essex CO4 3SQ
Email: [email protected]
Tel: +44 (0)1206 872145
URL: quads.esds.ac.uk/squad