Transcript keyword

Techniques for
Information Searching and Retrieval of
Web-based Multimedia Digital Library
Presented by:
Supervisors:
Markers:
Vincent Cheung
Prof. Michael Lyu
Prof. K.W. Ng
Prof. K. H. Lee
Prof. Y. S. Moon
3 May 2000
Abstract




Digital Library is getting more and more popular,
due to its strength in searching and retrieving
information.
Web-based environment provides a better media
for information sharing.
The trend that more multimedia information are
needed to be stored instead of pure text.
Research on the techniques for multimedia
information searching and retrieval in a web-based
digital library.
Presentation Outline




XML overview
Data structures for multimedia news archives
 for video clips
 using graph structures of XML
 giving annotation
Architecture and agents of digital library
Research plan and conclusion
Overview of XML
XML - eXtensible Markup Language
 Proposed by WWW Consortium, in 1998
 To define a complete, platform-independent
and system-independent environment for
the authoring and delivery of information
resources across the web.
 Semistructured

How XML differs from HTML
Extensibility - new tags may be defined at
will
 Structure - XML Structures can be nested to
arbitrary depth
 Validation - An XML document can contain
an optional description of its grammar

XML Documents

use elements and attributes to describe your
document
<database>
<news>
database
<date year = “2000” month = “4” day = “15”/>
<title>Press warning appropriate, says Beijing</title>
<reporter>Kong Lai-fan</reporter>
<reporter>Greg Torode</reporter>
<content>
Beijing
yesterday defended
news
newsremarks made by senior
SAR-based official Wang Fengchao that local media
should avoid reporting separatist views.
</content>
</news>
<news>
date
title
reporter
content
. . .
</news>
</database>
Document Type Definition

providing the definition of a document type,
for member documents to follow
<!DOCTYPE database [
<!ELEMENT database (news*)>
<!ELEMENT news (date,title,reporter*,content)>
<!ELEMENT date
year
CDATA #REQUIRED
month CDATA #REQUIRED
day
CDATA #REQUIRED>
<!ELEMENT title (#PCDATA)>
<!ELEMENT reporter (#PCDATA)>
<!ELEMENT content (#PCDATA)>
]>
Data Structure for News Videos
Multimedia presentation
 Graph structure property
keyword directory
thesaurus / classification directory
person / place directory
Chinese-English dictionary
 Semistructure property
annotation

Indexing a Video




Segment the video hierarchically into scenes. (A
video is composed of one or more related scenes.)
Describe the complete news video using
bibliographic information (title, source, reporters,
and abstract, etc…) plus format, duration, etc.
Describe each scene – id, start frame (time), end
frame (time), keyframe, and scripts.
A OCR tools is implemented for indexing the
videos in last semester.
Indexing a Video
For a news clip:
id = 1234
title = N. T. swamped after torrential downpour
date = 1999-9-9
source = Hone Kong ATV
reporter = Chan Tai Man
abstract = Large areas of the northwest New
Territories were under water yesterday as torrential
rain swept across the SAR.
duration = 2:34:56
has_scene = 1234.1, 1234.2, 1234.3
format = MPEG
language = Cantonese
identifier = http://www.cse.cuhk.edu.hk/1.mpg”
Indexing a Video
For a scene:
id = 1234.1
belong_to = 1234
next_scene = 1234.2
prev_scene = null
start_time = 0:0:00
end_time = 0:30:45
keyframe = 1238
transcrpt = . . .
Sample News Entry
In NewsDatabase.XML:
<database>
<news>
<date><year>2000</year><month>4</month><day>15</day>
</date>
<title>N.T.swamped after torrential downpour</title>
<content>Large areas of the northwest New
Territories were under water yesterday as
torrential rain swept across the SAR.
</content>
</news>
. . .
</database>
Keyword Directory

Each news has its own keyword elements

Build a keyword directory containing all
keywords

Every keyword points to the news that
having the same keyword
Keyword Directory
news
ID = 0010
title
date
keyword
reporter
N. T. swamped
after torrential
downpour
15 April, 2000
flood
Clifford Lo
…
News Database is a
tree structure
keyword
flood
ID
ID
ID
0010
0017
0137
keyword
keyword
keyword
France
fuel
gun
…
…
Keyword directory would be
pointed by news entries, and
also point to news entries.
database
news
ID = 0010
news
news
news
ID = 0015
ID = 0017
ID = 0043
…
Keywords point to news
database again to for a
graph structure
Keyword Directory
In NewsDatabase.XML:
<database>
<news ID=”0010”>
<date><year>2000</year><month>4</month><day>15</day>
</date>
<title>N.T.swamped after torrential downpour</title>
<keyword>flood</keyword>
<keyword>storm</keyword>
<content>Large areas of the northwest New
Territories were under water yesterday as
torrential rain swept across the SAR.
</content>
</news>
. . .
</database>
Keyword Directory
In KeywordDirectory.XML:
<keyworddirectory>
. . .
<keyword word=”flood”>
<newsid>0010</newsid>
<newsid>0017</newsid>
<newsid>0137</newsid>
. . .
</keyword>
. . .
</keyworddirectory>
Thesaurus/Classification Directory
To search for terms with similar meaning to the keyword
<thesaurus>
<item term = “organisation”>
<spelling>organization</spelling>
<similar>association</similar>
</term>
<item term = “World Trade Organization”>
<spelling>World Trade Organisation
</spelling>
<abbreviation>WTO</abbreviation>
</item>
. . .
<thesaurus>
Thesaurus/Classification Directory
To search for subset terms of the given keyword
<thesaurus>
<item term = “organisation”>
<spelling>organization</spelling>
<similar>association</similar>
</term>
<item term = “disaster”>
<contains>flood</contains>
<contains>earthquake</contains>
<contains>fire</contains>
<contains>storm</contains>
</item>
<item term = “flood”>
<belongs>disaster</belongs>
</item>
. . .
<thesaurus>
Web Search Engine
Person / Place Directory
Person Directory ( Person ID, name, newsid, …)
<person_directory>
<person id = “wangfengchao”>
<name><first>Fengchao</first><last>Wang</last></name>
<nationality>Chinese</nationality>
<organization> The central Government’s Liaison
Office </organization>
<position>deputy director</position>
<newsid>0123</newsid> <newsid>0245</newsid> ...
</person>
. . .
</person_directory>
Person / Place Directory
In news database:
<newsdatabase>
<news id = “0123”>
<date year=“2000” month=“4” day=“15”/>
<title>Press warning appropriate, says Beijing
</title>
<reporter>Kong Lai-fan</reporter>
<content>
Beijing yesterday defended remarks madeby senior
SAR-based official <person id=“wangfengchao”>
Wang Fengchao</person> that local media should
avoid reporting separatist views.
</content>
</news>
. . .
</newsdatabase>
Person / Place Directory
news
ID = 0123
title
date
keyword
Presswarning
appropriate, says
Beijing
15 April, 2000
media
content
…
Person
Wang Fengchao
person
Wang Fengchao
ID
ID
ID
0123
0246
0369
person
person
John
Tom
…
person
…
Robert
Person directory would be
pointed by news entries, and
also point to news entries.
database
news
ID = 0123
news
news
news
ID = 0155
ID = 0246
ID = 0258
…
Person entries point to
news database again to
form a graph structure
Person / Place Directory
Place Directory: category structure
<place_directory>
<place_id=“china” class=“country”>
<name>China</name>
<newsid>5839</newsid> . . .
<have_places>
<place_id>=“hongkong” class=“SAR”>
<name>Hong Kong</name>
<have_places>
<place id=“NT” class=“district”>
<name>New Territories</name>
</place>
. . .
</have_places>
<newsid>0010</newsid> . . .
</place>
. . .
</have_places>
`
</place>
</place_directory>
Person / Place Directory
In news database:
<newsdatabase>
<news id = “0010” place=“hongkong”>
<date year=“2000” month=“4” day=“15”/>
<title>N.T.swamped after torrential downpour
</title>
<reporter>Clifford Lo</reporter>
<content>
Large areas of the northwest <place id=“NT”>
New Territories</place> were under water
yesterday as torrential rain swept across the
<place id=“hongkong”> SAR </place>.
</content>
</news>
. . .
</newsdatabase>
Chinese-English Dictionary
Translate the keywords for searching
 We can have English to Chinese dictionary:

<e2cdict>
<english char = “f”>
<english char = “l”>
<english char = “o”>
<english char = “o”>
<english char = “d”>
<chinese>氾濫</chinese>
<chinese>水災</chinese>
<chinese>洪水</chinese>
. . .
</english>
</english>
. . .
</e2cdict>
Chinese-English Dictionary

We can have Chinese to English dictionary:
<c2edict>
<chinese term = “世”>
<chinese term = “貿”>
<english>WTO</english>
<english>World Trade Organization
</english>
</chinese>
. . .
</chinese>
. . .
</c2edict>
Annotation
XML is semistructured!
 More flexibility in adding tags to contents.
 Add our tags to give annotation to the
strings to provide “meanings” to it.
 Hence, more expressive queries can be
supported.

Annotation: example
<content>
Radioactive coolant water leaked at a nuclear
reactor in western Japan yesterday, but the
accident had no impact on the environment, the
plant director said. "Today when the plant was
operating with its usual output, a worker found
a small leak of primary coolant water from a
pipe of the No 2 reactor," said Katsuhiko
Takahashi.
</content>

We understand… but the system doesn’t…
Annotation: example
<content>
<disaster nature=“radioactive” death=“0”
injuried=“0”>Radioactive coolant water leaked at
a nuclear reactor</disaster> in western <place
id=“japan”> Japan </place>yesterday, but the
accident had no impact on the environment, the
plant director said. "<speech speaker="Katsuhiko
Takahashi"> Today when the plant was operating
with its usual output, a worker found a small
leak of primary coolant water from a pipe of the
No 2 reactor </speech>," said <person="Katsuhiko
Takahashi">Katsuhiko Takahashi </person>.
</content>
Usage of Annotation

So, we can have queries like:
All the speeches from Zhu Rongji in last
month
All storms which kill more than 200
people

We can also make some links to give more
details to people or places, etc.
Architecture of Digital Library




Designing stores and query processors for
semistructured data.
Traditional database systems use a client/server
architecture.
Over the distributed environment has given rise to
two new architectures, they are data warehouses
and mediators.
Video servers will also be integrated to our system
to provide video streaming.
Data Warehouse
client
client
client
answer
query
warehouse
data
update
update
data
data
server
data
data
update
server
data
server
data
Mediator
client
client
client
answer
query
mediator
query
query
answer
answer
server
data
answer
query
server
data
server
data
Agents Using Structured Data

Larger demands for more structured data
than loosely structured HTML.

Using semistructured XML data can provide
a very good environment for Web agents.

Our main aim of implementing our agent is
to illustrate that our semistructured XML
data can provide a better environment for an
agent to work.
Research Plan & Conclusion
 Design
of the structure in XML
semistructured format
to support multimedia data, multilingual
data, and various kind of retrieval.
 Architecture of the system that allows
multiple sources of data.
 Implementing an agent is to illustrate that our
semistructured data can provide a better
environment for an agent to work.
Q & A Session