Transcript Slide 1
Special Topics in Computer Science
The Art of Information Retrieval
Chapter 6: Text and Multimedia
Languages and Properties
Alexander Gelbukh
www.Gelbukh.com
Previous chapter: Conclusions
Query operations:
Relevance feedback
o Simple, understandable
o Needs user attention
o Term re-weighting
Local analysis for query expansion
o Co-occurrences in the retrieved docs
o Usually gives better results than global analysis
o Computationally expensive
Global analysis
o Worse results. What is good for collection is not for a query
o Linguistic methods, dictionaries, ontologies, stemming, ...
2
Previous chapter:
Trends and research topics
Interactive interfaces
o Graphical, 2D or 3D
Refining global analysis techniques
Application of linguistics methods. Stemming.
Ontologies
Local analysis for the Web (now too expensive)
Combine the tree techniques (feedback, local, global)
3
Anatomy of a document...
We search for documents
What is a document?
4
Characteristics of a document
Syntax is a device that “plays” the document
producing semantics (kind of: presentation)
Like CD drive plays CD to produce music
Knowing Korean + paper w/glyphs meaning
5
...Anatomy of a document
Queries are conditions on semantics/presentation, not
on (binary?) data of the document
Thus need to know syntax
o Example: search in PS or PDF
How to describe formally?
6
Metadata
Info about the organization of data
o Data about the data
Descriptive vs. Semantic metadata
o Descriptive: about creation: author, date, ...
o Semantic: about meaning: keywords, subject codes, ...
Ontologies
o Others: who and how to use. E.g.: adult, confident, signature
Standards (many)
o Dublin Core Metadata Element Set: 15 fields. Descriptive.
o Machine Readable Catalog Record (MARC): bibliographic
WEB – very important
o Many projects on Web ontologies. Semantic Web.
7
Text
Encoding. ASCII-7, 8. UNICODE: oriental
Format. Binary vs. ASCII (better). DOC, RTF, PDF, PS
Compression. ZIP, ARJ
Binary in ASCII: uuencode
To predict behavior of tools and systems, need to model text
Entropy: the limit of compression, degree of chaos
Statistics of the letters and words
o Very skewed
8
Zipf law
9
Zipf law, etc.
i-th most frequent word appears k/i times, = 1.5 – 2
Mandelbrot form: k/(c+i)
50% of text are few hundred words
Most of them are stopwords: the, of, and, a, to, in...
Not indexed smaller indices
Distribution of words by docs
o p, k depend on collection and word
10
Heaps’ law
11
Heaps’ law, etc.
# of distinct words (size of vocabulary) = Kn,
o = 0.4 ... 0.6 square root
Applies to collections
To WWW
Average length of word. English:
o 4.8 ... 5.3 letters per word, average by text
o 6 ... 7 without stopwords
o 8 ... 9 average by vocabulary
12
Similarity between strings
symmetric; triangle: dist (a,c) dist (a,b) + dist (b,c)
Hamming: # of different positions. Also for sets.
Soundex: phonetic similarity
Levenshtein: min # insertions, deletions, substitutions
o dist (survey, surgery) = 2
o A very good measure
Longest common subsequence: survey, surgery surey
Various metrics to compare whole docs
o E.g., consider strings as symbols, or similarity of strings, etc.
13
Markup languages
“Our documents do not belong to us but to Bill Gates!”
Extra textual syntax to describe formatting, structure, ...
Marks are called tags.
Initial and ending tags surround the marked text.
Standard metalanguage: SGML (Standard Generalized
Markup Language)
o XML (eXtensible), its subset: new metalanguage for Web
o HTML is an instance of SGML
14
SGML
Provides rules for defining tags
A document consists of:
o Definitions of tags
Document Type Declaration, DTD
Informal comments or an additional description
o Text with tags
Tags: <tag>text</tag>
Mostly defines semantics, not printing format
o Defined in other languages
15
HTML
1992; 4.0: 1997
Instance of SGML
o Exists DTD, usually not used
Also does not define (much of) formatting. Thus:
Cascade Style Sheets (CSS)
o define aspects of formatting
o can be combined (cascaded)
o not well supported by browsers
Does NOT (unlike generic SGML ( too expensive))
o allow to specify new tags
o support nesting structures
o support validity checks
16
XML (eXtensible ...)
More flexible than HTML, simpler than SGML
Simplified subset of SGML
o Much simpler in implementation
Allows for human- and machine-readable markup
o Good for development of Web docs
o Allow to do things that now are done with Java scripts
Using DTD is optional, parser can discover tags
Extensible Style sheet Language (like CSS in HTML)
o Like macros in a word processor
Extensible Linking Language
17
Uses of XML
MathML: Mathematical Markup Language
o Not only presentation but also meaning of expressions!
SMIL: Synchronized Multimedia Integration Language
o Declarative language to specify positions and timing
Resource Description Format
o Metadata for XML
Trend: HTML evolutions to model and describe the
structure of data, not presentation details
18
Multimedia
Text, sound, images, video
Image formats. BMP. Compression:
o GIF. Good for few colors
o JPG. Lossy compression. Parametric: can be controlled
o TIFF is used for exchange; can contain metadata
Moving images:
o MPEG: Moving Pictures Expert Group. Encodes changes
Textual images. Compression. Retrieval:
o Metadata, keywords
o OCR. Many typos; keyword search should be approximate
o Treat as a sequence of images, convert query similarly
19
Taxonomy of Web languages
20
Conclusions
Modeling of text helps predict behavior of systems
o Zipf law, Heaps’ law
Describing formally the structure of documents allows
to treat a part of their meaning automatically, e.g.,
search
Languages to describe document syntax
o SGML, too expensive
o HTML, too simple
o XML, good combination
21
The class of Oct 30
is cancelled
Thank you!
Till November 6
22