Transcript Slide 1

Special Topics in Computer Science
The Art of Information Retrieval
Chapter 6: Text and Multimedia
Languages and Properties
Alexander Gelbukh
www.Gelbukh.com
Previous chapter: Conclusions
Query operations:
 Relevance feedback
o Simple, understandable
o Needs user attention
o Term re-weighting
 Local analysis for query expansion
o Co-occurrences in the retrieved docs
o Usually gives better results than global analysis
o Computationally expensive
 Global analysis
o Worse results. What is good for collection is not for a query
o Linguistic methods, dictionaries, ontologies, stemming, ...
2
Previous chapter:
Trends and research topics
 Interactive interfaces
o Graphical, 2D or 3D
 Refining global analysis techniques
 Application of linguistics methods. Stemming.
Ontologies
 Local analysis for the Web (now too expensive)
 Combine the tree techniques (feedback, local, global)
3
Anatomy of a document...
 We search for documents
 What is a document?
4
Characteristics of a document
 Syntax is a device that “plays” the document
producing semantics (kind of: presentation)
 Like CD drive plays CD to produce music
 Knowing Korean + paper w/glyphs  meaning
5
...Anatomy of a document
 Queries are conditions on semantics/presentation, not
on (binary?) data of the document
 Thus need to know syntax
o Example: search in PS or PDF
 How to describe formally?
6
Metadata
 Info about the organization of data
o Data about the data
 Descriptive vs. Semantic metadata
o Descriptive: about creation: author, date, ...
o Semantic: about meaning: keywords, subject codes, ...
 Ontologies
o Others: who and how to use. E.g.: adult, confident, signature
 Standards (many)
o Dublin Core Metadata Element Set: 15 fields. Descriptive.
o Machine Readable Catalog Record (MARC): bibliographic
 WEB – very important
o Many projects on Web ontologies. Semantic Web.
7
Text




Encoding. ASCII-7, 8. UNICODE: oriental
Format. Binary vs. ASCII (better). DOC, RTF, PDF, PS
Compression. ZIP, ARJ
Binary in ASCII: uuencode
To predict behavior of tools and systems, need to model text
 Entropy: the limit of compression, degree of chaos
 Statistics of the letters and words
o Very skewed
8
Zipf law
9
Zipf law, etc.





i-th most frequent word appears k/i times,  = 1.5 – 2
Mandelbrot form: k/(c+i)
50% of text are few hundred words
Most of them are stopwords: the, of, and, a, to, in...
Not indexed  smaller indices
 Distribution of words by docs
o p, k depend on collection and word
10
Heaps’ law
11
Heaps’ law, etc.
 # of distinct words (size of vocabulary) = Kn,
o  = 0.4 ... 0.6  square root
 Applies to collections
 To WWW
 Average length of word. English:
o 4.8 ... 5.3 letters per word, average by text
o 6 ... 7 without stopwords
o 8 ... 9 average by vocabulary
12
Similarity between strings




symmetric; triangle: dist (a,c)  dist (a,b) + dist (b,c)
Hamming: # of different positions. Also for sets.
Soundex: phonetic similarity
Levenshtein: min # insertions, deletions, substitutions
o dist (survey, surgery) = 2
o A very good measure
 Longest common subsequence: survey, surgery  surey
 Various metrics to compare whole docs
o E.g., consider strings as symbols, or similarity of strings, etc.
13
Markup languages
“Our documents do not belong to us but to Bill Gates!”




Extra textual syntax to describe formatting, structure, ...
Marks are called tags.
Initial and ending tags surround the marked text.
Standard metalanguage: SGML (Standard Generalized
Markup Language)
o XML (eXtensible), its subset: new metalanguage for Web
o HTML is an instance of SGML
14
SGML
 Provides rules for defining tags
 A document consists of:
o Definitions of tags
 Document Type Declaration, DTD
 Informal comments or an additional description
o Text with tags
 Tags: <tag>text</tag>
 Mostly defines semantics, not printing format
o Defined in other languages
15
HTML
 1992; 4.0: 1997
 Instance of SGML
o Exists DTD, usually not used
 Also does not define (much of) formatting. Thus:
 Cascade Style Sheets (CSS)
o define aspects of formatting
o can be combined (cascaded)
o not well supported by browsers
 Does NOT (unlike generic SGML ( too expensive))
o allow to specify new tags
o support nesting structures
o support validity checks
16
XML (eXtensible ...)
 More flexible than HTML, simpler than SGML
 Simplified subset of SGML
o Much simpler in implementation
 Allows for human- and machine-readable markup
o Good for development of Web docs
o Allow to do things that now are done with Java scripts
 Using DTD is optional, parser can discover tags
 Extensible Style sheet Language (like CSS in HTML)
o Like macros in a word processor
 Extensible Linking Language
17
Uses of XML
 MathML: Mathematical Markup Language
o Not only presentation but also meaning of expressions!
 SMIL: Synchronized Multimedia Integration Language
o Declarative language to specify positions and timing
 Resource Description Format
o Metadata for XML
Trend: HTML evolutions to model and describe the
structure of data, not presentation details
18
Multimedia
 Text, sound, images, video
 Image formats. BMP. Compression:
o GIF. Good for few colors
o JPG. Lossy compression. Parametric: can be controlled
o TIFF is used for exchange; can contain metadata
 Moving images:
o MPEG: Moving Pictures Expert Group. Encodes changes
 Textual images. Compression. Retrieval:
o Metadata, keywords
o OCR. Many typos; keyword search should be approximate
o Treat as a sequence of images, convert query similarly
19
Taxonomy of Web languages
20
Conclusions
 Modeling of text helps predict behavior of systems
o Zipf law, Heaps’ law
 Describing formally the structure of documents allows
to treat a part of their meaning automatically, e.g.,
search
 Languages to describe document syntax
o SGML, too expensive
o HTML, too simple
o XML, good combination
21
The class of Oct 30
is cancelled
Thank you!
Till November 6
22