Class_05 - UNC School of Information and Library Science

Download Report

Transcript Class_05 - UNC School of Information and Library Science

information retrieval
wed sept 02 2015
data…
-start at 6.45
framework for today’s lecture…
data
organizing
data
retrieving
data
tools
supporting
the process
Structured Data
• information with a
high degree of
organization
• easy to put into a
relational database
• search is simple and
straightforward
Unstructured data
• essentially the
opposite of
structured data
• natural language /
free text
STRUCTURED vs unstructured data
easy to envision structured data in terms of “tables”
Employee
Manager
Salary
Smith
Jones
68000
Chang
Smith
65000
Ivy
Smith
50000
Typically allows numerical range and exact match (for text)
queries, e.g., Salary < 60000 AND Manager = Smith.
5
Relational Databases
• Structured data
• Designed to provide search
results with exact answers
• Queries built on schema of
structured fields
• Lack of ranking mechanism
(initially)
• We know the schema in
advance, so semantic
correlation between
queries and data is clear
• We can get exact answers
Information
Retrieval Systems
tables in a MS Access
relational database –
defines each defining a
social networking site
Data entry form in a
MS Access relational
database – create each
record
Structured Data
• information with a
high degree of
organization
• easy to put into a
relational database
• search is simple and
straightforward
Unstructured data
• essentially the
opposite of
structured data
• natural language /
free text
structured vs UNSTRUCTURED data
• typically refers to free text
• email is a good example of unstructured data.
it's indexed by date, time, sender, recipient,
and subject, but the body of an email remains
unstructured
• other examples of unstructured data include
books, documents, medical records, and social
media posts
magazine article is an
example of
unstructured data
Relational Databases
Information
Retrieval Systems
• Unstructured / semistructured data
• Designed to support
unstructured natural
language full text search
• Ranking mechanism is very
important – results must
be sorted by relevance in
order to satisfy user’s
information need
• We get inexact, estimated
answers
Query
Representation
function
Matching
function
Document collection
(corpus)
Representation
function
Index
CATEGORIES
SUBJECT HEADINGS
Results
KWIC
Key word in context
KWIC
Key word in context
metadata
metadata
What is Metadata?
• Classic definition: data about data
• Metadata is structured information that
describes, explains, locates, or otherwise
makes it easier to retrieve, use, or manage an
information resource. (NISO)
• 3 primary “types”:
– Descriptive
– Structural
– Administrative (rights management, preservation)
digital forensics
This reading really made me think about how easily
accessible and organized information is today because
of the implementation of metadata.
It sparked a few questions: Without metadata, how
would accessing data, resources and information be
different in today’s society?
-Chris
More Metadata: A Cataloging Record
http://search.lib.unc.edu/search?R=UNC
b7097376
The Idea of Facets
• Facets are a way of labeling data
– A kind of Metadata (data about data)
– Can be thought of as properties of items
• Facets vs. Categories
– Items are placed INTO a category system
– Multiple facet labels are ASSIGNED TO items
Facets Epicurious example
http://www.epicurious.com/
• Create INDEPENDENT categories (facets)
– Each facet has labels (sometimes arranged in a
hierarchy)
• Assign labels from the facets to every item
– Example: recipe collection
Ingredient
Cooking
Method
Chicken
Bell Pepper
Stir-fry
Curry
Course
Cuisine
Main Course
Thai
The Idea of Facets
• Break out all the important concepts into their
own facets
• Sometimes the facets are hierarchical
– Assign labels to items from any level of the
hierarchy
Preparation Method
Fry
Saute
Boil
Bake
Broil
Freeze
Desserts
Cakes
Cookies
Dairy
Ice Cream
Sorbet
Flan
Fruits
Cherries
Berries
Blueberries
Strawberries
Bananas
Pineapple
Using Facets
• Now there are multiple ways to get to each
item
Preparation Method
Fry
Saute
Boil
Bake
Broil
Freeze
Fruit > Pineapple
Dessert > Cake
Preparation > Bake
Desserts
Cakes
Cookies
Dairy
Ice Cream
Sherbet
Flan
Fruits
Cherries
Berries
Blueberries
Strawberries
Bananas
Pineapple
Dessert > Dairy > Sherbet
Fruit > Berries > Strawberries
Preparation > Freeze
labor intensive?
expensive?
UNC Libraries Online Catalog
http://www.lib.unc.edu/
e.g. personal crisis
caveat: semi-structured data
• in fact almost no data is absolutely
“unstructured”
• e.g., this slide has distinctly identified zones
such as the title and bullets
• facilitates “semi-structured” search such as
– title contains data and bullets contain structure
Let’s look at a database of magazine & journal articles…
…Academic Search Complete
>> UNC Libraries Homepage: http://www.lib.unc.edu/
>> E-Research by Discipline
>> Frequently Used
>> Academic Search Premier
[off-campus log in with onyen/password]
Organization / Search
• We organize to enable retrieval
• The more effort we put into organizing information, the more
effectively it can be retrieved
• The more effort we put into retrieving information, the less it
needs to be organized first
• We need to think in terms of investment, allocation of costs
and benefits between the organizer and retriever
• The allocation differs according to the relationship between
them; who does the work and who gets the benefit?