Lecture slides - UNC School of Information and Library Science
Download
Report
Transcript Lecture slides - UNC School of Information and Library Science
information retrieval
mon feb 08 2016
data…
& information organization
SPSS Workshop in Odum…
•
•
•
•
Monday, February 29
2:00 – 3:30 pm
Davis Library, Room 219 (same lab room)
introduction to SPSS and teach how to work
with data saved in SPSS format
• no registration required
Anyone need an “SPSS Cheat Sheet”?
framework for today’s lecture…
data
organizing
data
retrieving
data
tools
supporting
the process
info organization activity
• in a small group, examine the cards that identify
various “documents” in a collection
• on the table organize the document surrogates into
some sort of schema – grouping by category (like
items with like)
• choose your own organization scheme and hierarchy
• if desired, write on the blank cards to create new or
uber categories
• be ready to share your organization method with the
class
Structured Data
• information with a
high degree of
organization
• easy to put into a
relational database
• search is simple and
straightforward
Unstructured data
• essentially the
opposite of
structured data
• natural language /
free text
STRUCTURED vs unstructured data
easy to envision structured data in terms of “tables”
Employee
Manager
Salary
Smith
Jones
68000
Chang
Smith
65000
Ivy
Smith
50000
Typically allows numerical range and exact match (for text)
queries, e.g., Salary < 60000 AND Manager = Smith.
7
Relational Databases
• Structured data
• Designed to provide search
results with exact answers
• Queries built on schema of
structured fields
• Lack of ranking mechanism
(initially)
• We know the schema in
advance, so semantic
correlation between
queries and data is clear
• We can get exact answers
Information
Retrieval Systems
tables in a MS Access
relational database –
defines each entity in a
social networking site
Data entry form in a
MS Access relational
database – create each
record
Structured Data
• information with a
high degree of
organization
• easy to put into a
relational database
• search is simple and
straightforward
Unstructured data
• essentially the
opposite of
structured data
• natural language /
free text
structured vs UNSTRUCTURED data
• typically refers to free text
• email is a good example of unstructured data.
it's indexed by date, time, sender, recipient,
and subject, but the body of an email remains
unstructured
• other examples of unstructured data include
books, documents, medical records, and social
media posts
journal article is an
example of
unstructured data
Relational Databases
Information
Retrieval Systems
• Unstructured / semistructured data
• Designed to support
unstructured natural
language full text search
• Ranking mechanism is very
important – results must
be sorted by relevance in
order to satisfy user’s
information need
• We get inexact, estimated
answers
Query
Representation
function
Matching
function
Document collection
(corpus)
Representation
function
Index
CATEGORIES
SUBJECT HEADINGS
Results
KWIC
Key word in context
metadata
What is Metadata?
• Classic definition: data about data
• Metadata is structured information that
describes, explains, locates, or otherwise
makes it easier to retrieve, use, or manage an
information resource. (NISO)
• 3 primary “types”:
– Descriptive
– Structural
– Administrative (rights management, preservation)
digital forensics
How do we organize a collection of
“documents” so that users can find what
they need?
from Glushko reading…
• what three types/forms of categorization does
Glushko discuss in the Categorization in the
Wild piece?
• give a real-world example of a categorization
system and briefly describe the purpose
behind it (i.e. what problem is it trying to
address?)
from Glushko reading…
• Cultural categorization
– Embodied in culture and language
– Acquired implicitly through development via
parent-child interactions, language, and
experience
– Formal education can build on this, but nonformal cultural system can often dominate
– Traditional perspective for thinking and research
about categorization
From Glushko reading…
• Individual categorization
– A system developed by an individual for organizing
a personal domain to aid memory, retrieval, or
usage
– Can serve social goals to convey information,
develop a community, manage reputation
– Have exploded with the advent of social
computing, especially in applications based on
“tagging”
– An individual’s system of tags in web applications
is sometimes called a “folksonomy”
From Glushko reading…
• Institutional categorization
– Systems created to serve institutional goals and
facilitate sharing of information and increase
interoperability
– Helps to streamline interactions and transactions
so that consistency, fairness and higher yields can
result.
Let’s look at a database of magazine & journal articles…to see
how information is organized – with particular attention to
value-added SUBJECT TERMS/HEADINGS (categorization)
…Academic Search Premier
>> UNC Libraries Homepage: http://www.lib.unc.edu/
>> E-Research by Discipline
>> Frequently Used
>> Academic Search Premier
[off-campus log in with onyen/password]
Handout Activity #2
info organization & search
• We organize to enable retrieval
• The more effort put into organizing information, the more
effectively it can be retrieved
• The more effort we put into retrieving information, the less it
needs to be organized first
• We need to think in terms of investment, allocation of costs
and benefits between the organizer and retriever
• The allocation differs according to the relationship between
them; who does the work and who gets the benefit?
final notes…
• Homework #2: Database report
– sign up for a database – or talk with me about
suggestion
– next Wednesday – 5-min reports in class
• Wednesday: “Information Retrieval” intro with
Dr. Jaime Arguello (required reading prep)
• Wednesday: Data to Story Project – speed date/pitch