Just-In-Time Browsing - BYU Data Extraction Research Group

Download Report

Transcript Just-In-Time Browsing - BYU Data Extraction Research Group

Toward Automatic Processing
and Indexing of Microfilm
Microfilm Processing
Images are scanned
from ribbons of
microfilm.
Each image on the
microfilm ribbon is
then cropped and
de-skewed.
Microfilm Processing
Cropped and Deskewed Image
Image Zoning
• Lines in a document emit a
unique signature.
• The algorithm searches for these
patterns to detect the lines that
describe a table.
Image Zoning
Automatically Identifies
Table Structure.
Optical Character Recognition
• A neural net evaluates each zone in
the image.
• The neural net converts the printed
characters in each zone into
ASCII text.
Optical Character Recognition
Automatically Converts
Printed Text to ASCII.
Column-Row Recognition
• The algorithm uses the geometry of
each zone to identify the table’s
columns and rows.
• The algorithm associates each column
and row label with its values in the
table.
Column-Row Recognition
Identify Labels
• The algorithm maps the printed text of
each label to a standardized name.
• The standardized names correspond
to the fields in a database.
Identify Labels
ROAD,
STREET, &c.,
Address
And No. or NAME of HOUSE
Identify Labels
Address
NAME
and
Surname of
Full
Name
each Person
Identify Labels
Address
Full Name
RELATION
to Head
Relationship
of Family
Extract Data
• The algorithm identifies factored table
values.
• The algorithm stores each record
in an XML file.
Extract Data
*
Address
Collafer
Full Name
Relationship
* Extracted by hand.
Extract Data
*
Address
Collafer
Full Name
John Eyres
Relationship
Head
* Extracted by hand.
Extract Data
*
Address
Collafer
Full Name
Annie Eyres
Relationship
Wife
* Extracted by hand.
Extract Data
*
Address
Collafer
Full Name
Lehailes Eyre
Relationship
Son
* Extracted by hand.
Microfilm Queries
• A web form provides the
interface to query the microfilm
database.
• Individuals can enter keywords
(such as a first and last name), and
the system locates appropriate records
in the indexed microfilm documents.
Web Query
John
Eyre
Search Results
• The system returns the indexed images
that contain the results.
• Since the database indexes both the
text and geometry of the document,
the process can return just the
relevant regions of the microfilm
image.
Search Results
Click an image to select
a result document.
Search Results
Relevant region of the
document is displayed.
Just-In-Time Browsing
• To make the query results display quickly,
the system uses Just-In-Time Browsing.
• Just-In-Time Browsing will allow people
to browse digitized microfilm and other
large collections of images over the
Internet at interactive rates.
Just-In-Time Browsing
Small versions of each image will
allow rapid browsing of the collection
as a whole.
Just-In-Time Browsing
People will be able to “Zoom In” on
individual images and parts of images
as necessary.