Two_Approachesx

Download Report

Transcript Two_Approachesx

Economic Data Time Travel
Adrienne Brennecke
September 30, 2011
New York Times Article
Quick demo
• http://alfred.stlouisfed.org/
Value of this history
• Determine the accuracy of early estimates
• Evaluate policy decisions using information available
at the time, not what is known in hindsight
• Allows economists to model the economy using data
that was actually available
Value of this website
• Users can save data sets to their own account
• Share Published Data Lists
• Average about 3,000 unique visitors a month
Looking for revisions, and then solutions
• Former Research Director was looking for the
economic data that were released originally—not
the revised data
• We searched high and low...
– Libraries removed news releases when the final version was
published
– Agencies historically wrote over the data, as the computing
storage costs were high
Help from libraries
• Searched online catalogs for press releases
• Called documents librarians all over the country
• Contacted issuing agencies and the Library of
Congress
• Depository libraries came through for us
Challenges
• How to design ALFRED to store revisions
– See Developing Time-Oriented Database Applications in SQL
• Finding and verifying old data and release dates
• Early electronic information lost
• Underestimating amount of work involved
• Figuring out the best process, and dealing with
changing workloads for staff
Technical details
• These data are saved only when there are revisions;
each data value has three pieces of information
– The time period it applies to (e.g., 2nd quarter 2011)
– The time period it is true for (e.g., from July 30th to
August 26th)
– The date that the information was entered into the
database to allow for tracking of data entry errors
Technical details
• Underneath the hood, FRED and ALFRED are the
same application.
– ALFRED was populated by collecting historical data for
series in FRED, and ALFRED continues to be extended
by capturing "expiring" FRED values when new ones
are published.
– The coverage dates for data series are the same in
both FRED and ALFRED
Conclusion
• ALFRED shows revisions to a series and presents
data as they were at a particular point in time
• Unique information, FREE and easily accessed
• Preserving important data for future research
FRASER:
Federal Reserve Archival System for Economic Research
Technical Aspects of FRASER
• Variation on LAMP software bundle
– Linux operating system
– Apache web server
– PostgreSQL database (rather than the more
common MySQL)
– PHP programming
• Google search appliance
– Metadata plus full text
(OCR)
– Basic and advanced
search options available
– Standard Google search
functions, plus a couple
filters unique to FRASER
Topic
Collections
Special/
Archival
Collections
Publications
• Originally, data
publications
• Now include various
types of serials and
monographs
• Statistical releases
Available issues,
arranged by date
Bibliographic
information
Historical Documents
• Based on categories
• Originally “non-data”
publications
Categories
Documents
Special Collections
Page Stacking
• Purpose:
– View a single data
series over time
• Solution:
– Grouped page files
– PDFLib+PDI
Personnel
• Center for Economic Documents Digitization
(CEDD) consists of
– 1 manager
– 1 librarian
– 5 part-time scanning clerks
• Additional support from
– Web group
– Library director
Digitization Process
Selection and preparation
Scan
Quality check (QC)
•Review paper documents &
establish scanning procedures
•Additional review, page by page
•This is done by a person other
than the scanner
Transfer to server
Create PDF
Clean scanned image
•This must be done by one of the
two librarians
•OCR
•Add metadata
•QC (brief)
•Process varies based on project
Post to FRASER
•Items can be posted as
publications , historical
documents, special collections –
each with their own interface and
metadata options
Add link to catalog record
and OCLC record
•This is done by the library’s
cataloger, outside of the CEDD
Locating Paper Copies
• We scan documents from
– Our own library collection, and other Fed libraries
– FDLP Needs and Offers lists
– Interlibrary loan
– Partner institutions
• But…
– As we digitize, libraries throw out paper copies
Copyright
• We focus on public
domain materials
– Federal Reserve Bank
publications
• Not technically public
domain, but we have an
agreement to digitize
– Federal Government
publications
– Pre-1923 publications
Hardware and Software
Hardware
• Automatic Document Feeder (ADF)
– 3 - Fujitsu fi-5650C
– 2 - Fujitsu fi-6670 (newer model)
•
– Indus scanning
•
Overhead/planetary scanner
– 1 - Indus Color Book Scanner 5002
•
Software
• ImageWare BCS-2
Flatbed scanner
– 1 - Epson Expressions Graphic Arts
10000XL
Techsoft PixEdit7
– Fujitsu and Epson scanning, and all
cleaning
•
ABBYY FineReader 10.0
– OCR
•
Adobe Acrobat 9 Pro
– Metadata
Also:
• Microsoft Access 2007
– Metadata and tracking purposes for
some larger collections
•
PDF Summary Maker
– Embedding metadata from Access into
pdfs
•Image/text areas as recognized by OCR
software
Green=text Blue=table Red=picture
•Text recognized by OCR software
Blue=uncertain character(s)
Data Entry
• Web-based
forms for data
entry
• Here: setting up
the overall
publication
(library cataloglevel metadata)
Data Entry
• Issue-level
metadata
– Issue date
– Issue title (textformatted date,
or other title)
– Attach pdf
– Enter table
names and page
titles for the
page stacking
described earlier
Data Entry
• Historical and Special Collection documents have
both publication- and issue-level metadata
Special Collection
Document
Output
• 3 image files
– Original multipage tiff
– Cleaned multipage tiff
– PDF
• 3 types of
text/metadata
– Underlying text in pdf
(OCR)
– Title and author
embedded in pdf
– Other metadata entered
in database when
posting
Contact Us
Adrienne Brennecke
alfred.stlouisfed.org/
Data Acquisitions, Reference Librarian
314-444-7479
[email protected]
Pamela Campbell
Digital Projects Librarian
314-444-8907
[email protected]
fraser.stlouisfed.org/