link to slides
Download
Report
Transcript link to slides
POTENTIAL OCR SOFTWARE
FOR NUTRITION FACTS LABELS
Dennis Given
THE GENERAL OPTICAL CHARACTER
RECOGNITION CONCEPT
OCR
Output
Input
buzz
PREFERENCES
FOR THE
OCRS
Accurate
Fast
Written in Java
Open-source (free)
This will make it easier to find someone to work on
the software in the future.
Commercial options, although considerably faster
and more accurate, are costly solutions.
Editable
So that if I have to, I can go into the OCR engine and
edit whatever I have to.
Commercial OCRs don’t always allow for this option.
OCRS THAT MEET SOME OF THE
PREFERENCES
Aspire OCR SDK
Java OCR
ABBYY FineReader
Tesseract Version 3.01
COMPARISONS
Preferences:
Accuracy?
Speed?
Java?
Editable?
Open-source?
Example image to determine the best:
GIF Image
1204x2004 image - this resolution is close to the
resolution of the iPhone 3GS camera phone
(1500x2000) and the iPhone 3G resolution
(1200x1600) images.
EXAMPLE IMAGE
2004 pixels
1200 pixels
ASPIRE OCR
Pros:
Runs across many platforms
Relatively fast
Written in Java and meant to be added to Java
applications
Cons:
Not very accurate.
Must pay for the full SDK (Software Development
Kit).
ASPIRE RESULTS
JAVAOCR
Pros:
Written entirely in Java
Full source code is given (easy to edit)
Easy graphical user interface
Relatively fast
Cons:
Instead of converting the image to text, it converts it
to .png files by character
Not very accurate (sometimes won’t even bother
converting the image) to more than one character
Even the images that were converted were not done
very well…
JAVAOCR RESULTS
ABBYY FINEREADER
Pros:
Very good interface
Lots of tools to edit the area being scanned
The most accurate program tried
Cons:
Not in Java
Commercial (not open-source) and VERY expensive
ABBYY FINEREADER RESULTS
TESSERACT VERSION 3.01
Developed by HP Labs
Now used by Google
Pros:
Close in accuracy to the commercial OCRs
Easy to use from the command line
Lots of documentation available
Cons:
Must use a Java Wrapper if we want future edits to
be done in Java
Source code is written in C/C++ - will be difficult to
edit
TESSERACT RESULTS
COMMERCIAL OCR VS. TESSERACT
100+ languages
6+ languages
Accuracy is good
Accuracy is good, but not
Sophisticated application
as good as commercial
with complex user
OCRs
interface
No user interface
Mostly meant for
Runs on Linux, Mac,
Windows OS
Costs $100+ to use
Windows, and more…
Open Source – Free!
WHERE TO GO FROM HERE…
Tesseract is our best option at this point.
It is…
Fast
Free
Outperforms the other available open-source OCR
engines
Plenty of documentation
An Overview of the Tesseract OCR Engine by Ryan Smith
Tesseract OSCON pdf
http://code.google.com/p/tesseract-ocr/
Three different ways to go
OPTION 1: ~5 WEEKS
USE TESSERACT ENGINE AND WRAP IT
Wrapper Library
A collection of subroutines or classes used to develop
software. Libraries expose interfaces which clients of the
library use to execute library routines. Wrapper
libraries (or library wrappers) consist of a thin layer of
code which translates a library's existing interface into a
compatible interface.
By wrapping Tesseract, it won’t matter that
Tesseract’s source code is written in C++
However, this means we will still not be able to
customize the Tesseract engine to do exactly what we
want (specific to Nutrition labels).
We can control the input and output, but the process
of determining characters will remain the same.
OPTION 2: ~7 WEEKS
BUILD AN OCR ENGINE FROM SCRATCH
Understand general concepts
Can use ideas and implementations from OCRs such
as Tesseract and JavaOCR.
Can customize the engine to run specifically for
nutrition facts labels.
Would be more effective than a “general” OCR which isn’t
looking for specifics.
The whole thing can be written in Java (easier for
future developers to work on).
However:
It will take more time
Will probably have more bugs in it
Option 3 is to take more time to determine the OCR…
GOALS
At the end of the time frame, I plan to have:
A running OCR application that will:
At least be able to scan in cereal box (flat) images effectively
and convert the labels to usable data.
Have minimal bugs (although some will definitely exist).
Have an accuracy rate of at least 95%.
Begin to identify effective ways to manage images with
curved (jars, bottles, etc.) and wrinkled (bags, packaging,
etc.) nutrition facts labels.