AutoIdentification of Personally Identifiable

Download Report

Transcript AutoIdentification of Personally Identifiable

Data Mining Personally
Identifiable Information
(PII)
Monday, June 22, 2015
1:00 pm ET
Sandra Serkes, President & CEO
Valora Technologies, Inc.
AGENDA for Data Mining
Valora Technologies
Private Identification
Information (PII)
1. MA
What
is PII? How
it different from
PHI or other sensitive
• Bedford,
software
firmis specializing
in machine-assisted
documentinformation?
processing capabilities (aka analytics)
“
2. experts
Where does
live? How
can I find
it? mining and
– World
in thePII
automated
analysis,
The power
of Bigindexing,
Data is the
story
3.
What
do
I
do
once
I
find
PII?
What
PII
management
presentation of documents,
data
content
about
the&ability
to compete and win
are1,500,000+
there?
with few
resources
and limited
– 20 staff,obligations
200+ clients,
pages
every week
”
4. How does data mining
help identify
dollars.
- Forbes,PII?
March 2012
• Customers:
corporate
legal
departments,
government
agencies,
and
5. Why data mine? What is driving this practice? Why
is it
their professional
advisory
firmstoo)
& consultancies)
is Valora’s
story,
new/hot/etc?
Whocolleagues
is (this
doing
this(law
& why?
• Target 6.
market:
those
whomine
wishdocuments?
to harness and
the 2.5
How do
you data
Whatprofit
does from
that mean?
quintillionWhat
bytesabout
of document
& content
data
being
each
attachments
& versions?
What
arecreated
the typical
day, aka “Big
Data” to do data mining? How does it work?
techniques
7. How
to get started
on a PII data
mining project.
The Basics
You
• Objective:
to overtake
traditional
information
repository
creation
Needentry),
to Know
(manual data
management, analysis (search, review) and
1. Important
Terms & Concepts
workflow (retention,
production,
routing) with high quality, low cost,
2. Typical PII
Data practices
Mining Workflow
scalable technology
& best
in analytics.
3. competitive
Tips & Tools
– Provide cost
document analytics solutions in the United States
8. Things
Watch Out
For solutions to data, document & content
– Provide
efficient,toworld-class,
targeted
utilization problems
What is all this?
• Personally Identifiable Information (aka Private Identification
Information) is:
– Information that can be used on its own or with other information to
identify, contact, or locate a single person, or to identify an individual
in context.
• Protected Health Information (aka Private Health Information
or Personal Health Information) is:
– Information, medical history, test and laboratory results, insurance
information and other data that is collected by a health care
profession
• Sensitive Information (aka Trade Secrets or Classified
Information) is:
– Information that is protected against unwarranted disclosure. Access
to sensitive information should be safeguarded. Protection of sensitive
information may be required for legal or ethical reasons, for issues
pertaining to personal privacy, or for proprietary considerations
Pop Quiz! PII, PHI or SI?
Patent
Application
Layoff List
Home
address
Blood type
PINK SLIP
Secret formula
Credit Card Number
Household
Income
Cell
phone
number
Age of a minor
Employee ID
Where is PII likely to show up?
•
•
•
•
•
•
Forms & Applications
Employee Information (HR)
Supplier Information (Procurement)
Customer Information (Sales & Marketing)
Litigation & Investigations
Records Retention
Does your organization know what
exactly your data says?
Why should we care about PII?
“We know what’s
in our data, but
we aren’t dealing
with it.”
“We don’t
know what’s
in our data”
VERSUS
Courts, shareholders, consumers, government agencies, watchdog groups, media
spotlight reports and more demanding responsible data management (aka
Information Governance).
Why Data Mine Corporate Documents?
• Litigation
– Doc Review & Productions
• Investigations
• Compliance
– Legal, regulatory & ethics
– Financial & investor
– Health & safety
• Business Intelligence
• Information Governance
– Management & control
– Cost savings
– Exposure mitigation
Document is a loose term here.
Really, it means any structured or
unstructured form of textual or
metadata content.
Voicemails, tweets, texts, websites,
audio & video files, receipts and
transactions are all “documents” as
far as data analytics are concerned.
We data mine documents to learn
where they are & what they say.
Ultimately, we gain management
and control over the contents,
obligations, storage, access,
retrieval, use and exposure of our
information.
Who is Data Mining Documents for PII?
• Large multi-national corporations
–
–
–
–
Sometimes litigation or investigation collections, legal hold
Sometimes part of larger Information Governance initiative
Sometimes part of compliance and/or retention strategies
Typically happening at the departmental level
• Corporate Advisory (Outside Counsel & Consultants)
– Looking to assist clients in the items above
– Often part of business process re-engineering or IG engagements
• Health Care, Financial Services & Insurance
– Analysis for HIPAA compliance
– Analysis of mortgages, stock trades, tax forms, and other financial
transactions
– Analysis to de-identify documents for aggregate data mining purposes
Data Mining PII is a Good Example of
Information Governance
 Universal Issue
 Involves several key IG problems:
• Storage/hosting
• Content analysis & classification
• Sub-Context – terms, provisions, obligations & stipulations
• Administration, management & maintenance
 Elements of Backfile and Day Forward records management
 Typically a mix of paper and ESI documents
 Signatures & affirmations play a key role
 PII Management is a hot button issue with real budgets available
• Investor & media attention
• Customer concerns
• Risk & compliance danger zone
 Predecessor to Big Data mining
How a computer identifies PII
with data mining (analytics)
Clear PII: SSN
Likely PII
(“warning sign”)
Clear PII: Home
Phone Number
Not PII: Interest
Rate
Implied classification:
Active PII, needs
protection & redaction
How a computer classifies a contract
with data mining (analytics)
Authors/Parties
Agreement Type
Author Validation
& Contact Info
Contract Date
Key Provisions
Contract Term
No Survivorship
Clauses
Implied classification:
Active contract
How a computer classifies an attachment or exhibit
with data mining (analytics)
DocType = Patent Application
Date = 10/18/2007
Date Format = US
Author = Patent Authors,
Author City, Author Country
Assignee = RIM
Tone = Neutral to slightly positive
Embedded Graphic with Title
Other Data Capturable Data Elements:
• Patent Number
• Filing Date
• Key Phrases & Terms
• Managing PTO
• Implied/Attached Docs
• Bar Code Present
• And many more . . .
Up to 160 unique attributes.. And counting!
Additional Info Data Mining Determines
•
What Type of Document is this?
•
– Email vs. contract vs. employment
application, etc.
•
– Can someone put 2 + 2
together?
Who executed this? Who are the
parties?
– What obligations?
– Authors, Recipients, Copyees
•
What are the key content areas?
– What risk?
•
– How similar to other/past provisions?
– Which provisions most popular?
•
•
What is the scope of the PII?
– Internal or external PII?
– Financial? Health? Personal?
Company-Sensitive?
– Special conditions
What workflow is needed for
this documents?
–
–
–
What attachments and exhibits are
part of this record or file?
– What does their association imply?
What other context can be
inferred?
•
Exception Handling & Escalation
Approval process
Obsolescence planning, scheduled
deletion
What predictions about future
PII activity? What trends?
The importance of Context – is it PII?
• Single instances of “likely” PII are not necessarily PII, unless
and until they can be linked to a specific person
• Compare the following:
• SSN is always a ringer; it is a unique ID
• PII problems compound quickly
•
Not just in single instances, but across populations
When does PII become a problem?
• The existence of PII isn’t a problem per se
• It becomes a problem when:
–
–
–
–
It is exposed to others (knowingly or unknowingly)
It needs to be produced
It needs to be evaluated
There is an ongoing ethical obligation to treat such information
properly
– Single points of data can be connected to others to build a composite
picture (connect-the-dots)
• Force your group to evaluate: What would happen in a data
breach?
– What would be exposed?
– How quickly could you recover?
– What can you do now to mitigate expense & crisis later?
Think you are immune?
•
•
•
•
•
Data Breaches
Last Year (2014)
28%, a record high
15 incidents per week
Hacking is #1 cause (28%)
675 million records compromised since 2005
Every 3 seconds there is a new victim
What to do once you’ve reached
the “it’s a problem” stage
• Use analytics (or “safe” brute force labor) to:
– Cull out documents no longer needed (Retention Schedule)
– Mark documents heading for removal
– Build a broad dashboard of your data contents
• Population Analysis is beyond just a Data Map
• What do you have where, and in what concentrations?
– Analyze & assess documents into 3 categories: critical, nervous & safe
– ID specifics for each document type, person & information class (build
your Ruleset)
– Determine who should/should not have access “at rest”
– Determine presentation of data to different parties, particularly in
production/presentation scenarios
• Consider selective redactions with on/off toggle
• Run Analytics to ID, mark & redact offending information
• Host documents in an environment that can utilize and maintain
existing information (Backfile), as well as proactively analyze new
material entering the system (Day Forward)
[REDACTED]Defined
AutoRedaction
• Automated redaction of “offending” text or phrases
– Software performs the redaction based on Rules
• Multiple choice presentation
– Image, text or both
– Solid Black, Black with white writing, Translucent Yellow, Translucent
Gray
• Available for all kinds of information
– List provided or “derived” from tags
– Ex: SSN, DOB, Name, Age, Address, Account Number, Product
Name/ID…
• Unlimited redactions in a single document
What kind of redaction makes sense?
Serkes
Sandra
123-45-6789
226-588-98
• Should redactions be visible: always, sometimes or never?
• Does someone need to approve or override system
redactions?
What is Data Visualization? (aka Data Presentation)
• Simple visual representation of relationships and patterns in
document data
• Common examples
–
–
–
–
Graph sales over time
Distribution by ethnicity
Word Clouds & Heat Maps
USA Today-style graphics
Data visualization is a general term that
describes any effort to help people understand
the significance of data by placing it in a visual
context. -TechTarget
• Use of charts, graphs, dashboards, animation and sound to
help convey important connections
Data Visualization examples you already know
Data Visualization of Document Data Mining
BlackCat Screen Shots
Data Mining with Redactions
(PowerHouse QCUI Main Screen)
• Screen capture
showing details
of PowerHouse
QCUI (Quality
Control User
Interface).
• Quality Control
environment
customized for
each matter.
• Full record-byrecord and fieldby-field display,
as well as many
automated tools
to improve
throughput &
efficiency.
PH AutoCoder has previously
filled in these fields
QC Analysts edit
content right in the
document view.
PH AutoRedaction of
private identification
information (PII).
Understanding the Basics of a Contracts
Data Mining Project - Vocabulary
• Important Terms
–
–
–
–
–
–
–
–
–
–
–
–
Backfile
Day Forward
Custodians & Paths
Rules & Confidences
Exception Handling
Data flow model
Maintenance & Tuning
Identical Duplicates & Near Duplicates
Data Visualization
Taxonomy
Data Obsolescence
AutoRedaction
Typical PII Data Mining Project Workflow
Project design
& scoping
Budgetary
approval
Initial data
transfer
Rules creation
& testing
Limited
access
operation
Typical Phase 1: reduced scope, offsite/offline backfile, limited users
Full data
transfer/access
Full rules creation
& testing
Complete
backfile
processing
Full backfile
operation
Typical Phase 2: full scope, onsite full backfile, full users
Integration
with live
systems
Full Day
Forward
operation
Admin & User
Admin & User
Training
Ongoing
Maintenance
Typical Phase 3: full scope, onsite full backfile + day forward, full users
Tips & Tools for Getting Your PII
Data Mining Project Started
1. Start at the departmental level
– Identify 3 critical pain points in that department’s document
usage/management/etc.
– Ex: classifying & managing departing old/inherited documents; creating
standardized PII management terms; or identifying PII exposure areas
2. Pick a department that is already one of the critical buy-in parties: legal,
procurement or marketing
3. Start with a financially & logistically palatable Phase 1:
– Examples (< $30K, 1-5 parties affected, 20% of ultimate work spec)
– Keep onsite system installations to a minimum
4. Work on Backlog first – before Day Forward (new) files
5. Have an end point in mind
– Where/How will PII ultimately be stored?
– What is the ending file structure?
– How will new documents revise the existing taxonomy?
6. Remember that PII is a cross-population issue, not just single documents
– Effectively all file types & purposes
7. Have a project champion
– The litigation matter has the senior partner. Who is driving the PII data mining
project? Who is the point person, and internal advocate?
Things to Watch Out For
• Be prepared for “surprise”
content
• Data mining & management can
have “Big Brother” overtones
– Most organizations hold onto
key infromation forever. Be
prepared for defunct companies,
groups, policies, provisions, etc.
– Files may be anything
– Document content may say
anything
– Suggestion: start with some
basic rules about what’s in/out
for the analysis population
before the project “officially”
starts.
– Make sure you know what the
obligations are once certain
types of document content &
patterns of behavior are made
apparent.
– Contentoversight makes people
nervous.
– Suggestion: Share rules &
classification criteria with those
concerned enough to ask about it.
Be transparent.
– Once senior mgmt learns about
the ability to monitor, track &
predict behavior, they will want
regular reporting on these topics
– Suggestion: Make sure your
analysis & classification tools
include easy reporting &
monitoring of system behavior, &
usage patterns
More Things to Watch Out For
• System needs & priorities will change over time
– Unlike discovery, which has a fixed time window for document
collection, PII data mining typically endures forever
– What is acceptable today may not be tomorrow, when other concerns
dominate or additional material is added
– Suggestion: make sure your systems & workflow are flexible enough to
add or delete processing stages, adapt rules confidences, and grow
with your needs. Look for systems that have a “tuning” component.
• Don’t forget other related content stores
– Stored contracts and agreements
– Email and attachments
– Field office documents
• Remember: PII is probably lurking in many of the documents
that your organization has likely kept for several decades or
more. (Organization is easier to swallow than deletion/removal.)
Valora Technologies
• Bedford, MA software firm specializing in machine-assisted
document processing capabilities (aka analytics)
“
– World experts in the automated
analysis,
mining
The power
of Bigindexing,
Data is the
story and
presentation of documents,
data
content
about
the&ability
to compete and win
with few
resources
and limited
– 20 staff, 200+ clients, 1,500,000+
pages
every week
dollars.
”
- Forbes, March 2012
• Customers: corporate legal departments, government agencies, and
their professional advisory colleagues
(lawstory,
firmstoo)
& consultancies)
(this is Valora’s
• Target market: those who wish to harness and profit from the 2.5
quintillion bytes of document & content data being created each
day, aka “Big Data”
• Objective: to overtake traditional information repository creation
(manual data entry), management, analysis (search, review) and
workflow (retention, production, routing) with high quality, low cost,
scalable technology & best practices in analytics.
– Provide cost competitive document analytics solutions in the United States
– Provide efficient, world-class, targeted solutions to data, document & content
utilization problems
Typical Problems Valora Solves
Legal/Litigation/eDiscovery
Problems
• Too many documents to review, cull &
produce by hand
• Cost-effective alternative solutions to
contract attorney & offshore labor
“armies”
• Missing, poor, or ineffective metadata
• Re-unitization, organization, indexing &
redacting of documents
• Bridging multi-language document
populations to English
Business Intelligence Problems
• Organize & control decades of contracts &
agreements
• Provide brand integrity/protection data
mining of public/private documents
• Forecast & trending of topics, people &
locations over time
• Loose, shared files analysis & control
Records Management Problems
• Help automate defensible deletion efforts
for IG
• Organize & control loose documents on
shared drives, desktops, networks &
devices
• Eliminate expensive and information-poor
storage options
• Serve as automated intake for multiple
content generation sources
Health Care Problems
• Heavy expense & time converting hardcopy
medical records to EMRs/EHRs
• Cannot keep up with fax server data
collection
• Cost effective alternative solutions to
“armies” of temp data entry coders
Thank You!
For More Information:
Valora Technologies, Inc.
101 Great Road, Suite 220
Bedford, MA 01730
781.229.2265
www.valoratech.com
[email protected]