fininfoeng - University of Connecticut
Download
Report
Transcript fininfoeng - University of Connecticut
Informatics and Information Engineering
CSE
300
Prof. Steven A. Demurjian, Sr.
Computer Science & Engineering Department
The University of Connecticut
371 Fairfield Road, Box U-255
Storrs, CT 06269-2155
[email protected]
http://www.engr.uconn.edu/~steve
(860) 486 - 4818
Copyright © 2008 by S. Demurjian, Storrs, CT.
Portions of these slides are being used with the permission
of Dr. Ling Lui, Associate Professor,
College of Computing, Georgia Tech.
IIE-1
Overview
CSE
300
Informatics
What is Informatics?
What is Biomedical Informatics?
What are Key Biomedical Informatics Challenges?
Information Engineering
Data vs. Information vs. Knowledge
What is Science? What is Engineering?
What is Information Consistency?
Information Usage and Repositories
How do we Store and Utilize Information?
Role of Web in Informatics
Sharing, Collaboration, and Security
Databases vs. Data Mining
IIE-2
Informatics
CSE
300
Informatics is:
Management and Processing of Data
From Multiple Sources/Contexts
Involves Classification (Ontologies), Collection,
Storage, Analysis, Dissemination
Informatics is Multi-Disciplinary
Computing (Model, Store, Process Information)
Social Science (User Interactions, HCI)
Statistics (Analysis)
Informatics Can Apply to Multiple Domains:
Business, Biology, Fine Arts, Humanities
Pharmacology, Nursing, Medicine, etc.
IIE-3
What is Informatics?
CSE
300
Heterogeneous Field –
Interaction between
People, Information and
Technology
Computer Science
and Engineering
Social Science
(Human Computer
Interface)
Information Science
(Data Storage,
Retrieval and
Mining)
Informatics
People
Information
Technology
Adapted from Shortcliff textbook
IIE-4
What is Biomedical Informatics (BMI)?
CSE
300
BMI is Information and its Usage Associated with the
Research and Practice of Medicine Including:
Clinical Informatics for Patient Care
Medical Record + Personal Health Record
Bioinformatics for Research/Biology to Bedside
From Genomics To Proteomics
Public Health Informatics (State and Federal)
Tracking Trends in Public Sector
Clinical Research Informatics
Deidentified Repositories and Databases
Facilitate Epidemiological Research and Ongong
Clinical Studies (Drug Trails, Data Analysis, etc.)
IIE-5
What are Key BMI Focal Areas?
CSE
300
T1 Research
Transition Bench Results into Clinical Research
Clinical Research
Applying Clinical Research Results via Trials with
Patients on Medication, Devices, Treatment Plans
T2 Research
Translating “Successful” Clinical Trials into
Practice and the Community
Clinical Practice
Tracking all of the Information Associated with a
Patient and his/her Care
Integrated and Inter-Disciplinary Information
Spectrum
IIE-6
What is Medical Informatics?
CSE
300
Clinical Informatics, Pharmacy Informatics
Public Health Informatics
Consumer Health Informatics
Nursing Informatics
Systems and People Issues
Intended to Improve Clinical outcomes,
Satisfaction and Efficiency
Workflow Changes, Business Implications,
Implementation, etc…
Patient Centered – Personal Health Record and
Medical Home
Care Centered – Pay for Performance, Improving
Treatment Compliance
IIE-7
What is Bionformatics?
CSE
300
Focused on Research Tools for T1:
Genomic and Proteomic Tools, Evaluation
Methods, Computing And Database Needs
Information Retrieval and Manipulation of Large
Distributed (caBIG) Data Sets
(cabig.cancer.gov/index.asp)
Often Requires Grid Computing
Includes Cancer and Immunology Research
Increasing Need to Tie These Separate Types of
Systems Together = Personalized Medicine
Biology and the Bedside (www.i2b2.org)
IIE-8
Where is Data/How is it Used?
CSE
300
Medical And Administrative Data Found in Clinical
Information Systems (CIS) Such As:
Hospital Info. Systems Electronic Medical Records
Personal Health Records…
Pharmacy Nursing, Picture Archiving Systems
Complex Data Storage and Retrieval – Many
Different Systems
T1 Research Increasingly Reliant on CIS
T2 Research is Reliant on:
End Systems for Embedding EBM (EvidenceBased Medicine) Guidelines
Measuring Outcomes, Looking at Policy
IIE-9
What are Major Informatics Challenges?
CSE
300
Shortage of Trained People Nationally
Slows adoption of Health Information Technology
Results in Poor Planning and Coordination,
Duplication of Efforts and Incomplete Evaluation
What are Critical Needs?
Dually Trained Clinicians or Researchers in
Leadership of some Initiatives
Connect all folks with Informatics Roles across
Institutions to Improve Efficiency
Multi-Disciplinary: CSE, Statistics, Biology,
Medicine, Nursing, Pharmacy, etc.
Emerging Standards for Information Modeling and
Exchange (www.hl7.org) based on XML
IIE-10
Information Engineering
CSE
300
Data vs. Information vs. Knowledge
How do we Differentiate Between them?
Where are they used in BMI?
Science vs. Engineering
What is each of their Roles in Informatics?
How can we Engineer Information?
What is their Role in BMI?
What is Information Engineering?
What are the Unique Challenges and
Opportunities?
What is Available Today and Tomorrow?
IIE-11
From American Heritage
CSE
300
Data
Information, esp. information organized for
analysis or used as the basis for a decision.
Numerical information in a form suitable for
processing by computer.
Information
The act of informing or the condition of being
informed; communication of knowledge.
A non-accidental signal used as an input to a
computer or communications system.
Knowledge
The state or fact of knowing.
The sum or range of what has been perceived,
discovered, or learned.
Specific information about something.
IIE-12
From Webster’s 9th Collegiate
CSE
300
Data
Factual information (e.g. statistics) used as a basis
for reasoning, discussion, or calculation.
Information
The communication of knowledge or intelligence
Something (as a message, experimental data, or a
picture) which justifies change in a construct (as a
plan or theory) that represents physical or mental
experience or another construct
quantitative measure of the content of information
Knowledge
The fact or condition of having information or of
being learned.
The sum of what is known: the body of truth,
information, and principles acquired by mankind.
IIE-13
Data vs. Information vs. Knowledge
CSE
300
Overlapping Definitions
Conflicting Definitions
Agreement on Data
Knowledge and Information - Synonyms
Discussion Questions:
Equivalence of Knowledge/Information?
How can we Distinguish them?
Do these Three Terms Cover Possibilities?
IIE-14
Data, Information, and Knowledge in BMI
CSE
300
Data – Basic Level
BP, Pulse, Temperature
Peak Flow, Glucose Level, Biopsy Result
X-Ray, MRI, Cat Scan
Information - First level of Interpretation
BPs, Peak Flow, Glucose over Time
Interpreting Scan (Radiologist) or Biopsy Result
(Oncologist)
Knowledge – Applying Experience towards Diagnosis
What can Low Peak Flows over Time lead to?
What Next Step after Positive Scan or Biopsy?
What if Glucose Level is Yo-yoing?
IIE-15
From American Heritage
CSE
300
Science
The observation, identification, description,
experimental investigation, and theoretical
explanation of natural phenomena.
Methodologoical activity, discipline, or study.
An activity that appears to require study & method.
Knowledge, esp. gained through experience.
Engineering
The application of scientific and mathematical
principles to practical ends such as the design,
construction, and operation of efficient and
economical structures, equipment, and systems.
IIE-16
From Webster’s 9th Collegiate
CSE
300
Science
The state of knowing: knowledge as distinguished
from ignorance or misunderstanding
A department of systemized knowledge as an
object of study
A system or method reconciling practical ends with
scientific laws.
Engineering
The application of science and mathematics by
which the properties of matter and the sources of
energy in nature are made useful to people in
structures, machines, products, systems, and
processes.
IIE-17
Science and Engineering in BMI
CSE
300
Science
Data/Information Collection & Analysis to Reach
Hypothesis
Patients with CHF and Lipitor have Less Heart
Attacks than CHF and Baby Aspirin
Verify in Clinical Research/Epidemiological Study
Engineering
Usage of Information in Practice
Apply Scientific Results to Medical Practice
Image Processing used to Identify Tumors in CT
and MRI Scans
Transfer of Radiologists Knowledge into Computer
Based (Assisted) Solution
An Engineering Solution to Scientific Result
IIE-18
What is Information Engineering?
CSE
300
Incorporation of an Engineering Approach and
Discipline to the Generation of Information and the
Promotion of the Better Use of Information and
Resources Information Engineering Unifies and
Combines:
Software Engineering
Database Engineering
Security Engineering
Performance Engineering
Etc...
Moral: Systems Cannot and Must Not be Engineered
in a Vacuum!
Particularly true in BMI (T1, T2, Clinical Research,
and Clinical Practice)
IIE-19
Information Engineering is Motivated by:
CSE
300
Realization that Management/Control of Information
will be a Primary Concern as we Continue through the
1990s and into the 21st Century
Currently in an Age of Information - Volume and
Complexity Dependencies
Critical Systems Heavily Depend on Information:
Airline/Hotel/Auto Reservations
Telecommunications
Banking/ATMs
ATM/Credit Cards at Gas Stations/Supermarkets
Credit Bureaus Electronically Collect Information
from Many Diverse Sources
E-Tailing
Medical Care/All Aspects of BMI
IIE-20
Info. Engrg. - Challenge for 21st Century
CSE
300
Timely and Efficient Utilization of Information
Significantly Impacts on Productivity
Supports and Promotes Collaboration for
Competitive Advantage
Use Information in New and Different Ways
Collection, Synthesis, Analyses of Information
Better Understanding of Processes, Sales,
Productivity, etc.
Dissemination of Only Relevant/Significant
Information - Reduce Overload
Implications for BMI?
Sharing of Results – Benefit Mankind
Ability to Research on Rare Diseases
Are there Unknown Isolated “Cures”?
IIE-21
How is Information Engineered?
CSE
300
Careful Thought to its Definition/Purpose & Thorough
Understanding of its Intended Usage/Potential Impact
Insure and Maintain its Consistency
Quality, Correctness, and Relevance
Protect and Control its Availability (Secure Access)
Who can Access What Information in Which
Location and at What Time?
Long-Term Persistent Storage/Recoverability
Cost, Reusability, Longitudinal, and Cumulative
Experience
Integration of Past, Present and Future Information via
Intranet and Internet Access
What are Implications/Challenges for BMI?
Let’s Discuss Briefly…
IIE-22
Towards Information Consistency
CSE
300
Consistency of Information is Key!
Consistency Gauged with respect to:
Usage of Information
Persistency of Information
Integrity/Security of Information
Allowable Values and Protection from Misuse
Validity (Relevance) of Information
Means Something to Someone in a Postive Way
Discussion Questions:
Why is Consistency Important for BMI?
How is Consistency Attained for BMI?
What Else Impacts Consistency BMI?
IIE-23
What's Available to Support IE?
CSE
300
What Can be Provided to Make the Advanced
Application Design Process:
More Complete?
More Robust?
More Responsive?
Less Error Prone?
Current Choices to Support Information Engineering:
Conventional Programming Languages and Data
Models
Object-Oriented Programming Languages
Object-Oriented DBS
XML Databases
Middleware and SOA (Web)
Data Mining/Warehouses
IIE-24
What are Key Questions?
CSE
300
Focus on Information and its Behavior
What are Different Kinds of Information?
How is Information Manipulated?
Is Same Information Stored in Different Ways?
What are Information Interdependencies?
Will Information Persist? Long-Term DB?
Versions of Information?
What Past Info. is Needed from Legacy DBs or
Applications?
Who Needs Access to What Info. When?
What Information is Available Across WWW?
All of these Questions Apply to BMI!
IIE-25
Information Usage and Repositories
CSE
300
How do we Store and Utilize Information?
Databases
Data Mining
What are Key Issues?
Information Sharing/Data Correctness
Collaboration
1. Among Providers and Researchers
2. Among Providers and Patients
3. Among Patients (Support Groups)
Security
1. Control of Patient Information (De-identified)
2. Secure Exchange/Patient Ownership
3. Establish Custom Patient Controlled Groups
What is the Role of Web in Informatics?
IIE-26
The Role of a Database
CSE
300
Database is a Norm in Today's and Tomorrow's
Applications
Usage Information Tightly Linked to its Storage
Integration of Database - Key Component
Support Many Representations of ``Same'' Information
Promotes Retrieval of Information Geared Towards
User Needs and Responsibilities
Gap Exists Between Standalone Programming
Applications and Database Systems
For BMI:
Database (Data Warehouse) is a Key Feature
Need for Access to Data (De-identified)
Need to Share and Interact among Stakeholders
IIE-27
DBMS Architecture
CSE
300
DBMS Languages
Data Definition Language (DDL)
Data Manipulation Language (DML)
From Embedded Queries or DB Commands Within a
Program
“Stand-alone” Query Language
Host Language:
DML Specification (e.g., SQL) is Embedded in a
“Host” Programming Language (e.g., Java, C++)
DBMS Interfaces
Menu-Based Interface
Graphical Interface
Forms-Based Interface
Interface for DBA (DB Administrator)
IIE-28
ANSI/SPARC - Three Schema Architecture
CSE
300
External Data Schema (Users’ view)
Conceptual Data Schema (Logical Schema)
Internal Data Schema (Physical Schema)
IIE-29
How are these Used for BMI?
CSE
300
Internal Data Schema (Physical Schema)
Hidden Data Representation for Storage of BMI
Data in Proprietary Format
Under the Control of DB System
Conceptual Data Schema (Logical Schema)
The Data Model for the BMI Application
Access to Schema Controllable via SQL
External Data Schema (Users’ view)
Subsets of the Data Model for Different Users
External View for Patients
External View for Providers
External View for Clinical Researchers
Need Ability for a Patient to Control Access to
his/her Own External View
IIE-30
Data Independence
CSE
300
Ability that Allows Application Programs Not Being
Affected by Changes in Irrelevant Parts of the
Conceptual Data Representation, Data Storage
Structure and Data Access Methods
Invisibility (Transparency) of the Details of Entire
Database Organization, Storage Structure and Access
Strategy to the Users
Both Logical and Physical
Recall Software Engineering Concepts:
Abstraction the Details of an Application's
Components Can Be Hidden, Providing a Broad
Perspective on the Design
Representation Independence: Changes Can Be
Made to the Implementation that have No Impact
on the Interface and Its Users
IIE-31
Physical Data Independence
CSE
300
The Ability to Modify the Physical Data
Representation Without Causing Application Programs
to Be Rewritten
Examples:
Transparency of the Physical Storage Organization
Transparency of Physical Access Paths
Numeric Data Representation and Units
Character Data Representation
Data Coding
Physical Data Structure
All of these are Vital for BMI – Particularly if we Use
Standard to Achieve Application Independence
IIE-32
Physical Data Independence
CSE
300
Physical Data Independence is a Measure of How
Much the Internal Schema Can Change Without
Affecting the Application Programs
In BMI – Allows us to Plug and Play Different DBMS
Platforms – Extensible and Versatile Integration
Physical
IIE-33
Logical Data Independence
CSE
300
Transparency of the Entire Database Conceptual
Organization
As a Result:
Transparency of Logical Access Strategy
Addition of New Entities
Removal of Entities
Virtual (Derived) Data Items
Union of Records
Views
Common Mechanism for Logical Data
Dependency
Provide Different Logical Data Contexts to
Different Users Based on Their Needs
Update Views vs. Read-Only Views
IIE-34
Logical Data Independence
CSE
300
Logical Data Independence is a Measure of How
Much the Conceptual Schema Can Change Without
Affecting the Application Programs
For BMI – Allows us to Separate End User
Applications (Patients, Providers, etc.) from DB
Logical
IIE-35
Classic Information System Design
CSE
300
IIE-36
Data vs. Information
CSE
300
IIE-37
Programming Language Systems vs. DBS
CSE
300
Similarities and Differences Exist At System Level:
Shared Resources vs. Shared Data
Execution Granularity - Programs vs. Transactions
Granularity Difference - Files vs. Instances
Classic Problem of “Impedance Mismatch”
Thin Layer of Overlap between PLS (C++, Java,
etc.) and Relational Database System
What will Future Bring?
SQL3 with Object-Oriented Extensions
XML Databases (Apached Xindice, Sendra, etc.)
Today
Tomorrow?
PLS
PLS
RDBS
XML DBS
IIE-38
What is Today’s Impedance Mismatch?
CSE
300
Relational Data Organizes Information into Flat Files
Relational Tables with Primary Key
High Number of Tuples per Table (1000s & more)
Limited Number of Tables (10-50) for Even Large
Size Application
Limited Linkages Among Tables (Foreign Keys)
What Does BMI/PHR/EMR Require?
For Each Patient, Track Multiple Dependencies
Visits per Patient
Tests per Patient
Prescriptions per Patient
Data Inherently Complex and Interdependent
Flattened into Relational Format
IIE-39
The Health Care Application - Classes
CSE
300
IIE-40
The Health Care Application - Classes
CSE
300
IIE-41
The Health Care Application - Classes
CSE
300
IIE-42
The Health Care Application - Relationships
CSE
300
IIE-43
How Does Mismatch Occur?
CSE
300
On Left – OO Classes
Inheritance
Dependencies
Programmatic View
C++ or Java Usage
Staging from DB to OO
Item(Phy_Name*, Date*,
Visit_Flag, Symptom, Diagnosis, Treatment,
Presc_Flag, Pre_No, Pharm_Name, Medication,
Test_Flag, Test_Code, Spec_No, Status, Tech)
Above – Relational Tables
Stage Data from Tables into OO (e.g. Java) format
Utilize JDBC
What are the Implications/Impacts?
IIE-44
Implications and Impact
CSE
300
Three Copies of “Same” Information in Different
Database Table (Item)
OO Representation – Server Side (Classes)
GUI Display – Client Side (html/xml)
What can this Lead to?
Dr. D, Jan 01, 08
Fever, Flu, Bed
Rest
No Scripts
No Tests
Item(Phy_Name*, Date*,
Visit_Flag, Symptom, Diagnosis, Treatment,
Presc_Flag, Pre_No, Pharm_Name, Medication,
Test_Flag, Test_Code, Spec_No, Status, Tech)
IIE-45
What is one Possible Solution?
CSE
300
Standards and Usage of XML
Consider CDA – Clinical Document Architecture
Standard for Clinical (Provider) Medical Record
Clinical Record Organized as:
<patient_encounter> - location
<legal_authenticator> - MD
<originating_organization> and <provider>
<patient> - name, birthdate, gender
<body_confidentiality-”CONF1”> - note
History
Past Medical History
Medications
Allergies
Social History
Physical Exam
Vitals (BP, Resp, Temp, HR)
Etc...
IIE-46
What is one Possible Solution?
CSE
300
Let’s Explore this in Greater Detail
Starting with the CDA Header
<?xml version="1.0"?>
<!DOCTYPE levelone PUBLIC "-//HL7//DTD CDA Level One 1.0//EN" "levelone_1.0.dtd">
<levelone>
<clinical_document_header>
<id EX="a123" RT="2.16.840.1.113883.3.933"/>
<set_id EX="B" RT="2.16.840.1.113883.3.933"/>
<version_nbr V="2"/>
<document_type_cd V="11488-4" S="2.16.840.1.113883.6.1"
DN="Consultation note"/>
<origination_dttm V="2000-04-07"/>
<confidentiality_cd ID="CONF1" V="N" S="2.16.840.1.113883.5.1xxx"/>
<confidentiality_cd ID="CONF2" V="R" S="2.16.840.1.113883.5.1xxx"/>
<document_relationship>
<document_relationship.type_cd V="RPLC"/>
<related_document>
<id EX="a234" RT="2.16.840.1.113883.3.933"/>
<set_id EX="B" RT="2.16.840.1.113883.3.933"/>
<version_nbr V="1"/>
</related_document>
</document_relationship>
<fulfills_order>
<fulfills_order.type_cd V="FLFS"/>
<order><id EX="x23ABC" RT="2.16.840.1.113883.3.933"/></order>
<order><id EX="x42CDE" RT="2.16.840.1.113883.3.933"/></order>
</fulfills_order>
IIE-47
CDA Example - Continued
CSE
300
IIE-48
CDA Example - Continued
CSE
300
IIE-49
CDA Example - Continued
CSE
300
IIE-50
CDA Example - Continued
CSE
300
IIE-51
CDA Example - Continued
CSE
300
IIE-52
CDA Example - Continued
CSE
300
IIE-53
CDA Example - Continued
CSE
300
IIE-54
CDA Example - Continued
CSE
300
IIE-55
Information Sharing/Access: Potential Pitfalls
CSE
300
Another Critical Issue is Information Sharing
Perception: How do I see/understand Data/Info?
Differences: What is the Reality?
Dealing with Information at Different Levels
Syntax – Format of Information
Semantics – Meaning of Information
Pragmatics – Usage of Information
When Unifying Databases/Information Repositories,
Must Address all Three!
Data Integrity and Data Security
Correct and Consistent Values
Assurance in All Secure Accesses
For BMI – All of the Above are Critical for Correct
Usage and Interpretation in All Contexts (T1, T2, …)
IIE-56
Information Syntactic Considerations
CSE
300
Syntax is Structure and Format of the Information
That is Needed to Support a Coalition
Incorrect Structure or Format Could Result in Simple
Error Message to Catastrophic Event
For Sharing, Strict Formats Need to be Maintained
Health Care Data Suffers from Lack of Standards
Standards for Diagnosis (Insurance Industry)
Emerging Standards Include:
Health Level 7 (HL7)
Based on XML
Formats Non-Standard for Different Health
Organizations, Insurers, Pharmacy Networks, etc.
N*N Translations Prone to Errors!
IIE-57
Information Semantics Concerns
CSE
300
Semantics (Meaning and Interpretation)
NATO and US - Different Message Formats
Distances (Miles vs. Kilometers)
Grid Coordinates (Mils, Degrees)
Maps (Grid, True, and Magnetic North)
What Can Happen in Health Care Data?
Possible to Confuse Dosages of Medications?
Weight of Patients (Pounds vs. Kilos)?
Measurement of Vital Signs?
Dana Farber Chemo Death – Checks/Balances
What Others are Possible?
IIE-58
Syntactic & Semantic Considerations
CSE
300
What’s Available to Support Information Sharing?
How do we Insure that Information can be Accurately
and Precisely Exchanged?
How do we Associate Semantics with the Information
to be Exchanged?
What Can we Do to Verify the Syntactic Exchange and
that Semantics are Maintained?
Can Information Exchange Facilitate Federation?
Can this be Handled Dynamically?
Or, Must we Statically Solve Information Sharing in
Advance?
IIE-59
Information Pragmatics Considerations
CSE
300
Pragmatics Require that we Totally Understand
Information Usage and Information Meaning
What are the Critical Information Sources?
How will Information Flow Among Them?
What Systems Need Access to these Sources?
How will that Access be Delivered?
Who (People/Roles) will Need to See What When?
How will What a Person Sees Impact Other
Sources?
Focus on: Way that Information is Utilized and
Understood in its Specific Context
Can Medical Info be Misused even if Understood?
IIE-60
Information Pragmatics Considerations
CSE
300
What are Pragmatics Issues re. Underinsured and
Uninsured Populations in Event?
How Can we Use Info Effectively if we Don’t
Know if it is Complete?
Has Info from All Sources Been Collected?
What Happens if Same Patient in Different
Repositories Can’t be Reconciled?
What if Patient in Unresponsive and Can’t Supply
any Info?
Is Usage of Info Complicated due to
Incompleteness? Multiple Locations?
Or, if the Event is Major – will all Patient
Populations Suffer Same Substandard Care?
IIE-61
Collaboration and Security
CSE
300
Two Concepts go Hand in Hand
Strong Parallels
Collaboration
Among Providers and Researchers
Among Providers and Patients
Among Patients (Support Groups)
Security
Control of Patient Information (De-identified)
Secure Exchange/Patient Ownership
Establish Custom Patient Controlled Groups
Let’s Explore them Both via our Semester Project
Also Consider Emergent and Policy Issues
IIE-62
Collaboration: Providers and Researchers
CSE
300
Providers
Seeking new Treatment Plans
Looking for Clinical Research Studies for Patients
Looking to Communicate with Clinical
Researchers
Researchers
Publish Evidence-Based Guidelines
New Treatments
Collect Data on Provider Visits
Provide Forum to Discuss with Provider
Allow Provider to Upload Anonymous Outcomes
Also – Need to Collaborate Among Researchers of All
Types (Sharepoint, WIKIs, etc.)
IIE-63
Collaboration: Providers and Patients
CSE
300
Patients
Open Personal Health Record to Providers
Patients have
Data Entry Facility for Chronic Conditions
Ability to Graph and Track their Disease
Education Materials also Available
Providers
Securely Communicate (email) with Patients (see
https://www.relayhealth.com/rh/specific/patients/default.aspx)
Access to Authorized Patient Data
Tracking of Patients (to Reduce Office Visits)
Proactive Intervention to Head off Potential
Hospitalizations/Problems via Treatment
Algorithms to Auto-Notify Based on Data Values
IIE-64
Collaboration: Among Patients
CSE
300
Patients
Provide Each with a List of Support Groups
Allow them to Join Groups or Form New Groups
Secure Communication via:
Email
Chatting Environment
Link to Actual (Physical Meetings)
Repository of Available Support Groups
Overall:
Patients can Meet other Patients with Same Issues
Vital for Patients with Rare Diseases
Form On-Line Communities
IIE-65
Security: General Concepts
CSE
300
Authentication
Proving you are who you are
Signing a Message
Is the Client who S/he Says they are?
Authorization
Granting/Denying Access
Revoking Access
Does the Client have Permission to do what S/he
Wants?
Encryption
Establishing Communications Such that No One
but Receiver will Get the Content of the Message
Symmetric Encryption
Public Key Encryption
IIE-66
Key Security Issues
CSE
300
Legal and Ethical Issues
Information that Must be Protected
Information that Must be Accessible
Policy Issues
Who Can See What Information When?
Applications Limits w.r.t. Data vs. Users?
System Level Enforcement
What is Provided by the DBMS? Programming
Language? OS? Application?
How Do All of the Pieces Interact?
Multiple Security Levels/Organizational Enforcement
Mapping Security to Organizational Hierarchy
Protecting Information in Organization
IIE-67
What are Key Access Control Concepts?
CSE
300
Assurance
Are the Security Privileges for Each User
Adequate to Support their Activities?
Do the Security Privileges for Each User Meet but
Not Exceed their Capabilities?
Consistency
Are the Defined Security Privileges for Each User
Internally Consistent?
Least-Privilege Principle: Just Enough Access
Are the Defined Security Privileges for Related
Users Globally Consistent?
Mutual-Exclusion: Read for Some-Write for Others
IIE-68
Available Security Approaches
CSE
300
Mandatory Access Control (MAC)
Bell/Lapadula Security Model
Security Classification Levels for Data Items
Access Based on Security Clearance of User
Role Based Access Control (RBAC)
Govern Access to Information based on Role
Users can Play Different Roles at Different Times
Responsibilities of Users Guiding Factor
Facilitate User Interactions while Simultaneously
Protecting Sensitive Data
Discretionary Access Control (DAC)
Richer Set of Access Modes - Govern Access to
Information based on User Id
Discretionary Rules on Access Privileges
Focused on Application Needs/Requirements
IIE-69
Mandatory Security Mechanism
CSE
300
Typical Security Classification Levels for
Subjects/programs and Objects/resources
Top Secret (TS) and Secret (S)
Confidential (C) and Unclassified (U)
Rules:
TS is the Highest and U is the Lowest Level
TS > S > C > U
Security Levels:
C1 is Security Clearance Given to User U1
C2 is Security Classification Given to Object O1
U1 can Access O1 iff C1 C2
This is Referred to as the Domination of U1 Over O1
Not Prevalent in BMI – But May have Relevance
IIE-70
Role Based Access Control (RBAC)
CSE
300
Focuses on Defining Roles of Typical Behavior
Nurse, Nurse-Manager, Education-RN
Physician, Attending-MD, Specialist
Student, Faculty-Advisor, Head
Focus on Duties that are Shared
During Authorization of Roles to Users
Establish Boundaries of Access
User Steve with Role Faculty-Advisor
Limited to Faculty Capabilities on Peoplesoft
Only Can Manipulate His Advisees
User Steve with Role Associate Head
Possible Overlap in Responsibilities w/ Faculty-Advisor
Other Activities not given to Faculty-Advisor Role
IIE-71
Why is RBAC Needed?
CSE
300
In Health Care, different professionals (e.g., Nurses
vs. Physicians vs. Administrators, etc.) Require Select
Access to Sensitive Patient Data
Suppose we have a Patient Access Client
Lois playing the Nurse Role would be Allowed to
Enter Patient History, Record Vital Signs, etc.
Steve playing M.D. Role would be Allowed to do
all of a Nurse plus Write Orders, Enter Scripts, etc.
Vicky playing Admin Role would be Allowed to
Enter Demographic/Insurance Info.
Role Dictates Client Behavior
Physician’s Write Scripts
Nurses Enter Patient Data (Vitals + History)
All Access Shared Medical Record
Access is Limited Based on Role
IIE-72
Discretionary Access Control
CSE
300
Discretionary
Grant Privileges to Users, Including Capabilities to
Access Specific Data Items in a Specific Mode
Available in Most Commercial DBMSs
Aspects of DAC
User’s Identity
Predefined Discretionary “Rules” Defined by the
Security Administrator
Allows User to “Delegate” Capabilities to Another
User
Delegate Capabilities and Ability to Delegate
Role Delegation and Delegation Authority
DAC Available in SQL2
IIE-73
What is Role Delegation?
CSE
300
Role Delegation, a User-to-User Relationship, Allows
an Original User (OU) to Transfer Responsibility for a
Particular Role to a Delegated User (DU)
Two Major Types of Delegation
Administratively-directed Delegation has an
Administrative Infrastructure Outside the Direct
Control of a User Mediates Delegation
User-directed Delegation has an User (Playing a
Role) Determining If and When to Delegate a Role
to Another User
In Both, Security Administrators Still Oversee Who
Can Do What When w.r.t. Delegation
IIE-74
Why is Role Delegation Important?
CSE
300
Many Different Scenarios Under Which Privileges
May Want to be Passed to Other Individuals
Large organizations often require delegation to
meet demands on individuals in specific roles for
certain periods of time
True in Many Different Sectors
Health Care and Financial Services
Engineering and Academic Setting
Example:
Reda Delegates Head Role to Steve when Traveling
Key Issues:
Who Controls Delegation to Whom?
How are Delegation Requirements Enforced?
IIE-75
Coalitions for Clinical/Translational Science
CSE
300
Pfizer
Bayer
UConn
Storrs
UConn
Health
Center Saint
DCF,
Francis,
DSS, etc.
CCMC, …
Info. Sharing - Joint R&D
Support T1, T2, and Clinical Research
Company and University Partnerships
Collaborative Funding Opportunities
Cohesive and Trusted Environment
Existing Systems/Databases
and New Applications
How do you Protect Commercial Interests?
Promote Research Advancement?
Free Read for Some Data/Limited for Other?
Commercialization vs. Intellectual Property?
NIH
FDA
NSF
Balancing Cooperation with Propriety
IIE-76
Emergent Public Policy Issues
CSE
300
How do we Protect a Person’s DNA?
Who Owns a Person’s DNA?
Who Can Profit from Person’s DNA?
Can Person’s DNA be Used to Deny Insurance?
Employment? Etc.
How do you Define Security Limitations/Access?
What about i2b2 – Informatics for Integrating Biology
and the Bedside (see https://www.i2b2.org/)
Scalable Informatics Framework to Bridge
Clinical Research Data
Vast Data Banks for Basic Science Research
Goal: Understand Genetic Bases of Diseases
IIE-77
Emergent Public Policy Issues
CSE
300
Can DNA Repositories be Anonymously Available for
Medical Research?
Do Societal Needs Trump Individual Rights?
Can DNA be Made Available Anonymously for
Medical Research?
De-identified Data Repositories
Privacy Protecting Data Mining
International Repository Might Allow Medical
Researchers Access to Large Enough Data Set for
Rare Conditions (e.g., Orphan Drug Act)
Individual Rights vs. Medical Advances
IIE-78
Internet and the Web
CSE
300
A Major Opportunity for Business
A Global Marketplace
Business Across State and Country Boundaries
A Way of Extending Services
Online Payment vs. VISA, Mastercard
A Medium for Creation of New Services
Publishers, Travel Agents, Teller, Virtual Yellow Pages,
Online Auctions …
A Boon for Academia
Research Interactions and Collaborations
Free Software for Classroom/Research Usage
Opportunities for Exploration of Technologies in
Student Projects
What are Implications for BMI? Where is the Adv?
IIE-79
WWW: Three Market Segments
Server
CSE
300
Business to Business
Corporate
Network
Server
Intranet
Decision
support
Mfg.. System
monitoring
corporate
repositories
Workgroups
Information sharing
Ordering info./status
Targeted electronic
commerce
Internet
Corporate
Server Network
Internet
Sales
Marketing
Information
Services
Provider Network
Server
Provider Network
Exposure to Outside
IIE-80
Information Delivery Problems on the Net
CSE
300
Everyone can Publish Information on the Web
Independently at Any Time
Consequently, there is an Information Explosion
Identifying Information Content More Difficult
There are too Many Search Engines but too Few
Capable of Returning High Quality Data
Most Search Engines are Useful for Ad-hoc Searches
but Awkward for Tracking Changes
What are Information Delivery Issues for BMI?
Publishing of Patient Education Materials
Publishing of Provider Education Materials
How Can Patients/Providers find what Need?
How do they Know if its Relevant? Reputable?
IIE-81
Example Web Applications
CSE
300
Scenario 1: World Wide Wait
A Major Event is Underway and the Latest, Up-tothe Minute Results are Being Posted on the Web
You Want to Monitor the Results for this Important
Event, so you Fire up your Trusty Web Browser,
Pointing at the Result Posting Site, and Wait, and
Wait, and Wait …
What is the Problem?
The Scalability Problems are the Result of a
Mismatch Between the Data Access Characteristics
of the Application and the Technology Used to
Implement the Application
May not be Relevant to BMI: Hard to Apply Scenario
IIE-82
Example Web Applications
CSE
300
Scenario 2:
Many Applications Today have the Need for
Tracking Changes in Local and Remote Data
Sources and Notifying Changes If Some Condition
Over the Data Source(s) is Met
To Monitor Changes on Web, You Need to Fire
Your Trusty Web Browser from Time to Time,
Cache the Most Recent Result, and Difference
Manually Each Time You Poll the Data Source(s)
Issue: Pure Pull is Not the Answer to All Problems
BMI: If a Patient Enters Data that Sets off a Chain
Reaction, how Can Provider be Notified and in Turn
the Provider Notify the Patient (Bad Health Event)
IIE-83
What is the Problem?
CSE
300
Applications are Asymmetric but the Web is Not
Computation Centric vs. Information Flow Centric
Type of Asymmetry
Network Asymmetry
Satellite, CATV, Mobile Clients, Etc.
Client to Server Ratio
Too Many Clients can Swamp Servers
Data Volume
Mouse and Key Click vs. Content Delivery
Update and Information Creation
Clients Need to be Informed or Must Poll
Clearly, for BMI, Simple Web Environment/Browser
is Not Sufficient – No Auto-Notification
IIE-84
What are Information Delivery Styles?
CSE
300
Pull-Based System
Transfer of Data from Server to Client is Initiated
by a Client Pull
Clients Determine when to Get Information
Potential for Information to be Old Unless Client
Periodically Pulls
Push-Based System
Transfer of Data from Server to Client is Initiated
by a Server Push
Clients may get Overloaded if Push is Too
Frequent
Hybrid
Pull and Push Combined
Pull First and then Push Continually
IIE-85
Publish/Subscribe
CSE
300
Semantics: Servers Publish/Clients Subscribe
Servers Publish Information Online
Clients Subscribe to the Information of Interest
(Subscription-based Information Delivery)
Data Flow is Initiated by the Data Sources
(Servers) and is Aperiodic
Danger: Subscriptions can Lead to Other
Unwanted Subscriptions
Applications
Unicast: Database Triggers and Active Databases
1-to-n: Online News Groups
May work for Clinical Researcher to Provider Push
IIE-86
Design Options for Nodes
CSE
300
Three Types of Nodes:
Data Sources
Provide Base Data which is to be Disseminated
Clients
Who are the Net Consumers of the Information
Information Brokers
Acquire Information from Other Data Sources, Add
Value to that Information and then Distribute this
Information to Other Consumers
By Creating a Hierarchy of Brokers, Information
Delivery can be Tailored to the Need of Many Users
Brokers may be Ideal Intermediaries for BMI!
Act on Behalf of Patients, Providers
Incorporate Secure Access
IIE-87
Research Challenges
CSE
300
Ubiquitous/Pervasive
Many computers and information
appliances everywhere,
networked together
Inherent Complexity:
Coping with Latency (Sometimes
Unpredictable)
Failure Detection and Recovery
(Partial Failure)
Concurrency, Load Balancing,
Availability, Scale
Service Partitioning
Ordering of Distributed Events
“Accidental” Complexity:
Heterogeneity: Beyond the Local
Case: Platform, Protocol, Plus All
Local Heterogeneity in Spades.
Autonomy: Change and Evolve
Autonomously
Tool Deficiencies: Language
Support (Sockets,rpc),
Debugging, Etc.
IIE-88
Infosphere
Problem: too many sources,too much information
CSE
300
Internet:
Information Jungle
Infopipes
Clean, Reliable,
Timely Information,
Anywhere
Digital
Earth
Personalized
Filtering &
Info. Delivery
Sensors
IIE-89
Current State-of-Art
CSE
300
Web
Server
Mainframe
Database
Server
Thin
Client
IIE-90
Infosphere Scenario – for BMI
CSE
300
Infotaps &
Fat Clients
Sensors
Variety
of Servers
Many sources
Database
Server
IIE-91
Heterogeneity and Autonomy
CSE
300
Heterogeneity:
How Much can we Really Integrate?
Syntactic Integration
Different Formats and Models
Web/SQL Query Languages
Semantic Interoperability
Basic Research on Ontology, Etc
Autonomy
No Central DBA on the Net
Independent Evolution of Schema and Content
Interoperation is Voluntary
Interface Technology (Support for Isvs)
DCOM: Microsoft Standard
CORBA, Etc...
IIE-92
Security and Data Quality
CSE
300
Security
System Security in the Broad Sense
Attacks: Penetrations, Denial of Service
System (and Information) Survivability
Security Fault Tolerance
Replication for Performance, Availability, and
Survivability
Data Quality
Web Data Quality Problems
Local Updates with Global Effects
Unchecked Redundancy (Mutual Copying)
Registration of Unchecked Information
Spam on the Rise
IIE-93
Legacy Data Challenge
CSE
300
Legacy Applications and Data
Definition: Important and Difficult to Replace
Typically, Mainframe Mission Critical Code
Most are OLTP and Database Applications
Evolution of Legacy Databases
Client-server Architectures
Wrappers
Expensive and Gradual in Any Case
IIE-94
Potential Value Added/Jumping on Bandwagon
CSE
300
Sophisticated Query Capability
Combining SQL with Keyword Queries
Consistent Updates
Atomic Transactions and Beyond
But Everything has to be in a Database!
Only If we Stick with Classic DB Assumptions
Relaxing DB Assumptions
Interoperable Query Processing
Extended Transaction Updates
Commodities DB Software
A Little Help is Still Good If it is Cheap
Internet Facilitates Software Distribution
Databases as Middleware
IIE-95
Data Warehousing and Data Mining
CSE
300
Data Warehousing
Provide Access to Data for Complex Analysis,
Knowledge Discovery, and Decision Making
Underlying Infrastructure in Support of Mining
Provides Means to Interact with Multiple DBs
OLAP (on-Line Analytical Processing) vs. OLTP
Data Mining
Discovery of Information in a Vast Data Sets
Search for Patterns and Common Features based
Discover Information not Previously Known
Medical Records Accessible Nationwide
Research/Discover Cures for Rare Diseases
Relies on Knowledge Discovery in DBs (KDD)
IIE-96
Data Warehousing and OLAP
CSE
300
A Data Warehouse
Database is Maintained Separately from an
Operational Database
“A Subject-Oriented, Integrated, Time-Variant, and
Non-Volatile Collection of Data in Support for
Management’s Decision Making Process
[W.H.Inmon]”
OLAP (on-Line Analytical Processing)
Analysis of Complex Data in the Warehouse
Attempt to Attain “Value” through Analysis
Relies on Trained and Adept Skilled Knowledge
Workers who Discover Information
Data Mart
Organized Data for a Subset of an Organization
Establish De-Identified Marts for BMI Research
IIE-97
Building a Data Warehouse
CSE
300
Option 1
Leverage Existing
Repositories
Collate and Collect
May Not Capture All
Relevant Data
Option 2
Start from Scratch
Utilize Underlying
Corporate Data
Corporate
data warehouse
Option 1:
Consolidate Data Marts
Option 2:
Build from
scratch
Data Mart
...
Data Mart
Data Mart
Data Mart
Corporate data
IIE-98
BMI – Partition/Excerpt Data Warehouse
CSE
300
Clinical and Epidemiological Research (and for T2 and T1)
Each Study Submitted to Institutional Review Board (IRB)
For Human Subjects (Assess Risks, Protect Privacy)
See: http://resadm.uchc.edu/hspo/irb/
To Satisfy IRB (and Privacy, Security, etc.), Reverse Process to
Create a Data Mart for each Approved Study
Export/Excerpt Study Data from Warehouse
May be Single or Multiple Sources
BMI
data warehouse
Data Mart
...
Data Mart
Data Mart
Data Mart
IIE-99
Data Warehouse Characteristics
CSE
300
Utilizes a “Multi-Dimensional” Data Model
Warehouse Comprised of
Store of Integrated Data from Multiple Sources
Processed into Multi-Dimensional Model
Warehouse Supports of
Times Series and Trend Analysis
“Super-Excel” Integrated with DB Technologies
Data is Less Volatile than Regular DB
Doesn’t Dramatically Change Over Time
Updates at Regular Intervals
Specific Refresh Policy Regarding Some Data
IIE-100
Three Tier Architecture
CSE
300
monitor
External data sources
OLAP Server
integrator
Summarization
report
Operational databases
Extraxt
Transform
Load
Refresh
serve
Data Warehouse
Query report
Data mining
metadata
Data marts
IIE-101
Data Warehouse Design
CSE
300
Most of Data Warehouses use a Start Schema to
Represent Multi-Dimensional Data Model
Each Dimension is Represented by a Dimension
Table that Provides its Multidimensional Coordinates
and Stores Measures for those Coordinates
A Fact Table Connects All Dimension Tables with a
Multiple Join
Each Tuple in Fact Table Represents the Content of
One Dimension
Each Tuple in the Fact Table Consists of a Pointer
to Each of the Dimensional Tables
Links Between the Fact Table and the Dimensional
Tables for a Shape Like a Star
IIE-102
What is a Multi-Dimensional Data Cube?
CSE
300
Representation of Information in Two or More
Dimensions
Typical Two-Dimensional - Spreadsheet
In Practice, to Track Trends or Conduct Analysis,
Three or More Dimensions are Useful
For BMI – Axes for Diagnosis, Drug, Subject Age
IIE-103
Multi-Dimensional Schemas
CSE
300
Supporting Multi-Dimensional Schemas Requires Two
Types of Tables:
Dimension Table: Tuples of Attributes for Each
Dimension
Fact Table: Measured/Observed Variables with
Pointers into Dimension Table
Star Schema
Characterizes Data Cubes by having a Single Fact
Table for Each Dimension
Snowflake Schema
Dimension Tables from Star Schema are Organized
into Hierarchy via Normalization
Both Represent Storage Structures for Cubes
IIE-104
Example of Star Schema
CSE
300
Product
Date
Date
Month
Year
Sale Fact Table
Date
ProductNo
ProdName
ProdDesc
Categoryu
Product
Store
Customer
Unit_Sales
Store
StoreID
City
State
Country
Region
Dollar_Sales
Customer
CustID
CustName
CustCity
CustCountry
IIE-105
Example of Star Schema for BMI
CSE
300
Vitals
Date
Date
Month
Year
Patient Fact Table
Visit Date
BP
Temp
Resp
HR (Pulse)
Vitals
Symptoms
Patient
Medications
Symptoms
Pulmonary
Heart
Mus-Skel
Skin
Digestive
Etc.
Patient
PatientID
PatientName
PatientCity
PatientCountry
Reference another Star
Schema for all Meds
IIE-106
A Second Example of Star Schema …
CSE
300
IIE-107
and Corresponding Snowflake Schema
CSE
300
IIE-108
Data Warehouse Issues
CSE
300
Data Acquisition
Extraction from Heterogeneous Sources
Reformatted into Warehouse Context - Names,
Meanings, Data Domains Must be Consistent
Data Cleaning for Validity and Quality
is the Data as Expected w.r.t. Content? Value?
Transition of Data into Data Model of Warehouse
Loading of Data into the Warehouse
Other Issues Include:
How Current is the Data? Frequency of Update?
Availability of Warehouse? Dependencies of Data?
Distribution, Replication, and Partitioning Needs?
Loading Time (Clean, Format, Copy, Transmit,
Index Creation, etc.)?
For CTSA – Data Ownership (Competing Hosps).
IIE-109
Knowledge Discovery
CSE
300
Data Warehousing Requires Knowledge Discovery to
Organize/Extract Information Meaningfully
Knowledge Discovery
Technology to Extract Interesting Knowledge
(Rules, Patterns, Regularities, Constraints) from a
Vast Data Set
Process of Non-trivial Extraction of Implicit,
Previously Unknown, and Potentially Useful
Information from Large Collection of Data
Data Mining
A Critical Step in the Knowledge Discovery
Process
Extracts Implicit Information from Large Data Set
IIE-110
Steps in a KDD Process
CSE
300
Learning the Application Domain (goals)
Gathering and Integrating Data
Data Cleaning
Data Integration
Data Transformation/Consolidation
Data Mining
Choosing the Mining Method(s) and Algorithm(s)
Mining: Search for Patterns or Rules of Interest
Analysis and Evaluation of the Mining Results
Use of Discovered Knowledge in Decision Making
Important Caveats
This is Not an Automated Process!
Requires Significant Human Interaction!
IIE-111
OLAP Strategies
CSE
300
OLAP Strategies
Roll-Up: Summarization of Data
Drill-Down: from the General to Specific (Details)
Pivot: Cross Tabulate the Data Cubes
Slide and Dice: Projection Operations Across
Dimensions
Sorting: Ordering Result Sets
Selection: Access by Value or Value Range
Implementation Issues
Persistent with Infrequent Updates (Loading)
Optimization for Performance on Queries is More
Complex - Across Multi-Dimensional Cubes
Recovery Less Critical - Mostly Read Only
Temporal Aspects of Data (Versions) Important
IIE-112
On-Line Analytical Processing
CSE
300
Data Cube
A Multidimensonal Array
Each Attribute is a Dimension
In Example Below, the Data Must be Interpreted so
that it Can be Aggregated by Region/Product/Date
Product
Product
Store
Date
Sale
acron
Rolla,MO 7/3/99 325.24
budwiser LA,CA
5/22/99 833.92
large pants NY,NY
2/12/99 771.24
Pants
Diapers
Beer
Nuts
West
East
3’ diaper Cuba,MO 7/30/99 81.99
Region
Central
Mountain
South
Jan
Feb March April
Date
IIE-113
On-Line Analytical Processing
CSE
300
For BMI – Imagine a Data Table with Patient Data
Define Axis
Summarize Data
Create Perspective to Match Research Goal
Essentially De-identified Data Mart
Medication
Patient
Med
BirthDat Dosage
Steve
Lipitor
1/1/45 10mg
John
Zocor
2/2/55
Harry
Crestor
3/3/65 5mg
Lois
Lipitor
4/4/66 20mg
Charles Crestor
7/1/59
Lescol
Crestor
Zocor
Lipitor
80mg
10mg
5
10
Dosage
20
40
80
1940s 1950s 1960s 1970s
Decade
IIE-114
Examples of Data Mining
CSE
300
The Slicing Action
A Vertical or Horizontal Slice Across Entire Cube
Months
Slice
on city Atlanta
Products Sales
Products Sales
Months
Multi-Dimensional Data Cube
IIE-115
Examples of Data Mining
CSE
300
The Dicing Action
A Slide First Identifies on Dimension
A Selection of Any Cube within the Slice which
Essentially Constrains All Three Dimensions
Months
Products Sales
Products Sales
Months
March 2000
Electronics
Atlanta
Dice on Electronics and Atlanta
IIE-116
Examples of Data Mining
Drill Down - Takes a Facet (e.g.,
Q1)
and Decomposes into Finer Detail
Jan Feb March
Products Sales
CSE
300
Drill down
on Q1
Roll Up
on Location
(State, USA)
Roll Up: Combines Multiple Dimensions
From Individual Cities to State
Q1 Q2 Q3 Q4
Products Sales
Products Sales
Q1 Q2 Q3 Q4
IIE-117
Mining Other Types of Data
CSE
300
Analysis and Access Dramatically More Complicated!
Time Series Data for Glucose, BP, Peak Flow, etc.
Spatial databases
Multimedia databases
World Wide Web
Time series data
Geographical and Satellite Data
IIE-118
Advantages/Objectives of Data Mining
CSE
300
Descriptive Mining
Discover and Describe General Properties
60% People who buy Beer on Friday also have
Bought Nuts or Chips in the Past Three Months
Predictive Mining
Infer Interesting Properties based on Available
Data
People who Buy Beer on Friday usually also Buy
Nuts or Chips
Result of Mining
Order from Chaos
Mining Large Data Sets in Multiple Dimensions
Allows Businesses, Individuals, etc. to Learn about
Trends, Behavior, etc.
Impact on Marketing Strateg
IIE-119
Data Mining Methods (1)
CSE
300
Association
Discover the Frequency of Items Occurring
Together in a Transaction or an Event
Example
80% Customers who Buy Milk also Buy Bread
Hence - Bread and Milk Adjacent in Supermarket
50% of Customers Forget to Buy Milk/Soda/Drinks
Hence - Available at Register
Prediction
Predicts Some Unknown or Missing Information
based on Available Data
Example
Forecast Sale Value of Electronic Products for Next
Quarter via Available Data from Past Three Quarters
IIE-120
Association Rules
CSE
300
Motivated by Market Analysis
Rules of the Form
Item1^Item2^…^ ItemkItemk+1 ^ … ^ Itemn
Example
“Beer ^ Soft Drink Pop Corn”
Problem: Discovering All Interesting Association
Rules in a Large Database is Difficult!
Issues
Interestingness
Completeness
Efficiency
Basic Measurement for Association Rules
Support of the Rule
Confidence of the Rule
IIE-121
Data Mining Methods (2)
CSE
300
Classification
Determine the Class or Category of an Object
based on its Properties
Example
Classify Companies based on the Final Sale Results in
the Past Quarter
Clustering
Organize a Set of Multi-dimensional Data Objects
in Groups to Minimize Inter-group Similarity is
and Maximize Intra-group Similarity
Example
Group Crime Locations to Find Distribution Patterns
IIE-122
Classification
CSE
300
Two Stages
Learning Stage: Construction of a Classification
Function or Model
Classification Stage: Predication of Classes of
Objects Using the Function or Model
Tools for Classification
Decision Tree
Bayesian Network
Neural Network
Regression
Problem
Given a Set of Objects whose Classes are Known
(Training Set), Derive a Classification Model
which can Correctly Classify Future Objects
IIE-123
An Example
CSE
300
Attributes
Attribute
Possible Values
outlook
sunny, overcast, rain
temperature continuous
humidity
continuous
windy
true, false
Class Attribute - Play/Don’t Play the Game
Training Set
Values that Set the Condition for the Classification
What are the Pattern Below?
Outlook Temperature Humidity
sunny
85
85
overcast 83
78
sunny
80
90
sunny
72
95
sunny
72
70
…
…
…
Windy
false
false
true
false
false
…
Play
No
Yes
No
No
Yes
...
IIE-124
Data Mining Methods (3)
CSE
300
Summarization
Characterization (Summarization) of General
Features of Objects in the Target Class
Example
Characterize People’s Buying Patterns on the Weekend
Potential Impact on “Sale Items” & “When Sales Start”
Department Stores with Bonus Coupons
Discrimination
Comparison of General Features of Objects
Between a Target Class and a Contrasting Class
Example
Comparing Students in Engineering and in Art
Attempt to Arrive at Commonalities/Differences
IIE-125
Summarization Technique
CSE
300
Attribute-Oriented Induction
Generalization using Concert hierarchy (Taxonomy)
barcode category
14998
milk
brand
diaryland
content
size
Skim
2L
food
12998 mechanical MotorCraft valve 23a 12in
…
…
…
…
...
Milk
…
Skim milk … 2% milk
Category
milk
milk
…
Content Count
skim
2%
…
280
98
...
bread
White
whole
bread … wheat
Lucern … Dairyland
Wonder … Safeway
IIE-126
Why is Data Mining Popular?
CSE
300
Technology Push
Technology for Collecting Large Quantity of Data
Bar Code, Scanners, Satellites, Cameras
Technology for Storing Large Collection of Data
Databases, Data Warehouses
Variety of Data Repositories, such as Virtual Worlds,
Digital Media, World Wide Web
Corporations want to Improve Direct Marketing and
Promotions - Driving Technology Advances
Targeted Marketing by Age, Region, Income, etc.
Exploiting User Preferences/Customized Shopping
What is Potential for BMI?
How do you see Data Mining Utilized?
What are Key Issues to Worry About?
IIE-127
Requirements & Challenges in Data Mining
CSE
300
Security and Social
What Information is Available to Mine?
Preferences via Store Cards/Web Purchases
What is Your Comfort Level with Trends?
User Interfaces and Visualization
What Tools Must be Provided for End Users of
Data Mining Systems?
How are Results for Multi-Dimensional Data
Displayed?
Performance Guarantees
Range from Real-Time for Some Queries to LongTerm for Other Queries
Data Sources of Complex Data Types or Unstructured
Data - Ability to Format, Clean, and Load Data Sets
IIE-128
Concluding Remarks
CSE
300
We’ve looked at:
Informatics
Information Engineering
Information Usage and Repositories
Focused on Their Applicability and Relevance for
BMI
Likely Generated More Questions than Answers
IIE-129