fininfoeng - University of Connecticut

Download Report

Transcript fininfoeng - University of Connecticut

Informatics and Information Engineering
CSE
300
Prof. Steven A. Demurjian, Sr.
Computer Science & Engineering Department
The University of Connecticut
371 Fairfield Road, Box U-255
Storrs, CT 06269-2155
[email protected]
http://www.engr.uconn.edu/~steve
(860) 486 - 4818
Copyright © 2008 by S. Demurjian, Storrs, CT.
Portions of these slides are being used with the permission
of Dr. Ling Lui, Associate Professor,
College of Computing, Georgia Tech.
IIE-1
Overview

CSE
300


Informatics
 What is Informatics?
 What is Biomedical Informatics?
 What are Key Biomedical Informatics Challenges?
Information Engineering
 Data vs. Information vs. Knowledge
 What is Science? What is Engineering?
 What is Information Consistency?
Information Usage and Repositories
 How do we Store and Utilize Information?
 Role of Web in Informatics
 Sharing, Collaboration, and Security
 Databases vs. Data Mining
IIE-2
Informatics

CSE
300


Informatics is:
 Management and Processing of Data
 From Multiple Sources/Contexts
 Involves Classification (Ontologies), Collection,
Storage, Analysis, Dissemination
Informatics is Multi-Disciplinary
 Computing (Model, Store, Process Information)
 Social Science (User Interactions, HCI)
 Statistics (Analysis)
Informatics Can Apply to Multiple Domains:
 Business, Biology, Fine Arts, Humanities
 Pharmacology, Nursing, Medicine, etc.
IIE-3
What is Informatics?

CSE
300
Heterogeneous Field –
Interaction between
People, Information and
Technology
 Computer Science
and Engineering
 Social Science
(Human Computer
Interface)
 Information Science
(Data Storage,
Retrieval and
Mining)
Informatics
People
Information
Technology
Adapted from Shortcliff textbook
IIE-4
What is Biomedical Informatics (BMI)?

CSE
300
BMI is Information and its Usage Associated with the
Research and Practice of Medicine Including:
 Clinical Informatics for Patient Care
 Medical Record + Personal Health Record

Bioinformatics for Research/Biology to Bedside
 From Genomics To Proteomics

Public Health Informatics (State and Federal)
 Tracking Trends in Public Sector

Clinical Research Informatics
 Deidentified Repositories and Databases
 Facilitate Epidemiological Research and Ongong
Clinical Studies (Drug Trails, Data Analysis, etc.)
IIE-5
What are Key BMI Focal Areas?

CSE
300




T1 Research
 Transition Bench Results into  Clinical Research
Clinical Research
 Applying Clinical Research Results via Trials with
Patients on Medication, Devices, Treatment Plans
T2 Research
 Translating “Successful” Clinical Trials into
Practice and the Community
Clinical Practice
 Tracking all of the Information Associated with a
Patient and his/her Care
Integrated and Inter-Disciplinary Information
Spectrum
IIE-6
What is Medical Informatics?

CSE

300



Clinical Informatics, Pharmacy Informatics
Public Health Informatics
Consumer Health Informatics
Nursing Informatics
Systems and People Issues
 Intended to Improve Clinical outcomes,
Satisfaction and Efficiency
 Workflow Changes, Business Implications,
Implementation, etc…
 Patient Centered – Personal Health Record and
Medical Home
 Care Centered – Pay for Performance, Improving
Treatment Compliance
IIE-7
What is Bionformatics?

CSE
300


Focused on Research Tools for T1:
 Genomic and Proteomic Tools, Evaluation
Methods, Computing And Database Needs
 Information Retrieval and Manipulation of Large
Distributed (caBIG) Data Sets
(cabig.cancer.gov/index.asp)
 Often Requires Grid Computing
 Includes Cancer and Immunology Research
Increasing Need to Tie These Separate Types of
Systems Together = Personalized Medicine
Biology and the Bedside (www.i2b2.org)
IIE-8
Where is Data/How is it Used?

CSE
300


Medical And Administrative Data Found in Clinical
Information Systems (CIS) Such As:
 Hospital Info. Systems Electronic Medical Records
 Personal Health Records…
 Pharmacy Nursing, Picture Archiving Systems
 Complex Data Storage and Retrieval – Many
Different Systems
T1 Research Increasingly Reliant on CIS
T2 Research is Reliant on:
 End Systems for Embedding EBM (EvidenceBased Medicine) Guidelines
 Measuring Outcomes, Looking at Policy
IIE-9
What are Major Informatics Challenges?

CSE

300



Shortage of Trained People Nationally
Slows adoption of Health Information Technology
Results in Poor Planning and Coordination,
Duplication of Efforts and Incomplete Evaluation
What are Critical Needs?
 Dually Trained Clinicians or Researchers in
Leadership of some Initiatives
 Connect all folks with Informatics Roles across
Institutions to Improve Efficiency
 Multi-Disciplinary: CSE, Statistics, Biology,
Medicine, Nursing, Pharmacy, etc.
Emerging Standards for Information Modeling and
Exchange (www.hl7.org) based on XML
IIE-10
Information Engineering

CSE
300


Data vs. Information vs. Knowledge
 How do we Differentiate Between them?
 Where are they used in BMI?
Science vs. Engineering
 What is each of their Roles in Informatics?
 How can we Engineer Information?
 What is their Role in BMI?
What is Information Engineering?
 What are the Unique Challenges and
Opportunities?
 What is Available Today and Tomorrow?
IIE-11
From American Heritage

CSE
300


Data
 Information, esp. information organized for
analysis or used as the basis for a decision.
 Numerical information in a form suitable for
processing by computer.
Information
 The act of informing or the condition of being
informed; communication of knowledge.
 A non-accidental signal used as an input to a
computer or communications system.
Knowledge
 The state or fact of knowing.
 The sum or range of what has been perceived,
discovered, or learned.
 Specific information about something.
IIE-12
From Webster’s 9th Collegiate

CSE
300


Data
 Factual information (e.g. statistics) used as a basis
for reasoning, discussion, or calculation.
Information
 The communication of knowledge or intelligence
 Something (as a message, experimental data, or a
picture) which justifies change in a construct (as a
plan or theory) that represents physical or mental
experience or another construct
 quantitative measure of the content of information
Knowledge
 The fact or condition of having information or of
being learned.
 The sum of what is known: the body of truth,
information, and principles acquired by mankind.
IIE-13
Data vs. Information vs. Knowledge

CSE

300



Overlapping Definitions
Conflicting Definitions
Agreement on Data
Knowledge and Information - Synonyms
Discussion Questions:
 Equivalence of Knowledge/Information?
 How can we Distinguish them?
 Do these Three Terms Cover Possibilities?
IIE-14
Data, Information, and Knowledge in BMI

CSE
300


Data – Basic Level
 BP, Pulse, Temperature
 Peak Flow, Glucose Level, Biopsy Result
 X-Ray, MRI, Cat Scan
Information - First level of Interpretation
 BPs, Peak Flow, Glucose over Time
 Interpreting Scan (Radiologist) or Biopsy Result
(Oncologist)
Knowledge – Applying Experience towards Diagnosis
 What can Low Peak Flows over Time lead to?
 What Next Step after Positive Scan or Biopsy?
 What if Glucose Level is Yo-yoing?
IIE-15
From American Heritage

CSE
300

Science
 The observation, identification, description,
experimental investigation, and theoretical
explanation of natural phenomena.
 Methodologoical activity, discipline, or study.
 An activity that appears to require study & method.
 Knowledge, esp. gained through experience.
Engineering
 The application of scientific and mathematical
principles to practical ends such as the design,
construction, and operation of efficient and
economical structures, equipment, and systems.
IIE-16
From Webster’s 9th Collegiate

CSE
300

Science
 The state of knowing: knowledge as distinguished
from ignorance or misunderstanding
 A department of systemized knowledge as an
object of study
 A system or method reconciling practical ends with
scientific laws.
Engineering
 The application of science and mathematics by
which the properties of matter and the sources of
energy in nature are made useful to people in
structures, machines, products, systems, and
processes.
IIE-17
Science and Engineering in BMI

CSE
300

Science
 Data/Information Collection & Analysis to Reach
Hypothesis
 Patients with CHF and Lipitor have Less Heart
Attacks than CHF and Baby Aspirin
 Verify in Clinical Research/Epidemiological Study
Engineering
 Usage of Information in Practice
 Apply Scientific Results to Medical Practice
 Image Processing used to Identify Tumors in CT
and MRI Scans
 Transfer of Radiologists Knowledge into Computer
Based (Assisted) Solution
 An Engineering Solution to Scientific Result
IIE-18
What is Information Engineering?

CSE
300


Incorporation of an Engineering Approach and
Discipline to the Generation of Information and the
Promotion of the Better Use of Information and
Resources Information Engineering Unifies and
Combines:
 Software Engineering
 Database Engineering
 Security Engineering
 Performance Engineering
 Etc...
Moral: Systems Cannot and Must Not be Engineered
in a Vacuum!
Particularly true in BMI (T1, T2, Clinical Research,
and Clinical Practice)
IIE-19
Information Engineering is Motivated by:

CSE
300


Realization that Management/Control of Information
will be a Primary Concern as we Continue through the
1990s and into the 21st Century
Currently in an Age of Information - Volume and
Complexity Dependencies
Critical Systems Heavily Depend on Information:
 Airline/Hotel/Auto Reservations
 Telecommunications
 Banking/ATMs
 ATM/Credit Cards at Gas Stations/Supermarkets
 Credit Bureaus Electronically Collect Information
from Many Diverse Sources
 E-Tailing
 Medical Care/All Aspects of BMI
IIE-20
Info. Engrg. - Challenge for 21st Century

CSE
300


Timely and Efficient Utilization of Information
 Significantly Impacts on Productivity
 Supports and Promotes Collaboration for
Competitive Advantage
 Use Information in New and Different Ways
Collection, Synthesis, Analyses of Information
 Better Understanding of Processes, Sales,
Productivity, etc.
 Dissemination of Only Relevant/Significant
Information - Reduce Overload
Implications for BMI?
 Sharing of Results – Benefit Mankind
 Ability to Research on Rare Diseases
 Are there Unknown Isolated “Cures”?
IIE-21
How is Information Engineered?

CSE
300





Careful Thought to its Definition/Purpose & Thorough
Understanding of its Intended Usage/Potential Impact
Insure and Maintain its Consistency
 Quality, Correctness, and Relevance
Protect and Control its Availability (Secure Access)
 Who can Access What Information in Which
Location and at What Time?
Long-Term Persistent Storage/Recoverability
 Cost, Reusability, Longitudinal, and Cumulative
Experience
Integration of Past, Present and Future Information via
Intranet and Internet Access
What are Implications/Challenges for BMI?
 Let’s Discuss Briefly…
IIE-22
Towards Information Consistency

CSE

300
Consistency of Information is Key!
Consistency Gauged with respect to:
 Usage of Information
 Persistency of Information
 Integrity/Security of Information
 Allowable Values and Protection from Misuse

Validity (Relevance) of Information
 Means Something to Someone in a Postive Way

Discussion Questions:
 Why is Consistency Important for BMI?
 How is Consistency Attained for BMI?
 What Else Impacts Consistency BMI?
IIE-23
What's Available to Support IE?

CSE
300

What Can be Provided to Make the Advanced
Application Design Process:
 More Complete?
 More Robust?
 More Responsive?
 Less Error Prone?
Current Choices to Support Information Engineering:
 Conventional Programming Languages and Data
Models
 Object-Oriented Programming Languages
 Object-Oriented DBS
 XML Databases
 Middleware and SOA (Web)
 Data Mining/Warehouses
IIE-24
What are Key Questions?

CSE
300

Focus on Information and its Behavior
 What are Different Kinds of Information?
 How is Information Manipulated?
 Is Same Information Stored in Different Ways?
 What are Information Interdependencies?
 Will Information Persist? Long-Term DB?
Versions of Information?
 What Past Info. is Needed from Legacy DBs or
Applications?
 Who Needs Access to What Info. When?
 What Information is Available Across WWW?
All of these Questions Apply to BMI!
IIE-25
Information Usage and Repositories

CSE
300

How do we Store and Utilize Information?
 Databases
 Data Mining
What are Key Issues?
 Information Sharing/Data Correctness
 Collaboration
1. Among Providers and Researchers
2. Among Providers and Patients
3. Among Patients (Support Groups)

Security
1. Control of Patient Information (De-identified)
2. Secure Exchange/Patient Ownership
3. Establish Custom Patient Controlled Groups

What is the Role of Web in Informatics?
IIE-26
The Role of a Database

CSE
300






Database is a Norm in Today's and Tomorrow's
Applications
Usage Information Tightly Linked to its Storage
Integration of Database - Key Component
Support Many Representations of ``Same'' Information
Promotes Retrieval of Information Geared Towards
User Needs and Responsibilities
Gap Exists Between Standalone Programming
Applications and Database Systems
For BMI:
 Database (Data Warehouse) is a Key Feature
 Need for Access to Data (De-identified)
 Need to Share and Interact among Stakeholders
IIE-27
DBMS Architecture

CSE
300
DBMS Languages
 Data Definition Language (DDL)
 Data Manipulation Language (DML)
 From Embedded Queries or DB Commands Within a
Program
 “Stand-alone” Query Language


Host Language:
 DML Specification (e.g., SQL) is Embedded in a
“Host” Programming Language (e.g., Java, C++)
DBMS Interfaces
 Menu-Based Interface
 Graphical Interface
 Forms-Based Interface
 Interface for DBA (DB Administrator)
IIE-28
ANSI/SPARC - Three Schema Architecture

CSE

300

External Data Schema (Users’ view)
Conceptual Data Schema (Logical Schema)
Internal Data Schema (Physical Schema)
IIE-29
How are these Used for BMI?

CSE
300


Internal Data Schema (Physical Schema)
 Hidden Data Representation for Storage of BMI
Data in Proprietary Format
 Under the Control of DB System
Conceptual Data Schema (Logical Schema)
 The Data Model for the BMI Application
 Access to Schema Controllable via SQL
External Data Schema (Users’ view)
 Subsets of the Data Model for Different Users
 External View for Patients
 External View for Providers
 External View for Clinical Researchers
 Need Ability for a Patient to Control Access to
his/her Own External View
IIE-30
Data Independence

CSE
300


Ability that Allows Application Programs Not Being
Affected by Changes in Irrelevant Parts of the
Conceptual Data Representation, Data Storage
Structure and Data Access Methods
Invisibility (Transparency) of the Details of Entire
Database Organization, Storage Structure and Access
Strategy to the Users
 Both Logical and Physical
Recall Software Engineering Concepts:
 Abstraction the Details of an Application's
Components Can Be Hidden, Providing a Broad
Perspective on the Design
 Representation Independence: Changes Can Be
Made to the Implementation that have No Impact
on the Interface and Its Users
IIE-31
Physical Data Independence

CSE
300


The Ability to Modify the Physical Data
Representation Without Causing Application Programs
to Be Rewritten
Examples:
 Transparency of the Physical Storage Organization
 Transparency of Physical Access Paths
 Numeric Data Representation and Units
 Character Data Representation
 Data Coding
 Physical Data Structure
All of these are Vital for BMI – Particularly if we Use
Standard to Achieve Application Independence
IIE-32
Physical Data Independence

CSE
300

Physical Data Independence is a Measure of How
Much the Internal Schema Can Change Without
Affecting the Application Programs
In BMI – Allows us to Plug and Play Different DBMS
Platforms – Extensible and Versatile Integration
Physical
IIE-33
Logical Data Independence

CSE
300


Transparency of the Entire Database Conceptual
Organization
As a Result:
 Transparency of Logical Access Strategy
 Addition of New Entities
 Removal of Entities
 Virtual (Derived) Data Items
 Union of Records
Views
 Common Mechanism for Logical Data
Dependency
 Provide Different Logical Data Contexts to
Different Users Based on Their Needs
 Update Views vs. Read-Only Views
IIE-34
Logical Data Independence

CSE
300

Logical Data Independence is a Measure of How
Much the Conceptual Schema Can Change Without
Affecting the Application Programs
For BMI – Allows us to Separate End User
Applications (Patients, Providers, etc.) from DB
Logical
IIE-35
Classic Information System Design
CSE
300
IIE-36
Data vs. Information
CSE
300
IIE-37
Programming Language Systems vs. DBS

CSE
300

Similarities and Differences Exist At System Level:
 Shared Resources vs. Shared Data
 Execution Granularity - Programs vs. Transactions
 Granularity Difference - Files vs. Instances
Classic Problem of “Impedance Mismatch”
 Thin Layer of Overlap between PLS (C++, Java,
etc.) and Relational Database System
 What will Future Bring?
 SQL3 with Object-Oriented Extensions
 XML Databases (Apached Xindice, Sendra, etc.)
Today
Tomorrow?
PLS
PLS
RDBS
XML DBS
IIE-38
What is Today’s Impedance Mismatch?

CSE
300

Relational Data Organizes Information into Flat Files
 Relational Tables with Primary Key
 High Number of Tuples per Table (1000s & more)
 Limited Number of Tables (10-50) for Even Large
Size Application
 Limited Linkages Among Tables (Foreign Keys)
What Does BMI/PHR/EMR Require?
 For Each Patient, Track Multiple Dependencies
 Visits per Patient
 Tests per Patient
 Prescriptions per Patient


Data Inherently Complex and Interdependent
Flattened into Relational Format
IIE-39
The Health Care Application - Classes
CSE
300
IIE-40
The Health Care Application - Classes
CSE
300
IIE-41
The Health Care Application - Classes
CSE
300
IIE-42
The Health Care Application - Relationships
CSE
300
IIE-43
How Does Mismatch Occur?

CSE
300

On Left – OO Classes
 Inheritance
 Dependencies
Programmatic View
 C++ or Java Usage
 Staging from DB to OO
Item(Phy_Name*, Date*,
Visit_Flag, Symptom, Diagnosis, Treatment,
Presc_Flag, Pre_No, Pharm_Name, Medication,
Test_Flag, Test_Code, Spec_No, Status, Tech)

Above – Relational Tables
 Stage Data from Tables into OO (e.g. Java) format
 Utilize JDBC
 What are the Implications/Impacts?
IIE-44
Implications and Impact

CSE
300

Three Copies of “Same” Information in Different
 Database Table (Item)
 OO Representation – Server Side (Classes)
 GUI Display – Client Side (html/xml)
What can this Lead to?
Dr. D, Jan 01, 08
Fever, Flu, Bed
Rest
No Scripts
No Tests
Item(Phy_Name*, Date*,
Visit_Flag, Symptom, Diagnosis, Treatment,
Presc_Flag, Pre_No, Pharm_Name, Medication,
Test_Flag, Test_Code, Spec_No, Status, Tech)
IIE-45
What is one Possible Solution?

CSE
300

Standards and Usage of XML
 Consider CDA – Clinical Document Architecture
 Standard for Clinical (Provider) Medical Record
Clinical Record Organized as:





<patient_encounter> - location
<legal_authenticator> - MD
<originating_organization> and <provider>
<patient> - name, birthdate, gender
<body_confidentiality-”CONF1”> - note








History
Past Medical History
Medications
Allergies
Social History
Physical Exam
Vitals (BP, Resp, Temp, HR)
Etc...
IIE-46
What is one Possible Solution?

CSE

300
Let’s Explore this in Greater Detail
Starting with the CDA Header
<?xml version="1.0"?>
<!DOCTYPE levelone PUBLIC "-//HL7//DTD CDA Level One 1.0//EN" "levelone_1.0.dtd">
<levelone>
<clinical_document_header>
<id EX="a123" RT="2.16.840.1.113883.3.933"/>
<set_id EX="B" RT="2.16.840.1.113883.3.933"/>
<version_nbr V="2"/>
<document_type_cd V="11488-4" S="2.16.840.1.113883.6.1"
DN="Consultation note"/>
<origination_dttm V="2000-04-07"/>
<confidentiality_cd ID="CONF1" V="N" S="2.16.840.1.113883.5.1xxx"/>
<confidentiality_cd ID="CONF2" V="R" S="2.16.840.1.113883.5.1xxx"/>
<document_relationship>
<document_relationship.type_cd V="RPLC"/>
<related_document>
<id EX="a234" RT="2.16.840.1.113883.3.933"/>
<set_id EX="B" RT="2.16.840.1.113883.3.933"/>
<version_nbr V="1"/>
</related_document>
</document_relationship>
<fulfills_order>
<fulfills_order.type_cd V="FLFS"/>
<order><id EX="x23ABC" RT="2.16.840.1.113883.3.933"/></order>
<order><id EX="x42CDE" RT="2.16.840.1.113883.3.933"/></order>
</fulfills_order>
IIE-47
CDA Example - Continued
CSE
300
IIE-48
CDA Example - Continued
CSE
300
IIE-49
CDA Example - Continued
CSE
300
IIE-50
CDA Example - Continued
CSE
300
IIE-51
CDA Example - Continued
CSE
300
IIE-52
CDA Example - Continued
CSE
300
IIE-53
CDA Example - Continued
CSE
300
IIE-54
CDA Example - Continued
CSE
300
IIE-55
Information Sharing/Access: Potential Pitfalls

CSE
300




Another Critical Issue is Information Sharing
 Perception: How do I see/understand Data/Info?
 Differences: What is the Reality?
Dealing with Information at Different Levels
 Syntax – Format of Information
 Semantics – Meaning of Information
 Pragmatics – Usage of Information
When Unifying Databases/Information Repositories,
Must Address all Three!
Data Integrity and Data Security
 Correct and Consistent Values
 Assurance in All Secure Accesses
For BMI – All of the Above are Critical for Correct
Usage and Interpretation in All Contexts (T1, T2, …)
IIE-56
Information Syntactic Considerations

CSE
300



Syntax is Structure and Format of the Information
That is Needed to Support a Coalition
Incorrect Structure or Format Could Result in Simple
Error Message to Catastrophic Event
For Sharing, Strict Formats Need to be Maintained
Health Care Data Suffers from Lack of Standards
 Standards for Diagnosis (Insurance Industry)
 Emerging Standards Include:
 Health Level 7 (HL7)
 Based on XML

Formats Non-Standard for Different Health
Organizations, Insurers, Pharmacy Networks, etc.
 N*N Translations Prone to Errors!
IIE-57
Information Semantics Concerns

CSE
300
Semantics (Meaning and Interpretation)
 NATO and US - Different Message Formats
 Distances (Miles vs. Kilometers)
 Grid Coordinates (Mils, Degrees)
 Maps (Grid, True, and Magnetic North)

What Can Happen in Health Care Data?
 Possible to Confuse Dosages of Medications?
 Weight of Patients (Pounds vs. Kilos)?
 Measurement of Vital Signs?
 Dana Farber Chemo Death – Checks/Balances
 What Others are Possible?
IIE-58
Syntactic & Semantic Considerations

CSE

300





What’s Available to Support Information Sharing?
How do we Insure that Information can be Accurately
and Precisely Exchanged?
How do we Associate Semantics with the Information
to be Exchanged?
What Can we Do to Verify the Syntactic Exchange and
that Semantics are Maintained?
Can Information Exchange Facilitate Federation?
Can this be Handled Dynamically?
Or, Must we Statically Solve Information Sharing in
Advance?
IIE-59
Information Pragmatics Considerations

CSE
300


Pragmatics Require that we Totally Understand
Information Usage and Information Meaning
 What are the Critical Information Sources?
 How will Information Flow Among Them?
 What Systems Need Access to these Sources?
 How will that Access be Delivered?
 Who (People/Roles) will Need to See What When?
 How will What a Person Sees Impact Other
Sources?
Focus on: Way that Information is Utilized and
Understood in its Specific Context
Can Medical Info be Misused even if Understood?
IIE-60
Information Pragmatics Considerations

CSE
300

What are Pragmatics Issues re. Underinsured and
Uninsured Populations in Event?
 How Can we Use Info Effectively if we Don’t
Know if it is Complete?
 Has Info from All Sources Been Collected?
 What Happens if Same Patient in Different
Repositories Can’t be Reconciled?
 What if Patient in Unresponsive and Can’t Supply
any Info?
 Is Usage of Info Complicated due to
Incompleteness? Multiple Locations?
Or, if the Event is Major – will all Patient
Populations Suffer Same Substandard Care?
IIE-61
Collaboration and Security

CSE

300
Two Concepts go Hand in Hand
Strong Parallels
 Collaboration
 Among Providers and Researchers
 Among Providers and Patients
 Among Patients (Support Groups)

Security
 Control of Patient Information (De-identified)
 Secure Exchange/Patient Ownership
 Establish Custom Patient Controlled Groups


Let’s Explore them Both via our Semester Project
Also Consider Emergent and Policy Issues
IIE-62
Collaboration: Providers and Researchers

CSE
300


Providers
 Seeking new Treatment Plans
 Looking for Clinical Research Studies for Patients
 Looking to Communicate with Clinical
Researchers
Researchers
 Publish Evidence-Based Guidelines
 New Treatments
 Collect Data on Provider Visits
 Provide Forum to Discuss with Provider
 Allow Provider to Upload Anonymous Outcomes
Also – Need to Collaborate Among Researchers of All
Types (Sharepoint, WIKIs, etc.)
IIE-63
Collaboration: Providers and Patients

CSE
300
Patients
 Open Personal Health Record to Providers
 Patients have
 Data Entry Facility for Chronic Conditions
 Ability to Graph and Track their Disease
Education Materials also Available
Providers
 Securely Communicate (email) with Patients (see
https://www.relayhealth.com/rh/specific/patients/default.aspx)
 Access to Authorized Patient Data
 Tracking of Patients (to Reduce Office Visits)
 Proactive Intervention to Head off Potential
Hospitalizations/Problems via Treatment
Algorithms to Auto-Notify Based on Data Values


IIE-64
Collaboration: Among Patients

CSE
300
Patients
 Provide Each with a List of Support Groups
 Allow them to Join Groups or Form New Groups
 Secure Communication via:
 Email
 Chatting Environment
 Link to Actual (Physical Meetings)
Repository of Available Support Groups
Overall:
 Patients can Meet other Patients with Same Issues
 Vital for Patients with Rare Diseases
 Form On-Line Communities


IIE-65
Security: General Concepts

CSE
300


Authentication
 Proving you are who you are
 Signing a Message
 Is the Client who S/he Says they are?
Authorization
 Granting/Denying Access
 Revoking Access
 Does the Client have Permission to do what S/he
Wants?
Encryption
 Establishing Communications Such that No One
but Receiver will Get the Content of the Message
 Symmetric Encryption
 Public Key Encryption
IIE-66
Key Security Issues

CSE
300



Legal and Ethical Issues
 Information that Must be Protected
 Information that Must be Accessible
Policy Issues
 Who Can See What Information When?
 Applications Limits w.r.t. Data vs. Users?
System Level Enforcement
 What is Provided by the DBMS? Programming
Language? OS? Application?
 How Do All of the Pieces Interact?
Multiple Security Levels/Organizational Enforcement
 Mapping Security to Organizational Hierarchy
 Protecting Information in Organization
IIE-67
What are Key Access Control Concepts?

CSE
300

Assurance
 Are the Security Privileges for Each User
Adequate to Support their Activities?
 Do the Security Privileges for Each User Meet but
Not Exceed their Capabilities?
Consistency
 Are the Defined Security Privileges for Each User
Internally Consistent?
 Least-Privilege Principle: Just Enough Access

Are the Defined Security Privileges for Related
Users Globally Consistent?
 Mutual-Exclusion: Read for Some-Write for Others
IIE-68
Available Security Approaches

CSE
300


Mandatory Access Control (MAC)
 Bell/Lapadula Security Model
 Security Classification Levels for Data Items
 Access Based on Security Clearance of User
Role Based Access Control (RBAC)
 Govern Access to Information based on Role
 Users can Play Different Roles at Different Times
Responsibilities of Users Guiding Factor
 Facilitate User Interactions while Simultaneously
Protecting Sensitive Data
Discretionary Access Control (DAC)
 Richer Set of Access Modes - Govern Access to
Information based on User Id
 Discretionary Rules on Access Privileges
 Focused on Application Needs/Requirements
IIE-69
Mandatory Security Mechanism

CSE
300

Typical Security Classification Levels for
Subjects/programs and Objects/resources
 Top Secret (TS) and Secret (S)
 Confidential (C) and Unclassified (U)
Rules:
 TS is the Highest and U is the Lowest Level
 TS > S > C > U
 Security Levels:





C1 is Security Clearance Given to User U1
C2 is Security Classification Given to Object O1
U1 can Access O1 iff C1  C2
This is Referred to as the Domination of U1 Over O1
Not Prevalent in BMI – But May have Relevance
IIE-70
Role Based Access Control (RBAC)

CSE
300

Focuses on Defining Roles of Typical Behavior
 Nurse, Nurse-Manager, Education-RN
 Physician, Attending-MD, Specialist
 Student, Faculty-Advisor, Head
 Focus on Duties that are Shared
During Authorization of Roles to Users
 Establish Boundaries of Access
 User Steve with Role Faculty-Advisor
 Limited to Faculty Capabilities on Peoplesoft
 Only Can Manipulate His Advisees

User Steve with Role Associate Head
 Possible Overlap in Responsibilities w/ Faculty-Advisor
 Other Activities not given to Faculty-Advisor Role
IIE-71
Why is RBAC Needed?

CSE
300


In Health Care, different professionals (e.g., Nurses
vs. Physicians vs. Administrators, etc.) Require Select
Access to Sensitive Patient Data
Suppose we have a Patient Access Client
 Lois playing the Nurse Role would be Allowed to
Enter Patient History, Record Vital Signs, etc.
 Steve playing M.D. Role would be Allowed to do
all of a Nurse plus Write Orders, Enter Scripts, etc.
 Vicky playing Admin Role would be Allowed to
Enter Demographic/Insurance Info.
Role Dictates Client Behavior
 Physician’s Write Scripts
 Nurses Enter Patient Data (Vitals + History)
 All Access Shared Medical Record
 Access is Limited Based on Role
IIE-72
Discretionary Access Control

CSE
300



Discretionary
 Grant Privileges to Users, Including Capabilities to
Access Specific Data Items in a Specific Mode
 Available in Most Commercial DBMSs
Aspects of DAC
 User’s Identity
 Predefined Discretionary “Rules” Defined by the
Security Administrator
 Allows User to “Delegate” Capabilities to Another
User
 Delegate Capabilities and Ability to Delegate
Role Delegation and Delegation Authority
DAC Available in SQL2
IIE-73
What is Role Delegation?

CSE
300


Role Delegation, a User-to-User Relationship, Allows
an Original User (OU) to Transfer Responsibility for a
Particular Role to a Delegated User (DU)
Two Major Types of Delegation
 Administratively-directed Delegation has an
Administrative Infrastructure Outside the Direct
Control of a User Mediates Delegation
 User-directed Delegation has an User (Playing a
Role) Determining If and When to Delegate a Role
to Another User
In Both, Security Administrators Still Oversee Who
Can Do What When w.r.t. Delegation
IIE-74
Why is Role Delegation Important?

CSE
300
Many Different Scenarios Under Which Privileges
May Want to be Passed to Other Individuals
 Large organizations often require delegation to
meet demands on individuals in specific roles for
certain periods of time
 True in Many Different Sectors
 Health Care and Financial Services
 Engineering and Academic Setting

Example:
 Reda Delegates Head Role to Steve when Traveling

Key Issues:
 Who Controls Delegation to Whom?
 How are Delegation Requirements Enforced?
IIE-75
Coalitions for Clinical/Translational Science
CSE
300
Pfizer
Bayer
UConn
Storrs
UConn
Health
Center Saint
DCF,
Francis,
DSS, etc.
CCMC, …
Info. Sharing - Joint R&D
Support T1, T2, and Clinical Research
Company and University Partnerships
Collaborative Funding Opportunities
Cohesive and Trusted Environment
Existing Systems/Databases
and New Applications
How do you Protect Commercial Interests?
Promote Research Advancement?
Free Read for Some Data/Limited for Other?
Commercialization vs. Intellectual Property?
NIH
FDA
NSF
Balancing Cooperation with Propriety
IIE-76
Emergent Public Policy Issues

CSE
300

How do we Protect a Person’s DNA?
 Who Owns a Person’s DNA?
 Who Can Profit from Person’s DNA?
 Can Person’s DNA be Used to Deny Insurance?
Employment? Etc.
 How do you Define Security Limitations/Access?
What about i2b2 – Informatics for Integrating Biology
and the Bedside (see https://www.i2b2.org/)
 Scalable Informatics Framework to Bridge
 Clinical Research Data
 Vast Data Banks for Basic Science Research

Goal: Understand Genetic Bases of Diseases
IIE-77
Emergent Public Policy Issues

CSE
300
Can DNA Repositories be Anonymously Available for
Medical Research?
 Do Societal Needs Trump Individual Rights?
 Can DNA be Made Available Anonymously for
Medical Research?
 De-identified Data Repositories
 Privacy Protecting Data Mining
International Repository Might Allow Medical
Researchers Access to Large Enough Data Set for
Rare Conditions (e.g., Orphan Drug Act)
Individual Rights vs. Medical Advances


IIE-78
Internet and the Web

CSE
300
A Major Opportunity for Business
 A Global Marketplace
 Business Across State and Country Boundaries

A Way of Extending Services
 Online Payment vs. VISA, Mastercard

A Medium for Creation of New Services
 Publishers, Travel Agents, Teller, Virtual Yellow Pages,
Online Auctions …


A Boon for Academia
 Research Interactions and Collaborations
 Free Software for Classroom/Research Usage
 Opportunities for Exploration of Technologies in
Student Projects
What are Implications for BMI? Where is the Adv?
IIE-79
WWW: Three Market Segments
Server
CSE
300
Business to Business
Corporate
Network



Server
Intranet




Decision
support
Mfg.. System
monitoring
corporate
repositories
Workgroups
Information sharing
Ordering info./status
Targeted electronic
commerce
Internet
Corporate
Server Network
Internet




Sales
Marketing
Information
Services
Provider Network
Server
Provider Network
Exposure to Outside
IIE-80
Information Delivery Problems on the Net

CSE
300



Everyone can Publish Information on the Web
Independently at Any Time
 Consequently, there is an Information Explosion
 Identifying Information Content More Difficult
There are too Many Search Engines but too Few
Capable of Returning High Quality Data
Most Search Engines are Useful for Ad-hoc Searches
but Awkward for Tracking Changes
What are Information Delivery Issues for BMI?
 Publishing of Patient Education Materials
 Publishing of Provider Education Materials
 How Can Patients/Providers find what Need?
 How do they Know if its Relevant? Reputable?
IIE-81
Example Web Applications

CSE
300


Scenario 1: World Wide Wait
 A Major Event is Underway and the Latest, Up-tothe Minute Results are Being Posted on the Web
 You Want to Monitor the Results for this Important
Event, so you Fire up your Trusty Web Browser,
Pointing at the Result Posting Site, and Wait, and
Wait, and Wait …
What is the Problem?
 The Scalability Problems are the Result of a
Mismatch Between the Data Access Characteristics
of the Application and the Technology Used to
Implement the Application
May not be Relevant to BMI: Hard to Apply Scenario
IIE-82
Example Web Applications

CSE
300


Scenario 2:
 Many Applications Today have the Need for
Tracking Changes in Local and Remote Data
Sources and Notifying Changes If Some Condition
Over the Data Source(s) is Met
 To Monitor Changes on Web, You Need to Fire
Your Trusty Web Browser from Time to Time,
Cache the Most Recent Result, and Difference
Manually Each Time You Poll the Data Source(s)
Issue: Pure Pull is Not the Answer to All Problems
BMI: If a Patient Enters Data that Sets off a Chain
Reaction, how Can Provider be Notified and in Turn
the Provider Notify the Patient (Bad Health Event)
IIE-83
What is the Problem?

CSE
300

Applications are Asymmetric but the Web is Not
 Computation Centric vs. Information Flow Centric
Type of Asymmetry
 Network Asymmetry
 Satellite, CATV, Mobile Clients, Etc.

Client to Server Ratio
 Too Many Clients can Swamp Servers

Data Volume
 Mouse and Key Click vs. Content Delivery

Update and Information Creation
 Clients Need to be Informed or Must Poll

Clearly, for BMI, Simple Web Environment/Browser
is Not Sufficient – No Auto-Notification
IIE-84
What are Information Delivery Styles?

CSE
300


Pull-Based System
 Transfer of Data from Server to Client is Initiated
by a Client Pull
 Clients Determine when to Get Information
 Potential for Information to be Old Unless Client
Periodically Pulls
Push-Based System
 Transfer of Data from Server to Client is Initiated
by a Server Push
 Clients may get Overloaded if Push is Too
Frequent
Hybrid
 Pull and Push Combined
 Pull First and then Push Continually
IIE-85
Publish/Subscribe

CSE
300


Semantics: Servers Publish/Clients Subscribe
 Servers Publish Information Online
 Clients Subscribe to the Information of Interest
(Subscription-based Information Delivery)
 Data Flow is Initiated by the Data Sources
(Servers) and is Aperiodic
 Danger: Subscriptions can Lead to Other
Unwanted Subscriptions
Applications
 Unicast: Database Triggers and Active Databases
 1-to-n: Online News Groups
May work for Clinical Researcher to Provider Push
IIE-86
Design Options for Nodes

CSE
300
Three Types of Nodes:
 Data Sources
 Provide Base Data which is to be Disseminated

Clients
 Who are the Net Consumers of the Information

Information Brokers
 Acquire Information from Other Data Sources, Add
Value to that Information and then Distribute this
Information to Other Consumers
 By Creating a Hierarchy of Brokers, Information
Delivery can be Tailored to the Need of Many Users

Brokers may be Ideal Intermediaries for BMI!
 Act on Behalf of Patients, Providers
 Incorporate Secure Access
IIE-87
Research Challenges

CSE
300
Ubiquitous/Pervasive
Many computers and information
appliances everywhere,
networked together

Inherent Complexity:
 Coping with Latency (Sometimes
Unpredictable)
 Failure Detection and Recovery
(Partial Failure)
 Concurrency, Load Balancing,
Availability, Scale
 Service Partitioning
 Ordering of Distributed Events
“Accidental” Complexity:
 Heterogeneity: Beyond the Local
Case: Platform, Protocol, Plus All
Local Heterogeneity in Spades.
 Autonomy: Change and Evolve
Autonomously
 Tool Deficiencies: Language
Support (Sockets,rpc),
Debugging, Etc.
IIE-88
Infosphere
Problem: too many sources,too much information
CSE
300
Internet:
Information Jungle
Infopipes
Clean, Reliable,
Timely Information,
Anywhere
Digital
Earth
Personalized
Filtering &
Info. Delivery
Sensors
IIE-89
Current State-of-Art
CSE
300
Web
Server
Mainframe
Database
Server
Thin
Client
IIE-90
Infosphere Scenario – for BMI
CSE
300
Infotaps &
Fat Clients
Sensors
Variety
of Servers
Many sources
Database
Server
IIE-91
Heterogeneity and Autonomy

CSE
300
Heterogeneity:
 How Much can we Really Integrate?
 Syntactic Integration
 Different Formats and Models
 Web/SQL Query Languages

Semantic Interoperability
 Basic Research on Ontology, Etc

Autonomy
 No Central DBA on the Net
 Independent Evolution of Schema and Content
 Interoperation is Voluntary
 Interface Technology (Support for Isvs)
 DCOM: Microsoft Standard
 CORBA, Etc...
IIE-92
Security and Data Quality

CSE
300
Security
 System Security in the Broad Sense
 Attacks: Penetrations, Denial of Service
 System (and Information) Survivability
 Security Fault Tolerance
 Replication for Performance, Availability, and
Survivability

Data Quality
 Web Data Quality Problems




Local Updates with Global Effects
Unchecked Redundancy (Mutual Copying)
Registration of Unchecked Information
Spam on the Rise
IIE-93
Legacy Data Challenge

CSE
300

Legacy Applications and Data
 Definition: Important and Difficult to Replace
 Typically, Mainframe Mission Critical Code
 Most are OLTP and Database Applications
Evolution of Legacy Databases
 Client-server Architectures
 Wrappers
 Expensive and Gradual in Any Case
IIE-94
Potential Value Added/Jumping on Bandwagon

CSE
300




Sophisticated Query Capability
 Combining SQL with Keyword Queries
Consistent Updates
 Atomic Transactions and Beyond
But Everything has to be in a Database!
 Only If we Stick with Classic DB Assumptions
Relaxing DB Assumptions
 Interoperable Query Processing
 Extended Transaction Updates
Commodities DB Software
 A Little Help is Still Good If it is Cheap
 Internet Facilitates Software Distribution
 Databases as Middleware
IIE-95
Data Warehousing and Data Mining

CSE
300

Data Warehousing
 Provide Access to Data for Complex Analysis,
Knowledge Discovery, and Decision Making
 Underlying Infrastructure in Support of Mining
 Provides Means to Interact with Multiple DBs
 OLAP (on-Line Analytical Processing) vs. OLTP
Data Mining
 Discovery of Information in a Vast Data Sets
 Search for Patterns and Common Features based
 Discover Information not Previously Known
 Medical Records Accessible Nationwide
 Research/Discover Cures for Rare Diseases

Relies on Knowledge Discovery in DBs (KDD)
IIE-96
Data Warehousing and OLAP

CSE
300


A Data Warehouse
 Database is Maintained Separately from an
Operational Database
 “A Subject-Oriented, Integrated, Time-Variant, and
Non-Volatile Collection of Data in Support for
Management’s Decision Making Process
[W.H.Inmon]”
OLAP (on-Line Analytical Processing)
 Analysis of Complex Data in the Warehouse
 Attempt to Attain “Value” through Analysis
 Relies on Trained and Adept Skilled Knowledge
Workers who Discover Information
Data Mart
 Organized Data for a Subset of an Organization
 Establish De-Identified Marts for BMI Research
IIE-97
Building a Data Warehouse

CSE
300
Option 1
 Leverage Existing
Repositories
 Collate and Collect
 May Not Capture All
Relevant Data

Option 2
 Start from Scratch
 Utilize Underlying
Corporate Data
Corporate
data warehouse
Option 1:
Consolidate Data Marts
Option 2:
Build from
scratch
Data Mart
...
Data Mart
Data Mart
Data Mart
Corporate data
IIE-98
BMI – Partition/Excerpt Data Warehouse

CSE
300

Clinical and Epidemiological Research (and for T2 and T1)
Each Study Submitted to Institutional Review Board (IRB)
 For Human Subjects (Assess Risks, Protect Privacy)
 See: http://resadm.uchc.edu/hspo/irb/
To Satisfy IRB (and Privacy, Security, etc.), Reverse Process to
Create a Data Mart for each Approved Study
 Export/Excerpt Study Data from Warehouse
 May be Single or Multiple Sources
BMI
data warehouse
Data Mart
...
Data Mart
Data Mart
Data Mart
IIE-99
Data Warehouse Characteristics

CSE

300


Utilizes a “Multi-Dimensional” Data Model
Warehouse Comprised of
 Store of Integrated Data from Multiple Sources
 Processed into Multi-Dimensional Model
Warehouse Supports of
 Times Series and Trend Analysis
 “Super-Excel” Integrated with DB Technologies
Data is Less Volatile than Regular DB
 Doesn’t Dramatically Change Over Time
 Updates at Regular Intervals
 Specific Refresh Policy Regarding Some Data
IIE-100
Three Tier Architecture
CSE
300
monitor
External data sources
OLAP Server
integrator
Summarization
report
Operational databases
Extraxt
Transform
Load
Refresh
serve
Data Warehouse
Query report
Data mining
metadata
Data marts
IIE-101
Data Warehouse Design

CSE
300


Most of Data Warehouses use a Start Schema to
Represent Multi-Dimensional Data Model
Each Dimension is Represented by a Dimension
Table that Provides its Multidimensional Coordinates
and Stores Measures for those Coordinates
A Fact Table Connects All Dimension Tables with a
Multiple Join
 Each Tuple in Fact Table Represents the Content of
One Dimension
 Each Tuple in the Fact Table Consists of a Pointer
to Each of the Dimensional Tables
 Links Between the Fact Table and the Dimensional
Tables for a Shape Like a Star
IIE-102
What is a Multi-Dimensional Data Cube?

CSE
300



Representation of Information in Two or More
Dimensions
Typical Two-Dimensional - Spreadsheet
In Practice, to Track Trends or Conduct Analysis,
Three or More Dimensions are Useful
For BMI – Axes for Diagnosis, Drug, Subject Age
IIE-103
Multi-Dimensional Schemas

CSE
300



Supporting Multi-Dimensional Schemas Requires Two
Types of Tables:
 Dimension Table: Tuples of Attributes for Each
Dimension
 Fact Table: Measured/Observed Variables with
Pointers into Dimension Table
Star Schema
 Characterizes Data Cubes by having a Single Fact
Table for Each Dimension
Snowflake Schema
 Dimension Tables from Star Schema are Organized
into Hierarchy via Normalization
Both Represent Storage Structures for Cubes
IIE-104
Example of Star Schema
CSE
300
Product
Date
Date
Month
Year
Sale Fact Table
Date
ProductNo
ProdName
ProdDesc
Categoryu
Product
Store
Customer
Unit_Sales
Store
StoreID
City
State
Country
Region
Dollar_Sales
Customer
CustID
CustName
CustCity
CustCountry
IIE-105
Example of Star Schema for BMI
CSE
300
Vitals
Date
Date
Month
Year
Patient Fact Table
Visit Date
BP
Temp
Resp
HR (Pulse)
Vitals
Symptoms
Patient
Medications
Symptoms
Pulmonary
Heart
Mus-Skel
Skin
Digestive
Etc.
Patient
PatientID
PatientName
PatientCity
PatientCountry
Reference another Star
Schema for all Meds
IIE-106
A Second Example of Star Schema …
CSE
300
IIE-107
and Corresponding Snowflake Schema
CSE
300
IIE-108
Data Warehouse Issues

CSE
300

Data Acquisition
 Extraction from Heterogeneous Sources
 Reformatted into Warehouse Context - Names,
Meanings, Data Domains Must be Consistent
 Data Cleaning for Validity and Quality
is the Data as Expected w.r.t. Content? Value?
 Transition of Data into Data Model of Warehouse
 Loading of Data into the Warehouse
Other Issues Include:
 How Current is the Data? Frequency of Update?
 Availability of Warehouse? Dependencies of Data?
 Distribution, Replication, and Partitioning Needs?
 Loading Time (Clean, Format, Copy, Transmit,
Index Creation, etc.)?
 For CTSA – Data Ownership (Competing Hosps).
IIE-109
Knowledge Discovery

CSE
300


Data Warehousing Requires Knowledge Discovery to
Organize/Extract Information Meaningfully
Knowledge Discovery
 Technology to Extract Interesting Knowledge
(Rules, Patterns, Regularities, Constraints) from a
Vast Data Set
 Process of Non-trivial Extraction of Implicit,
Previously Unknown, and Potentially Useful
Information from Large Collection of Data
Data Mining
 A Critical Step in the Knowledge Discovery
Process
 Extracts Implicit Information from Large Data Set
IIE-110
Steps in a KDD Process

CSE

300







Learning the Application Domain (goals)
Gathering and Integrating Data
Data Cleaning
Data Integration
Data Transformation/Consolidation
Data Mining
 Choosing the Mining Method(s) and Algorithm(s)
 Mining: Search for Patterns or Rules of Interest
Analysis and Evaluation of the Mining Results
Use of Discovered Knowledge in Decision Making
Important Caveats
 This is Not an Automated Process!
 Requires Significant Human Interaction!
IIE-111
OLAP Strategies

CSE
300

OLAP Strategies
 Roll-Up: Summarization of Data
 Drill-Down: from the General to Specific (Details)
 Pivot: Cross Tabulate the Data Cubes
 Slide and Dice: Projection Operations Across
Dimensions
 Sorting: Ordering Result Sets
 Selection: Access by Value or Value Range
Implementation Issues
 Persistent with Infrequent Updates (Loading)
 Optimization for Performance on Queries is More
Complex - Across Multi-Dimensional Cubes
 Recovery Less Critical - Mostly Read Only
 Temporal Aspects of Data (Versions) Important
IIE-112
On-Line Analytical Processing

CSE
300

Data Cube
 A Multidimensonal Array
 Each Attribute is a Dimension
In Example Below, the Data Must be Interpreted so
that it Can be Aggregated by Region/Product/Date
Product
Product
Store
Date
Sale
acron
Rolla,MO 7/3/99 325.24
budwiser LA,CA
5/22/99 833.92
large pants NY,NY
2/12/99 771.24
Pants
Diapers
Beer
Nuts
West
East
3’ diaper Cuba,MO 7/30/99 81.99
Region
Central
Mountain
South
Jan
Feb March April
Date
IIE-113
On-Line Analytical Processing

CSE
300
For BMI – Imagine a Data Table with Patient Data
 Define Axis
 Summarize Data
 Create Perspective to Match Research Goal
 Essentially De-identified Data Mart
Medication
Patient
Med
BirthDat Dosage
Steve
Lipitor
1/1/45 10mg
John
Zocor
2/2/55
Harry
Crestor
3/3/65 5mg
Lois
Lipitor
4/4/66 20mg
Charles Crestor
7/1/59
Lescol
Crestor
Zocor
Lipitor
80mg
10mg
5
10
Dosage
20
40
80
1940s 1950s 1960s 1970s
Decade
IIE-114
Examples of Data Mining

CSE
300
The Slicing Action
 A Vertical or Horizontal Slice Across Entire Cube
Months
Slice
on city Atlanta
Products Sales
Products Sales
Months
Multi-Dimensional Data Cube
IIE-115
Examples of Data Mining

CSE
300
The Dicing Action
 A Slide First Identifies on Dimension
 A Selection of Any Cube within the Slice which
Essentially Constrains All Three Dimensions
Months
Products Sales
Products Sales
Months
March 2000
Electronics
Atlanta
Dice on Electronics and Atlanta
IIE-116
Examples of Data Mining
Drill Down - Takes a Facet (e.g.,
Q1)
and Decomposes into Finer Detail
Jan Feb March
Products Sales
CSE
300
Drill down
on Q1
Roll Up
on Location
(State, USA)
Roll Up: Combines Multiple Dimensions
From Individual Cities to State
Q1 Q2 Q3 Q4
Products Sales
Products Sales
Q1 Q2 Q3 Q4
IIE-117
Mining Other Types of Data

CSE

300
Analysis and Access Dramatically More Complicated!
Time Series Data for Glucose, BP, Peak Flow, etc.
Spatial databases
Multimedia databases
World Wide Web
Time series data
Geographical and Satellite Data
IIE-118
Advantages/Objectives of Data Mining

CSE
300


Descriptive Mining
 Discover and Describe General Properties
 60% People who buy Beer on Friday also have
Bought Nuts or Chips in the Past Three Months
Predictive Mining
 Infer Interesting Properties based on Available
Data
 People who Buy Beer on Friday usually also Buy
Nuts or Chips
Result of Mining
 Order from Chaos
 Mining Large Data Sets in Multiple Dimensions
Allows Businesses, Individuals, etc. to Learn about
Trends, Behavior, etc.
 Impact on Marketing Strateg
IIE-119
Data Mining Methods (1)

CSE
300
Association
 Discover the Frequency of Items Occurring
Together in a Transaction or an Event
 Example
 80% Customers who Buy Milk also Buy Bread
Hence - Bread and Milk Adjacent in Supermarket
 50% of Customers Forget to Buy Milk/Soda/Drinks
Hence - Available at Register

Prediction
 Predicts Some Unknown or Missing Information
based on Available Data
 Example
 Forecast Sale Value of Electronic Products for Next
Quarter via Available Data from Past Three Quarters
IIE-120
Association Rules

CSE

300


Motivated by Market Analysis
Rules of the Form
 Item1^Item2^…^ ItemkItemk+1 ^ … ^ Itemn
Example
 “Beer ^ Soft Drink  Pop Corn”
Problem: Discovering All Interesting Association
Rules in a Large Database is Difficult!
 Issues
 Interestingness
 Completeness
 Efficiency

Basic Measurement for Association Rules
 Support of the Rule
 Confidence of the Rule
IIE-121
Data Mining Methods (2)

CSE
300
Classification
 Determine the Class or Category of an Object
based on its Properties
 Example
 Classify Companies based on the Final Sale Results in
the Past Quarter

Clustering
 Organize a Set of Multi-dimensional Data Objects
in Groups to Minimize Inter-group Similarity is
and Maximize Intra-group Similarity
 Example
 Group Crime Locations to Find Distribution Patterns
IIE-122
Classification

CSE
300


Two Stages
 Learning Stage: Construction of a Classification
Function or Model
 Classification Stage: Predication of Classes of
Objects Using the Function or Model
Tools for Classification
 Decision Tree
 Bayesian Network
 Neural Network
 Regression
Problem
 Given a Set of Objects whose Classes are Known
(Training Set), Derive a Classification Model
which can Correctly Classify Future Objects
IIE-123
An Example

CSE
300


Attributes
Attribute
Possible Values
outlook
sunny, overcast, rain
temperature continuous
humidity
continuous
windy
true, false
Class Attribute - Play/Don’t Play the Game
Training Set
 Values that Set the Condition for the Classification
 What are the Pattern Below?
Outlook Temperature Humidity
sunny
85
85
overcast 83
78
sunny
80
90
sunny
72
95
sunny
72
70
…
…
…
Windy
false
false
true
false
false
…
Play
No
Yes
No
No
Yes
...
IIE-124
Data Mining Methods (3)

CSE
300
Summarization
 Characterization (Summarization) of General
Features of Objects in the Target Class
 Example
 Characterize People’s Buying Patterns on the Weekend
 Potential Impact on “Sale Items” & “When Sales Start”
 Department Stores with Bonus Coupons

Discrimination
 Comparison of General Features of Objects
Between a Target Class and a Contrasting Class
 Example
 Comparing Students in Engineering and in Art
 Attempt to Arrive at Commonalities/Differences
IIE-125
Summarization Technique

CSE

300
Attribute-Oriented Induction
Generalization using Concert hierarchy (Taxonomy)
barcode category
14998
milk
brand
diaryland
content
size
Skim
2L
food
12998 mechanical MotorCraft valve 23a 12in
…
…
…
…
...
Milk
…
Skim milk … 2% milk
Category
milk
milk
…
Content Count
skim
2%
…
280
98
...
bread
White
whole
bread … wheat
Lucern … Dairyland
Wonder … Safeway
IIE-126
Why is Data Mining Popular?

CSE
300
Technology Push
 Technology for Collecting Large Quantity of Data
 Bar Code, Scanners, Satellites, Cameras

Technology for Storing Large Collection of Data
 Databases, Data Warehouses
 Variety of Data Repositories, such as Virtual Worlds,
Digital Media, World Wide Web


Corporations want to Improve Direct Marketing and
Promotions - Driving Technology Advances
 Targeted Marketing by Age, Region, Income, etc.
 Exploiting User Preferences/Customized Shopping
What is Potential for BMI?
 How do you see Data Mining Utilized?
 What are Key Issues to Worry About?
IIE-127
Requirements & Challenges in Data Mining

CSE
300



Security and Social
 What Information is Available to Mine?
 Preferences via Store Cards/Web Purchases
 What is Your Comfort Level with Trends?
User Interfaces and Visualization
 What Tools Must be Provided for End Users of
Data Mining Systems?
 How are Results for Multi-Dimensional Data
Displayed?
Performance Guarantees
 Range from Real-Time for Some Queries to LongTerm for Other Queries
Data Sources of Complex Data Types or Unstructured
Data - Ability to Format, Clean, and Load Data Sets
IIE-128
Concluding Remarks

CSE
300


We’ve looked at:
 Informatics
 Information Engineering
 Information Usage and Repositories
Focused on Their Applicability and Relevance for
BMI
Likely Generated More Questions than Answers
IIE-129