Transcript Chap1

STEGANOGRAPHY:
Data Mining:
SOUNDARARAJAN EZEKIEL
Department of Computer Science
Indiana University of Pennsylvania
Indiana, PA 15705
Steganography Cryptography Data Mining
Art of hiding
information
in ways that
prevent the
detection of
hidden
message
Existence is
not know
Science of
writing in
secret code
It encodes a
message so it
cannot be
understood
Discovering hidden
Values in your data
Warehouse
That is
The extraction of
hidden predictive
information from large
database
Knowledge discovery
method– extraction of
implicit and interesting
pattern from large data
collection
Data Mining-- Introduction

It started when we started to store data in
computer( businesses)
 Continued improvements– technology that
navigate through data in real time
 Examples:–
–
–
–
–
–
–
–
Single case:
Web server collect data for every single cleick
Logs are too big and contain gibberish
Lots of data and statistics
What we collected is not really useful
Multiple Case:Collection of web servers with large bandwidth
Think about the size of the data we collect
Data Mining --- Continue

It helps to design better and more intelligent
business( e-learning environments) because it
supported by
– Massive data collection
– Powerful multiprocessor computers
– Good data mining algorithms

It existed at least 10 years, but it is getting popular
recently
 Example:– Winter Corporation Report
• Data warehouses with as much as 100 to 200 terabytes of raw
data will be operational by next year, performing nearly 2,000
concurrent queries and occupying nearly 1 petabyte (1,000
terabytes) of disk space. In the same time period, transactionprocessing databases will handle workloads of nearly 66,000
transactions per second
Evolution of Data mining
Evolutionary step
Question
Tech
Product
providers
characteristics
Data collection
60’s
What was my total
revenue last few
years
Computer, tapes,
disks
IBM , CDC
Retrospective static
data delivery
Data Access
80’s
What were unit
sales in India last
year January
RDBMS(Relation
al DataBases)
SQL( Structured
Query
Languages)
ODBC
Oracle
Sybase
Informix
IBM
Microsoft
Dynamic data
delivery
Data warehouse
and decision
support
90’s
What were unit
sales price in
India last March?
On-line analytic
processing
(OLAP)
Multidimensional
data base, data
warehouses
Pilot
Comshare
Arbor
Cognos
Microstrategy
Dynamic data
delivery in multiple
level
Data mining
Now
What will be unit
price in India next
month?
Why?
Advanced
algorithms,
multiprocessor
computers,
massive
database
Pilot
Lockheed
IBM,SGI
Many more…
Prospective,
proactive
information delivery
The scope of Data mining
It is similar to sifting gold from immense
amount of dirt--- searching valuable
information in a gigabytes data
 Automated prediction of trends and
behaviors: Data mining automates the
process of finding predictive information in
a large database.

• Example: Question related to target marketing
– Data mining can use mailing list data– other previous data to
identify the solution
• Another example- Forecasting bankruptcy by
identifying segments of a population likely to respond
similarly to given events

Automated discovery of previously
unknown patterns: It sweep through the
database and identify previously hidden
patterns in one step
– Example: Unrelated items purchased together in
a store.
• Detecting fraudulent credit card transactions etc

Data base can be larger in both depth and
breadth
– High performance data mining need to analyze
full depth of a database without pre-selecting
subsets
– Larger samples yield lower estimation errors
and variances
Research Rank
2001 – According to MIT’s Technology
Review – Data mining is a top 10 research
area
 Recently – According to Gartner Group
Advanced Technology Research Note– data
mining and AI is top 5 key research area.

Multi-disciplinary field with a broad applicability

Has several applications
– Market based analysis
– Customer relationship
management
– Fraud detection
– Network intrusion detection
– Non-destructive eavaluation
– Astronomy (look up dataa)
– Remote sensing data
• ( look down data)
– Text and mulitmedia mining
– Medical imaging
– Automated target recognition

My point of view of Data mining
Borrowing the idea from
•Machine Learning
•Artificial Intelligence
•Statistics
•High performance computing
•Signal and Image Processing
• Mathematical Optimization
• Pattern Recognition
•Natural Language processing
•Steganography
•Cryptography
Combined ideas from several
diffferent fields
– Steganography-- Cryptography
General view of Data mining
Raw
Data
Target Preprocessed Transformed
Data
Data data
Data processing
Data Fusion De-noising
Object
Sampling
Identification
MRA
Feature
Extraction
Normalization
Pattern
pattern recog.
Dimensi
on
Reducti
on
Classification
Clustering
Regression
Knowledge
Interpreting results
Visualization
Validation
An Iterative and Interactive Process
Our Research Based On

Data Preprocessing
– Multiresolution Analysis
– De-noising ( wavelet based methods)
– Object Classifications
– Feature Extraction

Pattern Recognition
– Classification
– Clustering

Visualization
and Validation
– Steganography
– Cryptography
Where we are going from here

More robust , accurate, scalable algorthim
– For pre-processing and pattern recognition
– Wavelets– and fractals

Newer data types
– Video and multimedia
– Multi-sensor data

More complex problems
– Dynamic tracking in video
– Mining text, audio, video, images

Investigating Steganography in images, analysis
of data hiding methods, attacks against hidden
information, and counter measures to attacks
against digital watermarking ( detection and
distortion)
How data mining works?


How exactly the data mining able to tell you important
things that you did not know or what is going to happen
next?
The method/ techniques that is used to perform these feats
in data mining is called modeling
– Modeling is simply the act of building a model in one situation
where you know the answer and then applying it to another
situation that you don’t
– Example: Sunken treasure ship– Bermuda shore, other ships– path- keep all these information– build the model– if the model is
good– you find the treasure in the ocean
– Example 2: Identify telephone customer– for example you have the
information that is the model that 98% customer who makes $60K
per year spend more than $80 per month on long distance
• with this model new customer can be selectively targeted
Most commonly used techniques

Artificial Neural Networks: Non linear predictive models
that learn through training and resemble biological neural
networks in structure
 Decision Trees: Tree- shaped structures that represents set
of decisions . These decisions generated rules for the
classification of a dataset. Specific decision tree include
classification and Regression Test(CART)and Chi Square
Automated Interaction Detection (CAID)
 Genetic Algorithms: optimization techniques that uses
processes genetic combination, mutation, and selection in
a design based on the concept of evolution
 Nearest Neighbor Method:
 Rule Induction:

OUR METHODS WILL BE BASED ON WAVELETS,
FRACTALS, STEG, AND CRYPT
Steganography Methods
Lets us discuss few methods and its
advantage and disadvantage
 1. Least Significant Method

– Idea:- Hide the hidden message in LSB of the
pixels
– Example:– Advantage:- quick and easy– works well in
gray image
– Disadvantage:- insert in 8 bit– changes color–
noticeable change– vulnerable to image
processing– cropping and compression

Redundant method
– Store more than one time--- withstand
cropping

Spread Spectrum
– Store the hidden message everywhere
STEGANALYSIS
 Detection
Distortion

Analyst observe various
Various relationship between
Cover, message, stego-media
Steganography tool
Seeing the Unseen
Analyst manipulate the stego-media
To render the embedded information
Useless or remove it altogether
DCT - Discrete Cosine Transformation
– Encode
• Take image
• Divide into 8x8 blocks
• Apply 2-D DCT--- DCT
coefficients
• Apply threshold value
• Store the hidden message
in that place
1720
5.667
• Take inverse– store as
0.3711
3.888
image
– Decode
• Start with modified image
• Apply DCT
• Find coefficient less than
T
• Extract bits
• Combine bits and make
message
1.524
3.475
-1.442
-3.356
1.625 -2.279
-4.049 -1.223
1.876
1.924
0.8995 -0.7233
219
219
217
215
217
216
215
215
216
217
215
216
216
214
214
216
218
215
214
215
210
216
216
216
215
216
214
210
218
215
212
211
215
215
211
218
215
212
212
215
215
215
217
215
213
214
217
215
215
216
215
215
216
218
216
216
218 215 211 211 213 214 216 216
7.683
-4.181
1.067
-1.97
0.4735
0.5466
-1.369
0.667
1.234
1.625 0.9234 -0.07047 -1.055
-1.524
1.152
1.637
1.016 0.3802
5.944 0.3943 -0.4591 0.1313 0.7812
3.265 0.5632 -0.939 -0.2434 0.2354
1.392
1.375 0.6552 -1.143 0.03459
-0.5425 -1.013 -0.2651 0.5696 -0.9296
-1.132 -0.02802 -0.4646 0.1831 0.9729
0.436 0.1325 -0.03665 -0.3141 -0.4749
Wavelets Transformation
Wavelets are basis function wjk (t ) in continuous time.
a basis is a set of linearly independent functions that can be used
to produce all admissible functions f(t)
f (t )  combination of basis functions  bjk wjk (t )
j ,k
The special feature of wavelet basis is that all functions wjk (t )
are constructed from a single mother wavelet w(t). This wavelet is
is a small wave ( a pulse). Normally it starts at time t=0 and end at
time t=N
j
w0k (t )  w(t  k )
Shifted
k
time
=
w

w
(2
t
)
j
0
Compressed =
Combine both we have wjk (t )  w(2 j t  k )
Haar Wavelet :- 1909 Haar, 1984– theory, 88– daubechies
Haar=
89- Mallat 2-d, mra, -- 92- bi-orthogonal
figure
Carrier
Stego image
Wavelet
Transformation
Thresholding
Compression
Message to be Hidden
Error Image
Inverse Transformation
Extract the Hidden Message
Information security and data mining
Goal of intrusion detection – discover
intrusion into a computer or network
 With internet and available tool for attacking
networks– security becomes a critical
component of network
 Misuse detection: finds intrusion by looking
for activity corresponding to known
techniques for intrusion
 Anomaly detection: the system defines the
expected behavior of the network in
advance

What we want
The tools to filter and classify information
 Tools to find and retrieve the relevant
information when you need it
 Tools that adapt to your pace and needs
 Tools to predict information needs
 Tools to recommend tasks and information
sources
 Tools than can be personalized, manually or
automatically

The tools should be…










Non- intrusive
Secure
Integrated
Adaptable
Controllable
Automatic or semi-automatic
Useful
For learners
For educators
Integrate operational data with customer,
suppliers and market --
Profitable application


A wide range of companies have deployed successful
application of data mining
Some applications area include
– A pharmaceutical company can analyze its recent sales force
activity and their results to improve target of high-value physician
and determine which marketing activities will have the greatest
impact in the next few months
– A credit card companies can leverage its vast warehouse of
customers transactions data to identify customers most likely to be
interested in a new credit product
– A diversified transportation company with a large direct sales
forces can apply data mining to identify the best prospect for its
services
– A large consumer package goods company can apply data mining
to improve its sales process to retailers
Conclusion
In this talk, we have discussed data mining
related topics
 Our goals

– Research
– Software and algorithms
– Application
Our main focus is Science Data, though
applicable to other data sets as well
 More information – check out website
http://www.cosc.iup.eud/sezekiel
Contact: [email protected]