Unit-4,5 - SRM CSE-A

Download Report

Transcript Unit-4,5 - SRM CSE-A

DATA MINING
Introductory and Advanced Topics
Part III
Margaret H. Dunham
Department of Computer Science and Engineering
Southern Methodist University
Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory
and Advanced Topics, Prentice Hall, 2002.
© Prentice Hall
1
Data Mining Outline
• PART I
– Introduction
– Related Concepts
– Data Mining Techniques
• PART II
– Classification
– Clustering
– Association Rules
• PART III
– Web Mining
– Spatial Mining
– Temporal Mining
© Prentice Hall
2
Web Mining Outline
Goal: Examine the use of data mining on the
World Wide Web
• Introduction
• Web Content Mining
• Web Structure Mining
• Web Usage Mining
© Prentice Hall
3
Web Mining Issues
• Size
– >350 million pages (1999)
– Grows at about 1 million pages a day
– Google indexes 3 billion documents
• Diverse types of data
© Prentice Hall
4
Web Data
•
•
•
•
•
Web pages
Intra-page structures
Inter-page structures
Usage data
Supplemental data
– Profiles
– Registration information
– Cookies
© Prentice Hall
5
Web Mining Taxonomy
Modified from [zai01]
© Prentice Hall
6
Web Content Mining
• Extends work of basic search engines
• Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
© Prentice Hall
7
Crawlers
• Robot (spider) traverses the hypertext sructure in the
Web.
• Collect information from visited pages
• Used to construct indexes for search engines
• Traditional Crawler – visits entire Web (?) and
replaces index
• Periodic Crawler – visits portions of the Web and
updates subset of index
• Incremental Crawler – selectively searches the Web
and incrementally modifies index
• Focused Crawler – visits pages related to a particular
subject
© Prentice Hall
8
Focused Crawler
• Only visit links from a page if that page is
determined to be relevant.
• Classifier is static after learning phase.
• Components:
– Classifier which assigns relevance score to each
page based on crawl topic.
– Distiller to identify hub pages which determines
the pages that contain links to many relevant
pages.
– Crawler visits pages based on classifier and
distiller scores.
© Prentice Hall
9
Focused Crawler
• Classifier to related documents to topics
• Classifier also determines how useful outgoing
links are
• Hub Pages contain links to many relevant
pages. Must be visited even if not high
relevance score.
© Prentice Hall
10
Focused Crawler
© Prentice Hall
11
Context Focused Crawler
•
Context Graph:
–
–
–
–
•
Context graph created for each seed document .
Root is the seed document.
Nodes at each level show documents with links to
documents at next higher level.
Updated during crawl itself .
Approach:
1. Construct context graph and classifiers using seed
documents as training data.
2. Perform crawling using classifiers and context graph
created.
© Prentice Hall
12
Context Graph
© Prentice Hall
13
Harvest System
• Based on the use of caching ,indexing and crawling
• Harvest is a set of tools that facilitate gathering of
information from diverse sources.
• Gatherers – Obtains information for indexing from
an Internet Service Provider
• Brokers- Provides the index and query interface
• Gatherers use essence system to assist in collecting
data
• Essence-Classifies documents by creating a semantic
index
© Prentice Hall
14
Virtual Web View
• Multiple Layered DataBase (MLDB) built on top of the
Web.
• Each layer of the database is more generalized (and
smaller) and centralized than the one beneath it.
• Upper layers of MLDB are structured and can be
accessed with SQL type queries.
• Translation tools convert Web documents to XML.
• Extraction tools extract desired information to place in
first layer of MLDB.
• Higher levels contain more summarized data obtained
through generalizations of the lower levels.
© Prentice Hall
15
Personalization
• Web access or contents tuned to better fit the
desires of each user.
• Manual techniques identify user’s preferences
based on profiles or demographics.
• Collaborative filtering identifies preferences based
on ratings from similar users.
• Content based filtering retrieves pages based on
similarity between pages and user profiles.
© Prentice Hall
16
Web Structure Mining
• Mine structure (links, graph) of the Web
• Techniques
– PageRank
– CLEVER
• Create a model of the Web organization.
• May be combined with content mining to
more effectively retrieve important pages.
© Prentice Hall
17
PageRank
• Used by Google
• Prioritize pages returned from search by
looking at Web structure.
• Importance of page is calculated based on
number of pages which point to it –
Backlinks.
• Weighting is used to provide more
importance to backlinks coming form
important pages.
© Prentice Hall
18
PageRank (cont’d)
• PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)
– PR(i): PageRank for a page i which points to target
page p.
– Ni: number of links coming out of page i
© Prentice Hall
19
CLEVER
• Identify authoritative and hub pages.
• Authoritative Pages :
– Highly important pages.
– Best source for requested information.
• Hub Pages :
– Contain links to highly important pages.
© Prentice Hall
20
HITS
• Hyperlink-Induces Topic Search
• Based on a set of keywords, find set of relevant
pages – R.
• Identify hub and authority pages for these.
– Expand R to a base set, B, of pages linked to or from R.
– Calculate weights for authorities and hubs.
• Pages with highest ranks in R are returned.
© Prentice Hall
21
HITS Algorithm
© Prentice Hall
22
Web Usage Mining
• Extends work of basic search engines
• Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
© Prentice Hall
23
Web Usage Mining Applications
• Personalization
• Improve structure of a site’s Web pages
• Aid in caching and prediction of future page
references
• Improve design of individual pages
• Improve effectiveness of e-commerce (sales
and advertising)
© Prentice Hall
24
Web Usage Mining Activities
• Preprocessing Web log
– Cleanse
– Remove extraneous information
– Sessionize
Session: Sequence of pages referenced by one user at a sitting.
• Pattern Discovery
– Count patterns that occur in sessions
– Pattern is sequence of pages references in session.
– Similar to association rules
• Transaction: session
• Itemset: pattern (or subset)
• Order is important
• Pattern Analysis
© Prentice Hall
25
ARs in Web Mining
• Web Mining:
– Content
– Structure
– Usage
• Frequent patterns of sequential page references
in Web searching.
• Uses:
–
–
–
–
Caching
Clustering users
Develop user profiles
Identify important pages
© Prentice Hall
26
Web Usage Mining Issues
• Identification of exact user not possible.
• Exact sequence of pages referenced by a user
not possible due to caching.
• Session not well defined
• Security, privacy, and legal issues
© Prentice Hall
27
Web Log Cleansing
• Replace source IP address with unique but
non-identifying ID.
• Replace exact URL of pages referenced with
unique but non-identifying ID.
• Delete error records and records containing
not page data (such as figures and code)
© Prentice Hall
28
Sessionizing
• Divide Web log into sessions.
• Two common techniques:
– Number of consecutive page references from a
source IP address occurring within a predefined
time interval (e.g. 25 minutes).
– All consecutive page references from a source IP
address where the interclick time is less than a
predefined threshold.
© Prentice Hall
29
Data Structures
• Keep track of patterns identified during Web
usage mining process
• Common techniques:
– Trie
– Suffix Tree
– Generalized Suffix Tree
– WAP Tree
© Prentice Hall
30
Trie vs. Suffix Tree
• Trie:
– Rooted tree
– Edges labeled which character (page) from pattern
– Path from root to leaf represents pattern.
• Suffix Tree:
– Single child collapsed with parent. Edge contains
labels of both prior edges.
© Prentice Hall
31
Trie and Suffix Tree
© Prentice Hall
32
Generalized Suffix Tree
• Suffix tree for multiple sessions.
• Contains patterns from all sessions.
• Maintains count of frequency of occurrence of
a pattern in the node.
• WAP Tree:
Compressed version of generalized suffix tree
© Prentice Hall
33
Types of Patterns
• Algorithms have been developed to discover
different types of patterns.
• Properties:
– Ordered – Characters (pages) must occur in the exact order
in the original session.
– Duplicates – Duplicate characters are allowed in the
pattern.
– Consecutive – All characters in pattern must occur
consecutive in given session.
– Maximal – Not subsequence of another pattern.
© Prentice Hall
34
Pattern Types
• Association Rules
None of the properties hold
• Episodes
Only ordering holds
• Sequential Patterns
Ordered and maximal
• Forward Sequences
Ordered, consecutive, and maximal
• Maximal Frequent Sequences
All properties hold
© Prentice Hall
35
Episodes
• Partially ordered set of pages
• Serial episode – totally ordered with time
constraint
• Parallel episode – partial ordered with time
constraint
• General episode – partial ordered with no
time constraint
© Prentice Hall
36
DAG for Episode
© Prentice Hall
37
Spatial Mining Outline
Goal: Provide an introduction to some spatial
mining techniques.
• Introduction
• Spatial Data Overview
• Spatial Data Mining Primitives
• Generalization/Specialization
• Spatial Rules
• Spatial Classification
• Spatial Clustering
© Prentice Hall
38
Spatial Object
• Contains both spatial and nonspatial
attributes.
• Must have a location type attributes:
– Latitude/longitude
– Zip code
– Street address
• May retrieve object using either (or both)
spatial or nonspatial attributes.
© Prentice Hall
39
Spatial Data Mining Applications
•
•
•
•
•
•
•
Geology
GIS Systems
Environmental Science
Agriculture
Medicine
Robotics
May involved both spatial and temporal
aspects
© Prentice Hall
40
Spatial Queries
• Spatial selection may involve specialized selection
comparison operations:
–
–
–
–
Near
North, South, East, West
Contained in
Overlap/intersect
• Region (Range) Query – find objects that intersect a given
region.
• Nearest Neighbor Query – find object close to identified
object.
• Distance Scan – find object within a certain distance of an
identified object where distance is made increasingly
larger.
© Prentice Hall
41
Spatial Data Structures
• Data structures designed specifically to store or index
spatial data.
• Often based on B-tree or Binary Search Tree
• Cluster data on disk basked on geographic location.
• May represent complex spatial structure by placing the
spatial object in a containing structure of a specific
geographic shape.
• Techniques:
– Quad Tree
– R-Tree
– k-D Tree
© Prentice Hall
42
MBR
• Minimum Bounding Rectangle
• Smallest rectangle that completely contains
the object
© Prentice Hall
43
MBR Examples
© Prentice Hall
44
Quad Tree
• Hierarchical decomposition of the space into
quadrants (MBRs)
• Each level in the tree represents the object as
the set of quadrants which contain any
portion of the object.
• Each level is a more exact representation of
the object.
• The number of levels is determined by the
degree of accuracy desired.
© Prentice Hall
45
Quad Tree Example
© Prentice Hall
46
R-Tree
• As with Quad Tree the region is divided into
successively smaller rectangles (MBRs).
• Rectangles need not be of the same size or
number at each level.
• Rectangles may actually overlap.
• Lowest level cell has only one object.
• Tree maintenance algorithms similar to those
for B-trees.
© Prentice Hall
47
R-Tree Example
© Prentice Hall
48
K-D Tree
• Designed for multi-attribute data, not
necessarily spatial
• Variation of binary search tree
• Each level is used to index one of the
dimensions of the spatial object.
• Lowest level cell has only one object
• Divisions not based on MBRs but successive
divisions of the dimension range.
© Prentice Hall
49
k-D Tree Example
© Prentice Hall
50
Topological Relationships
•
•
•
•
•
Disjoint
Overlaps or Intersects
Equals
Covered by or inside or contained in
Covers or contains
© Prentice Hall
51
Distance Between Objects
• Euclidean
• Manhattan
• Extensions:
© Prentice Hall
52
Progressive Refinement
• Make approximate answers prior to more
accurate ones.
• Filter out data not part of answer
• Hierarchical view of data based on spatial
relationships
• Coarse predicate recursively refined
© Prentice Hall
53
Progressive Refinement
© Prentice Hall
54
Spatial Data Dominant Algorithm
© Prentice Hall
55
STING
• STatistical Information Grid-based
• Hierarchical technique to divide area into
rectangular cells
• Grid data structure contains summary
information about each cell
• Hierarchical clustering
• Similar to quad tree
© Prentice Hall
56
STING
© Prentice Hall
57
STING Build Algorithm
© Prentice Hall
58
STING Algorithm
© Prentice Hall
59
Spatial Rules
• Characteristic Rule
The average family income in Dallas is $50,000.
• Discriminant Rule
The average family income in Dallas is $50,000,
while in Plano the average income is $75,000.
• Association Rule
The average family income in Dallas for families
living near White Rock Lake is $100,000.
© Prentice Hall
60
Spatial Association Rules
• Either antecedent or consequent must contain
spatial predicates.
• View underlying database as set of spatial
objects.
• May create using a type of progressive
refinement
© Prentice Hall
61
Spatial Association Rule Algorithm
© Prentice Hall
62
Spatial Classification
• Partition spatial objects
• May use nonspatial attributes and/or spatial
attributes
• Generalization and progressive refinement
may be used.
© Prentice Hall
63
ID3 Extension
• Neighborhood Graph
– Nodes – objects
– Edges – connects neighbors
• Definition of neighborhood varies
• ID3 considers nonspatial attributes of all
objects in a neighborhood (not just one) for
classification.
© Prentice Hall
64
Spatial Decision Tree
• Approach similar to that used for spatial
association rules.
• Spatial objects can be described based on
objects close to them – Buffer.
• Description of class based on aggregation of
nearby objects.
© Prentice Hall
65
Spatial Decision Tree Algorithm
© Prentice Hall
66
Spatial Clustering
• Detect clusters of irregular shapes
• Use of centroids and simple distance
approaches may not work well.
• Clusters should be independent of order of
input.
© Prentice Hall
67
Spatial Clustering
© Prentice Hall
68
CLARANS Extensions
• Remove main memory assumption of
CLARANS.
• Use spatial index techniques.
• Use sampling and R*-tree to identify central
objects.
• Change cost calculations by reducing the
number of objects examined.
• Voronoi Diagram
© Prentice Hall
69
Voronoi
© Prentice Hall
70
SD(CLARANS)
• Spatial Dominant
• First clusters spatial components using
CLARANS
• Then iteratively replaces medoids, but limits
number of pairs to be searched.
• Uses generalization
• Uses a learning to to derive description of
cluster.
© Prentice Hall
71
SD(CLARANS) Algorithm
© Prentice Hall
72
DBCLASD
• Extension of DBSCAN
• Distribution Based Clustering of LArge Spatial
Databases
• Assumes items in cluster are uniformly
distributed.
• Identifies distribution satisfied by distances
between nearest neighbors.
• Objects added if distribution is uniform.
© Prentice Hall
73
DBCLASD Algorithm
© Prentice Hall
74
Aggregate Proximity
• Aggregate Proximity – measure of how close
a cluster is to a feature.
• Aggregate proximity relationship finds the k
closest features to a cluster.
• CRH Algorithm – uses different shapes:
– Encompassing Circle
– Isothetic Rectangle
– Convex Hull
© Prentice Hall
75
CRH
© Prentice Hall
76
Temporal Mining Outline
Goal: Examine some temporal data mining
issues and approaches.
• Introduction
• Modeling Temporal Events
• Time Series
• Pattern Detection
• Sequences
• Temporal Association Rules
© Prentice Hall
77
Temporal Database
• Snapshot – Traditional database
• Temporal – Multiple time points
• Ex:
© Prentice Hall
78
Temporal Queries
•
Query
•
Database
•
Intersection Query
•
Inclusion Query
•
Containment Query
•
Point Query – Tuple retrieved is valid at a particular point in time.
t sq
teq
t sd
ted
t sq
t sq
tsd teq
t sd
t sd
© Prentice Hall
ted
tedteq
t sq
tteeqd
79
Types of Databases
• Snapshot – No temporal support
• Transaction Time – Supports time when
transaction inserted data
– Timestamp
– Range
• Valid Time – Supports time range when
data values are valid
• Bitemporal – Supports both transaction and
valid time.
© Prentice Hall
80
Modeling Temporal Events
• Techniques to model temporal events.
• Often based on earlier approaches
• Finite State Recognizer (Machine) (FSR)
–
–
–
–
Each event recognizes one character
Temporal ordering indicated by arcs
May recognize a sequence
Require precisely defined transitions between states
• Approaches
– Markov Model
– Hidden Markov Model
– Recurrent Neural Network
© Prentice Hall
81
FSR
© Prentice Hall
82
Markov Model (MM)
• Directed graph
–
–
–
–
Vertices represent states
Arcs show transitions between states
Arc has probability of transition
At any time one state is designated as current state.
• Markov Property – Given a current state, the
transition probability is independent of any
previous states.
• Applications: speech recognition, natural
language processing
© Prentice Hall
83
Markov Model
© Prentice Hall
84
Hidden Markov Model (HMM)
• Like HMM, but states need not correspond to
observable states.
• HMM models process that produces as output a
sequence of observable symbols.
• HMM will actually output these symbols.
• Associated with each node is the probability of the
observation of an event.
• Train HMM to recognize a sequence.
• Transition and observation probabilities learned from
training set.
© Prentice Hall
85
Hidden Markov Model
Modified from [RJ86]
© Prentice Hall
86
HMM Algorithm
© Prentice Hall
87
HMM Applications
• Given a sequence of events and an HMM,
what is the probability that the HMM
produced the sequence?
• Given a sequence and an HMM, what is the
most likely state sequence which produced
this sequence?
© Prentice Hall
88
Recurrent Neural Network (RNN)
• Extension to basic NN
• Neuron can obtian input form any other
neuron (including output layer).
• Can be used for both recognition and
prediction applications.
• Time to produce output unknown
• Temporal aspect added by backlinks.
© Prentice Hall
89
RNN
© Prentice Hall
90
Time Series
• Set of attribute values over time
• Time Series Analysis – finding patterns in the
values.
– Trends
– Cycles
– Seasonal
– Outliers
© Prentice Hall
91
Analysis Techniques
• Smoothing – Moving average of attribute
values.
• Autocorrelation – relationships between
different subseries
– Yearly, seasonal
– Lag – Time difference between related items.
– Correlation Coefficient r
© Prentice Hall
92
Smoothing
© Prentice Hall
93
Correlation with Lag of 3
© Prentice Hall
94
Similarity
• Determine similarity between a target pattern,
X, and sequence, Y: sim(X,Y)
• Similar to Web usage mining
• Similar to earlier word processing and spelling
corrector applications.
• Issues:
–
–
–
–
–
Length
Scale
Gaps
Outliers
Baseline
© Prentice Hall
95
Longest Common Subseries
• Find longest subseries they have in common.
• Ex:
– X = <10,5,6,9,22,15,4,2>
– Y = <6,9,10,5,6,22,15,4,2>
– Output: <22,15,4,2>
– Sim(X,Y) = l/n = 4/9
© Prentice Hall
96
Similarity based on Linear Transformation
• Linear transformation function f
– Convert a value form one series to a value in
the second
• ef – tolerated difference in results
• d – time value difference allowed
© Prentice Hall
97
Prediction
• Predict future value for time series
• Regression may not be sufficient
• Statistical Techniques
– ARMA
– ARIMA
• NN
© Prentice Hall
98
Pattern Detection
• Identify patterns of behavior in time series
• Speech recognition, signal processing
• FSR, MM, HMM
© Prentice Hall
99
String Matching
• Find given pattern in sequence
• Knuth-Morris-Pratt: Construct FSM
• Boyer-Moore: Construct FSM
© Prentice Hall
100
Distance between Strings
• Cost to convert one to the other
• Transformations
– Match: Current characters in both strings are the
same
– Delete: Delete current character in input string
– Insert: Insert current character in target string
into string
© Prentice Hall
101
Distance between Strings
© Prentice Hall
102
Frequent Sequence
© Prentice Hall
103
Frequent Sequence Example
• Purchases made by
customers
• s(<{A},{C}>) = 1/3
• s(<{A},{D}>) = 2/3
• s(<{B,C},{D}>) = 2/3
© Prentice Hall
104
Frequent Sequence Lattice
© Prentice Hall
105
SPADE
• Sequential Pattern Discovery using
Equivalence classes
• Identifies patterns by traversing lattice in a top
down manner.
• Divides lattice into equivalent classes and
searches each separately.
• ID-List: Associates customers and transactions
with each item.
© Prentice Hall
106
SPADE Example
• ID-List for Sequences of length 1:
• Count for <{A}> is 3
• Count for <{A},{D}> is 2
© Prentice Hall
107
Q1 Equivalence Classes
© Prentice Hall
108
SPADE Algorithm
© Prentice Hall
109
Temporal Association Rules
• Transaction has time:
<TID,CID,I1,I2, …, Im,ts,te>
• [ts,te] is range of time the transaction is active.
• Types:
–
–
–
–
–
Inter-transaction rules
Episode rules
Trend dependencies
Sequence association rules
Calendric association rules
© Prentice Hall
110
Inter-transaction Rules
• Intra-transaction association rules
Traditional association Rules
• Inter-transaction association rules
– Rules across transactions
– Sliding window – How far apart (time or number
of transactions) to look for related itemsets.
© Prentice Hall
111
Episode Rules
• Association rules applied to sequences of
events.
• Episode – set of event predicates and partial
ordering on them
© Prentice Hall
112
Sequence Association Rules
• Association rules involving sequences
• Ex:
<{A},{C}>  <{A},{D}>
Support = 1/3
Confidence 1
© Prentice Hall
113
Calendric Association Rules
• Each transaction has a unique timestamp.
• Group transactions based on time interval
within which they occur.
• Identify large itemsets by looking at
transactions only in this predefined interval.
© Prentice Hall
114
UNIT V-CASE STUDY
DATA WAREHOUSING FOR THE TAMILNADU
GOVERNMENT
GISTNIC DATA WAREHOUSE
The general information service terminal of national informatics centre Data warehouse is an initiative
taken by national informatics centre to provide a comprehensive information database by the government on
national issues ranging across diverse subjects like food and agriculture to trends in the economy and latest updates
on science and technology.
The GISTNIC data warehouse is a web-enabled SAS software solution.
The data warehouse aims to provide online information to key decision makers in the government
sector enabling them to make better strategic decisions with regard to administrative policies and investments
The government of tamilnadu is the first one to perceive the need and importance of converting data
into a valuable information for better decision making.
The GISTNIC web site has online data warehouse which includes data marts on village amenities,
rainfall, agricultural census data, essential commodity prices etc.
OBJECTIVES OF WEB ENABLED DATA WAREHOUSE
The objective is to provide a powerful decision making tools in the hands of the end users in order to
facilitate prompt decision making.
DATA WAREHOUSING FOR THE MINISTRY OF
COMMERCE
The ministry of commerce has set up the following seven export processing zones at various locations.
1.Kandla free trade zone gandhidham
2. Santacruz electronics export processing zone bombay
3. Falta export processing zone , west bengal
4. Madras export processing zone, chennai
5.Cochin export processing zone ,cochin
6. Noida export processing zone noida
7. Vishakapatnam export processing zone visakhapatnam
This case study report presents how the data warehouse can be effectively built for the ministry of commerce.
The following are the objectives
1. Globalization of india’s foreign trade
2. Attracting foregn investment
3. Scaling down tariff barriers.
4. Encouraging high and internationally acceptable standards of quality
5. 5. simplification and streamlining of procedures governing imports and exports.
DATA WAREHOUSE:
Data warehouse is subject oriented, integrated, time variant and non volatile collection of data in support of
management decision making process
OBJECTIVES OF DATA WAREHOUSE FOR THE MINISTRY OF COMMERCE
The ministry of commerce has been regularly reviewing the data warehouse in its board meetings. The ministry is
equipped with all analysis variables a reporting form and the zone performance and the progress of exports are reviewed in each
zone of the country.
The data warehouse includes all analysis variables into consideration from all the zones with an option to drill down
to the daily data.
DATA WAREHOUSE IMPLEMENTATION STRATEGY FOR EXPORT PROCESSING ZONES
INFRASTRUCTURE
The basic infrastructure required for building the warehouse for the ministry of commerce is based on the
communication infrastructure, hardware/software/tools and manpower.
The common infrastructure
This includes all the tasks necessary to provide the technical basis for the warehouse environment. This includes the
connectivity between the legacy environment and the new warehouse environment on a network as well as on database level.
Man power requirements
The senior officials in the ministry of commerce sponsored the whole warehouse implementation and played active
role as EXIM policy and business architect for data warehouse and also subject area specialists.
KEY AREAS FOR ANALYSIS
The data topic deals with the data access mapping, derivation , transformation and aggregation according to the
requirement of ministry. The key areas are decision making , which will be used for analytical processing and for data mining at the
ministry of commerce are listed as follows
1. Unitwise, sectorwise and countrywise imports and exports
2. Direction of imports and exports
3. Sectorwise , countrywise and zone wise trends imports and exports
4. Comparative country wise export and import for each sector
5. DTA sales
6. Claims of reimbursement of central sales tax of zones
7. Deemed export benefits
8. Employment generation
9. Investments in the zone
10. Deployment of infrastructure
11. Growth of industrial units
12. Occupancy details
IMPLEMENTATION OF DATA WAREHOUSE
Operational data systems
The operational data system keeps the organization running by supporting daily transactions such as import
and export bills submitted to customs department in each zone, daily transactions of permissions allotments etc.
ARCHITECTURE
The architecture of the OLAP implementation consists of 5 layers.
All the 7 zones have DBMS/RDBMS data for internal management of zone activities and been forwarding the
MIS reports to the MOC , new delhi.
The second layer is located at the MoC new delhi with large metadata repository and data warehouse
The metadata warehouse tools and OLAP server handling and maintaining same are focussed in level-3 of the
architecture.
secretary
Front end tools
Olap
server
mddb
meta
data
mepz
sepz
Data
warehouse
cepz
kaepz
Data
marts
nepz
vepz
fepz
DESIGN OF ANALYSIS/CATEGORY VARIABLES
The data model is prepared by the entire data availability and data requirement are analyzed and the analysis variables
for building the multidimensional cube are listed as follows
1. Employment generation managerial/skilled/unskilled classification with zone/unit/industry break-up
2. Investments in the zone
3. Performance of units and close monitoring during production
4. Deployment of infrastructure etc.
Related tables:
Ep_mast, ep_stmst ,dist_mast ,states , indu_mast; eo_mast, eo_stmst.
Analysis variables:
1.EPZ or EOU
2.Zone
3.Type of approval
4. Type of industry
5. State
6. District
7. Year of approval
8. Month of approval
9. Day of approval
10 . Year of production commencement
11. Month of production commencement
12.Day of production commencement
13. Current status
14. Date of current status
15. Net foreign exchange percentage
16. Stipulated NFE
17. Number of approvals
18. Number of units
exports and imports values with zone sector port year month and day break-ups deal with the following
performances
Zone-wise performance
Industry-wise performance
Country-wise performance
This will indicate the following direction of exports
Country-wise performance overall and with zone break-up
Port-wise performance which will decide the examination of infrastructure in these ports
Export performance during different time periods and the analysis of the same.
Trend over years/quarters/months for different country, sector, and zone
Comparative country-wise import/export for each sector
Related tables:
Shipbill, country, industry, currency, shipment_mode , export , bill entry
Export analysis variables:
Zone , export type, year of shipping bill, month of shipping bill, country, currency, mode of shipment, destination port,
value of export etc.
Import analysis variables:
Auto/approval, zone, import type, import year, import month, import day, type of goods, import value, duty foregone,
import country and mode of shipment
DEEMED EXPORT BENEFITS
Analysis variables:
Zone , sector, claims received amount disbursed year/quarter/month
EMPLOYMENT GENERATION
Analysis variables:
Zone , male/female, managerial/skilled/unskilled, number of employees
INVESTMENTS IN THE ZONE
Analysis variables:
Zone, unit, NRI foreign investment, Indian investment, remittances received, approved value.
CONCLUSION
The data warehouse for EPZ provides the ability to scale large volume of data seamless presentation of historical ,
projected and derived data.
It helps the planners in what-if analysis and planning without depending on the zones to supply the dat.
The time lag between the zone and the ministry is saved and then the analysis can be carried out with the speed of
thought analysis.
The data warehouse for the ministry of commerce can add more dimensions to the propose warehouse with the data
collected from other offices to evolve a data warehouse model for better analysis of promotion of imports and exports in
the country.
This will provide an excellent executive information system to the secretary joint secretaries of the ministry.
DATA WAREHOUSING FOR THE GOVERNMENT OF ANDHRA
PRADESH
DATA WAREHOUSE FOR FINANCE DEPARTMENT
Responsibilities of the finance department:
The finance department of the government of andhra pradesh has the following responsibilities.
1. Preparing a department-wise budget up to the sub-detail head and submission to the legislature for its approval.
2. Watching out the government expenditure and revenue department-wise
3. Looking after development activities under various plan schemes
4. Monitoring other administrative matters related to all heads of the department.
Treasuries in andhra pradesh:
Money standing in the government account are kept either in treasuries or in banks. Money deposited in the banks shall
be considered as general fund held in the books of the banks on behalf of the state.
Director of treasuries:
Treasuries are under the general control of the director of treasuries and accounts.
Sub treasuries:
If the requirements of the public business make necessary the establishment of one or more sub-treasures under a
district treasury.
The accounts of receipts and payments at a sub-treasury must be included monthly in the accounts of the district
treasury.
Treasuries handle all the government receipts and payments.
Every transaction in the government is made through related departments.
DESCRIPTION OF THE
ACCOUNT HEAD
LENGTH OF THE CODE
MAJOR
4
SUB-MAJOR
2
MINOR
3
GROUP SUB-HEAD
1
SUB-HEAD
2
DETAIL HEAD
3
SUB-DETAIL HEAD
3
LESS THAN 2000
RECEIPTS
2000<4000
SERVICE MAJOR HEADS
4000<6000
CAPITAL OUTLAY
6000<8000
LOANS
MORE THAN 8000
DEPOSITS
A project for building data warehouse for OLAP is implemented by NIC for the department of treasuries. The
concept of building DW in the department of treasuries has been established for providing easy access to integrated up to
date data related to various aspects of the department functions.
DW technology is used to develop analytical tools designed to provide support for decision making at all
levels of the department.
Traditional information systems implemented in the department of treasuries are based on transactional
databases
Which are not designed for providing fast and efficient access to information critical for decision making.
Data required for analysis are typically distributed among a number of isolated information systems meeting the needs of
different sub-treasuries.
Data warehouse technology provided to the department of treasuries by national informatics centre
eliminates these problems by storing current historical data from disparate information systems.
Data ware house provide efficient analysis and monitoring of financial data of treasuries.
It also evaluates the internal and external business factors related to operational economic and financial conditions of
treasuries budget utilization.
DIMENSIONS COVERED UNDER FINANCE DATA WAREHOUSE
The different dimensions taken for drill down approach against two measures(payments and receipts)
1. Department
2. District treasury office
3. Sub-treasury office
4. Drawing and disbursing officer
5. Time
6. Bank-wise
7. Based on different forms
COGNOS GRAPHIC USER INTERFACE FOR TREASURIES DATA WAREHOUSE
Impromptu:
It is used for generating various kind of reports like simple crosstab etc.
Transformer:
transformer model objects contain definition of queries dimensions measures dimension views user classes and
related authentication information as well as objects for one or more cubes that transformer creates for viewing and
reporting in powerplay.
Powerplay:
COGNOS powerplay is used to populate reports with drill-down facility.
Scheduler:
Scheduler coordinates the execution of automated processes called task , on a set date and time, or at recurring
intervals.
Authenticator:
Authenticator is a user class management system. It provides COGNOS client applications with the ability to create and
show data based on user authenticated access.
DATA WAREHOUSING IN HEWLETT-PACKARD
HP can easily access and quickly analyze enormous volumes of self through data to help its reseller
customers improve the efficiency and profitability of their businesses with an OLAP system based on Microsoft SQL server
and knosys proclarity.
Hewlett-Packard is a worldwide market leader in the $18 million inkjet industry.
In the past, HP’s brand recognition and reputation for reliability were enough to ensure that a reseller would carry HP
products.
ACCESS TO INFORM ATION NEEDED USING DATA WAREHOUSING TECHNOLOGY
HP has captured and stored the information both from primary research and third parties
The businesss analysis group decided they needed a system that would provide market metric data to help field sales
force managers or account teams make brand and channel management decisions.
HP requires a system that requires low cost, low maintenance and as simple to administer as possible. So the
group turned too Knosys, a Boise,Idaho based software company that has developed a business analysis/online analytical
processing(OLAP) package called ProClarity.
The solution enables to move the data so quickly and at such a low cost of maintenance and ownership that
it would solve the problems.
Knosys has helped the group build the data flow algorithms with microsoft visual basic development system
and SQL server 7.0 data transformation services.
Due to HP’s enormous sell through data volumes, it would take too long to build analytical models with a
pure, multidimensional OLAP solution.
HP used SQL server 7.0 virtual cubes and cube partitioning capabilities.
Virtual cube capabilities allows decision makers to cross analyze data from all these OLAP sources.
Cube partitioning allows HP to more effectively manage large number of OLAP cubes .
Knosys Proclarity provides HP decision makers with the key to analyzing masses of data.
Pro clarity is fully integrated with Microsoft products, and its PC-based client is modelled after Internet explorer 4.0
Proclarity powerful analytical features takes full advantage of the robust capabilities found in SQL server 7.0 OLAP
services
CONCLUSION
HP’s proclarity system and SQL server 7.0 provides the system with more accurate, detailed and timely data,
that makes the business more efficient and helps its resellers make their businesses more efficient.
DATA WAREHOUSING IN
LEVI STRAUSS
In 1998, ArsDigita corporation built a web service as a front-end to an experimental custom clothing factory
operated by Levi strauss.
The whole purpose of the factory and web service was to test and analyze the consumer reaction to this
method of buying clothes. Therefore , a data warehouse was built into their project almost from the beginning.
The public website was supported by a mid-range HP Unix server that had ample leftover capacity to run the data
warehouse.
The new ‘dw’ oracle user was created to support the data warehouse.
GRANTed SELECT on the OLTP tables to the ‘dw’ user and wrote procedures to copy all the data from the OLTP system into
a star schema of tables owned by the ‘dw’ user
This kind of schema was proven to scale to the world’s largest data warehouses.
In star join schema, there is one fact table that references a bunch of dimension tables.
The following dimension tables are designed after discussing with Levi’s executives
1.TIME: for queries comparing sales by season, quarter or holiday
2.PRODUCT: for queries comparing sales by color or style
3.SHIP TO: for queries comparing sales by region or state.
4.PROMOTION: for queries aimed at determining the relationship between discounts and sales.
5.USER EXPERIENCE: for queries looked at returned versus exchanged versus accepted items
DATA WAREHOUSING IN
WORLD BANK
The world bank collects and maintains huge data of economic and developmental parameters for all the third
world countries across the globe.
The bank performed analysis on this huge data manually and later with limited tools for analysis.
The bank collects and analyzes macro economic financial statistics and also information on parameters such as poverty,
health, education, environment and public sector.
THE LIVE DATABASE DATA WAREHOUSE
The OLAP cubes were defined for this database using OLAP server module of SQL server 2000.
Universal access was provided for this data warehouse was called live database
BENEFITS OF THE SECOND-GENERATION LDB DATA WAREHOUSE
the first generation LDB data warehouse had certain limitations. Therefore the second generation ODB
datawaregouse was built using SQL server 2000 analysis server along with proclarity of knosys corporation,
The package offered direct user functionality which was otherwise requires technical intervention by a programmer.
Proclarity also provides web enablement, thereby ensuring universal accessibility.
It results in significant cost savings by reducing the time and effort required to prepare a large variety of reports to suit
varying needs of the economists and other governmental decision makers to aid effective and better economic planning.