BI-Keynote - SiliconIndia

Download Report

Transcript BI-Keynote - SiliconIndia

Connect. Collaborate. Innovate.
GlobalLogic
Emerging Trends and
Technologies in BI
Sunil K Singh
January, 2011
© Copyright GlobalLogic 2011
1
Company Overview
Connect. Collaborate. Innovate.
– 10 years of leadership in global software R&D services
– Provides full lifecycle product engineering and
advisory services for ISVs and software-enabled
businesses
– Privately held and backed by Sequoia Capital, NEA,
Draper Atlantic / NAV and Goldman Sachs
– US $170M in revenue, 40%+ CAGR
– 175+ client partnerships under active management
– 5,500+ employees
– Headquartered in the US with business offices in the
UK, Germany, Israel and India
– Global R&D Centers and Innovation Labs in the US,
Ukraine, India, China and Argentina
© Copyright GlobalLogic 2011
“A product development
company like GlobalLogic is
doing more than just
providing offshore
developers — it is seeking to
collaborate with clients at a
strategic level and provide
executives with on-demand
access to global innovation
networks.”
— Forrester Research
“Being Innovative Means
Moving Beyond the Hype”
2
Globallogic—A Software R&D Services Company
Connect. Collaborate. Innovate.
GlobalLogic has created a network of global innovation
hubs made up on some of the brightest and most
innovative software minds connected by a platform
that supports agile collaboration which together
accelerate breakthrough products to market.
© Copyright GlobalLogic 2011
3
Industry Focus
Connect. Collaborate. Innovate.
Digital Media
Retail
Finance
Infrastructure
Electronics
Healthcare
Telecom
Mobile
Copyright GlobalLogic 2009
© Copyright GlobalLogic 2011
4
Connect. Collaborate. Innovate.
The BI (R) Evolution!
© Copyright GlobalLogic 2011
5
Connect. Collaborate. Innovate.
First came the Relational Database
© Copyright GlobalLogic 2011
6
Typical Retail Operational Database
Connect. Collaborate. Innovate.
create table
product_categories (
product_category_id
integer primary key,
product_category_name
varchar(100) not null
);
create table
manufacturers (
manufacturer_id
integer primary key,
manufacturer_name
varchar(100) not null
);
create table
products (
product_id
integer primary key,
product_name
varchar(100) not null,
product_category_id
references product_categories,
manufacturer_id
references manufacturers
);
create table
cities (
city_id
integer primary key,
city_name
varchar(100) not null,
state
varchar(100) not null,
population
integer not null
);
create table
stores (
store_id
integer primary key,
city_id
references cities,
store_location
varchar(200) not null,
phone_number
varchar(20)
);
create table
sales (
product_id
not null references products,
store_id
not null references stores,
quantity_sold
integer not null,
date_time_of_sale
date not null );
© Copyright GlobalLogic 2011
7
Marketing Trying to do Some Sales Analysis
Connect. Collaborate. Innovate.
How many Oreo cookies were sold yesterday in cities with population less than
fifty thousand people?
select sum(sales.quantity_sold)
from sales, products, product_categories,
manufacturers, stores, cities
where manufacturer_name = 'Oreo'
and product_category_name = 'cookie'
and cities.population < 50000
and trunc(sales.date_time_of_sale) =
trunc(sysdate-1) -- restrict to yesterday
and sales.product_id = products.product_id
and sales.store_id = stores.store_id
and products.product_category_id =
product_categories.product_category_id
and products.manufacturer_id =
manufacturers.manufacturer_id
and stores.city_id = cities.city_id;
This query has six join from all 7 tables. It is a very expensive query
Let’s copy the data to another databases for the marketing people
© Copyright GlobalLogic 2011
8
Connect. Collaborate. Innovate.
Then Came the Data Warehouse
© Copyright GlobalLogic 2011
9
Pick a FACT as the Center of Data Warehouse
Connect. Collaborate. Innovate.
Marketing Cares Most About Sales
Let us create a Fact table on sales
create table
sales_date
product_id
store_id
unit_sales
dollar_sales
sales_fact (
date not null,
integer,
integer,
integer,
number );
You can fill this table at a scheduled time from the operational database
This is you ETL process
© Copyright GlobalLogic 2011
10
Different DIMENSIONS can be created about the FACT
Connect. Collaborate. Innovate.
For example, we are interested in sales from a store
Let us create a DiMENSION table
create table
stores_dimension (
stores_key
integer primary key,
name
varchar(100),
city
varchar(100),
county
varchar(100),
state
varchar(100),
zip_code
varchar(100),
date_opened
date,
date_remodeled
date,
store_size varchar(100), ...
);
Now query on sales from a city take one join on 2 tables
select sd.city, sum(f.dollar_sales)
from sales_fact f, stores_dimension sd
where f.stores_key = sd.stores_key
group by sd.city
© Copyright GlobalLogic 2011
11
Traditional Approach to BI
Enterprise
Systems
Core
Production
Systems
Sales
Systems
Other
Systems &
Flat Files
Data
Warehouse
Datamart
OLAP layer
Transform
Extract
Slice
Load
Data cleanup
Extract Lookup
Validation
Mapping Value
Sort
Extract
Load
Join
Aggregation
.
.
Extract
etc
Dice
Load
Data Warehouse
Financial
Systems
Staging
Connect. Collaborate. Innovate.
Rollup
Load
Drilldown
Load
End User
Tools
Enterprise
Reporting
(Crystal,
BIRT…)
Analytic
Application
(SAS, SPSS …)
Machine
Learning
Pivot
Load
External Data
Extract
Load
Decision
Modeling
Feedback loop
© Copyright GlobalLogic 2011
12
Data Warehouse
Connect. Collaborate. Innovate.
Collection of a large amount of data which is cleaned,
transformed and cataloged and is made available for use
in data mining, online analytical processing, market
research and decision support
Method of storage – Normalized vs. Dimensional
Normalized: Similar to Database Normalization Rules. Tables are
grouped by subject area
Dimensional: Transactions are split into “Facts” and
“Dimensions”. Facts are numbers, whereas Dimension are
reference information of Facts
© Copyright GlobalLogic 2011
13
Data Warehouse (Cont.)
Connect. Collaborate. Innovate.
Schema design – Snowflake or Star Schema
Read-only access
The term OLAP was created as a slight modification of the traditional database
term OLTP (OnLine Transaction Processing)
MOLAP: Multi-dimensional OLAP, which uses multi-dimensional cube to store the data
ROLAP: Relational OLAP, with RDBMS as the underneath storage technology
HOLAP: Hybrid OLAP, which uses a mix of Relational and Multi-dimensional
technology
ETL stands for
Extract, Transform, Load
Some shops use home grown ETL
Language: Shell Script, Perl, Python and Ruby, Java
Other use ETL tools
Informatica, SAP and MS SISS (Commercial)
Talend and Pentaho Kettle (Open Source)
© Copyright GlobalLogic 2011
14
Connect. Collaborate. Innovate.
Then Came the Internet and the
Explosion of Data on the Web
© Copyright GlobalLogic 2011
15
Web 2.0 BI Approach
Connect. Collaborate. Innovate.
Cooperate Data Center
User
User
Behavior
Behavior
Analysis
Analysis
Result
Result
Result
Result
Service
Service
Processor
Processor
Service Responser
Service Responser
Operation
Operation
data
&
data &
rules
rules
New Rules
New Rules
Decision
Decision
Support
Support
Result
Response
Response
Request
Request
Log entry
Log entry
Logger
Logger
Response
Response
Load balancer
Internet
Request
Website Request
Website
Request Dispatcher
Request Dispatcher
Request
Request
Third-party
Supplier
(e.g.
Doubleclick)
Web
Crawler
Transaction related Info
Trend
Analysis
Customer behavior Statistics
Web Application
© Copyright GlobalLogic 2011
Result
DMZ
Data Provider
Map/Reduce Task
16
Connect. Collaborate. Innovate.
And suddenly Data Mining is the new BI !
© Copyright GlobalLogic 2011
17
Data Mining – a process view
Connect. Collaborate. Innovate.
Many Definitions
Non-trivial extraction of implicit, previously unknown and potentially
useful information from data
Exploration & analysis, by automatic
or semi-automatic means,
of large quantities of data
in order to discover
meaningful
patterns
© Copyright GlobalLogic 2011
18
Why Mine Data – Commercial Viewpoint
Connect. Collaborate. Innovate.
Lots of data is being collected
and warehoused
Web data
• Yahoo! collects 10GB/hour
purchases at department/
grocery stores
• Walmart records  20 million
transactions per day
Bank/Credit Card transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
© Copyright GlobalLogic 2011
19
Why Mine Data – Scientific Viewpoint
Connect. Collaborate. Innovate.
Data collected and stored at
enormous speeds (GB/hour)
remote sensors on a satellite
• NASA EOSDIS archives over
1-petabytes of Earth Science data per year
telescopes scanning the skies
• Sky survey data
gene expression data
scientific simulations
• terabytes of data generated in a few hours
Traditional techniques infeasible for raw data
Data mining may help scientists
in automated analysis of massive data sets
in hypothesis formation
© Copyright GlobalLogic 2011
20
Common Data Mining Techniques
Connect. Collaborate. Innovate.
Data
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
11
No
Married
60K
No
12
Yes
Divorced 220K
No
13
No
Single
85K
Yes
14
No
Married
75K
No
15
No
Single
90K
Yes
60K
10
Milk
© Copyright GlobalLogic 2011
21
Connect. Collaborate. Innovate.
Amazon.com Case Study:
Personalized Customer Relationship
Management
© Copyright GlobalLogic 2011
22
Amazon.com 5-step loyalty model
Connect. Collaborate. Innovate.
Step
Amazon’s action
Need Creation
anticipate/stimulate
Information search
provide /assist
Evaluate alternatives
assist / negate
Purchase transaction
optimise /reward
Post purchase
experience
© Copyright GlobalLogic 2011
add value
23
Step1: Need Creation
Need Creation
© Copyright GlobalLogic 2011
Connect. Collaborate. Innovate.
anticipate/stimulate
24
Step2: Information Search
Information search
© Copyright GlobalLogic 2011
Connect. Collaborate. Innovate.
provide /assist
25
Step3: Evaluation of Alternatives
Connect. Collaborate. Innovate.
Evaluate alternatives
© Copyright GlobalLogic 2011
assist / negate
26
Step4: Purchase Optimisation/Reward
Connect. Collaborate. Innovate.
Purchase transaction
optimise /reward
•1-click purchase
•‘slippery check out counter’ vs. ‘sticky aisles’
© Copyright GlobalLogic 2011
27
Step5: Post-purchase experience
Connect. Collaborate. Innovate.
Post purchase experience
© Copyright GlobalLogic 2011
add value
28
Internet Marketing Insight – Jeff Bezos
Connect. Collaborate. Innovate.
Role of
Advertisement – get customer to the store
Customer experience – get customer to buy
Brick & mortar stores
Getting customer to store is the hard part
Shopping cart abandonment is not common, since the overhead
of going to another store is very high – especially in Minnesota
winters!
Marketing expenses
80% for advertisement; 20% for customer experience
The 80-20 rule should be reversed for on-line stores
© Copyright GlobalLogic 2011
29
Difference in Two BI Approaches
Connect. Collaborate. Innovate.
Traditional (Enterprise approach)
Mainly use for exec reports, consumed by human
Medium size data volume at enterprise-scale, not web-scale
Very batch-oriented, weekly or monthly is norm.
ETL (Informatica)
Data Warehouse (RDBMS, Fact / Dimension tables, Star / Snowflake schema)
Multi-dimensional (ROLAP, MOLAP, Slice / Dice / Rollup / Drilldown)
Analytic Tool (Business Object)
Modern (Web 2.0 company approach)
Mainly use for data mining, and automatic feedback loop for adaptation
Gigantic size data volume at web-scale, from many different sources
Tight feedback loop, latency is within seconds or minutes.
ETL (more tolerance on unclean data, but must be processed at high speed)
Data Warehouse (Distributed Files Systems, NOSQL)
Map/Reduce Parallel Processing (Hadoop)
Analytic Tool (Hive / R)
© Copyright GlobalLogic 2011
30
Connect. Collaborate. Innovate.
BI with Unstructured Data
Hadoop + Vertica
© Copyright GlobalLogic 2011
31
Big Data comes in Three Forms
© Copyright GlobalLogic 2011
Connect. Collaborate. Innovate.
32
Near Time BI Reporting on Continuous Data Stream
Connect. Collaborate. Innovate.
Expected high
volume incoming
data stream
Processing System
Streaming Data
Operational System
BI Reporting System
The data volume will
determine underneath
technology framework (MOM,
CEP or HOP)
© Copyright GlobalLogic 2011
Lookup
DB
BI Adaptor
R
M
R
M
HDFS
M
Aggregator
MOM / CEP / HOP
MapReduce
Using Commodity
Hardware
Data
Real Time
Dashboard
Queries
Near Time
Reporting
Any BI Reporting Tool
33
Connect. Collaborate. Innovate.
What do people do with Hadoop?
> Look for
Patterns
> Parse Logs
> Archive data
> Transform data
© Copyright GlobalLogic 2011
35
Vertica® Analytic Database
Connect. Collaborate. Innovate.
MPP columnar architecture
Second to sub-second queries
300GB/node load times
Scales to hundreds of TBs
Standard ETL & Reporting Tools
www.vertica.com
© Copyright GlobalLogic 2011
36
Availability, Scalability and Efficiency
Connect. Collaborate. Innovate.
…how fast can you go from data to answers?
Unstructured data needs to be analyzed
to make sense.
Semi-structure data parsed based on
spec (or brute force).
Structured data can be optimized for adhoc analysis.
© Copyright GlobalLogic 2011
37
Hadoop / Vertica
Connect. Collaborate. Innovate.
Distributed processing framework
(MapReduce)
Distributed storage layer (HDFS)
> Vertica can be used as a data source and target
for MapReduce
> Data can also be moved between Vertica and
HDFS (sqoop)
> Hadoop talks to Vertica via custom Input and
Output Formatters
© Copyright GlobalLogic 2011
38
Connect. Collaborate. Innovate.
Hadoop / Vertica
Hadoop Compute
Cluster
Map
Map
Reduce
Map
Vertica serves as a structured data repository for hadoop
© Copyright GlobalLogic 2011
39
Hadoop / Vertica
Connect. Collaborate. Innovate.
Vertica’s input formatter takes a parameterized query
Relational Map operations can be pushed down to the
database
Vertica’s output formatter takes an existing table name or a
description
Vertica output tables can be optimized directly from hadoop
© Copyright GlobalLogic 2011
40
Connect. Collaborate. Innovate.
Hadoop / Vertica
Hadoop Compute
Cluster
Hadoop Compute
Cluster
Ma
p
Ma
p
Red
uce
Ma
p
Ma
p
Ma
p
Ma
p
Red
uce
Hadoop Compute
Cluster
Ma
p
Ma
p
Ma
p
Red
uce
Hadoop Compute
Cluster
Ma
p
Ma
p
Red
uce
Ma
p
Federate multiple Vertica database clusters with hadoop
© Copyright GlobalLogic 2011
41
Connect. Collaborate. Innovate.
Data Mining for Computational Social
Sciences
A Case Study from Virtual Worlds
© Copyright GlobalLogic 2011
42
Online Games
Connect. Collaborate. Innovate.
Massively Multiplayer Online Role Playing Games (MMORPG)
are computer games that allow hundreds to thousands of
players to interact and play together in a persistent online
world
Popular MMO
Games- Everquest 2,
World of Warcraft
and Second Life
© Copyright GlobalLogic 2011
43
MMORPG – Everquest 2
Connect. Collaborate. Innovate.
MMORPGs (MMO Role Playing Games) are the most popular of MMO Games
Examples: World of Warcraft by Blizzard and Everquest 2 by Sony Online
Entertainment
Various logs of players’ behavior are maintained
Player activity in the environment as well his/her chat is recorded at regular
time instances, each such record carries a time stamp and a location ID
Some of the logs capture different aspects of player behavior
Guild membership history (member of, kicked out of, joined, left)
Achievements (Quests completed, experience gained)
Items exchanged and sold/bought between players
Economy (Items/properties possessed/sold/bought, banking activity, looting,
items found/crafted)
Faction membership (faction affiliation, record of actions affecting faction
affiliation)
© Copyright GlobalLogic 2011
44
Connect. Collaborate. Innovate.
Social Science Data Mining with EverQuest 2 Data
improve understanding of the dynamics of group
behavior
MMORPG data enables us to look at dynamics of groups
in a new way
Multiple groups are part of a large social network
Individuals from the social network can join or leave groups
Groups are not isolated and some of them can be related i.e.
they may be geared towards specific objectives, each of which
works towards a larger goal (e.g. different teams working
towards disaster recovery)
The emergence, destruction as well as dynamic memberships of
the groups depend on the underlying social network as well as
the environment
© Copyright GlobalLogic 2011
45
Connect. Collaborate. Innovate.
Thank You!
We are always looking for good engineers who are
passionate about technology.
For more information, please contact @
[email protected]
© Copyright GlobalLogic 2011
46