Folie 1 - Stanford University

Download Report

Transcript Folie 1 - Stanford University

Situational Business Intelligence
Volker Markl
Technische Universität Berlin
ima
Database Systems and Information Management
Technische Universität Berlin
© Chair for Database Systems and Information Management
1
Agenda
ima
► Traditional
Business Intelligence
► Next Generation Business Intelligence
► Building Blocks
 Cloud Computing, Map-Reduce, and Hadoop, Piglatin
 UIMA, Social Tagging
► The
Long Tail of Situational Applications
► Situational Business Intelligence
► Challenges
© Chair for Database Systems and Information Management
2
Traditional Business Intelligence
© Chair for Database Systems and Information Management
ima
3
How Did We Get Here?
ima
7000
BI over Text
6500
6000
5500
Web enabled
Business
Intelligence
5000
4500
4000
Client Server
Business
Intelligence
3500
3000
2500
Actual and
forecasted BI
tools software
revenue
as reported by
IDC
2000
1500
1000
Batch
Reporting
Query/Reporting
OLAP
500
0
1990
1993
1996
1999
2002
2005
2008
Source: IDC
© Chair for Database Systems and Information Management
Source: Gartner
4
2008 CIO Priorities
ima
2008 CIO Technology Priorities
To what extent will each of the following technologies be a
Top 5 priority for you in 2008?
Rank
2008
Rank Rank
2007 2006
2008
Increase*
Business Intelligence Applications
1
1
1
11.20%
Enterprise Applications (ERP, SCM, and CRM)
2
2
**
8.02%
Server and Storage Technologies (Virtualization)
3
5
9
8.45%
Legacy Application Modernization
4
3
10
5.79%
Security Technologies
5
6
2
8.53%
Technical Infrastructure
6
8
12
4.67%
Networking, Voice, and Data Communications (VoIP)
7
4
8
6.83%
Collaboration Technologies
8
10
4
7.75%
Document Management
9
9
**
7.91%
Service-Oriented Technologies (SOA and SOBA)
10
7
6
6.71%
Source: 2008 Gartner Executive Programs CIO Survey, January 10, 2008
© Chair for Database Systems and Information Management
* Unweighted average budget change
** New question for 2007
5
What are CIOs missing?
Better/more information
Faster/quick retrieval
Accurate/updated data
Consistent platform
Better integration
Standardization
Other single mentions
ima
22.9%
14.3%
11.4%
8.6%
8.6%
8.6%
40.0%
Please give me an example of how your business intelligence
solution could better meet your organizations main objective?
Source: Business Intelligence Survey, IDC
© Chair for Database Systems and Information Management
6
Next Generation Business Intelligence
Internet
Intranet
Text
Text
Text
Text
Text
XLS
ima
Who is leading in
American Idol?
Information Extraction
Semantic Integration
Load/Refresh or ad-hoc
Text
Text
XML
Analysis
Schema and
Entities
Who are the
biggest players in
the Linux market?
Which insurance
policy customers are
at risk of being hit by
a current storm?
Data Warehouse
Data Marts
The next generation of Business Intelligence (NGBI)
correlates data warehouses with text and
semi-structured data from webservices of corporate
intranets and the internet
© Chair for Database Systems and Information Management
7
Answering a NGBI Query
ima
Who are the biggest players in the “Linux” market?
Web 2.0 documents from 332 Wiki News docs (January –March 2007)
© Chair for Database Systems and Information Management
8
Data Source Identification
Data Source
identification
Atomic Entity
extraction
Schema
extraction
Data Cleansing
ima
Data Fusion
► Data
Warehouse
► Masterdata
► Information Providers
► Information Marketplaces
► Crawling (Internet/Intranet)
© Chair for Database Systems and Information Management
9
Atomic Entity Extraction
Additional extraction and data cleansing effort
Data Source
identification
Atomic Entity
extraction
Schema
extraction
Data Cleansing
ima
Data Fusion
Out-of-the box data
► Web Services for complex, atomic
and named entities
Frameworks
► Infrastructures for extracting,
managing and scalable storage of
named entities
►
Web Services for extracting
named entities
Basic Components
► Screen scraper
© Chair for Database Systems and Information Management
10
Ad hoc analysis process
Data Source
identification
Atomic Entity
extraction
Schema
extraction
© Chair for Database Systems and Information Management
Data Cleansing
ima
Data Fusion
11
Schema Extraction
Pre Process
Base extraction
Schema
extraction
Company Technology ->Technology
© Chair for Database Systems and Information Management
Data Cleansing
ima
Data Fusion
Company Technology -> Company
12
Data Cleansing
Pre Process
Base extraction
Schema
extraction
ima
Data Cleansing
Data Fusion
Duplicates
© Chair for Database Systems and Information Management
13
Data Fusion
Pre Process
Base extraction
Schema
extraction
Data Cleansing
Data Fusion
Information Integration
Data Source A
Schema Mapping
Duplicate Detection
Apple
iPhone 3 Gen
match
Data Fusion
Apple
max
length
iPhone 3G
Data Source B
© Chair for Database Systems and Information Management
299.95
min
Apple iPhone 3 Gen
199.99
199.99
e.g., Hummer (U Potsdam)
14
Data Fusion
Pre Process
Schema
extraction
Base extraction
a
b
c
-
a
b
-
d
a
b
-
-
a
b
-
-
a
b
c
-
a
b
-
-
a
b
c
-
a
e
-
d
Data Cleansing
Data Fusion
a
b
c
d
Integration of complementary tuples
a
b
-
-
Elemination of identical tuples
a
b
c
-
Elemination of subsumed tuples
a f(b,e) c
© Chair for Database Systems and Information Management
d
Conflict resolution
15
Address Uncertainty: Query Refinement
►
Extract->SELECT->PROJECT-JOIN-(COUNT, AVG, SUM, MEAN..)
►
“Everything” about Dell?
►
The market of “Linux” from 2007-2008?
►
“What's the average analyst quote about the IBM stock price for the last
month?”
►
Drill down on region, time, organization ….
ima
QUERY
U
S
S
U
DATA
© Chair for Database Systems and Information Management
16
Building Blocks
ima
► Cloud
Computing
► Map Reduce
► Pig
► UIMA
► Social Tagging
© Chair for Database Systems and Information Management
17
Cloud Computing
► What





ima
is Cloud Computing?
Computing platform architecture
Scales to any application
High fault tolerance
No generally accepted definition available
Separation from Utility or Grid Computing is not obvious
© Chair for Database Systems and Information Management
18
Cloud Computing
► How
ima
does Cloud Computing work?





Lots of loosely coupled computers
Use of commodity hardware
Flexible up- or downgrading of resources
APIs offer access to cloud computing systems
Software takes care of parallelization, hardware failures
and error handling
 Resources (e.g. storage, computing power) can be bought
as services (paying for usage, e.g. Amazon)
© Chair for Database Systems and Information Management
19
MapReduce – Programming Model
ima
Program logic is split into 2 functions:
Map(k,v) and Reduce(k,list(v))
► Functions receive and produce (Key, Value)-pairs
► Map(k,v) computes for each (k,v)-pair an
intermediate (ki,vi)-pair
► Reduce(k,list(v)) merges all values with the same
key k and outputs the result.
► MapReduce programs are easy to develop
►
 Frameworks provide libraries
 Frameworks take care of parallelization, distribution and
error handling
 Only application specific source code is required
(no parallelization and error handling code)
© Chair for Database Systems and Information Management
20
MapReduce – Group AVG Example
MAP(k,v)
Input Data
Intermediate
(K,V)-Pairs
(US,10)
(US,40)
NewYork, US, 10
LosAngeles, US, 40
London,
GB, 20
Berlin,
Glasgow,
Munich,
…
DE, 60
GB, 10
DE, 30
REDUCE(k,list(v))
ima
Result
(US,10)
(US,40)
(GB,20)
(GB,20)
(GB,10)
(DE,45)
(GB,15)
(US,25)
(GB,10)
(DE,60)
(DE,30)
© Chair for Database Systems and Information Management
(DE,60)
(DE,30)
21
MapReduce
► MapReduce




Programming Model
For processing of huge amounts of data
Massive parallelization of computing tasks
Applicable to many real world applications
MapReduce programs are easy to implement
► MapReduce





ima
Engine
Environment to run MapReduce programs
Distributes computing tasks
Errors are transparently handled
Very scalable architecture
Examples: Google MapReduce & Apache Hadoop
© Chair for Database Systems and Information Management
22
Hadoop
► What
ima
is Hadoop?
 Free software framework for data intensive applications
 Enables distributed processing of vast amounts of data on
cloud computing architectures
 Supports clouds with 1000+ nodes
 Two components:
1) Hadoop Distributed File System (HDFS)
2) MapReduce Engine
► Where
can you get Hadoop?
 Top-level Apache Project: http://hadoop.apache.org/core/
© Chair for Database Systems and Information Management
23
Hadoop - HDFS
►
►
►
►
►
ima
Inspired by Google File System
Distributed storage for large files
Files are split up in multiple parts (default size 64MB)
Parts are spread over the HDFS nodes
Each part replicated (default 3 times)
© Chair for Database Systems and Information Management
24
Hadoop – MapReduce Engine
►
►
►
►
ima
Runs MapReduce programs
Libraries for Java and C++
Assigns Map and Reduce tasks to computing nodes
Reduction of data transfer volume
 Tasks are assigned to nodes holding the data
►
Node failures are transparently handled
 Tasks are restarted on node holding a replica of the data
TaskManager
MAP(
MAP(
MAP(
MAP(
MAP(
…
)
)
)
) FAILS!
)
© Chair for Database Systems and Information Management
25
Hadoop
►
ima
Who uses Hadoop?
 Amazon A9.com (Search Index Building, Analytics)
 Facebook (Logfile Analysis)
 Google & IBM (University Initiative to Address Internet-Scale
Computing Challenges)
 Yahoo! (Crawling, Indexing, Searching)
Yahoo! Hadoop Cluster runs Terabyte Sort Benchmark in 209 seconds
 And many others… (see http://wiki.apache.org/hadoop/PoweredBy)
►
Hadoop resembles Google‘s MapReduce Framework
 J. Dean, S. Ghemawat
„MapReduce: Simplified Data Processing on Large Clusters“
© Chair for Database Systems and Information Management
26
The Pig Project
ima
A platform for analyzing large data sets
► Pig consists of two parts:
►
 PigLatin: A Data Processing Language
 Pig Infrastructure (Grunt): An Evaluator for PigLatin programs
►
Where can you get Pig?
 Apache Incubator Project: http://incubator.apache.org/pig
►
Alternatives:
 HIVE (Facebook)
 JAQL (IBM Research)
© Chair for Database Systems and Information Management
27
PigLatin Data Processing Language
►
ima
PigLatin is imperative (whereas SQL is declarative)
 Step-by-step programming approach
 PigLatin queries are easy to write and understand
►
Fully nestable data model
 Atomic values, tuples, bags, maps
►
Operators of two flavors:
 Relational style operators (filter, join, etc.)
 Functional-programming style operators (map, reduce)
►
►
Easy to extend by user functions
Example: “Find the top 10 most visited pages in each category”
visits
gVisits
visitCounts
urlInfo
=
=
=
=
visitCounts =
gCategories =
topUrls
=
store topUrls
load ‘/data/visits’ as (user, url, time);
group visits by url;
foreach gVisits generate url, count(visits);
load ‘/data/urlInfo’
as (url, category, pRank);
join visitCounts by url, urlInfo by url;
group visitCounts by category;
foreach gCategories
generate top(visitCounts,10);
into ‘/data/topUrls’;
Example taken from:
“Pig Latin: A Not-So-Foreign Language For Data Processing” Talk, Sigmod 2008
© Chair for Database Systems and Information Management
28
Pig Infrastructure
►
ima
Currently two modes:
 Local: PigLatin programs are locally evaluated (run in a single JVM)
 MapReduce: PigLatin programs are compiled to sequences of MapReduce
programs and executed (e.g. on Hadoop)
►
Example:
Map1
LOAD visits
GROUP BY url
Reduce1
FOREACH url
GENERATE count
LOAD url info
JOIN on url
Map3
Map2
Reduce2
GROUP by category
Reduce3
FOREACH category
GENERATE top10(urls)
© Chair for Database Systems and Information Management
Example taken from:
“Pig Latin: A Not-So-Foreign Language For Data Processing” Talk, Sigmod 2008
29
UIMA
© Chair for Database Systems and Information Management
ima
30
UIMA
Pre-Processing
Analysis Phase
© Chair for Database Systems and Information Management
ima
Post-Processing
31
UIMA
ima
► Annotators
for Part of Speech detection,
Named-Entity detection and Relation
detection.
© Chair for Database Systems and Information Management
32
The Stratosphere Project
►
ima
Many BI queries exceed the capabilities of today‘s BI
systems
 „ Who are the biggest players in the Linux market?“
 „ Which insurance policy customers are at risk of being hit by a current
storm?“
►
The Internet offers valuable information
 Enterprise announcements and public business reports
 User generated content: Blogs, Wikis, Reviews, Comments, etc.
 News websites and feeds
►
Next Generation Business Intelligence (NGBI) requires
joint analysis of internet and enterprise data
 Internet, Intranet, Data Warehouse and Local Data must be processed
Goal of the Stratosphere Project is to build
a NGBI System on a Cloud Computing Platform
© Chair for Database Systems and Information Management
33
Stratosphere - Architecture
ima
Further data sources:
Intranet
Internet
Data
Warehouse
Office documents
(spreadsheets)
Email
Computing Cloud
Extract
(UIMA)
Crawl
Scan
Cache
Filter
HADOOP
Retrieve
Extract
Process
Join
Group
UI
Query
Result
© Chair for Database Systems and Information Management
Query Plan
Query
Translation
34
Stratosphere – Research Challenges
►
ima
Definition an algebra for expressing NGBI-queries
 Includes: traditional database operators, data retrieving
operators, information extraction operators, and
information integration operators
►
Implementation of NGBI query operators
 Requirements: highly-scalable, robust, self-tuning
 Leveraging Hadoop and map-reduce-frameworks
►
Implementation of a cloud computing monitoring
infrastructure
 Enabling for self-tuning NGBI-operators
© Chair for Database Systems and Information Management
35
Related Project: DBLife
© Chair for Database Systems and Information Management
ima
36
Related Projects: Avatar Email Search
© Chair for Database Systems and Information Management
ima
37
Situational Business Intelligence Example
(Zipcode)
ima
Which insurance
policy customers are
at risk of being hit by
a current storm?
Severe weather –
Meet Pete, an insurance agent in Lousiana.
1. He sees a news report of a severe storm. What is the company’s risk?
2. Pete has an Excel spreadsheet with all policy holders he manages,
which he filters to select only properties insured for more than $250,000.
3. Pete searches for a website that can predict flood levels for his area and finds
www.floodlevels.com, a mashup which predicts the flood level for a geographic
area based on USGS flood level forecasts, and GIS databases from
4. Pete connects his spreadsheet to www.floodlevels.com
5. He then forwards a risk summary to executives.
(HUC = Hydrological Unit Code)
http://water.usgs.gov/waterwatch/
(Geocode = Latitude/Longitude)
edc.usgs.gov/
© Chair for Database Systems and Information Management
(Geocode = Latitude/Longitude)
http://www.dotd.florida.gov/
38
Flood Risk Assessment Mashup
ima
Report
Mashup Search
Standardization
Standardize
Screen Scraping
www.floodlevels.com
Lineage
standardize
policy XLS
water.usgs.gov
© Chair for Database Systems and Information Management
edc.usgs.gov
dotd.louisiana.gov
39
Situational BI Evolution
IT Dept
DataMart
Line of Business
Best
Effort,
AdHoc
SCA
Portals
Mission
Critical
ima
DataWarehouse
New Initiatives
Proof of Concept
Mashups
Limited Time, Immediate
© Chair for Database Systems and Information Management
Lots of Time
40
Select Literature
ima
(Algebraic) Extraction
►
Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, Oren Etzioni: Open Information Extraction
from the Web. International Joint Conferences on Artificial Intelligence (IJCAI) 2007: 2670-2676
►
Frederick Reiss, Shivakumar Vaithyanathan, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu: An Algebraic Approach
to Rule-Based Information Extraction. International Conference on data engineering (ICDE) 2008. 933-942
Schema generation from extracted uncertain data
►
Xin Dong, Alon Y. Halevy: Malleable Schemas: A Preliminary Report. WebDB 2005: 139-144
►
Marcos Antonio Vaz Salles, Jens-Peter Dittrich, Shant Kirakos Karakashian, Olivier René Girard, Lukas Blunschi: iTrails: Pay-asyou-go Information Integration in Dataspaces. International Conference on Very Large Databases (VLDB) 2007: 663-674
Optimization
►
Alpa Jain, AnHai Doan, Luis Gravano: Optimizing SQL Queries over Text Databases. International Conference on Data
Engineering (ICDE) 2008: 636-645
BI over text
►
Alpa Jain, AnHai Doan, Luis Gravano: Optimizing SQL Queries over Text Databases. International Conference on Data
Engineering (ICDE) 2008: 636-645
►
Raghu Ramakrishnan and Andrew Tomkins: Towards a PeopleWeb. IEEE Computer 40(8): 63-72.
►
Web 2.0 Business Analytics. Alexander Löser, Gregor Hackenbroich, Hong-Hai Do, Henrike Berthold. Datenbank Spektrum
25/2008
►
T. S. Jayram, Andrew McGregor, S. Muthukrishan, Erik Vee: Estimating Statistical Aggregateson Probabilistic Data Streams.
PODS 07
© Chair for Database Systems and Information Management
41
Conclusion
►
BI over text will tap into a huge set of additional information for BI
►
The next generation of business intelligence applications will utilize
technologies for scalable processing and service computing to integrate
data sources from warehouses, intranet, and internet
►
Situational BI will create ad-hoc applications to answer complex questions
over integrated data sources
►
Open research problems:







ima
Which is the right extraction service?
“How much” schema can be generated?
“How much” optimization has the user to add?
How to optimize UIMA based extraction plans on a HADDOP cloud?
What is a suitable query language over HADOOP?
Data cleansing, completion, and Duplicate detection of extracted data?
Data explanation: Lineage but also: Why I do NOT see that data tuple?
© Chair for Database Systems and Information Management
42
Acknowledgements
► Discussions






at IBM Research and IBM SWG
Anant Jhingran
Hamid Pirahesh
Kevin Beyer
David Simmen
Mehmet Altinel
et al.
► My




ima
team at TU Berlin
Alexander Löser
Fabian Hüske
Stephan Ewen
Helko Glathe
© Chair for Database Systems and Information Management
43
ima
Hindi
Thai
Traditional Chinese
Gracias
Spanish
Russian
Thank You
English
Obrigado
Brazilian Portuguese
Arabic
Danke
German
Grazie
Merci
Italian
Simplified Chinese
Tamil
French
Japanese
Korean
© Chair for Database Systems and Information Management
44