Data Quality Services - Microsoft NT konferenca

Download Report

Transcript Data Quality Services - Microsoft NT konferenca

SQL SERVER ‘DENALI’
WHAT’S NEXT IN THE SQL SERVER PLATFORM
Tara Seppa, CEE Application Platform Product Manager, Microsoft
CAVEATS
• Dates and capabilities are subject to change.
• Any screen captures or concepts are prototypes only.
• Packaging of new capabilities is yet to be determined.
• This deck is not a complete view of all
investments.
Disclaimer: This presentation contains preliminary information that may be changed substantially prior to final commercial release of the
software described herein. The information contained in this presentation represents the current view of Microsoft Corporation on the issues
discussed as of the date of the presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted
to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of
the presentation.
This presentation is for informational purposes only.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this
presentation. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this information does not give
you any license to these patents, trademarks, copyrights, or other intellectual property.
© 2010 Microsoft Corporation. All rights reserved.
AGENDA
• The Growth of Data
• SQL Server Today
• SQL Server Denali:
 Contain Database, AlwaysOn, Availability Groups,
Data Quality Services
 Business Intelligence (Project Crescent)
• Demo
• Resources
THE GROWTH OF DATA
• Data will grow 44x over the next ten years
• By end of 2011:
 The amount of data is expected to reach 1.2M
Zetabytes
 1 Million Terabytes = 1 exabyte
 1000 exabytes = 1 zetabyte
 All spoken words in history = 5 exabytes
MICROSOFT SQL SERVER
EVOLUTION
INCREASED FUNCTIONALITY AND ENTERPRISE-LEVEL
PERFORMANCE OVER THE LAST DECADE
ANALYST REVIEWS
• Gartner positions Microsoft in “Leader” Quadrant
for BI
• Forrester positions Microsoft in the “Leader”
Category for BI
• Forrester positions Microsoft in the “Leader”
Category for Data Warehouse
WHAT ANALYSTS ARE SAYING…
“By incorporating BI capabilities into
Microsoft's most ubiquitous products,
it virtually guarantees its BI
offering's continued adoption
“While BI spending grew in 2010, the
economic downturn caused an increased
focus on cost. Organizations showed an
increased willingness to consider
traditional low-cost options such as
Microsoft.”
“Based on the Magic Quadrant customer survey,
Microsoft is executing well on this strategy, with
customers rating it above average across many
key satisfaction metrics, including "sales and
customer experience" (including product quality and
support), "achievement of business benefits," as
well as for "vendor success." Strong customer
satisfaction with its approach is reflected in
Microsoft's Ability to Execute position.”
INTRODUCING SQL DENALI
CODE NAMES DEFINED
• SQL Server AlwaysOn: High availability
solution delivering increased application
availability, lower TCO and ease of use.
• Project “Apollo”: New column-store that
delivers significantly greater query performance
• Project “Juneau”: Single environment for
developing database, BI, and web solutions
• Project “Crescent”: Web-based, highly
interactive, data visualization solution
DELIVERING BUSINESS
“Denali” AGILITY AND
INNOVATION TO GAIN STRATEGIC VALUE
OUT OF YOUR INFORMATION
MISSION-CRITICAL
PLATFORM
DEVELOPER and IT
PRODUCTIVITY
PERVASIVE INSIGHT
Increase flexibility with
integrated high availability
Realize lower cost of
IT administration
Expand the reach of BI to
business users
Gain higher return on high
availability investments
Accelerate time-tomarket for developers
Get advanced analytic
performance
Get breakthrough data
warehousing scale &
performance
Build more flexible,
innovative applications
Deliver credible,
consistent data to right
users
MISSION-CRITICAL PLATFORM
NEW, INTEGRATED
HIGH AVAILABILITY
HIGHER RETURN ON
HIGH AVAILABILITY
INVESTMENTS
A
A
BREAKTHROUGH
PERFORMANCE
AND SCALE
A
Greater flexibility with
HADR
New HADR solution with
SQL Server “Denali”
AlwaysOn
Support for Windows
Server Core to reduce
planned downtime
Increased use of
hardware
Simplified configuration
and monitoring
Column-based query
accelerator for dramatic
performance gains
Up to 15,000 partitions
for larger scale slidingwindow scenarios
MAXIMIZE RESOURCES
Higher return on high availability investments
•
Increase hardware utilization through active secondaries
for backups, reporting, and ad hoc queries
•
Reuse existing infrastructure with support for both SAN
and direct attached storage
Simplify management and administration
•
Integrated manageability for one-stop configuration
•
Easy setup and monitoring integrated into Microsoft SQL
Server Management Studio
•
Availability Groups that provide failover units with
contained dependencies (such as logons)
ALWAYS ON (HADR)
HADR “ALWAYS ON”
• Deploying "HADR" involves configuring one or
more availability groups.
• Each availability group establishes the context
within which a selected set of user databases
can fail over as a single unit.
• The availability group involves a set of failover
partners, known as availability replicas.
HADR “ALWAYS ON”
Each replica within an availability group is
assigned a specific role, as follows:
• Primary role: Current primary replica. Since
only one replica can exist as a primary replica at
a given time.
• Secondary role: Indicates that the availability
replica is currently a secondary replica.
• Resolving role: Indicates that the current status
of an availability replica is uncertain for some
reason or its role is changing.
ALWAYSON
Integrated, Flexible, Efficient high Availability for
mission critical business
A high availability platform built for the future
AlwaysOn provides database level and instance
level protection
AlwaysOn Availability Groups
for database protection
AlwaysOn Failover Cluster Instances
for instance level protection
Multi-Database Failover
Multisite Clustering
Multiple Secondaries
Flexible Failover Policy
Active Secondaries
Improved Diagnostics
Integrated HA Management
Built for consolidation scenarios
ALWAYS ON – FLEXIBILITY
AlwaysOn provides the flexibility of different HA
configurations
A
A
A
A
A
A
A
Direct attached storage local, regional and geo target
Synchronous
Data Movement
Asynchcronous
Data
Movement
Shared Storage, regional and geo secondaries
17
QUERY ACCELERATION
BREAKTHROUGH PERFORMANCE
AND SCALE
Column-based query accelerator
• Dramatically faster star-join query processing—much
faster than current SQL Server (~10X)
•
Query speed increase varies with query and data
• Reduced I/O
• Reduced performance tuning effort
Up to 15,000 partitioning for very large databases
• Support for large sliding-window scenarios
11001010
01010010
10011101
01100101
001
ROW VS COLUMN STORE
EmpId
Lastname
Firstname
Salary
1
Smith
Joe
40000
2
Jones
Mary
50000
3
Johnson
Cathy
44000
Row Oriented Storage:
Column Oriented Storage:
1,Smith,Joe,40000;
2,Jones,Mary,50000;
3,Johnson,Cathy,44000;
1,2,3;
Smith,Jones,Johnson;
Joe,Mary,Cathy;
40000,50000,44000;
COLUMN VS ROW
COLUMN Oriented:
• Minimize the time to read the disk
 Need to compute an aggregate on a value - no need to read the rest of the
row
 This time adds up when executing large-scale calculations (data
warehouse)
• Data compression: column data is likely similar, which means you
get greater compression ratios
• Best for OLAP workloads
ROW Oriented:
• Writing data to disk in row format is faster than doing so by columns.
• Key for high-transaction database applications where data is
constantly being read and written to the database
• Best for OLTP workloads
CONTAINED DATABASE
SEPARATING CONCERNS
Application /
DB
Instance
Management Settings
HA Setup
Application Settings
Collation
Authentication
Security
Jobs
DQ
Authorization
CONTAINED DATABASES
• Includes all database settings and metadata required to define the
database
• Has no configuration dependencies on the instance of the SQL
Server Database Engine where the database is installed.
• Users can connect to the database without authenticating a login at
the Database Engine level.
• Simplifies database moves (can easily move the database to
another instance of SQL Server)
• All the database settings are included which enables DB owners to
manage all the configuration settings.
BENEFITS OF CONTAINMENT
• Database Movement:
 Login info, agent jobs, etc
• Initial Application Development
 Developer productivity
 No consideration for impact on database instance
• Application Administration
 Easier to maintain applications
MDS & DQS
MASTER DATA SERVICES
DATA QUALITY SERVICES
DELIVER CONSISTENT DATA
Enterprise Data Integration & Management
Master Data Services
Data quality
Data Quality Services
Profiling
Familiar tools
Cleansing
Performance
Matching
Integration Services
Impact Analysis
Lineage tracking
Easy data loading
with data sources
MASTER DATA MANAGEMENT
WORKFLOW
APPROVAL
HIERARCHY MGMT
Purchasing DB
CRM
ERP
HR Doc
Asset Mgmt
MDS: EXCEL PROTOTYPE
WHAT IS DATA QUALITY ?
 Data Quality represents the degree to which
the data is suitable for business usages
 The quality of data can be defined, measured
and managed through various Data Quality
dimensions and metrics such as accuracy,
consistency, completeness, duplicates,
timeliness
 Data Quality is built through Processes
involving People and Technology
REAL LIFE EXAMPLES –
CUSTOMER DATA
Different
representations
Initials Misplaced
Missing
Zip
Is It the Same
Customer?
Will she get
her mail?
What is the
right data ?
Wrong Zip
Non
Standard
Can we save
on mail
expenses?
32
REAL LIFE EXAMPLES –
PRODUCT DATA
10029
10129
Format
OFSUP10029
A4
white
Laser paper
80 gr
Offc Dep
OUT OF STOCK
Premium
10408
Id
convention
OFSUP10129
8.26” x 11.69”
weiss
Copy, multipurpose
80 gr
Office Depot
48 left
Bright white
10408
A4
white
Printing paper
20 kg
Staples
12 left
Pack of 5
Typos
Free text
Language
Database
error, the
same
product?
Are these
substituting
products?
Can we
consolidate
suppliers?
Who is the
cheapest
supplier for a
component?
33
COMMON DATA QUALITY ISSUES
Data Quality
Issue
Sample Data Problem
Do values follow consistent
formatting standards ?
Telephone # may appear as xxxxxxxxxx,
(+xx) xxx-xxxx, 1.xxx.xxx.xxxx, etc.
Standard
Are data elements consistently
defined and understood ?
Gender code = M, F, U in one system and
Gender code = 0, 1, 2 in another system
Consistent
Do values represent the same
meaning ?
Is revenue always presented in Dollars or
also in Euros?
Complete
Is all necessary data present ?
20% of customers’ last name is blank,
50% of zip-codes are 99999
Accurate
Does data accurately represent
reality or a verifiable source?
A Supplier is listed as ‘Active’ but went out
of business six years ago
Do data values fall within
acceptable ranges?
Salary values should be between
60,000-120,000
Data appears several times
Both William Ryan and Bill Ryan appear in
the system – are they the same person?
Format
Valid
Duplicates
34
REQUIREMENTS FOR DATA
QUALITY SOLUTIONS
Monitoring – Tracking
and monitoring the
state of Quality
activities and Quality of
Data
Cleansing – amend,
remove or enrich
data that is incorrect
or incomplete. This
includes correction,
standardization and
enrichment.
Profiling analysis of the
data source to
provide insight into
the quality of the
data and help
identify data
quality issues.
Matching identifying, linking or
merging related
entries within or
across sets of data.
MATCHING – SCENARIO
Linas would like to consolidate his company’s customer’s database by
matching and grouping the customers into households.
“Householding” helps marketers understand which of their customers or
targets live together for better targeting
Before:
Name
Address
Postal Code City
State
John Smith
Margaret & John smith
Maggie Smith
John Smith
545 S Valley View Drive # 136
545 Valley View ave unit 136
545 S Valley View Dr
545 Valley Drive St.
34563
34563-2341
New York
New York
New York
NY
34253
Anytown
Anytown
Anytown
NY
After:
Name
Address
Postal Code City
State
John Smith
Margaret & John smith
Maggie Smith
John Smith
545 S Valley View Drive # 136
545 Valley View ave unit 136
545 S Valley View Dr
545 Valley Drive St.
34563
34563-2341
New York
New York
New York
NY
34253
Anytown
Anytown
Anytown
NY
37
Cluster
1
1
1
2
MATCHING - EXAMPLE
BUSINESS INTELLIGENCE
DELIVERING BUSINESS
“Denali” AGILITY AND
INNOVATION TO GAIN STRATEGIC VALUE
OUT OF YOUR INFORMATION
MISSION-CRITICAL
PLATFORM
DEVELOPER and IT
PRODUCTIVITY
PERVASIVE INSIGHT
Increase flexibility with
integrated high availability
Realize lower cost of
IT administration
Expand the reach of BI to
business users
Gain higher return on high
availability investments
Accelerate time-tomarket for developers
Get advanced analytic
performance
Get breakthrough data
warehousing scale &
performance
Build more flexible,
innovative applications
Deliver credible,
consistent data to right
users
PERVASIVE INSIGHT
PERFORM ADVANCED
EXPAND THE
REACH OF BI TO
BUSINESS USERS
ANALYTICS AND
INCREASE
ORGANIZATIONAL
DELIVER CONSISTENT
DATA TO THE RIGHT
USERS AT THE RIGHT
TIME
ALIGNMENT
Magnifying the reach of BI
with Project “Crescent”
Removing barriers between
data and analytics
Amplifying collaboration
Fast and scalable new BI
semantic model
Enhanced experience
Unified modeling
experience
Greater ease-of-use for
quick ramp on SSIS and
MDS
New capabilities to
better track, analyze,
and cleanse data
EXPAND REACH OF BI TO BUSINESS
USERS
Magnify the reach of your business intelligence
solutions with Project “Crescent”
• Highly interactive web-based authoring and sharing
of information
• Drive greater insights through smart and powerful data querying
• New presentation and storyboard turns pervasive into persuasive
Remove barriers between data and analytics
• Rich new features in PowerPivot, including hierarchy and
KPI creation
• Sophisticated calculations that span personal to corporate BI
Amplify collaboration
• Turn pervasive into persuasive with new presentation mode
and insight storyboard
• Proactive and personalized alerts on data-driven events
INTRODUCING BISM
• BISM: Business Intelligence Semantic Model
• What is it?
 Semantic model designed specifically for all BI
workloads
 Evolution of the previous Unified Dimension Model
into a BI Semantic Model
 Supports modeling of multi-dimensional and tabular
(think Relational) BI solutions
 For end users the BI Semantic Model will be
transparent regardless of how it was implemented
BI Semantic Model
Third-Party BI
Applications
Reporting
Services
Reports
Excel
Workbooks
Relational
DAX
BI Semantic
VertiPaq
PowerPivot
Applications
SharePoint
Dashboards
& Scorecards
Multidimensional
Data Model
MDX
Model
Business
Logic
DirectQuery
Files
BI Applications
Data Access
OData
Feeds
Data Sources
DEMO
PROJECT CRESCENT
IN SUMMARY“Denali”
MISSION-CRITICAL
PLATFORM
DEVELOPER and IT
PRODUCTIVITY
PERVASIVE INSIGHT
Increase flexibility with
integrated high availability
Realize lower cost of
IT administration
Expand the reach of BI to
business users
Gain higher return on high
availability investments
Accelerate time-tomarket for developers
Get advanced analytic
performance
Get breakthrough data
warehousing scale &
performance
Build more flexible,
innovative applications
Deliver credible,
consistent data to right
users
RESOURCES
• Sign up for Denali CTP 3 notification:
http://www.sqlserverlaunch.com/
• Benchmarks:
http://www.microsoft.com/sqlserver/en/us/product-info/benchmarks.aspx
• SQL Server Team Blog:
http://blogs.technet.com/b/dataplatforminsider/
• SQL Denali on MSDN:
http://msdn.microsoft.com/en-us/library/bb500435(v=SQL.110).aspx
THANK YOU
APPENDIX
SELF-SERVICE
BI APPLIANCE
End-to-end,
pre-configured stack
quickly enables BI for Excel power users
Rich insight: Empower users to easily create PowerPivot workbooks from real-time business data for faster, more
accurate insights
Reduced complexity: Overcome cost and complexity of BI; shift IT resources from running ad-hoc reports to
innovation initiatives
Improved deployment & manageability: Appliance management dashboard and BI Getting Started Home Page
streamline deployment and simplify administration
Target: Small and medium sized businesses, and enterprise departments or POCs
SKUs
Components
SSBI Appliance
Server
Affordable industry standard hardware balanced with RAM to support BI
workloads (RS, AS, & PowerPivot)
Storage
>2 TB onboard storage tuned to support BI workload
Software
Windows Server 2008 R2 Enterprise Edition, SQL Server 2008 R2
Enterprise Edition, SharePoint 2010 E, PowerPivot, custom code for
installation and management
Infrastructure
Existing Active Directory, 1U w/ power & network to install in existing
rack
Services
Solution support from basic to Mission Critical
SSBI Assessment Services
DB CONSOLIDATION
APPLIANCE
Aggregate 100s of Tier 2/3 DB/Applications in virtual, scalable infrastructure
•
Reduced TCO: Virtual, scalable infrastructure drives database deployment and
management costs dramatically down
•
Reliable, low-risk platform: Cost-effective high availability ensures continuous
SKUs
Components
uptime
•
Base DB
Consolidation SKU
(~40 instances and
management
infrastructure)
Servers
4 x HP BL465c G7 Blades with 256GB RAM with internal
storage
Simplified operation: Integrated management and automated P2V migration
HP P2000 G3 Modular Smart Arrays (48 spindles)
ease ongoing administration Storage
of virtualized26 xxresources
HP StorageWorks D2700 Disk Enclosures (150 spindles)
Building Blocks (~40
additional instances)
Software
Windows Server 2008 R2 DC, SQL Server 2008 R2 DC,
Hyper-V, System Center, Virtual Machine Manager, SQL
Fabric Controller,
custom scripts, custom VMs
Infrastructure
HP BladeSystem c3000 Enclosure, HP Virtual Connect Flex10, 10GbE switches and connectivity, 42u rack
Services
Consolidation & Virtualization services; Software technical
support
Servers
4 x HP BL465c G7 Blades with 256GB RAM with internal
storage
Storage
(proposed)
2 x HP P2000 G3 Modular Smart Arrays (48 spindles)
6 x HP StorageWorks D2700 Disk Enclosures (150 spindles)
Software
None (Requires CALs for additional processors)
Infrastructure
None (Plugs into base infrastructure)
MICROSOFT DATA
Fast Track Data
Parallel Data
HP Starter DW
Enterprise
Warehouse RA
Warehouse
WAREHOUSING
OFFERINGS
Appliance
Scalable and reliable SMP
platform for data
warehousing on any
hardware
Scalable and reliable
platform for data
warehousing on any
hardware
Reference architectures
offering best price
performance for data
warehousing
Appliance for high end
MPP Data Warehousing
delivering highest
scalability and
performance
Ideal for data marts or small
to mid-sized enterprise data
warehouses (EDWs)
Ideal for large data marts or
mid-sized EDWs
Ideal for data marts or small
to mid-sized data
warehouses with scan-centric
workloads
Ideal for high scale or high
performance data marts
and EDWs
Software only
Integrated Appliance
(Software and Hardware)
Reference Architectures
(Software and Hardware)
DW Appliance
(Fully integrated Software
and Hardware)
Scale-Up DW
Scale-Up DW
Scale-Up DW
Scale-Out DW with MPP
10s of terabytes
<5 terabytes
5–80 terabytes
10s - 100s of TB
Software Assurance;
Premier Mission Critical
Support
TBD
Software Assurance;
Premier Mission Critical
Support
Mission Critical Advantage
Program
SQL SERVER “DENALI”
High Availability
Scalability and
Performance
Security &
Manageability
> 1000
Partitions /
table
User-Defined
Server Roles
PHP Driver
Unified Semantic
Model
Data Lineage
Reliable &
Integrated Failover
Detection
Column store
DW
SQL Studio
Semantic
Platform
Crescent
Impact Analysis
Application
Centric Failover
Fast FileStream
LocalDB
In-memory BI for
corporate
SSIS Server
Multiple
Readable
Secondaries
Win32 access to
database files
Contained
Database
Authentication
Database
Replay
UTF-16
Alerting
Data Quality
Online
Operations
Audit
Enhancements
Paging for
result sets
Sysprep for AS
Enhanced MDS
Fast Full Text
HA for
StreamInsight
FileTable
Windows Server
Core Support
AlwaysOn
Management
Pack for High
Availability
Backup
Secondaries
SSMS
Enhancements
Default
Schema for
Windows
Group
Web & Breadth
Full Globe
Spatial Support
DAC
Enhancements
ODBC for
Linux
JDBC 4.0
driver
Support for
ARM
processors
Business
Intelligence
PowerPivot
Enhancements
Reporting as
SharePoint
Shared Service
EDIM