A Database Reverse Engineering Case Study

Download Report

Transcript A Database Reverse Engineering Case Study

DAMA Chicago
Aug 17, 2016
A Database Reverse
Engineering Case Study
Michael R. Blaha, DSc.
[email protected]
www.superdataguy.com
What is Database Reverse
Engineering?
Reverse engineering is the inverse to normal
development
• Start with an application and work backwards to
understand the software and infer its intent
• Reverse engineering can apply to a variety of
artifacts
– Hardware, programming code, databases, …
• Our focus here is on databases
Why Would Anyone Want
to Do DBRE?
• To elicit requirements
– DBRE is not intended to perpetuate past flaws
– DBRE is merely a source of tentative requirements
•
•
•
•
•
To convert legacy data
To integrate application stovepipes
To assess software
To assist maintenance
To construct documentation
Case Study 1:
Reverse Engineer
WordPress
Rationale
• WordPress is an interesting DBRE case study
because...
– WordPress is a well-known application
– The case study has a populated database
The data is real (not synthetic)
The data is not proprietary
• Illustrate DBRE techniques for a small database
Processing Details
1. Export MySQL db from www.superdataguy.com
website
– The exported localhost.sql is unreadable
2. Import SQL code into a local MySQL db
3. Export schema only from local MySQL db
– The exported file is readable
4. Manually edit the SQL by deleting…
– `, unsigned, COLLATE, KEY, UNIQUE KEY, ENGINE
5. Reverse engineer schema with ERwin
Initial ERwin Model
Record Counts
From querying the MySQL database
wp_fqir_commentmeta
68
wp_fqir_comments
23
wp_fqir_links
0
wp_fqir_options
359
wp_fqir_postmeta
656
wp_fqir_posts
406
wp_fqir_term_relationships
59
wp_fqir_term_taxonomy
15
wp_fqir_termmeta
0
wp_fqir_terms
15
wp_fqir_usermeta
47
wp_fqir_users
1
Manually Add FKs
• Look for name similarity
• Verify with data analysis
– SELECT * FROM wp_fqir_commentmeta
WHERE comment_ID NOT IN
(SELECT comment_ID FROM wp_fqir_comments);
-- 0 records
– SELECT * FROM wp_fqir_commentmeta
WHERE comment_ID IS NULL;
-- 0 records
DBRE ERwin Model
Data Dictionary
By inspecting the data of each table…
wp_fqir_commentmeta
Key-value pairs from Akismet plugin
wp_fqir_comments
Comments posted to the Website (mostly spam)
wp_fqir_links
wp_fqir_options
Key-value pairs for WordPress and plugins
wp_fqir_postmeta
Key-value pairs for WordPress, embedded media, and plugins
wp_fqir_posts
Each record is a WordPress page or post
Appears to keep a history
wp_fqir_term_relationships
wp_fqir_term_taxonomy
wp_fqir_termmeta
wp_fqir_terms
Mostly user defined categories in WordPress
wp_fqir_usermeta
Key-value pairs for users
wp_fqir_users
Authorized users for the Website
Commentary
• WordPress has a very small database
– Only 12 tables
– I had expected more tables
• Note the lack of connectivity among tables
– Typically tables are more highly interconnected
• WordPress has no dangling references
– I don’t think WordPress uses RI
• ERwin only partially reverse engineers MySQL
– It chokes on some keywords
Case Study 2:
Reverse Engineer
Adventure Works 2012
Rationale
• Adventure Works 2012 is an interesting DBRE
case study because...
– Adventure Works is a free database provided with MS
SQL Server
– The case study has a populated database
The data is not proprietary
– The database is of medium size (71 tables)
– The database defines referential integrity
Only one FK is missing
Mechanical Approach
• Strip down the schema to get to a core model
• This is like skimming a book
– We are working towards an abridgement of a model
• We can quickly get a sense of a schema
• We will use ER/Studio
Record Counts
Query the SQL Server database
dbo.AWBuildVersion
dbo.DatabaseLog
dbo.ErrorLog
HumanResources.Department
Person.BusinessEntity
20777
1597
Person.BusinessEntityAddress
19614
0
Person.BusinessEntityContact
909
1
16
Person.ContactType
20
HumanResources.Employee
290
Person.CountryRegion
HumanResources.
EmployeeDepartmentHistory
296
Person.EmailAddress
19972
Person.Password
19972
HumanResources.EmployeePayHistory
316
Person.Person
19972
Person.PersonPhone
19972
HumanResources.JobCandidate
HumanResources.Shift
Person.Address
Person.AddressType
13
3
19614
6
Person.PhoneNumberType
Person.StateProvince
238
3
181
Record Counts
Record counts partially indicate table purpose
Production.BillOfMaterials
Production.Culture
2679
8
Production.Document
13
Production.Illustration
5
Production.Location
14
Production.Product
504
Production.ProductCategory
4
Production.ProductCostHistory
395
Production.ProductDescription
762
Production.ProductDocument
Production.ProductInventory
Production.ProductListPriceHistory
32
1069
395
Production.ProductModel
Production.ProductModelIllustration
128
7
Production.ProductModel
ProductDescriptionCulture
762
Production.ProductPhoto
101
Production.ProductProductPhoto
504
Production.ProductReview
4
Production.ProductSubcategory
37
Production.ScrapReason
16
Production.TransactionHistory
Production.TransactionHistoryArchive
Production.UnitMeasure
Production.WorkOrder
113443
89253
38
72591
Record Counts
Note different counts for ‘types’ and ‘instances’
•
Purchasing.WorkOrderRouting
Purchasing.ProductVendor
67131
460
Sales.SalesOrderHeader
31465
Sales.SalesOrderHeaderSalesReason
27647
Purchasing.PurchaseOrderDetail
8845
Sales.SalesPerson
Purchasing.PurchaseOrderHeader
4012
Sales.SalesPersonQuotaHistory
17
163
5
Sales.SalesReason
10
Purchasing.Vendor
104
Sales.SalesTaxRate
29
Sales.CountryRegionCurrency
109
Sales.SalesTerritory
10
Sales.SalesTerritoryHistory
17
Purchasing.ShipMethod
Sales.CreditCard
Sales.Currency
19118
105
Sales.ShoppingCartItem
3
Sales.CurrencyRate
13532
Sales.SpecialOffer
Sales.Customer
19820
Sales.SpecialOfferProduct
538
Sales.PersonCreditCard
19118
Sales.Store
701
Sales.SalesOrderDetail
121317
16
Processing Details
1. Import schema into ER/Studio
– File / New / Reverse engineer AdventureWorks2012
– All owners, user tables, no inferences
– Add FK Production.WorkOrderRouting.ProductID ->
Production.Product.ProductID
2. Successively delete all entity types with 0,1
connections
3. Delete entity types with few (<= 3) connections
Step 1: ER/Studio Model
Step 2: ER/Studio Model
Step 3: ER/Studio Model
Step 3: Final Tables
Remaining tables in black
dbo.AWBuildVersion
dbo.DatabaseLog
dbo.ErrorLog
HumanResources.Department
Person.BusinessEntity
20777
1597
Person.BusinessEntityAddress
19614
0
Person.BusinessEntityContact
909
1
16
Person.ContactType
20
HumanResources.Employee
290
Person.CountryRegion
HumanResources.
EmployeeDepartmentHistory
296
Person.EmailAddress
19972
Person.Password
19972
HumanResources.EmployeePayHistory
316
Person.Person
19972
Person.PersonPhone
19972
HumanResources.JobCandidate
HumanResources.Shift
Person.Address
Person.AddressType
13
3
19614
6
Person.PhoneNumberType
Person.StateProvince
238
3
181
Step 3: Final Tables
Deleted tables in red
Production.BillOfMaterials
Production.Culture
2679
8
Production.Document
13
Production.Illustration
5
Production.Location
14
Production.Product
504
Production.ProductCategory
4
Production.ProductCostHistory
395
Production.ProductDescription
762
Production.ProductDocument
Production.ProductInventory
Production.ProductListPriceHistory
32
1069
395
Production.ProductModel
Production.ProductModelIllustration
128
7
Production.ProductModel
ProductDescriptionCulture
762
Production.ProductPhoto
101
Production.ProductProductPhoto
504
Production.ProductReview
4
Production.ProductSubcategory
37
Production.ScrapReason
16
Production.TransactionHistory
Production.TransactionHistoryArchive
Production.UnitMeasure
Production.WorkOrder
113443
89253
38
72591
Step 3: Final Tables
Purchasing.WorkOrderRouting
Purchasing.ProductVendor
67131
460
Sales.SalesOrderHeader
31465
Sales.SalesOrderHeaderSalesReason
27647
Purchasing.PurchaseOrderDetail
8845
Sales.SalesPerson
Purchasing.PurchaseOrderHeader
4012
Sales.SalesPersonQuotaHistory
17
163
5
Sales.SalesReason
10
Purchasing.Vendor
104
Sales.SalesTaxRate
29
Sales.CountryRegionCurrency
109
Sales.SalesTerritory
10
Sales.SalesTerritoryHistory
17
Purchasing.ShipMethod
Sales.CreditCard
Sales.Currency
19118
105
Sales.ShoppingCartItem
3
Sales.CurrencyRate
13532
Sales.SpecialOffer
Sales.Customer
19820
Sales.SpecialOfferProduct
538
Sales.PersonCreditCard
19118
Sales.Store
701
Sales.SalesOrderDetail
121317
16
Commentary
• DBRE depends on having FKs
• The 0,1 connection deletions lose little info
• The “few” connections deletions are speculative
• Supertype/subtypes are troublesome
– From a separate manual DBRE
– BusinessEntity -> Employee, Vendor, Person, Store
– Employee -> SalesPerson
Supertype / Subtype
Case Study 3:
Core DBRE
Rationale
• A project building a very large data warehouse
– 100 facts
– 200 dimensions
• The primary operational feeder application has
8500 tables
• I was new to the project and there was a lot to
learn
• I wanted to reverse engineer the feeder
application so that I could understand it
Available Inputs
• We had the following inputs (paper printouts) for
the feeder application
– A thorough data dictionary
– Primary key definitions
– Foreign key definitions
The DBRE Problem
Reverse engineer a database with 8500 tables
• With smaller schema, we could type the database
structure into a modeling tool and then analyze it
• However, 8500 tables would take too long
• We decided to determine the tightly connected
tables and hope that would yield a much smaller
model
– We presume that the tightly connected tables are the
most important ones
DBRE Approach
Do a graph analysis
• Create a meta-table with FK to PK references
– The FK in the source table points to the PK in the
target table
• Using SQL, successively delete tables with 0,1
FK connections
• The final result is the multiply connected tables
Example
TableReferences
H
sourceTable
targetTable
A
B
D
E
E
G
H
F
I
B
A
C
C
D
B
A
G
G
A
B
C
E
TableReferences
G
I
sourceTable
targetTable
A
B
D
E
E
G
B
A
C
C
D
B
C
F
D
B
A
E
D
G
Example
TableReferences
sourceTable
targetTable
A
B
D
E
E
G
B
A
C
C
D
B
B
A
TableReferences
targetTable
A
B
D
E
E
B
A
C
C
D
C
G
A
B
E
C
E
sourceTable
D
D
Repeatedly subtract tables with one reference
until there is no change
Finding Core Tables
• DELETE FROM TableReferences AS T3
WHERE EXISTS (
SELECT T1.sourceTable
FROM TableReferences AS T1
WHERE NOT EXISTS (
SELECT *
FROM TableReferences AS T2
WHERE T1.sourceTable = T2.targetTable )
AND T3.sourceTable = T1.sourceTable
GROUP BY T1.sourceTable
HAVING COUNT(*)=1 );
• The middle query finds tables with one source reference
• The innermost query limits the one-source tables to those that
are not the target of any other sources
• The outer query does the deletion
Results
• Initial: 8500 tables
– Several thousand FK definitions
– 854 tables have FK columns
– 254 tables are referenced by FKs
• Final result: 553 core tables
Case Study 4:
Enterprise Data Model
Rationale
• Construct an enterprise data model
• My client – a financial software vendor – was a
fusion of five formerly separate companies
• The applications were greatly dissimilar because
they were built by separate organizations
• The purpose of the EDM was to provide a basis
for integrating the applications and help the new
company strengthen their brand
DBRE Approach
• We wanted to seed the EDM with application
content
• We tried full DBRE but it was not helpful because
the resulting models were so different
• We tried core DBRE but the models were still
confusing because they were so different
• Finally we decided to count the FK references to
each table
– We included the tables with the highest counts
– This worked
Example
•
•
•
•
•
•
•
•
•
•
•
•
•
•
AddressType
Applicant
ApplicantAddressHistory
ApplicantType
LenderApplicantDetails
Country
OverseasCorrespondence
ExistingInsuranceCover
Provider
PaymentFrequency
Fee
FeeDueType
FeeType
ProductFee
2
29
7
2
2
2
2
6
3
2
11
2
2
2
Results
Holding
Account
Customer
Product
Activity
FinancialScenario
D ocument
DBRE and Zachman FW
• The “What” column
–
–
–
–
–
Populated database – Technician perspective
Database schema – Engineer perspective
Physical data model – Architect perspective
Logical data model – Business mgmt. perspective
Conceptual data model – Executive perspective
Thank you for attending…
Any questions???