CS chit-chat - ERI people pages

Download Report

Transcript CS chit-chat - ERI people pages

Gazetteer database
• 4.5 million items, each having:
– 1+ names
• fair to good discriminator
– 1 geospatial footprint
• 99.9% points, 0.1% boxes & polygons
• anticipate getting more (overlapping) polygons
– 1+ types
• excellent to horrible discriminator
• No obvious way to partition them
Greg Janée • chit-chat with CS database folks • 10/26/01
Gazetteer database
• Informix Dynamic Server 2000
• Indexes
– names: Verity (external; blade interface)
– footprint: MapInfo (R-tree; blade interface)
– types: B-tree
Greg Janée • chit-chat with CS database folks • 10/26/01
Gazetteer database
• Queries
– dynamic
– general case: arbitrary boolean combinations
– in practice:
• [name] AND [footprint] AND [([type] OR [type] OR ...)]
• Desired behavior
– see some results immediately  queries
– after seeing some results, ability to kill query
Greg Janée • chit-chat with CS database folks • 10/26/01
Gazetteer database
• Observed behavior:
– Either: query answered quickly (< 2 minutes)
• optimizer picks “right” index
– Or: query takes forever (> 30 minutes)
• optimizer picks “wrong” index
• Complications:
– no way to kill or interrupt database thread,
through JDBC or otherwise
Greg Janée • chit-chat with CS database folks • 10/26/01
Challenge #1
• Solve ADEPT’s query problem
– Multiple, different data types
• spatial, text, traditional linear types
– Discriminability of any given index greatly
depends on both data & query
– Dynamic queries
Greg Janée • chit-chat with CS database folks • 10/26/01
Load balancing
• Why
– distribute those queries-from-hell
– increase reliability
• How
– multiple independent, identical databases
– middleware directs query to “best” database
– database executes query in its entirety
Greg Janée • chit-chat with CS database folks • 10/26/01
Load balancing
• What the middleware knows about a
database:
– current & maximum connection count
– current queries
– amount of time each query has been processing
(QPT)
• Idea:
– score each DB inversely proportional to QPT2
Greg Janée • chit-chat with CS database folks • 10/26/01
Challenge #2
• Define metric
– overall query processing time
– response time
• Do better
– incorporate connection counts into formula
– analyze queries
– keep history
Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation
• Gazetteer is an instance of a much more general
problem
• To wit:
– how to describe the automatic translation of dynamic
queries written in an abstract query language to SQL
– in an easy, powerful, flexible way
– making as few assumptions as possible about the
underlying schema
– and producing “reasonable” SQL
• not so bad as to preclude database’s optimizer from working
Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation
• Collection ( database) contains items having IDs
– query should return IDs; duplicates OK
– response time more important than overall QPT
• Search bucket
– abstract, typed thing against which constraints may be
placed;  standard buckets
• Each collection supports 1+ buckets in
idiosyncratic ways
• Query language
– arbitrary boolean combinations of bucket constraints
Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation
• Standard buckets
– subject-related text
• title
• assigned term
–
–
–
–
–
–
originator
geographic location
coverage date
object type, feature type, ...
format
identifier
Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation
• Buckets grouped by types
– spatial
• e.g., overlaps a given polygon
– temporal
• e.g., contains a given date range
– hierarchical
• e.g., is any kind of geographic work
– textual, qualified textual
• e.g., contains the phrase “luis obispo”
– numeric
• e.g., > 5.7 meters
Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation
• Example: translating a spatial bucket
– MapInfo datablade
• ST_Contains(table.column,
“HG_Box(coords)”)
– Geodetic datablade
• Inside(“GeoBox(coords)”, table.column)
– four bounding coordinate columns
• table.northcolumn >= ... and
table.southcolumn <= ... and ...
Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation
• Existing Python-based scripting system
– easy to configure & extend
– comes with library of standard translation techniques
– "geographic-location" :
Bucket(
"spatial",
standardSpatialOperators,
spatialToInformixMapInfo,
["j_holding", "footprint"])
• How to extend to boolean queries?
Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation
• Too-easy solution:
– given constraints C1, C2 that translate into SQL
constraints (T1, S1), (T2, S2) then constraint
C1 op C2 where op is AND, OR, or AND NOT becomes
• select id from T1 where S1 op id in (select id from T2 where C2)
• But: Informix appears to execute subqueries in
their entirety before considering the outer query
Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation
• The problems with JOINs
– handling ANDs
• tables may have {1, 1?, 1+, 0+} rows per item
– handling ORs
• may require UNION
• but UNION is not nestable
Greg Janée • chit-chat with CS database folks • 10/26/01
Query translation
• The problem with disjunctive normal form:
– may be inefficient
• select id from table where S1 and (S2 or S3)
– versus
• select id from table where S1 and S2
union
select id from table where S1 and S3
Greg Janée • chit-chat with CS database folks • 10/26/01
Challenge #3
• Design a translation description system
– easy to configure & extend
– makes as few assumptions as possible about the
underlying schema
– should produce reasonable SQL by default
– supports customization of translation process
– supports pattern-based overrides
Greg Janée • chit-chat with CS database folks • 10/26/01