Encoding for High-Performance Content

Download Report

Transcript Encoding for High-Performance Content

XML Data Binding:
Encoding for High-Performance
Content-Based Event Routing
Gail Kaiser
Phil Gross
Columbia University
Programming Systems Lab
Overview





PSL Intro
MEET Project
Encoding Conversion Efficiency
Encoding Size Efficiency
Encoding Classification Efficiency
Programming Systems Lab


“PSL conducts research on Web
technologies, collaborative work, virtual
worlds, process/workflow, extended
transaction models, software development
environments and tools, software
engineering, information management, and
distributed programming systems”
Lately, lots of XML stuff
PSL XML-related Research

FlexML: Flexible XML
–
–

XUES: XML-based Universal Event Service
–
–
–

Open-ended XML streams that may include “new” tags
Dynamic schema and semantics discovery and composition
Event Packager: Data mining over XML structured data
Event Distiller: XML event poset pattern matching
Learning new application-domain events to recognize
DISCUS: Decentralized Information Spaces for
Composition and Unification of Services
–
–
Rapid and secure application composition using Web
Services
Trust Evolution: PGP Trust + KeyNote + real-world business
MEET



Multiply Extensible Event Transport
Content-based multicast routing
Must be efficient enough for embedded and
high-performance applications
MEET Motivations






Personal Life Recorder (sensor oriented)
GroupWork Recorder (computer/DB
oriented)
Parallel/Grid computing
Distributed simulation
Battlefield C4I
Last, but not least:
–
Dissertation submission
Relationship to Other Work

Generally modeling communication like
Machine A
Relational


What actually goes over the line is
afterthought
But with N-Way Internet-scale
communication
–

Machine B
XML
Millions of publishers and subscribers
We can (must!) do better than ASCII text…
MEET Extensibility




Want to scale up, to millions of pubs and
subs
Want to scale down, to embedded and
wireless
No single solution satisfactory at all scales
Composed of hot-swappable subsystems
–
Router, transports, clock/causality, types, etc.
Why Types




Event data is not just an opaque bag of bits
Subscriptions are Boolean functions over
events
Type safety would be nice
What type system to use?
Initial MEET Type Design




Initial design calls for supporting Java, C#,
and XML Schema defined objects “out of the
box”
XML Schema used as Urlanguage/Esperanto for conversions
Subscriptions are arbitrary boolean functions
on datatypes
XML Schema is not ideal ur-type
–
Excessively complex, verbose, etc.
Encodings for Efficiency


Java, C#, XML, ASN.1 have well-defined but
proprietary encodings for instances
Would be nice to have an independent
encoding scheme with some desirable
properties missing from the above
–
–
–
Fast serialization/deserialization
Elimination of redundant information from
message sequences
Data organized for rapid classification/routing
Conversion Efficiency



Need to get to and from wire format as fast
as possible
Leverage homogeneity to eliminate
unnecessary conversions, e.g., network byte
order
ECho system from Eisenhauer et. al.,
Georgia Tech
–
–
Using “native data” for ultra-low latency
Necessary for HPC
Size Efficiency




Ideal for single message is self-describing
data
With multiple messages of same type, one
can pull out redundant type info, e.g.,
schema
Goal is to go further: If 90% of content of
messages is the same, generate a new
subtype with fixed values
From self-describing to all-schema is a
continuum
Classification Efficiency

When bits start arriving serially at the router,
would like to begin cut-through routing as
soon as possible
–


Avoid the curse of IP/IPv6: source address first
Want key routing bits as close to the front as
possible
Want data in fixed locations
Fast Classifying: First Things
First

In the packet, type info first (after magic)
–
–

Would like to represent type codes as bit string
with “most significant” info e.g. parent type first,
followed by subtype identifier, sub-subtype, etc.
Need access to type hierarchy
Popular classification fields at the front
–
–
Need to tag with popularity metadata
“subscribers will want to select on me”
Fast Classifying: Fixed
Positions



Would like to avoid scanning through long or
variable-length fields
Long/Variable data needs to be in a separate
channel/section
Primitives and fixed-length references at the
front
–
–
References point into data section
Classifier can jump large, uninteresting data
quickly
Plus: Schema Format



We’d like the schema format to be amenable
to programmatic manipulation and analysis
For instance, when negotiating formats, we’d
like to be able to compute how our original
format offer differs from the counter-offer
XML Schema is pretty good for this
Conclusions




Efficient instance transfer is an interesting
case for data-binding
Special needs for efficiency
But we can negotiate our own format among
the communicating parties
Some explicit support for this in a general
data-binding solution could help acceptance