Beyond Java - Tolerant Systems
Download
Report
Transcript Beyond Java - Tolerant Systems
UC Irvine – project transprose: transporting programs securely
New Approaches to Mobile Code:
Reconciling Execution Efficiency
with Provable Security
Michael Franz
University of California, Irvine
July 2001
Introduction
mobile code is an enabling technology
– download functionality as needed
– handheld, untethered devices, “information appliances”
– platform-independent identical code can run on PDAs,
desktop machines, even supercomputers
but, many unresolved issues with respect to
–
–
–
–
performance of the mobile program (on the target)
performance of the mobile code distribution mechanism
protecting the host against malicious mobile programs
[guarding a mobile program’s secrets against a malicious host]
Guiding Overall Objective
make mobile code practical, so that
eventually, native code will need to exist only
transiently, created on-the-fly and consumed on the
spot
while mobile code will be used as the storage and
distribution medium
Context
dynamic code-generation technology is approaching
maturity and processors are becoming fast enough to
sustain it (in real time)
this is rapidly diminishing the value of “binary
compatibility”
moreover, dynamic optimization techniques yield better
code than static compilation
– exploit actual processor parameters (caches, …)
– “live” profiling data may be available
=> “mobile code will define future platform(s)”
Mobile Code Security
most approaches are based on some type-safe
programming language
host systems publish their policies in terms of typesafe APIs
conformance to that interface is then guaranteed by the
mobile code transportation scheme
– semantically equivalent to transporting source code
– however, for efficiency and quality of dynamic code
generation, usually want to transport a format “closer to the
machine” while still preserving source-program type-safety
semantics
Existing Practice: Java
the Java Virtual Machine is the de-facto standard
format for distributing mobile programs
the JVM has an instruction set that has been designed
specifically for representing Java programs
– interestingly enough, there still are JVM programs for which
no legal equivalent Java source program exists
– there are also legal Java programs that are rejected by all
possible JVM bytecode verifiers [Staerk’00]
security
is obtained by verifying the JVM bytecode,
essentially a symbolic execution of the program
Security vs. Efficiency
the Java Virtual Machine's instruction format is not
very capable in transporting the results of program
analyses and optimizations
as a consequence, when Java byte-code is transmitted,
each recipient must repeat most of the analyses and
optimizations that could have been performed just
once at the origin
the main reason why Java byte-code has these
deficiencies is to allow verification by the recipient
Security vs. Efficiency
for example, a code producer often has information
about the redundancy of a type or index check
but this fact cannot be communicated safely to the
code consumer - not in a manner that the recipient can
be sure that this is not a false claim inserted by a
malicious third party
similar concerns inhibit common compiler
optimizations such as common subexpression
elimination at the code producer’s side
An Alternative Approach: PCC
instead of executing the program symbolically at the
receiver’s site (which is time consuming and
complex), the code producer attaches a “proof” that
the code is correct
the “proof” shortcuts the verification: checking a given
solution is often much simpler than finding it in the
first place
the Java KVM for embedded devices uses a kind of
PCC (“stack maps”) that may become a standard for
Java
A Third Approach
instead of verifying or checking, we have been been
investigating a class of mobile code representations that can
provably encode only “legal” programs
– security is obtained by construction
– the need for verification disappears
– our approach can provide the identical security guarantees as the Java
Virtual Machine, but it can express most of them statically as a wellformedness property of the encoding itself
– in our solution, an incoming mobile program may not do the intended
task, but it will not do anything “bad” - for any definition of “bad” that
can be cast into a type system
– interestingly enough, such “intrinsically secure” mobile code is also
denser than virtual machine code, and permits to generate better object
code, and faster
A Third Approach: Two Variants
we have in fact designed not just one, but two
alternative mobile-code representations, both of which
provide “security by construction”
they differ in the semantic level at which they describe
the mobile program
– “high-level”: close to the source language but with
supporting compiler-related information
– “low-level”: as close to what a modern code generator backend needs without being target-machine specific
Rationale for Multi-Track Approach
the relative trade-offs (encoding density vs.
decoding/dynamic compilation speed vs. code quality)
are completely unknown and can only be determined
by collecting experience with actual prototypes
by implementing both the “high-level” and the “lowlevel” solution, we are exploring the design space
rather than designing an ad-hoc solution
Low-Level Encoding [PLDI01]
SafeTSA preserves control and dataflow information
as well as full typing for each intermediate result
it is based on SSA form, a representation that is also
used internally by a number of important state-of-theart research compilers for Java, e.g.,
– IBM T.J. Watson Lab: Jalapeño
– Microsoft: Marmot
– Sun Microsystems: HotSpot Server
SafeTSA is far easier to parse into a form useful for
code optimization than JVM-code
Current Status and Results
based on Martin Odersky’s Pizza front-end
can compile all of Java to safeTSA
prototype run-time environment almost finished; will
provide full interoperability between safeTSA and
JVM-based class files
– can mix and match both formats with dynamic loading
– call-backs from JVM to safeTSA are ugly
safeTSA representation is surprisingly small
High-Level Encoding [Babel01]
ultra-compact representation using grammar-based
compression of abstract syntax trees
goal is to transport the source program along with as
much compiler-related support information as possible
Schematic Overview
Source
Parser
Code
Generator
“classic
Frontend”
AST
Encoder
PPM-Model
& Arithmetic
Encoder
“classic
Backend”
011000101010…
AST
Decoder
PPM-Model
& Arithmetic
Decoder
Compression / Decompression
Compression Overview
Parsing: get AST from source
Serialize: get stream of symbols from AST
Modeling: use context and abstract grammar to build
predictive statistical model
Coding: use arithmetic coding with model
Types of nodes in AST
String, Integer, Terminal
List : e.g. Block = BlockStatement*
Aggregate : e.g. IF = cond thenbranch elsebranch
Choice : e.g. BinOp = Plus | Minus | …
Information is in choice nodes
– want to guess which choice is taken
Transmitting an AST
any predefined serialization will do
we use depth first (pre-order)
when serialized, most info in AST is redundant, e.g.
– order and kind of kids of aggregate nodes known
– this is because we use knowledge of the grammar
must encode index of choice made at choice nodes
Prediction by Partial Match (PPM)
dynamically maintain counts of characters seen after
various contexts
contexts may be of various lengths
– eg. for “abcd”, contexts for “d” are :
» length 1 context : “c”
» length 2 context : “bc”
» length 3 context : “abc”
predict characters in current context by looking at what
occurred previously
Maintaining Contexts
a
b
*
c d
a
b
b
c
d
c
d
c
d
d
Adapting PPM To Work On Trees
each node is a symbol
the context is path from root to the current node in the
AST
problem: in DFS, what when we reach leaf node and
go back up to ancestor?
– pop context – all active nodes moved up one position to their
parents (in context tree)
Encoding
PPM is used to model the choices made at choice
nodes, i.e. associate a probability with every choice
these probabilities are used to drive an arithmetic
coder to output bits
Compressing Constants
constants (strings, integers, names) are a significant
fraction of source
to compress: make table of constants, and refer to them
by their index in this table
further compress: maintain different tables for strings,
names etc. – reduces number of bits in index
currently exploring more sophisticated context
modeling ideas for compressing constants
AST Compression: Example
AST for: i = i + 1
Choice nodes
Relevant grammar rules
Stmt = If|While|Assign|….
Assign = Lvalue Expr
Lvalue = Field|VarAccess|…
Expr = Unary|Binary|…
Binary = BinOp Expr Expr
Preorder traversal
Stmt Assign Lvalue VarAccess i Expr
Binary Expr VarAccess i Expr Literal
IntLiteral 1 BinOp +
AST Compression: Example
AST for: i = i + 1
Context tree
AST Compression: Example
AST for: i = i + 1
Context tree
AST Compression: Example
AST for: i = i + 1
Context tree
Model:
Prob(j) = 0.3
Prob(k) = 0.5
Prob(i) = 0.2
Send model and
choice “i” to
arithmetic coder
Status and Results
compressor/decompressor prototype written in Python
completely generic – can be used with any abstract
grammar
have implemented the Java abstract grammar
– works with single Java source files as well as entire
packages.
comparison for Java class-file compression with
Pugh’s results (best published Java compressor)
Results: Classes
Name
Class
file
Gzip
Bzip2
Pugh
CAST
CAST/
Pugh
ErrorMessage
305
256
270
209
105
50%
CompilerMember
1192
637
641
396
230
58%
BatchParser
4939
2037
2130
1226
1069
87%
Main
11363
5482
5607
3452
3295
95%
SourceMember
13809
5805
5705
3601
2988
83%
SourceClass
32157
13663
13157
8863
7849
89%
Classes from Sun’s javac package - all sizes in bytes
Results: Archives
Package
Jar file
Gzip
Bzip2
Pugh
CAST
CAST/
Pugh
Javac
36787
32615
30403
18021
14070
78%
Jess
232041
133146
97852
48331
31083
64%
compressed collections of classes - all sizes in bytes
• compressed ASTs are 5-50% smaller than Pugh’s
• 3-8 times smaller than uncompressed class files
or JAR files
Performance-Enhancing Information
now raise the semantic level of the grammar
e.g. “Escape Analysis”
– an object that doesn’t “escape” its defining scope can be allocated on the
stack rather than on the heap
– this optimization alone can often double performance
the analysis itself is very difficult to do, but the results of the
analysis are easy to verify
– augment the type system by “escaping/non-escaping”
– make this part of the encoding scheme itself
– e.g., => a non-escaping object cannot be assigned to a variable from an
enclosing scope
Insights So Far
abstract syntax trees viable as a mobile code format
can be highly compressed
– Java archives by factor of 3-8
– 5-50% better than Java bytecode specific compression by
Pugh
Overall Project Achievements
lead the way to a genuine improvement over virtual
machine transportation formats
– security without need for validation
– tamper-proof performance-improving information
innovative and generic program compression method
as a useful by-product of this effort
Task Schedule
Y1 Milestones:
•source-level representation
=> Java compression
•low-level representation
•core calculus representation
1999
investigate:
•multiple source languages
•graph-based encoding
schemes
•proof-carrying code
Y2 Milestones:
•system prototypes
•trade-off analysis
•encoding format
comprehensive definition
2000
2001
investigate:
•requirements of
optimizing code generators
•integration of security vs.
compiler-related data
End of Project:
•system deliverable
•comprehensive
documentation
2002
investigate:
•mutual interaction of
security, efficiency,
and compression density
•security of system
Mobile Code Security Revisited
provided through type-safe programming language and type-safe APIs
– semantically equivalent to transporting source code (everybody does it this way)
but many policies currently cannot be expressed in terms of a type system and
hence need to be implemented inside the library
–
–
–
–
–
“open only files in directory X”
“initiate connections only with IP addresses in range […]”
“execute no more than N instructions between OS calls”
“do not send on network after reading local files”
=> security automata
need to represent these properties directly and support them along the whole
pipeline from code producer to code consumer
=> some other PIs in Oasis are working on these themes and their work can be
directly beneficial to this project
Transition of Technology
our prototype implementation(s) will be made
available in source form
the idea is to create a “turnkey” replacement to current
Java compilers and JVM runtime systems
–
–
–
–
you simply take your code and recompile using our compiler
it will then run on our runtime
our runtime will also run your old JVM class files
you can even mix our stuff with JVM class files
=> we simply provide a new (better!) mobile code
transportation layer without changing anything else
Thank You