Code Gen of Expr Evaluation in Shark

Download Report

Transcript Code Gen of Expr Evaluation in Shark

Code Gen of Expr Eval
in Shark
[email protected]
Outlines
• CG examples
• Performance Comparison (CG Expr Eval V.S. Hive Expr
Eval)
• CG Design & Major Class Diagram
• Implemented UDFs/Generic UDFs
• Future Works
CG Examples
shark.expr.cg=true/false in hive-site.xml to enable/disable the feature; default is true.
Performance Comparison (CG Expr
Eval V.S. Hive Expr Eval)
747,747,840 records / 66,909,023,675 bytes / RC File (with LzoCodec) on 4 Slaves Machines
Performance Comparison (CG Expr
Eval V.S. Hive Expr Eval) (2)
• Why CG Expr Eval is Faster than Hive Expr Eval? In Hive Expr Eval:
A. Keep re-evaluating the common sub node expressions
e.g. in expression: concat(year(date_add(visitDate,7)), '/', month(date_add(visitDate,7)), '/',
day(date_add(visitDate,7))), the “date_add(visitDate,7)” will be evaluated 3 times.
B. Keep checking data types in the runtime
The parameter types of “evaluate” method in GenericUDFs is uncertain until runtime, and Hive
Expr Eval have to keep checking the value types inside of the “evaluating”. e.g.
GenericUDFOPGreaterThan.evaluate, GenericUDFPrintf.evaluate etc.
C. Un-necessary type converting
e.g. in expression: (duration + 1.03), variable “duration” will be converted into a new object
FloatWritable first in Hive Expr Eval, which creates lots of small temperate objects
(GenericUDFBridge.conversionHelper)
D. Large mount of virtual function calls in runtime
Hive Expr Eval always use the base class objects, particularly the UDF objects and the field
value objects
E. Using the Java Reflection to call UDF evaluate() method
Hive Expr Evals access the UDF (in class GenericUDFBridge) is based on the Java Reflection
API, which cause another performance issue
(http://docs.oracle.com/javase/tutorial/reflect/index.html)
CG Expr Eval Generates Source Code with concrete objects and executing branches.
CG Design & Major Class Diagram
CG Design & Major Class Diagram (2)
• Why not generate the bytecode directly?
A. The generated content is quite complicated, source code is much easier to
debug / troubleshooting.
B. Java complier could do another optimizations when compile the source
code.
• Why not generate the evaluating source code according to Hive
ExprNodeEvaluator tree, but the ExprNodeDesc tree?
A. ExprNodeEvaluator tree loss some information, which may be helpful for
further optimization. (e.g. the common sub node expression evaluating)
B. Extracting the information from the ExprNodeEvaluator tree is kind of
tough, as most of the variables are protected / private in
ExprNodeEvaluator.
Implemented UDFs/Generic UDFs
• Supported Features:
o
o
o
o
o
Relational Operators (=,!=,<,<= etc.)
Arithmetic Operators (+,-,*,/,% etc.)
Logical Operators (AND,OR,NOT etc.)
Built-in Functions(UDF) and existed User-Defined Functions
Partial of the generic UDF
• GenericUDFBetween
• GenericUDFPrintf
• GenericUDFInstr
• GenericUDFBridge
• Unsupported Features
o
o
o
o
o
Conditional Functions (if/case/when etc.)
Map/Array
UDAF
UDTF
Misc. Functions (java_method/reflect/hash etc.)
Future Works
• Generated Java Source Compile once and distribute
among the cluster
• Reuse the Generated .class for the same queries
• Support more General UDF (case/when/if etc.)
• Support Collection Type(Array/Map etc.)
• Code Gen in Aggregations