PPT Format - David Lau Shyh Tzer

Download Report

Transcript PPT Format - David Lau Shyh Tzer

Improving Andriod
App Development's
Efficiency and Quality
through Machine
Learning Techinque
刘世泽 Lau Shyh Tzer, David
Visiting Student, IIIS, Tsinghua University
Summer 2013
BSc. Computer Science, The Chinese University of Hong
Kong
Background
•
Problem: The Growing of Android API
•
Level 1-18, over 20k API
Methods
• Difficult for developers to
master the usage of Android
API, especially the
inexperienced developer
•
Aim: Adapt Machine Learning and Reverse
Engineering Technique to Analyze the Usage
Pattern
• Possibly developer a helper tool to suggest/fix
the Android API usage during development stage
Workflow
1
Adapt reverse engineering technique to
retrieve the needed data from packaged
Android App (.apk)
2
Perform data mining on the result
raw data to dig out interesting API
usage pattern, relationship.
1 Reverse Engineering on Android App
•
Need to retrieve information
from package (.apk) file not
Perform static analysis
source code
not dynamic analysis
• There’re basically three options:
Retrieve .dex file
1
Dalvik
bytecode
.apk
Disassembly
2
smali
code
.apk
Decompile
3
.apk
.jar (.class)
bytecode
Low level, Lack of analysis
tools (Few)
Good for hacking.
Take time to familiar
with smali code
Becomes a Java
problem
Lots of analysis tool
1 Reverse Engineering on Android App
Decompile
.apk
.jar (.class)
bytecode
•
Use dex2jar open source tool:
https://code.google.com/p/dex2jar/
•
Support directly decompile .apk to .jar file
linux sh dex2jar/d2j-dex2jar.sh someApk.apk
windows dex2jar\d2j-dex2jar.bat someApk.apk
•
We can then redirect it into a Java problem
and focus on the static analysis with Java
bytecode
1 Reverse Engineering on Android App
•
In order to understand the usage pattern of
Android API, we have to know the structure of the
code
• The easy and abstract approach to understand
the structure of the code is to look at its Abstract
Syntax Tree (AST)
Generate AST from Java bytecode
It’s obvious and easy to parse Java source code
into Abstract Syntax Tree, but parse the bytecode
not
• is
Bytecode
is a set of instructions that JVM
interpret to perform stack execution to run the
• program
The stack execution on the bytecode is different
than the common program flow that we observe at
source code
•
Example
method
=
i3
*
i1
return
*
i3
Bytecode
i2
AST
2
Bytecode Outline Plugin for Eclipse
http://asm.ow2.org/eclipse/index.ht
ml
Generate AST from Java bytecode
•
Intuitively, bytecode is interpreted by JVM as the stack
execution, so we can ‘recover’ the code structure and
construct the AST through simulating the JVM stack
operation
Example: Variable Assignment
Sourc
e
code
int i=1;
=
i
ICONST1
byte
code
ISTORE0
1
1
Thread Stack Abstract Syntax
Tree
Example 2: From Previous Bytecode Example
Source code
method
public int method(inti1,int i2){
int i3=i1*i2
return i3*2;
}
=
i3
bytecode
ILOAD1
ILOAD2
IMUL
ISTORE3
ILOAD3
ICONST2
IMUL
IRETURN
*
i1
i2
2
i1
i3
*
Thread Stack
return
i2
*
i3
2
Abstract Syntax
Tree
Generate AST from Java bytecode
•
There are various kinds of AST structure, such
as condition statement, goto statement,
compound statement, but they can all be
‘recovered’ from bytecode by using the
previous technique to simulate the stack
execution
• However, read directly on the
.class file result in binary
format that useless for our
parsing
•
So we need a systematic way
to parse the bytecode
ASM - Bytecode Engineering Library
•
ASM is an all purpose Java bytecode
manipulation and analysis framework
http://asm.ow2.org/
•
It provides two powerful APIs: Core API
and Tree API
•
Core API creates an interface of visiting
bytecode
• Tree API parses bytecode into Objects
Refer to
http://download.forge.objectweb.org/asm/as
m4-guide.pdf for complete usage
ASM - Bytecode Engineering Library
•
Tree API is particularly useful for generating
the AST from bytecode
• It provides two important interfaces:
ClassNode and MethodNode which enables
the developer to assess to the bytecode The bytecode (opcode) of
the respective method is
information directly
stored at InsnList
instructions
ASM - Bytecode Engineering Library
Usage:
ClassReader cr = new ClassReader(“App1.class”);
ClassNode cn = new ClassNode();
cr.accept(cn,0);
ACM parse the whole
class and store the
respective objects
over here
Assess Class Information:
All the information is stored at the properties of the object, so simply retrieve from
them
cn.name; <- the class name
cn.field; <- class field variables
cn.innerClasses <- class inner classes
Assess Method Information:
List<MethodNode> mnList = cn.methods;
for(MethodNode mn:mnList){
mn.name <- method name
mn.signature <- method signature (parameter and return value
type)
mn.instructions; <- the respective opcodes
}
Each class contains
several methods, so it’s
List<MethodNode> type
ASM - Bytecode Engineering Library
Assess Bytecode Intructions:
InsnList insn = mn.instructions;
Iterator itr=insn.iterator();
while(itr.hasNext()){
AbstractInsnNode ain=(AbstractInsnNode)itr.next();
int opcode = ain.getOpcode();
int type = ain.getType();
An abstract class to wrap
the instructions. ASM
separate instruction into
16 different kinds
//simulate the stack execution here to generate AST
}
The type of the
instructions is also
defined in integer
constant. For example:
4 = FIELD_INSN
5 = METHOD_INSN
Detail of the instruction, like
which store to which local
variable, the Field Variable ID,
invoked method’s signature
are stored in this object as
properties.
The bytecode instruction
is defined as an integer
constant in ASM. For
example,
3 = ICONST_0
21 = ILOAD
54 = ISTORE
My Design of AST Generator
•
With the support of ASM Tree API, we can parse
the bytecode and simulate each stack execution to
construct the respective Abstract Syntax Tree
• In
order to suit with the needed data for data
systematically
mining, I designed a customized Abstract Syntax
Tree structure
Specifically
designed class to
suit each code
structure
Abstract
Parent Class
ASTNode
Inherited
ASTArithmeti
cNode
ASTArray
Node
ASTArray
ValueNod
e
ASTConsta
ntNode
ASTFieldN
ode
ASTFuncti
onNode
ASTJump
Node
ASTLabel
Node
ASTLocalVa
riableNode
ASTMetho
dNode
ASTObject
Node
ASTRetur
nNode
ASTSwitc
hNode
ASTCastN
ode
ASTClass
Node
My Design of AST Generator
ASTNode
•
•
•
•
•
•
getASTKind
setName
getName
setSignature
getSignature
setCallBy
getCallBy
setUsedBy
getUsedBy
setUsedAsObject
getUsedAsObject
Example:
CallBy/UsedBy/UsedAsObje
ct are stored as
ArrayList<ASTNode> to
handle multiple connections
“Sample”
sb
setUsedAsObject
setUsedBy
text
StringBuilder sb = new StringBuilder;
String text = “Sample”;
sb.append(text);
String result = sb.toString();
setUsedBy
append
setCallBy
toString
setUsedBy
result
Object and its methods
have doubly connections
to ensure bidirectional
traverse
My Design of AST Generator
addParameter are stored as
ArrayList<ASTNode> to
handle multiple connections
ASTMethodNode
•
addParameter
ASTLocalVaria
bleNode
•
getPara
Trick:
Local variable are stored separately at
JVM Method Area by index. So in order
to track the changing of local variable
assignment (such as one variable can
be used multiple times), create a hash
table to record the pointers reference
to the local variable. So the update of
the variable assignment can be done
easily while parsing the bytecode
setIndex
getIndex
• setVariableType
getVariableType
• setVariableValue
getVariableValue
ASTFieldNode
•
setFieldValue
getFieldValue
Trick:
Same case with Local Variable, create
a hash table to have pointers
reference to the Field Variable. Be
careful that the hash table is clear
while accessing each method, but
Field Value is tracked through the
whole class
Data Flow Analysis on the AST
•
With complete Abstract Syntax Tree for an Android
App, it gives a very useful details to perform
various kinds of static analysis
• My research is mainly focused on its data flow
analysis:
1
Where does the return value
goes?
2
?? = Builder.fromParts(String scheme, String ssp, String fragment);
3
This methods is called by
what kind of object?
Where are these arguments
come from?
Data Flow Analysis on the AST
•
Performed depth-first-search on the AST to trace
the return value path, argument path and call by
path
• Trick: use hash
table/list to record
Collect data from Android API
Invocation
down the path to
avoid infinite loop
within the AST
QQ Android
ASTClassNode
ASTMethodNode
ASTClassNode
ASTMethodNode
2 Mining on the Analysis Data
Convert App into Jar
format
Generate AST from Jar file
Define To-Do Analysis
Format
Preparing Mining Raw Data
•
Coded a web crawler to download free Android
Apps from http://apk.gfan.com/ open Android
Market
• Successfully grabbed 10, 266 valid Android Apps
and generated respective AST, analysis data
through the self-developed ASTGenerator by using
Amazon Web Service High Memory cluster.
•
Result data structure:
1
1
1
0
0
1
1
0
appname-android/net/Uri buildUpon-0
1
1
1
0
0
0
0
1
appname-android/net/Uri parse-0
Mining on the Analysis Data
•
Adapted Weka 3 to perform data mining
task. http://www.cs.waikato.ac.nz/ml/weka/
•
It’s convenient to convert the
analysis raw data into Weka ARFF
input format, especially its support of
sparse matrix format
Mining on the Analysis Data
•
Adapted Hierarchical Agglomerative Clustering in
the result matrix to discover the apis’ relationship
and their usage pattern
Mining on the Analysis Data
•
The analysis result data from 10, 266 apps is huge
(~50GB text file), it’s time-consuming and
unnecessary to mine directly on them
• Designed a MapReduce task and
ran it at Hadoop to categorize the
result data into methods by
methods and compute the
statistics of their invocation
numbers
• Then perform hierarchical clustering on
methods that have enough data to discover
meaningful pattern (like the number of
invocation reached a threshold)
Mining on the Analysis Data
Total 19, 250 Android API methods
discovered to be invoked at least one
time among 10, 266 apps
Methods that have high numbers of
invocation don’t directly mean they’re
having higher possible to find
meaningful pattern. They’re more
likely to be really common usage like
UI elements, Log
Methods that have average number
of invocation, especially those
indicating specific feature like geolocation, network, database might be
the target of data mining
Clustered Result:
•
Weka’s hierarchical clustering result in Newick
Format, we can use software like Dendroscope to
visualize it
http://ab.inf.uni-tuebingen.de/software/dendroscope/
Some clustering result
has obvious clusters
which may implied an
obvious usage pattern
with this method
android/net/Uri buildUpon
Clustered Result:
•
Trick: Some analysis that shows many redundant on
the column key (related APIs), perform a filter to throw
away those no obvious relations (such as only related
one time) before sending the data for clustering
android/location/Location <init>
May perform
several trials and
observe the best
cut-off branch
Clustered Result:
•
The result Newick Format can be parsed back to get
the list of related APIs for each cluster
android/location/Geocoder <init>
There may have
many unique
usage pattern like
this which
probably an error
pattern or special
usage
Mining on the Analysis Data
•
By tracing back the result Newick Tree, we got
the related APIs for each methods
•
Interesting Results:
Package
Analysis
Identify Good and
Error Usage
Pattern
•
Complete APIs
Relationship
•
The clustering result shows the android/location
package classes are strongly inner-connected. Classes
like GpsSatellite, Location, Location Manager are highly
relied on each other.
•
By having App’s name as identifier, we can trace back to
the information of the nature of the app (its download
times, rating, popularity) to determine feature, possible
good/bad pattern of the result clusters.
Perform clustering on all the useful data and retrieve the
APIs relationship from the cluster after identify its usage
pattern. It can eventually conclude a ‘Good’ usage
pattern suggestion list when respective API methods are
called
Mining on the Analysis Data
•
Android API vs
Java Library
App’s Permissions
vs Method
invocation
There are many Android APIs have data flow relations
with Java Library methods, by digging into the methods’
clusters, we can discover some obvious usage pattern
of Android APIs and Java Library methods.
•
Android Permissions information is stored at
AndroidManifest.xml which is an binary-encoded XML
file in .apk package. It can be decoded by using tool
such as APK-tool (https://code.google.com/p/androidapktool/) to read it.
• Android Permissions declare at statically at compile
time and can’t grant dynamically (at run time)
•
So with the AST generated from the bytecode, we can
check/traverse the AST to determine the correctness of
one app’s declared permissions.
•
For example, one app may declared BroadcastReceiver
permissions and then there’re no related
methods/functions are found at the AST
Thank you very much