Poster - Computer Science at RPI - Rensselaer Polytechnic Institute

Download Report

Transcript Poster - Computer Science at RPI - Rensselaer Polytechnic Institute

PCFG Based
Synthetic Mobility Trace Generation
S. C. Geyik, E. Bulut, and B. K. Szymanski
Department of Computer Science, Rensselaer Polytechnic Institute
Motivation
Methodology
Need for long-term mobility data for initial testing of protocols,
before actual deployment
 It is difficult to collect realistic long-term traces
 Context free grammars with probabilities assigned to productions
START  a START (0.4) | a (0.6)
Probability of “ a a”  (P(STARTa START) x P(STARTa))
= 0.4 x 0.6 = 0.24
 Modeling behavior sequences by constructing PCFGs from training
data
Contributions
A Probabilistic Context Free Grammar (PCFG) based method to
generate synthetic mobility traces
 A PCFG is learned from a real mobility trace where each sentence that
can be produced by this PCFG represents a movement sequence in this
real trace
 We provide extensions to the PCFG model to capture spatial and
temporal mobility characteristics
 Once a PCFG is constructed, many sentences can be produced,
dictating the movements of an entity, hence providing the synthetic
mobility data
 The algorithm creates a sentence from the constructed
grammar which represents a movement sequence for a
single mobile entity. Once the sequence is over, another
sentence is generated. The sequences should be filtered
according to the current location of the entity.
Probabilistic Context Free Grammars (PCFG)
Ex:
 A method is needed to increase the amount of the collected mobility
traces, keeping the characteristics
Trace Generation Algorithm
Extensions to the PCFG Model
Only positive data is enough for Probabilistic CFGs !
(ii) Application of Operators: Chunk and Merge operators applied
at each step to generalize and miniaturize the grammar.
 Chunking: Generate a new non-terminal for a string and replace all
occurrences of this string with the non-terminal, updating its frequency
 Merging: Combine two non-terminals and replace all the
occurrences of each non-terminal with this non-terminal
(Generalization)
 Goodness of grammar: Bayesian posterior probability P(G|D)
of the grammar G given the training data D is defined as:
Automated Construction of PCFGs
P(G) x P(D|G)
 The algorithm is introduced in [1] .
P(G|D) =
 Two stages in Construction:
(i) Data Incorporation: All sentences are introduced to the initial
grammar as rules of the START non-terminal. One non-terminal for each
terminal with probability 1.0 .
P(D)
 P(G) is related to description length and P(D|G) is related to
sentence probabilities of training data. Time complexity of
O(D2log(D)) in (1) .
Spatial mobility properties are modeled by
representing each location with a terminal
symbol in the PCFG
Terminal mobility properties are modeled by the
time tokens (t) representing a preset amount of
time between the location symbols (please note
that an alternative is to have the time token
represent a distribution and always use a single
time token, but this is left for future work).
 Example: a node starts at location lA, then goes to
lB after 40 seconds and to lC after 25 seconds :
lA, 40 lB 25 lC
If t represents 25 seconds, this movement sequence
can be integrated into the grammar training as:
lA t t lB t lC
!!! Notice the approximation of 40 seconds with two
time tokens !!!
 Trade-off between the time interval of the time
token (resolution) and the complexity of the grammar
Evaluations
 We consider two metrics. One is how accurate are the synthetic traces in simulating the next movement
(next location or next mobile entity encountered) given the previous k movements (e.g. Cons 2 means the
next movement given just the previous movement (2-1=1 history)). The other one is how accurate are the
synthetic traces in simulating the time it takes for the next movement given the previous k movements (e.g.
Intern 2 is analogous to Cons 2).
 We used Euclidean distances and used weighting to calculate the scores
We compared the closeness of the PCFG and 2-Level Markov Model generated traces
to the actual trace
 Previous work suggests goodness of Markov Models on prediction and modeling, furthermore shows
that more than two levels does not perform better.
 Time complexity of building a 2-Level Markov Model is O(D log(D)) where D is the size of the trace. This
can be calculated by taking the worst-case number of entries in the state table to be D and updating the
statistics by each transition in the trace to be log(D) to find the necessary entry.
DieselNet Trace [2]
 Bus to bus meeting data collected in Amherst, MA. Evaluated based on how much time is between
consecutive meetings and how accurate is the set of buses met by a bus in a certain route.
SF Cab Mobility Trace [3]
 Taxi movements in San Francisco
 Evaluated based on how much the
synthetic traces coincide with actual traces
(i.e. how much time passes between
location change or how accurate the next
location is)
Conclusions and Future Work
 A PCFG-based method is provided to generate long-term
synthetic mobility traces, crucial for protocol testing.
 The method performs better than Markov models in
closely modeling the mobility characteristics of the real
world traces.
 Future work includes future movement prediction for
mobile entities and application of the PCFG-based method
to other domains.
References
[1] Geyik, S. C., Szymanski, B., Event Recognition in Sensor Networks by
Means of Grammatical Inference, IEEE INFOCOM 2009, Rio de Janeiro,
Brazil, March 2009, Page(s): 900-908.
[2] Zhang, X., Kurose, J., Levine, B., N., Towsley, D., Zhang, H., Study of a
bus-based disruption tolerant network: Mobility modeling and impact on
routing, In Proc. ACM Annual Intl. Conf. on Mobile Computing and Networking
(Mobicom), Page(s): 195-206, 2007.
[3] Piorkowski, M., Sarafijanovoc-Djukic, N., Grossglauser, M., A
Parsimonious Model of Mobile Partitioned Networks with Clustering, The First
International Conference on COMmunication Systems and NETworkS
(COMSNETS), 2009, Bangalore, India.
Acknowledgment
This research was sponsored by US Army Research laboratory and the UK Ministry of Defence
and was accomplished under Agreement Number W911NF-06-3-0001 and under Cooperative
Agreement Number W911NF-09-2-0053. The views and conclusions contained in this document
are those of the authors, and should not be interpreted as representing the official policies, either
expressed or implied, of the US Army Research Laboratory, the U.S. Government, the UK
Ministry of Defense, or the UK Government. The US and UK Governments are authorized to
reproduce and distribute reprints for Government purposes notwithstanding any copyright notation
hereon.