Feature - University of Notre Dame

Download Report

Transcript Feature - University of Notre Dame

Computational Discovery in
Evolving Complex Networks
Yongqin Gao
Advisor: Greg Madey
Yongqin Gao
December 2006
Dissertation Defense
Outline
•
•
•
•
•
•
•
•
•
Background
Methodology for Computational Discovery
Problem Domain – OSS Research
Process I: Data Mining
Process II: Network Analysis
Process III: Computer Simulation
Process IV: Research Collaboratory
Contributions
Conclusion and Future Work
Yongqin Gao
December 2006
Dissertation Defense
Background
• Network research gains more attentions
–
–
–
–
–
Internet
Communication network
Social network
Software developer network
Biological network
• Understanding the evolving complex network
– Goal I: Search
– Goal II: Prediction
• Computational scientific discovery
Yongqin Gao
December 2006
Dissertation Defense
Computational Discovery
Our Methodology
Network
Analysis
Discovery
Initialization
Assessment
Computer
Simulation
Data Mining
Researcher
Feedback
Revision
Research
Collaboratory
Contribution
Reference
Community Members
Yongqin Gao
December 2006
Dissertation Defense
Problem Domain
• Open Source Software Movement
– What is OSS
• Free to use, modify and distribute and source code available
and modifiable
• Potential advantages over commercial software: Potentially
high quality; Fast development; Low cost
– Why study OSS (Goal)
• Software engineering — new development and coordination
methods
• Open content — model for other forms of open, shared
collaboration
• Complexity — successful example of selforganization/emergence
Yongqin Gao
December 2006
Dissertation Defense
Glory of OSS
Number of Active Apache Hosts
Yongqin Gao
December 2006
Dissertation Defense
Problem Domain
• SourceForge.net community
– The biggest OSS development communities
– 134,751 registered projects
– 1,439,773 registered users
Yongqin Gao
December 2006
Dissertation Defense
Problem Domain
• Our Data Set
–
–
–
–
25 monthly dumps since January 2003.
Totally 460G and growing at 25G/month.
Every dump has about 100 tables.
Largest table has up to 30 million records.
• Experiment Environment
– Dual Xeon 3.06GHz, 4G memory, 2T storage
– Linux 2.4.21-40.ELsmp with PostgreSQL 8.1
Yongqin Gao
December 2006
Dissertation Defense
Related Research
• OSS research
– W. Scacchi, “Free/open source software development
practices in the computer game community”, IEEE
Software, 2004.
– C. Kevin, A. Hala and H. James, “Defining open source
software project success”, 24th International
Conference on Information Systems, Seattle, 2003.
• Complex networks
– L.A. Adamic and B.A. Huberman, “Scaling behavior of
the world wide web”, Science, 2000.
– M.E.J. Newman, “Clustering and preferential
attachment in growing networks”, Physics Review, 2001.
Yongqin Gao
December 2006
Dissertation Defense
Process I: Data Mining
• Related Research:
– S. Chawla, B. Arunasalam and J. Davis,
“Mining open source software (OSS) data using
association rules network”, PAKDD, 2003.
– D. Kempe, J. Kleinberg and E. Tardos,
“Maximizing the spread of influence through a
social network”, SIGKDD, 2003.
– C. Jensen and W. Scacchi, “Data mining for
software process discovery in open source
software development communities”, Workshop
on Mining Software Repositories, 2004.
Yongqin Gao
December 2006
Dissertation Defense
Process I: Data Mining
Algorithm Application
Feature Selection
Relevant data
Data Purging
Database
Data Preparation
Raw data
Yongqin Gao
December 2006
Dissertation Defense
Process I: Data Mining
• Data Preparation
– Data discovery
• Locating the information
– Data characterization
• Activity features: user categorization
• Network features
– Data assembly
• Data Purging
– Treatment about data inconsistency
• Unifying the date presentation by loading into single depository
– Treatment about data pollution
• Removing “inactive” projects
• Feature Selection
– This method is used to remove dependent or insignificant features.
– NMF (Non-negative Matrix Factorization)
Yongqin Gao
December 2006
Dissertation Defense
Process I: Data Mining
• Result I
– Significant features
• By feature selection, we can identify the significant
feature set describing the projects.
• Activity features: “file_releases”, “followup_msg”,
“support_assigned”, “feature_assigned” and task
related features
• Network features: “degrees”, “betweenness” and
“closeness”
Yongqin Gao
December 2006
Dissertation Defense
Process I: Data Mining
• Distribution-based clustering (Christley, 2005)
– Clustering according to the distribution of
features instead of values of individual feature
– We assume every entity (project) has an
underlying distribution of the feature set
(activity features)
– Using statistical hypothesis test
• Non-parametric test
• Fisher’s contingency-table test is used
– Joachim Krauth, “Distribution-free statistics: an
application-oriented approach”, Elsevier Science Publisher,
1988.
Yongqin Gao
December 2006
Dissertation Defense
Process I: Data Mining
• Procedure:
While (still unclustered entities)
Put all unclustered entities into one cluster
While (some entities not yet pairwise compared)
A = Pick entity from cluster
For each other entity, B, in cluster not yet
compared to A
Run statistical test on A and B
If significant result
Remove B from cluster
• Worst case complexity: O(n2)
Yongqin Gao
December 2006
Dissertation Defense
Process I: Data Mining
• Result II
• Unsupervised learning
– Distribution-based method
used to cluster the project
history using the activity
distribution
– We named the clusters using
ID and the results are shown
in the table
– High support and confidence
in evaluation
Yongqin Gao
December 2006
Cluster ID
Size
1
89709
2
9191
3
2060
Total
100960
Dissertation Defense
Process I: Data Mining
• Two sample
distributions from
different categories
• Unbalanced feature
distribution → could
be “unpopular”
• Balanced feature
distribution → could
be “popular”
Cluster 1
3488
3500
3000
2500
2000
1641
1510
1500
1000
736
534
500
312
20
0
1
3
4
82
0
22
2
229
5
6
7
8
9
10
11
121
12
0
28
13
14
4
15
Activity Category
Cluster 3
10000
9169
9000
8435
8000
7134
7000
6000
5000
3781
4000
3000
2179
2537
2411
2000
1651
1000
431
134
0
1
667
2
3
4
601
0
5
0
6
7
8
9
10
11
12
13
14
399
15
Activity Category
Yongqin Gao
December 2006
Dissertation Defense
Process I: Data Mining
• Discoveries in Process I
– Significant feature set selection
• Network features are important
• Further inspection in next process
– Distribution based predictor
• Based on the activity feature distribution
• Prediction of the “popularity” based on the balance of the
activity feature distribution
• Benefit of these discoveries
– For collaboration based communities, these discoveries
can help in resource allocation optimization.
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Why network analysis
– Assess the importance of the network measures
to the whole network and to individual entity in
the network
– Inspect the developing patterns of these
network measures
• Network analysis
– Structure analysis
– Centrality analysis
– Path analysis
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Related research:
– P. Erdös and A. Rényi, “On random graphs”,
Publicationes Mathematicae, 1959.
– D.J. Watts and S. H. Strogatz, “Collective
dynamics of small-world networks”, Nature,
1998.
– R. Albert and A.L. Barabάsi, “Emergence of
scaling in random networks”, Science, 1999.
– Y. Gao, “Topology and evolution of the open
source software community”, Master Thesis,
2003.
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Structure Analysis
– Understanding the influence of the network structure
to individual entities in the network
– Inspected measures
• Approximate diameter
D
log( N / z1 )
1
log( z2 / z1 )
• Approximate clustering coefficient
C
1
( 2  1 )( 2  1 ) 2
1
1 1 (2 1  3 2  3 )
• Component distribution
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Conversion among C-NET, P-NET and DNET
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Result I
– Approximate Diameters
• D-NET: between (5,7) while network size ranged
from 151,803 to 195,744.
• P-NET: between (6,8) while network size ranged
from 123,192 to 161,798.
– Approximate Clustering Coefficient
• D-NET: between (0.85, 0.95)
• P-NET: between (0.65, 0.75)
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Result I
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Centrality Analysis
– Understanding the importance of individual entities
to the global network structure
– Inspected measures:
• Average Degrees
• Degree Distributions
• Betweenness
B (v ) 
 st (v)

s  v  tV  st
• Closeness
C (v ) 
Yongqin Gao
1
tV dG (v, t )
December 2006
Dissertation Defense
Process II: Network Analysis
• Result II
– Average Degrees
•
•
•
•
Yongqin Gao
Developer degree in C-NET: 1.4525
Project degree in C-NET: 1.7572
Developer degree in D-NET: 12.3100
Project degree in P-NET: 3.8059
December 2006
Dissertation Defense
Process II: Network Analysis
• Result II (Degree distributions in C-NET)
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Result II (Degree distributions in D-NET
and P-NET)
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Result II
– Average Betweenness
• P-NET: 0.2669e-003
– Average Closeness
• P-NET: 0.4143e-005
– Normally these two measures yield very small
value in large networks (N>10,000).
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Path Analysis
– Understanding the developing patterns of the
network structure and individual entities in the
network
– Inspected measures:
•
•
•
•
•
•
Yongqin Gao
Active Developer Percentage
Average Degrees
Diameters
Clustering coefficients
Betweenness
Closeness
December 2006
Dissertation Defense
Process II: Network Analysis
• Result III (Active entities)
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Result III (Average degrees in C-NET)
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Result III (Average degrees in D-NET and
P-NET)
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Result III (Diameters in D-NET and P-NET)
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Result III (Clustering coefficients for DNET and P-NET)
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Result III (Average betweenness and
closeness for P-NET)
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
Measures
D-NET
P-NET
C-NET
Average Degree
Yes
Yes
Yes
Diameter
Yes
Yes
N/A
Clustering Coefficient
Yes
Yes
N/A
Degree Distribution
Yes
Yes
Yes
Component Distribution
N/A
Yes
N/A
Major Component
N/A
Yes
N/A
Average Betweenness
Yes
Yes
N/A
Average Closeness
Yes
Yes
N/A
Active Entity Size Development
Yes
Yes
Yes
Average Degree Development
Yes
Yes
Yes
Diameter Development
Yes
Yes
N/A
Clustering Coefficient Development
Yes
Yes
N/A
Average Betweenness Development
Yes
Yes
N/A
Average Closeness Development
Yes
Yes
N/A
Yongqin Gao
December 2006
Dissertation Defense
Process II: Network Analysis
• Discoveries in Process II:
– Measures of structure analysis and centrality analysis
all indicate very high connectivity of the network.
– Measures of path analysis reveal the developing
patterns of these measures (life cycle behavior).
• Benefits of these discoveries
– High connectivity in a network is an important feature
for information propagation, failure proof.
Understanding this discovery can help us improve our
practices in collaboration networks and communication
networks.
– Understanding the developing patterns of these network
measures provides us a method to monitor network
development and to improve the network if necessary.
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Related Research:
– P.J. Kiviat, “Simulation, technology, and the decision
process”, ACM Transactions on Modeling and
Computer Simulation,1991.
– R. Albert and A.L. Barabási, “Emergence of scaling in
random networks”, Science, 1999.
– J. Epstein R. Axtell, R. Axelrod and M. Cohen,
“Aligning simulation models: A case study and results”,
Computational and Mathematical Organization Theory,
1996.
– Y. Gao, “Topology and evolution of the open source
software community”, Master Thesis, 2003.
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Iterative simulation
method
ter
iza
tio
n
cri
pt
ion
De
s
n
tio
ra
c
era
Ch
a
n
Ge
December 2006
t
Yongqin Gao
en
tm
– More measures
– More methods
Empirical
Data
Collection
s
ju
• Verification and
validation
Ad
– Empirical dataset
– Model
– Simulation
Model
Verification
Simulation
Validation
Dissertation Defense
Process III: Computer Simulation
• Previous iterated models (master thesis):
–
–
–
–
Adapted ER Model
BA Model
BA Model with fitness
BA Model with dynamic fitness
• Iterated models in this study
– Improved Model Four (Model I)
– Constant user energy (Model II)
– Dynamic user energy (Model III)
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Model I
– Realistic stochastic procedures.
• New developer every time step based on Poisson
distribution
• Initial fitness based on log-normal distribution
– Updated procedure for the weighted project
pool (for preferential selection of projects).
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Average degrees
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Diameter and CC
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Betweenness and Closeness
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Degree Distributions
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Deficit in the measures
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Model II
– New addition: user energy.
– User energy
• the “fitness” parameter for the user
• Every time a new user is created, a energy level is
randomly generated for the user
• Energy level will be used to decide whether a user
will take a action or not during every time step.
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Degree distributions for Model II
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Deficit in the measures
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Model III
– New addition: dynamic user energy.
– Dynamic user energy
• Decaying with respect to time
• Self-adjustable according to the roles the user is
taking in various projects.
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Degree distributions (Model III)
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
Models
Model I
(more realistic
distributions)
Model II
(constant user energy)
Model III
(dynamic user energy)
Yongqin Gao
Measures
Patterns in Data
Simulated Patterns
Developer Distribution
Power Law (large tail)
Power Law (small tail)
Project Distribution
Power Law (small tail)
Power Law (large tail)
Average Degrees
Increasing
Increasing
Clustering Coefficient
Decreasing
Decreasing
Diameter
Decreasing
Decreasing
Average Betweenness
Decreasing
Decreasing
Average Closeness
Decreasing
Decreasing
Developer Distribution
Power Law (large tail)
Power Law (large tail)
Project Distribution
Power Law (small tail)
Power Law (reasonable tail)
Average Degrees
Increasing
Increasing
Clustering Coefficient
Decreasing
Decreasing
Diameter
Decreasing
Decreasing
Average Betweenness
Decreasing
Decreasing
Average Closeness
Decreasing
Decreasing
Developer Distribution
Power Law (large tail)
Power Law (large tail)
Project Distribution
Power Law (small tail)
Power Law (small tail)
Average Degrees
Increasing
Increasing
Clustering Coefficient
Decreasing
Decreasing
Diameter
Decreasing
Decreasing
Average Betweenness
Decreasing
Decreasing
Average Closeness
Decreasing
Decreasing
December 2006
Dissertation Defense
Process III: Computer Simulation
• Discoveries in Process III
– Expanding the network models for modeling
evolving complex networks (more parameters)
– Providing a validated model to simulate the
community network at SourceForge.net
• Benefits of these discoveries
– Expanded network models can benefit other
researchers in complex networks.
– Validated model for SourceForge.net can be
used to study other OSS communities or similar
collaboration networks.
Yongqin Gao
December 2006
Dissertation Defense
Process IV: Research
Collaboratory
• Related Research:
– G. Chin Jr. and C. Lansing, “The biological
sciences collaboratory”, Mathematics and
Engineering Techniques in Medicine and
Biological Sciences, 2004.
– L. Koukianakis, “A system for hybrid learning
and hybrid psychology”, Cybernetics and
Information Technologies, Systems and
Applications, 2003.
– NCBI, FlyBase, Ensembl, VectorBase
Yongqin Gao
December 2006
Dissertation Defense
Process IV: Research
Collaboratory
• What is Collaboratory?
– An elaborate collection of data, information,
analytical toolkits and communication
technologies
– A new networked organizational form that also
includes social processes, collaboration
techniques and agreements on norms, principles,
value, and rules
Yongqin Gao
December 2006
Dissertation Defense
Process IV: Research
Collaboratory
Researchers
Researchers
Presentation Tier
This top tier is the user interface.
The main function of the interface is
to translate tasks and results to
something the user can understand.
Wiki Interface
Logic Tier
This tier coordinates the web
interface and the data storage,
moves and processes data between
the two surrounding tiers.
Query
RPC
Browse
Data Tier
Here information is stored and
retrieved from a database. The
information will then be passed back
to user through the logic tier.
Yongqin Gao
December 2006
Data Repository
Dissertation Defense
Process IV: Research
Collaboratory
• Data tier - schema design
Timeline
Yongqin Gao
Every schema is a
database dump
from the
SourceForge.net
SF0205
SF0305
SF0405
SF0103
SF0605
SF0505
SF0805
SF0705
December 2006
Dissertation Defense
Process IV: Research
Collaboratory
• Data tier - connection pool
Logic
Tier
Persistent
Link
Connection
Request
Connection
Assigner
Persistent
Link
Persistent
Link
Connection Pool
Yongqin Gao
December 2006
Timeline
Dissertation Defense
Process IV: Research
Collaboratory
• Presentation Tier
– Various access
methods
– Documentation and
references
– Community support
– Wiki interface
Yongqin Gao
December 2006
Dissertation Defense
Process IV: Research
Collaboratory
• Logic Tier
– Interactive web query system
• Authorized user can submit query to the back end
repository through the web query
• Results are provided by files with various formats
– Dynamic web schema browser
• Authorized user can access the dynamic schema of
the repository through the schema browser
Yongqin Gao
December 2006
Dissertation Defense
Process IV: Research
Collaboratory
• Utilization reports
– Monthly statistics (June 2006)
• Total queries submitted: 16,947
• Total data files retrieved: 13,343
• Total bytes of query data downloaded: 26,684,556,278
• Programmable access method
– Programmable access method should be provided
for complicated access
– Web services planned
Yongqin Gao
December 2006
Dissertation Defense
Process IV: Research
Collaboratory
• Results in Process IV
– Designing, implementing and maintaining a
research collaboratory for OSS related research.
• Benefits of these results
– OSS researchers can access one of the most
complete data sets for a OSS community
development.
– By providing the community service to OSS
researchers, the collaboratory can help in
sparkling, improving and promoting research
ideas about OSS.
Yongqin Gao
December 2006
Dissertation Defense
Contributions
• Designed and demonstrated a computational discovery methodology to
study evolving complex networks using research on OSS as a
representative problem domain
• Understanding the OSS movement by applying the methods.
– Process I: data mining
• Identifying significant features to describe a project
• Using distribution based clustering to generate a distribution based predictor to
predict the “popularity” of a project
– Process II: network analysis
• Introducing more complete analysis to inspect more complete data set from
SourceForge.net.
• Discovering high connectivity and possible life cycle behaviors in both the
network structure and individuals in the network
– Process III: computer simulation
• Introducing more parameters in modeling evolving complex networks
• Generating a “fit” model to replicate the evolution of the SourceForge.net
community.
– Process IV: research collaboratory
• Designing, implementing and maintaining a research collaboratory to host the
SourceForge.net data set and provide community support for OSS related
researches.
Yongqin Gao
December 2006
Dissertation Defense
Publications to-date
•
•
•
•
•
•
•
•
•
Y. Gao; G. Madey and V. Freeh. “Modeling and simulation of the open
source software community”, ADSC, San Diego, 2005.
Y. Gao and G. Madey. “Project development analysis of the oss
community using st mining”, NAACSOS, Notre Dame, 2005.
S. Christley; Y. Gao; J: Xu and G. Madey. “Public goods theory of the
open source software development community”, Agent, Chicago, 2004.
Y. Gao, Y. Huang and G. Madey, “Data Mining Project History in Open
Source Software Communities”, NAACSOS, Pittsburgh, 2004.
J. Xu, Y. Gao, J. Goett and G. Madey, “A Multi-model Docking
Experiment of Dynamic Social Network Simulations”, Agent, Chicago,
2003.
Y. Gao, V. Freeh, and G. Madey, “Analysis and Modeling of the Open
Source Software Community”, NAACSOS, Pittsburgh, 2003.
Y. Gao, V. Freeh, and G. Madey, “Conceptual Framework for Agent-based
Modeling and Simulation”, NAACSOS, Pittsburgh, 2003.
G. Madey; V. Freeh; R: Tynan and Y. Gao. “Agent-based modeling and
simulation of collaborative social networks”, AMCIS, Tampa, 2003.
Y. Gao; V. Freeh and G. Madey. “Topology and evolution of the open
source software community”, SwarmFest, Notre Dame, 2003.
Yongqin Gao
December 2006
Dissertation Defense
Publication Plan
• Chapter III (data mining)
– Journal of Machine Learning Research
– Journal of Systems and Software
• Chapter IV (network analysis)
– Journal of Network and Systems Management
– Journal of Social Structure
• Chapter V (computer simulation)
– Spring Simulation Conference 2007 (under review)
– IEEE Computing in Science and Engineering
• Chapter VI (research collaboratory)
– CITSA 2007
– Journal of Computer Science and Applications
Yongqin Gao
December 2006
Dissertation Defense
Conclusion and Future Work
• Cyclic computational discovery method for
studying evolving complex networks
• Study of Open Source Software by applying this
method
• Future works:
– Maintaining and expanding the collaboratory
– Verifying the discoveries in the SourceForge.net against
further accumulated database dump from
SourceForge.net
– Applying our simulation model on other software
development communities
– Extending our methodology to other evolving complex
networks like Internet, communication network and
various social networks
Yongqin Gao
December 2006
Dissertation Defense
Acknowledgement
• My advisor: Dr. Madey
• My committee members:
– Dr. Flynn
– Dr. Striegel
– Dr. Wood
• My Colleagues:
– Scott Christley, Yingping Huang, Tim Schoenharl, Matt Van
Antwerp, Ryan Kennedy, Alec Pawling and Jin Xu
• SourceForge.net managers:
– Jeff Bates, VP of OSTG Inc.
– Jay Seirmarco, GM of SourceForge.net.
• US NSF CISE/IIS-Digital Society & Technology, under
Grant No. 0222829.
Yongqin Gao
December 2006
Dissertation Defense
Questions
Yongqin Gao
December 2006
Dissertation Defense
Case Study II
Project 6882
OSS Developer Network (Part)
Developers are nodes / Projects are links
24 Developers
5 Projects
2 hub Developers
1 Cluster
Project 7028
dev[52]
Project 7597
dev[64]
dev[72]
dev[67]
dev[65]
dev[70]
dev[57]
6882 dev[47]
dev[52]
6882 dev[47]
dev[55]
dev[47]
dev[99]
dev[55]
dev[51]
6882 dev[47]6882 dev[58]
dev[79]
dev[47]
7597 dev[46]
7597 dev[46]dev[64]
7597 dev[46]
dev[72]
dev[67]
7597 dev[46]
dev[55]
7597 dev[46]
7028 dev[46] dev[70]
7597 dev[46]
7028 dev[46]
dev[57]
dev[45]
dev[99]
7597
dev[46]
7028 dev[46]
dev[61]
dev[51]
7597 dev[46]
dev[58]
dev[45]
dev[61]
dev[58]
dev[46]
15850 dev[46]
dev[58]
dev[79]
dev[58]
9859 dev[46]
dev[54]
dev[54]
9859 dev[46]
9859 dev[46]
dev[49]
dev[53] 9859 dev[46]
15850 dev[46]
dev[59]
dev[56]
15850 dev[46]
dev[83] 15850 dev[46]
dev[48]
dev[49]
dev[53]
dev[56]
dev[83]
dev[59]
Project 9859
Project 15850
Yongqin Gao
dev[48]
December 2006
Dissertation Defense
Process I: Data Mining
• Characteristics of data set
– Massive
– Incomplete, noisy, redundant
– Complex structures, unstructured
• Classic analysis tools are often inadequate and
inefficient for analyzing these data, especially in
exploratory research
• What is DM (Data mining)
– Nontrivial extraction of implicit, previously unknown
and potentially useful information from data.
Yongqin Gao
December 2006
Dissertation Defense
Process I: Data Mining
• Feature Selection
– Given a non-negative n x m matrix V, find
factors W (n, r) and H (r, m) , such that
V ≈ W *H
– This is called the non-negative matrix
factorization (NMF) of the matrix V
– NMF can be used on multivariate data to reduce
the dimension of the data set
– By using NMF, we can reduce dimension from
m features to r features
Yongqin Gao
December 2006
Dissertation Defense
Why NMF?
• Feature extraction methods
– linear methods are simpler and more completely
understood.
– nonlinear methods are more general and more difficult
to analyze.
• Linear methods:
– ICA: Independent Component Analysis
– Matrix decomposition: PCA, SVD, NMF
• In practice, NMF is most popular and simple.
• Dimensionality reduction is effective if the loss of
information due to mapping to a lowerdimensional space is less than the gain due
simplifying the problem.
Yongqin Gao
December 2006
Dissertation Defense
Process I: Data Mining
• Feature-based Clustering
– Grouping data into K number of clusters based on
features.
– The distance metrics used is Euclidean distance
like
n
ED 
 (x  y )
i 0
i
2
i
– Hierarchical K-Means is used.
• The result is a binary tree.
• The root is the whole data set and the leaf clusters are
the fine-grained clusters, which are the resulting K
clusters.
Yongqin Gao
December 2006
Dissertation Defense
Process I: Data Mining
• Case Study Result II
• Unsupervised learning
– K-Means method used to
cluster the project history
using the features we
selected
– We named the clusters using
ID and the results are shown
in the table
– The result is not acceptable
by evaluation
Yongqin Gao
December 2006
Cluster ID
Size
1
6201
2
98
3
64824
4
2
5
4
6
29724
7
4
8
10
9
9
10
84
Total
100960
Dissertation Defense
Process I: Data Mining
Other
tables
artifact
table
Forum
table
People_job
table
User_group
table
Doc_data
table
Project_task
table
UNION
User_project_act
table
Admin_flags?
No
Grantcvs?
No
Assigned?
No
Yes
Activities?
Yes
No
Yes
Yes
Administrator
Yongqin Gao
Core developer
Co-developer
December 2006
Active user
lurker
Dissertation Defense
Process I: Data Mining
Yongqin Gao
December 2006
Dissertation Defense
Clustering Result Evaluation
• Evaluation test set generation
– Popular/unpopular projects
– Stratified sampling to make 500 projects
• Feature sets used
– Popular feature set
– Activity Feature set (Page 34, Table 3.2)
– Network Feature set (Page35, Table 3.3)
• Generating rules for the test sets
• Calculating the support and confidence value
Yongqin Gao
December 2006
Dissertation Defense
Popularity Definition
Yongqin Gao
Feature
Description
Developers
Number of core developers
Downloads
Number of downloads
Site_views
Number of views of the website
Subdomain_views
Number of views of the subdomain
Page_views
Number of views of the pages
December 2006
Dissertation Defense
Why K-MEAN?
• The algorithm has remained extremely popular because it
converges extremely quickly in practice. In fact, many
have observed that the number of iterations is typically
much less than the number of points.
• K-Means is most successful algorithm in large data set
(size>1000, dimension > 2) than GA and Evolution
• CLIQUE is sensitive to noise
• CURE is not scalable O(n2logn)
• CLARANS & BIRCH are not good for high dimension
data
• D. Arthur, S. Vassilvitskii (2006): "How Slow is the kmeans Method?," Proceedings of the 2006 Symposium on
Computational Geometry (SoCG).
Yongqin Gao
December 2006
Dissertation Defense
K-MEAN
• It maximizes inter-cluster (or minimizes
intra-cluster) variance, but does not ensure
that the result has a global minimum of
variance. Multiple run is needed.
• Elbow criterion
Yongqin Gao
December 2006
Dissertation Defense
Distribution Categories
Yongqin Gao
Category
Feature
1
File release
2
New message
3
Followup message
4
Artifact request
5
Todo request
6
Support request
7
Feature request
8
Patch request
9
Bug reports
10
Bug assigned
11
Patch assigned
12
Feature assigned
13
Support assigned
14
Todo assigned
15
Artifact assigned
December 2006
Dissertation Defense
Process III: Computer Simulation
Start
Simulation model
procedure
New Users
User List
User_Project
Links
Project List
Project Pool
Update
User Action
Idle
Drop
Create
End of Simu?
Join
Weighted
Project Pool
No
Yes
Stop
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Poisson Process:
– It expresses the probability of a number of events
occurring in a fixed period of time if these events
occur with a known average rate, and are independent
of the time since the last event.
– PDF:
e   k
F (k ;  ) 
Yongqin Gao
k!
December 2006
Dissertation Defense
Process III: Computer Simulation
• Log-normal distribution:
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
• Kolmogorov-Smirnov test
– Used to determine whether two underlying onedimensional distributions differ.
– Two one-sided K-S test statistics are given by

n
D  max( Fn ( x)  F ( x))
Dn  max( F ( x)  Fn ( x))
Yongqin Gao
December 2006
Dissertation Defense
Process III: Computer Simulation
Yongqin Gao
December 2006
Dissertation Defense
Similar Publications
• Chapter III (data mining)
– JMLR: G. Hamerly, E. Perelman..Using machine learning to guide
simulation (Feb. 2006)
– JSS: S. Kim, J. Yoon..Shape-based retrieval in time-series database (Feb.
2006)
• Chapter IV (network analysis)
– JNSM: Special Issue Self-Managing Systems and Networks
– JoSS: The Journal of Social Structure (JoSS) is an electronic journal of the
International Network for Social Network Analysis (INSNA)
• Chapter V (computer simulation)
– SSC 2007: simulation co
– IEEE/CSE: E. Luijten..Fluid simulation with monte carlo algorithm (2006
Vol. 8, Issue 2)
• Chapter VI (research collaboratory)
– CITSA 2007: L. Koukianakis..A system for hybrid learning and hybrid
psychology (2005)
– JCSA: S. Chen, K. Wen..An Integrated System for Cancer-Related Genes
Mining from Biomedical Literatures (2006)
Yongqin Gao
December 2006
Dissertation Defense