Developer Identification Methods for Integrated Data from
Download
Report
Transcript Developer Identification Methods for Integrated Data from
Developer Identification Methods
for Integrated Data from Various
Sources
Gregorio Robles
Jesus M. Gonzalez-Barahona
Presented by
Brian Chan
Cisc 864
Table of Contents
Background Information
Problems Addressed
Motivation
Data Gathered
Conclusion
Personal Thoughts
Question and Comments
Background Information
Data mining for project comes from a
single source of data
Results can be applied to Libre Software
Look at separately:
Mailing Lists
Bug Repositories
Background Information
Libre Software shows Pareto law for
commits:
For each major artifact, 20% of developers
are shown to contribute 80% of the activity
in it.
Problems Addressed
Are the people that commit so much in
one artifact the same people in the
other artifact?
People use different identities in each
artifact
Current mining techniques focus on one
artifact so cannot tell who is who
Motivation
To gain insight into the social network
and structure of libre software projects
To find all the identities that correspond
to one person
Focus more on data analysis rather than
the extraction process
Data Gathered
Actor has access to
artifacts
Alternate rules for
each artifact
Figure 1.0
Data Gathered
Actor can post on more
than one mailing list:
[email protected]
[email protected]
Source Files can appear with many
identities:
Brian Chan
Brian
bchan
Interaction with versioning repository occurs through account in
server machine
Bug tracking systems require email address: i.e. Bugzilla
Data Gathered
Primary
Required Information
Secondary
Not Required
for the transaction
i.e. name in email
Figure 2.0
Data Gathered (cont’d)
Automated process extracts data into
data repository
Figure 3.0
Data Gathered
Sources Table:
Lists where id information was originally
extracted: i.e. file1.C
bugreport230
Identification Table:
Identity
Id key to Source table
Data Gathered
Persons
Identifications
Gender, Nationality, Hash
Pseudo identity: bchan
Match number with another identity
Matches
Tells which two identities belong to the same actor
Table 1.0
1
bchan
[email protected]
Deduction
80%
1
Brian Chan
[email protected]
Same Email
90%
Data Gathered
Matching during automated data
gathering process
Inference
Automatic Heuristics
Human Verification
Data Gathered
Rule 1:
Primary Identities may have part of the real name in it:
Example User <[email protected]
Rule 2
Identities can be built from another one
[email protected], [email protected]
name [email protected]
Rule 3
Some projects or repositories have foresight to keep list
information that can be used for matching
Data Gathered
Still error in matching algorithms but in
statistical gathering process, if it is
small enough then can be ignored.
Still use cleaning and verification.
Data Gathered
Privacy Issues:
Use Hash value (1st Firewall) to reference
information. Cannot reference
Identifications directly
Person ID (2nd Firewall) Given in such a
way so cannot infer real identity without
direct access to Identifications table
Given to unique person so hackers cannot
find specific id
Conclusions
Actors in Libre Software may use many
different identities for development
Paper deals with design of how to
account for all the different people and
who is actually doing what
Discussed how privacy can be dealt
with
Personal Thoughts
Good Points:
Effective Solution
Good examination of all the different
identities in business
Unique interpretation of data mining
Personal Thoughts
Points for improvement:
No actual ‘data’ to view results
Reference GNOME but never actually give
statistical information from it
Some interpretation is left to the reader
Questions and Comments