Developer Identification Methods for Integrated Data from

Download Report

Transcript Developer Identification Methods for Integrated Data from

Developer Identification Methods
for Integrated Data from Various
Sources
Gregorio Robles
Jesus M. Gonzalez-Barahona
Presented by
Brian Chan
Cisc 864
Table of Contents







Background Information
Problems Addressed
Motivation
Data Gathered
Conclusion
Personal Thoughts
Question and Comments
Background Information



Data mining for project comes from a
single source of data
Results can be applied to Libre Software
Look at separately:


Mailing Lists
Bug Repositories
Background Information

Libre Software shows Pareto law for
commits:

For each major artifact, 20% of developers
are shown to contribute 80% of the activity
in it.
Problems Addressed



Are the people that commit so much in
one artifact the same people in the
other artifact?
People use different identities in each
artifact
Current mining techniques focus on one
artifact so cannot tell who is who
Motivation



To gain insight into the social network
and structure of libre software projects
To find all the identities that correspond
to one person
Focus more on data analysis rather than
the extraction process
Data Gathered


Actor has access to
artifacts
Alternate rules for
each artifact
Figure 1.0
Data Gathered




Actor can post on more
than one mailing list:
[email protected]
[email protected]
Source Files can appear with many
identities:
Brian Chan
Brian
bchan
Interaction with versioning repository occurs through account in
server machine
Bug tracking systems require email address: i.e. Bugzilla
Data Gathered

Primary


Required Information
Secondary
Not Required
for the transaction
i.e. name in email

Figure 2.0
Data Gathered (cont’d)
Automated process extracts data into
data repository
Figure 3.0

Data Gathered

Sources Table:


Lists where id information was originally
extracted: i.e. file1.C
bugreport230
Identification Table:

Identity

Id key to Source table
Data Gathered

Persons


Identifications



Gender, Nationality, Hash
Pseudo identity: bchan
Match number with another identity
Matches
Tells which two identities belong to the same actor
Table 1.0

1
bchan
[email protected]
Deduction
80%
1
Brian Chan
[email protected]
Same Email
90%
Data Gathered

Matching during automated data
gathering process



Inference
Automatic Heuristics
Human Verification
Data Gathered

Rule 1:



Primary Identities may have part of the real name in it:
Example User <[email protected]
Rule 2

Identities can be built from another one
[email protected], [email protected]
name [email protected]

Rule 3

Some projects or repositories have foresight to keep list
information that can be used for matching
Data Gathered


Still error in matching algorithms but in
statistical gathering process, if it is
small enough then can be ignored.
Still use cleaning and verification.
Data Gathered

Privacy Issues:



Use Hash value (1st Firewall) to reference
information. Cannot reference
Identifications directly
Person ID (2nd Firewall) Given in such a
way so cannot infer real identity without
direct access to Identifications table
Given to unique person so hackers cannot
find specific id
Conclusions



Actors in Libre Software may use many
different identities for development
Paper deals with design of how to
account for all the different people and
who is actually doing what
Discussed how privacy can be dealt
with
Personal Thoughts

Good Points:



Effective Solution
Good examination of all the different
identities in business
Unique interpretation of data mining
Personal Thoughts

Points for improvement:



No actual ‘data’ to view results
Reference GNOME but never actually give
statistical information from it
Some interpretation is left to the reader
Questions and Comments