Gianluca Tarasconi & Francesco Lissoni - APE-INV

Download Report

Transcript Gianluca Tarasconi & Francesco Lissoni - APE-INV

Name matching for PATSTAT data
Gianluca Tarasconi
KITeS Database Administrator
Website: rawpatentdata.blogspot.com
1
KITeS
Knowledge, Internationalization and Technology Studies
KITeS’s mission is understanding the relationship between
innovation, technology management, firms’ competitiveness
and economic growth in the global economy. KITeS’
research intends to be rigorous, relevant and interdisciplinary. It focuses on three main areas: innovation,
technology management and trade.
2
2
KITeS –The centre

KITeS was founded in 2008, building upon the experience of research centres
such as CESPRI and CRITOM. It’s guested @ Bocconi University.

KITeS is an inter-departmental research centre, integrating researchers from the
Economics Dpt., the Management Dpt. and the Institutional Analysis Dpt. KITeS
researchers hold doctoral degrees from Yale, Stanford, London School of
Economics, Bocconi, Manchester, Leuven, Sussex, Maastricht, and others.

Patent statistics have been widely used at KITeS for many years now, dating
back to CESPRI's early research in industrial dynamics.

This tradition has led to the cumulative creation and updating of a large database,
known as EP-CESPRI. Inventors' data used so far are organized in a sub-section
of such database, known as EP-INV.

… who’s who: www.kites.unibocconi.it
3
The EP-CESPRI Database (i)
The EP‐CESPRI database contains information on patents applied
for at the European Patent Office (EPO), from 1978 to October 2009.
The EP‐CESPRI database was first created by making use of
information downloaded regularly from EPO Bulletins. Since October
2007 it is based upon applications published on a regular basis by
EPO in PATSTAT ; presently, it contains about 2.090.000 patent
applications.
A beta version for USPTO was released in 2009 and SIPO (chinese
patent office) version is forecasted for 2010.
4
The EP-CESPRI Database (ii)
EP-CESPRI data fall into three broad categories:
1. Patent data, such as the patent’s publication number, its
priority/application date, and main/secondary technological class
(IPC12‐digit).
2. Applicant data, such as a unique code assigned by KITeS to each
applicant after cleaning the applicant’s data, plus the applicant ‘s
name and address.
3. Inventor data: such as name, surname, address and a unique
code (CODINV) assigned by KITeS to all inventors found to be the
same person. This section of EP-CESPRI is also known as EP-INV
and it is the one of major interest to today’s seminar
5
EP-INV: From raw data to structured data
 Data coming from PATSTAT are cleaned, standardized
and re-structured  CODINV2 code
 Eventually a similarity score is calculated for pairs of
inventors who have the same name and surname, but
different addresses  CODINV code
6
Standardization of inventors’ names and
addresses
Original EPO data on inventors come from PATSTAT table
TLS206_ASCII, where data are only partially parsed for
names, address, city, zip codes.
Further steps are as follows:
1.
Cleaning of address data
CODINV2 codes
2.
Cleaning of names
3.
Computation of similarity scores
7
CODINV codes
Cleaning of address data
Parsed data are given a unique code (CODINV2) and (iteratively)
cleaned by:
shifting information contained in wrong fields (like zip code, county…);
 standardizing city names or parts of names (e.g.: “Saint” is turned
into “St.”);
fixing mistakes in zip codes, according to national post office tables;
In 10/2007 data there were 2.381.991 codinv2 in EP-INV DB out of
3.278.486 PATSTAT person_id (28% less).
8
Example of city cleaning
CITY
ZIP
ORIGINAL
DDR-4203 Bad
Dürrenberg
ZIP PARSED
Bad Dürrenberg
4203
CITY CLEANED
BAD DURRENBERG
4203
ZIP LOOKUP
BAD DURRENBERG
06231
9
Cleaning of names
The “name+surname” field was parsed into the following fields:
first, second, third name, extension (e.g. Jr, Sr, III), surname, and
academic title (e.g. Dr., Prof, Ing….).
This operation was mainly based on two iterative steps:
 Pairs of inventors with the same address and equal first
name, surname, extension and initial of second or third
name are corrected for the third name (e.g.: “Rossi Giovanni
Paolo” is turned into “Rossi Giovanni P.”);
 Pairs of inventors’ records where 2 out of the 3 fields city,
address and name are the same and the remaining one has
a low edit distance (Levenshtein/alfanum) are updated on the
data for the inventor with the higher number of patents.
10
An example
Name
Address
City
Tarasconi, Gianluca
Via P. Maspero, 24
Milan
1
Tarasconi, Gianluca
Via Maspero, 24
IT-20137 Milan
2
Tarasconi, G.
c/o university bocconi
Milano
20136
3
Tarasconi, Gianluca
c/o university bocconi
Milano
20136
4
Tarasconi, Gianluca
35, Via Tertulliano
Milan
Name
Address
City
Zip
codinv2
Tarasconi, Gianluca
Via Maspero, 24
Milano
20137
1
Tarasconi, Gianluca
c/o university bocconi
Milano
20136
3
Tarasconi, Gianluca
Via Tertulliano, 35
Milano
20135
5
11
Zip
codinv2
5
Further info on cleaning names and
addresses
 Cleaning of names and address has been realized by
MySQL;
 The sql code is based on 25 lookup tables and 950
recursive queries;
 The aggregation algorithm was quite conservative (to allow
‘new entries’ to be quickly linked);
12
Computation of similarity score
• Inventors data are restructured following a structure person
(CODINV) vs person@location (CODINV2)
• All inventors with anything different other than name and
surname are compared in pairs, through the Massacrator
SQL routine
13
Introduction of CODINV
Name
Address
City
Zip
codinv2
Codinv
Tarasconi, Gianluca
Via Maspero, 24
Milano
20137
1
1
Tarasconi, Gianluca
c/o university
bocconi
Milano
20136
3
2
Tarasconi, Gianluca
Via Tertulliano, 35
Milano
20135
5
3
14
14
Computation of similarity score
Workplace:
Toponymic
permanence:
same applicant/
company/ group
same address,
town, county…
Social networks:
coinventors in common,
3 degrees of distance in
coinventorship
Similarity
Score
IPC:
Citation’s
linkages:
(self)citing or cited
Time lag: how
long since last
patent?
15
patenting in
the same
tech fields
Scores by category
Workplace
Same applicant
Same applicant (the applicant
has <50 inventors)
Same group (if available)
Toponymic Permanence
Same city
Same province
Same region
Same state (US)
Same address [in different cities; it may
indicate misspellings in the city field]
Social Networks
Same coinventor
3 degrees of separation
5
IPC
Same IPC code (4 digits)
5
5
5
Same IPC code (6 digits)
Same IPC code (12 digits)
5
10
Time Lag
Priority dates differ for >20 years
-5
Citation linkages
Inventor 1 cites inventor 2
5
Inventor 1 is cited by inventor 2
5
5
5
5
5
5
Other
10 Widespread surname
10 16
-5
Update of CODINV using similarity score
Intuitively, high similarity scores can be taken as indication of a high
probability that the two inventors in the pairs are the same person.
Whenever two inventors in a pair are found to be the same the lowest
CODINV code is assigned to both inventors.
Name
Address
City
Zip
codinv2
Codinv
codinv
Tarasconi, Gianluca
Via Maspero, 24
Milano
20137
1
1
Tarasconi, Gianluca
c/o university
bocconi
Milano
20136
3
2
1
Tarasconi, Gianluca
Via Tertulliano, 35
Milano
20135
5
3
1
Algorithm should be run recursively
17
17
Finding a threshold value (I)
Manual checking of EP-INV records suggest that a large number
paired inventors with total score higher than 20 are indeed the same
person.
Percentages vary across countries, largely because of the different
distribution of frequent surnames. Therefore, no automatic reassignment of CODINV codes has been performed so far.
In KEINS research data have been extensively checked for IT, FR,
SE; the threshold value of the similarity score was set at 15 (median
value): inventors in pairs with score >= 15 are then presumed to be
the same person, and assigned the same CODINV code.
18
Finding a threshold value (II)
Manual checking suggests that:
 no Type 2 error (false positives) is introduced with this
choice, i.e. no pair of inventors are assigned erroneously
the same CODINV code)
 several Type 1 errors remains, i.e. pairs of inventors who
are indeed the same person have scores <15 and are not
given the same CODINV code
19
Applying Massacrator to all EPO (I)
 At 10/2007 we get 2.672.671 couples out of 2.363.501 inventors
 Mode is 0 pts (764946 couples) but 758.471 couples have >= 15pts
distribution of score
1000000
10000
1000
100
10
score
20
375
358
340
323
305
288
270
253
235
218
200
183
165
148
130
113
95
78
60
43
25
8
1
-10
n couples
100000
Applying Massacrator to all EPO (II)
 16,78 % of couples are >= 20 pts
 22,72% of couples are >= 15 pts
100,00%
90,00%
80,00%
70,00%
60,00%
50,00%
40,00%
30,00%
20,00%
10,00%
21
155
148
140
133
125
118
110
103
95
88
80
73
65
58
50
43
35
28
20
13
5
-2
-10
0,00%
Applying Massacrator to all EPO (III)
A raw version of the algorithm for getting a proxy of the possible
reductions may be
same IPC (12 digits)
same applicant
same address
3 degrees of distance
1 coinventor in common
citation linkage
same IPC (6 digits) and same country
OR
OR
OR
OR
OR
OR
Compressing 571970 CODINVs out of 2363501 (-24%)
22
Some publications using the EP-INV data
Lissoni, F., Llerena, P., McKelvey, M., and B. Sanditov "Academic Patenting in Europe: New
Evidence from the KEINS Database," Research Evaluation, 17(2): 87-102.
Bacchiocchi E., Montobbio F. (2009); Knowledge Diffusion from University and Public Research.
A Comparison between US Japan and Europe using Patent Citations. Journal of Technology
Transfer, vol.34 (2), pp.169-181.
Breschi S., Lissoni F., Montobbio F. (2008). University patenting and scientific productivity. A
quantitative study of Italian academic inventors. European Management Review. The Journal of the
European Academy of Management 5(2): 91-109
Corrocher N., Malerba F., Montobbio F. (2007); Schumpeterian Patterns of Innovative Activity in
the ICT Field. Research Policy. vol. 36, pp. 418-432
Breschi S., Lissoni F., Montobbio F. (2007). The Scientific Productivity Of Academic Inventors:
New Evidence From Italian Data. Economics of Innovation and New Technology, Vol. 16, Issue 2,
pp. 101-118
Della Malva A, Breschi S, Lissoni F, Montobbio F. (2007). L'attivita' brevettuale dei docenti
universitari: L'Italia in un confronto internazionale. Economia e Politica Industriale.v.2 pp.43-70.
[pdf]
Montobbio F. (2008); Patenting Activity in Latin American and Caribbean Countries.In World
Intellectual Property Organization(WIPO) - Economic Commission for Latin America and the
Caribbean (ECLAC) - Study on Intellectual Property Management in Open Economies: A Strategic
Vision for Latin America". Forthcoming
23
Future uses of the algorithm (I)
 Cross Patent-office match:
Is J. Smith in EPO the same of USPTO ?
 Decompression:
Where toponymic data are few (USPTO data FI), a mere
data cleaning would group inventors who are not the same;
the algorithm could help to avoid type 2 errors
24
Future uses of the algorithm (II)
 Companies’ match:
Identify applicants who have similar companies names as
the same;
 NPL match:
Helping to deduplicate authors / affiliations
25