Role of information in multiscale and ab

Download Report

Transcript Role of information in multiscale and ab

Data Mining Approaches
in Atomistic Modeling
H. Aourag
URMER, University of Tlemcen
AMASS – 7/25/03
Outline
•
•
•
•
•
•
•
Introduction
Ex 1: Intergranular Embrittlement of Fe
Ex 2: Catalytic Activity - Hydrogenation
Ex 3: Stainless Steel CrxNiyFe(1-x-y)
Ex 4: Conductivity T7 7xxx Al Alloys
Ex 5: Boiling Points
Ex 6: Crystal Structure Prediction – open
questions…
AMASS – 7/25/03
Predicting Properties with Atomistic Modeling
Atomistic modeling
• Atom positions
• Electronic structure
• Energies
Band Gap
Elastic Constants
Segregation Energies
Activation Barriers
Atomic Scale
Descriptors
?
Macroscopic properties
• Elastic properties
• Conductivity
• Toxicity
Direct
calculation
Band Gap
Elastic Constants
Physical laws
Constitutive relations
Data Mining
AMASS – 7/25/03
Embrittlement
Transport
Weldability
Toxicity
Power of Data Mining
Use known data to establish R
Calculated Atomistic
Properties Database
R
Measured Macroscopic
Properties Database
R
Predicted Macroscopic
Properties Database
Use R to predict new data
Calculated Atomistic
Properties Database
• Does not require complete and accurate multiscale
theories
• New physics in relationships R
• Quick, cheap screening for desired properties, errors,
etc. – can be qualitative
AMASS – 7/25/03
Key Issues
Atomic scale
descriptors
Data Mining
Macroscopic
Properties
– Descriptors accessible to modeling
– Descriptors optimally chosen
• Use known relationships/physics
• Optimize from large set of possibilities
– Descriptors→Property relationship is robust
• Sensible choice of methods
• tested with cross validation, test sets
– Data
• Large enough
• Clean enough
AMASS – 7/25/03
Ex 1: Intergranular Embrittlement of Fe
• Property: Fe embrittlement
• Descriptors→Property relationship:
Embrittlement  [Grain boundary segregation E
- Free surface segregation E] = (EGB – EFS) (Rice ’89)
• Descriptors: (EGB – EFS) (calculated ab initio)
• Data: Embrittling potency for B, C, P, S.
AMASS – 7/25/03
Ex 1: Intergranular Embrittlement of Fe
(Wu, et al., Phys. Rev. B., ‘96)
Also correctly predicts effect of Mn and Mo on P embrittlement!
(Zhong, et al., Phys Rev B, ’97, Geng, et al., Solid State Comm., ’01)
AMASS – 7/25/03
Ex 2: Catalytic Activity - Hydrogenation
• Property: Reaction rates (Hydrogenation of ethene, benzene on 3d
transition metal M)
• Descriptors→Property relationship:
Adapted Bronsted-Evans_Polanyi Free E
+ Langmuir-Hinshelwood Rate Equations
 Rate = R[EMC,12 fitting “constants” independent of M]
• Descriptors:
– EMC = M-C bond strength in bulk NaCl structure (calculated ab
initio)
– 12 fitting “constants” (fit to experimental data for each reaction)
• Data: 10-20 reaction rates for each of ethene and benzene
AMASS – 7/25/03
Ex 2: Catalytic Activity - Hydrogenation
Cross-validation in black
EMC
Ethene: C2H4+H2→C2H6
Cross-validation with alloys
EMC
Benzene: C6H6+3H2→C6H12
AMASS – 7/25/03
(Toulhoat, et al. ’02)
Ex 3: Stainless Steel CrxNiyFe(1-x-y)
• Property: High hardness and ductility
• Descriptors→Property relationship:
Hardness  shear modulus = G
Ductility  bulk modulus/shear modulus = B/G
• Descriptors: B,G (from ab initio)
• Data: Not clearly defined
AMASS – 7/25/03
Vickers Hardness [GPa]
Hardness vs. Shear Modulus
(Teter, MRS Bulletin, ’98)
AMASS – 7/25/03
Shear Modulus [GPa]
Ex 3: Stainless Steel CrxNiyFe(1-x-y))
Shear Modulus G
Bulk Modulus B
Cr (at%)
Cr (at%)
High
Low
Ni (at%)
High G (hard)
High B/G (ductile)
Ni (at%)
(Vitos, et al., Nature Materials, ‘02)
• Optimal at ~Cr18Ni24Fe58
(multiple patents)
• Predict improved mechanical
properties for Ir, Os doping
AMASS – 7/25/03
Ex 4: Conductivity T7 7xxx Al Alloys
• Property: Electrical conductivity s
• Descriptors→Property relationship:
– Linear: s = V*d (requires only fitting)
– Neurofuzzy: s = NF(d) (requires only fitting)
– Physical: s = P(d) (requires thermodynamic models of relevant
phases, Rayleigh–Maxwell equation for resistivity with dispersed
particles, Starink-Zahra equation for precipitation, 1D diffusion
equation, Matthiesen’s rule for resistivity with dissolved elements)
• Descriptors: Concentrations, ageing time  d = xZn, xMg,
xCu, xZr, xFe, xSi, t
AMASS – 7/25/03
Ex 4: Conductivity T7 7xxx Al Alloys
s measured for 36 concentration/ageing time samples
R-Model
Linear
Fitting
Params
7
RMS
Cross
Error (%) Validation (%)
4.75
5.25
Neurofuzzy 5
1.35
1.525
Physical
0.97
1.05
6
(Starink, et al., ‘00)
AMASS – 7/25/03
Ex 5: Boiling Points
(Quantitative Structure-Property Relationships: QSPR)
• Property: Boiling Point TB
• Descriptors→Property relationship: Neural
Network (10:18:1, sigmoid, backpropagation)
• Descriptors: Electrostatic and structural
properties (calculated with semiempirical VAMP
– AM1)
• Data: TB for 6629 molecules containing
elements H, B, C, N, O, F, Al, Si, P, S, Cl, Zn,
Ge, Br, Sn, I, Hg
AMASS – 7/25/03
Data Mining Descriptors→Property Relationships
Many general approaches
• Graphical
• Linear Regressions (normal least squares, principal component
regression, partial least squares, …)
• Neural Networks (perceptrons, feed-forward, radial-basis, …)
• Clustering (k-means, nearest-neighbor, …)
Many choices in each approach
In
Neural Networks:
• Number of neurons/layers – 3:4:1
• Transfer functions: step, sigmoid, tansig, etc.
• Training method: backpropagation algorithms
Thousands of possible approaches!
• Many yield similar results
• Appropriate for different situations
• Problem dependent - much art!!
AMASS – 7/25/03
Out
Descriptors
Charged partial surface areas descriptors, Accelyris QSAR module
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
Partial positive surface area (sum of the surface area of positive atoms)
Partial negative surface area (sum of the surface area of negative atoms)
Total charge weighted positive surface area (descriptor 1 multiplied by the total positive charge)
Total charge weighted negative surface area (descriptor 2 multiplied by the total negative charge)
Atomic charge weighted positive surface area: (sum of sasa*charge for all positive atoms)
Atomic charge weighted negative surface area (sum of sasa*charge for all negative atoms)
Difference in charged surface areas: (descriptor 1 - descriptor 2)
Difference in total charge weighted surface areas (descriptor 3 - descriptor 4)
Difference in atomic charge weighted surface areas (descriptor 5 - descriptor 6)
Fractional charged partial surface areas (6 descriptors divided by total surface area)
"
"
"
"
"
Surface weighted charged partial surface areas (6 descriptors multiplied by total surface area)
"
"
"
"
"
Relative positive charge (charge of most positive atom divided by total positive charge
Relative negative charge (charge of most negative atom divided by total negative charge
Relative positive charge surface area (surface area of most positive atom divided by descriptor 22)
Relative negative charge surface area (surface area of most negative atom divided by descriptor 23)
Total hydrophobic surface area (sum of surface areas of atoms with |charge| < 0.2)
Total polar surface area (sum of surface areas of atoms with |charge| > 0.2)
Relative hydrophobic surface area (descriptor 26 divided by total surface area)
Relative polar surface area (descriptor 27 divided by total surface area)
Total solvent-accessible surface area
(http://www.accelrys.com/cerius2/descriptor.html#list)
AMASS – 7/25/03
Descriptors
• Many broad categories: composition,
topological, electronic, physical-chemical
properties, …
• Thousands of possible descriptors
– Use physical knowledge to choose relevant
ones (e.g., QSAR principle)
– Use numerical methods to choose important
descriptors
AMASS – 7/25/03
Ex 5: Boiling Point Descriptors
(Chalk, et al., J Chem. Inf. Comput. Sci, ‘01)
AMASS – 7/25/03
Ex 5: Atomistic Modeling Methods
Use VAMP – AM1 and PM3 Hamiltonians
– Semi-empirical molecular orbital based
– Quantum mechanical, but matrix elements are
fit to experimental data
– Can calculate optimized geometries,
electronic structure (charge properties)
– Fairly accurate (known failings) and fast
AMASS – 7/25/03
Ex 5: Boiling Points
Training set (6000)
17 (max -119)
Test set (629)
19 (max -94)
(Chalk, et al., J Chem. Inf. Comput. Sci, ‘01)
Large errors often due to
• Incorrect experimental measurements of TB (low pressure)
• Incorrect experimental structures (tautomer misidentification)
• Failure of atomistic modeling
AMASS –method
7/25/03 (approximation errors)
Ex 6: Crystal Structure Prediction
• Property: Stable crystal structure
• Descriptors→Property relationship:
Neighbor Clustering algorithm (Euclidean
metric)
• Descriptors: Chemical scale (empirically
assigned value for each element) (Pettifor, J. Phys.
C, ’86)
• Data: All intermetallic binary alloys
(thousands)
AMASS – 7/25/03
CsCl
NaCl
Structure
Maps
AMASS – 7/25/03
(Rodgers, CRYSTMET, ‘03)
Ex 6: Crystal Structure Prediction
• Powerful: structure maps can give 90-95% predictive
accuracy
• Many Descriptors: ~50 have been tried based on size,
atomic number, cohesive energy, electrochemistry,
valence electrons
• Can’t be extended: accurate maps require ~40% of the
possible systems to be known (~80% binaries known,
~0.1% quaternaries)
• Can atomistic modeling help?
– Fill in data for multicomponent systems
– Provide optimal descriptors
(Villars, Intermetallic Compounds, ’94)
AMASS – 7/25/03
Conclusions
• Atomistic modeling and data mining can
provide valuable predictive ability when
physical theories are incomplete
• Key issues are data quality, descriptors,
and descriptor→properties relationship
• Dangers of overfitting and tuning
AMASS – 7/25/03
Bible Code
Are these words closer than by chance?
Can the Bible predict future events?
Some say yes (Witzumn, et al, Stat. Sci., ’94)
Some say no (McKay, et al., Stat. Sci., ’99)
• Many articles
• >60 books on Bible Codes on Amazon
• 1 major motion picture (Omega Code)
AMASS – 7/25/03
Be careful with
your statistics!
The First and Greatest Example
of Atomic Level Data Mining
AMASS – 7/25/03
END
AMASS – 7/25/03