No Slide Title - ARUP Laboratories

Download Report

Transcript No Slide Title - ARUP Laboratories

Feature selection for characterizing HLA class I peptide motif anchors.
Perry G. Ridge1, Hernando Escobar1, Peter E. Jensen1, Julio C. Delgado1, David K. Crockett1,2
1ARUP
Laboratories, Department of Pathology, University of Utah School of Medicine, Salt Lake City, UT
2Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, UT 84108
INTRODUCTION
RESULTS
HLA class I peptide motifs have been described by dominant amino acid
residues located in primary anchor positions. For example, the reported motif
for HLA-A*0201 from the SYFPEITHI database is x-[LM]-x-x-x-x-x-x-[VL]. [1]
Variations of this nomenclature are also seen in other HLA class I peptide motif
databases such as IMGT/HLA [2]. Patterns of anchor residues has led to the
development of software tools and algorithms for prediction of peptide binding
and screening of target organisms or sequences for a given peptide motif.
However, the physical and chemical properties of peptide anchor position
residues that confer allele specificity have not been as well described. For this
study, supervised feature selection was used to identify the physical and
chemical properties that best distinguish A*0201 peptide binders from nonbinders.
Selected features using the full training set for anchor 1 and anchor 2
were summarized in Table 1, and results using fivefold cross-validation
are reported below.
Anchor 1
Anchor 2
Using fivefold cross-validation, the amino acid properties of
normalized frequency of extended structure (Burgess et al., 1974),
parameter of charge transfer capability (Charton-Charton, 1983), and
relative preference value at C1 (Richardson-Richardson, 1988) best
characterized the residues in anchor 1 (P2).
Table 1. Selected attributes for HLA-A*0201 anchor positions 1 and 2.
Anchor Position
AAIndex Propertya
Original Reference
Anchor 1
The anchor 2 position (Pω), again using fivefold cross-validation, was
best represented by the number of atoms in the side chain labeled
3+1 (Charton-Charton, 1983), parameter of charge transfer donor
capability (Charton-Charton, 1983), normalized frequency of Cterminal non helical region (Chou-Suzuki, 1976), information measure
for middle turn (Robson-Suzuki, 1976), and amphiphilicity index
(Mitaku et al., 2002).
METHODS
A parameter of charge transfer donor capability
Charton, 1983
Amino acid composition
Dayhoff, 1978
Atom based hydrophobic moment
Eisenberg, 1986
Partition coefficient
Garel, 1973
Polarity
Grantham, 1974
Hydrophilicity value
Hopp-Woods, 1981
Normalized frequency value of alpha-helix with weights
Levitt, 1978
AA composition of total proteins
Nakashima, 1990
Normalized frequency of beta-sheet in all-beta class
Palau, 1981
Weights for alpha-helix at the window position of 3
Qian-Sejnowski, 1988
Average relative fractional occurrence in E0(i)
Rackovsky-Scheraga, 1982
Relative preference value at C-cap
Richardson, 1988
Normalized positional frequency at helix termini N4
Aurora-Rose, 1998
Volumes including crystallographic waters using ProtOr
Tsai, 1999
The number of bonds in the longest chain
Charton, 1983
Average volume of buried residue
Chothia, 1975
Normalized frequency of N-terminal beta-sheet
Chou-Fasman, 1978
Conformational preference for parallel beta-strands
Lifson-Sander, 1979
AA composition of mt-proteins from fungi and plant
Nakashima, 1990
Information measure for C-terminal turn
Robson-Suzuki, 1976
Volumes including crystallographic waters using ProtOr
Tsai, 1999
Anchor 2
Figure 1. Common HLA-A*0201 motif. Anchor 1 and Anchor 2 were characterized using
AAIndex Properties (v9.4).
References:
1. Rammensee, H.G., T. Friede, and S. Stevanoviic, MHC ligands and peptide motifs: first
listing. Immunogenetics, 1995. 41(4): p. 178-228.
2. Robinson, J., et al., IMGT/HLA database--a sequence database for the human major
histocompatibility complex. Tissue Antigens, 2000. 55(3): p. 280-7.
3. Peters, B., et al., The immune epitope database and analysis resource: from vision to
blueprint. PLoS Biol, 2005. 3(3): p. e91.
4. Kawashima, S. and M. Kanehisa, AAindex: amino acid index database. Nucleic Acids
Res, 2000. 28(1): p. 374.
5. Hall, M.A., Correlation-based feature selection of discrete and numeric class machine
learning, in Computer Science Working Papers. 2000, University of Waikato, Department
of Computer Science: Hamilton, New Zealand.
6. Witten and Frank. Data Mining: Practical machine learning tools and techniques. 2nd
edition ed. 2005, San Francisco: Morgan Kaufmann.
A publicly available data set of A*0201 binding peptides (n=1181)
and non-binding peptides (n=1908) was downloaded from the
Immune Epitope Database (IEDB) [3]. Amino acid residues of anchor
positions (P2 and Pω) were characterized using values of 544
physical, chemical, conformational, or energetic properties (AAindex
v9.4). [4]
Properties downloaded from the AAindex
(http://www.genome.jp/aaindex/) were each represented
numerically (each amino acid had a numerical value for each
property). In cases where there was no value for a particular amino
acid/property combination a value of zero was assigned. We created
input files for the next step in processing using a simple Java
program. Each amino acid in the anchor positions was assigned the
numerical value given from the reported AAindex properties table.
For each anchor position, the Correlation-based Feature Subset
Selection algorithm [5], together with the Best First (greedy
hillclimbing) search method, were used to identify the subset of
properties that best distinguished binders from non-binders.
Attribute selection algorithms were implemented using the Weka
software package v3.6. [6]
a
Accessed March 2010 from http://www.genome.jp/aaindex/
CONCLUSIONS
Supervised feature selection was used to characterize prominent physical and chemical
properties for anchoring amino acid residues in HLA-A*0201 allele specificity. Ongoing efforts
include allele representation and binding prediction algorithms for different HLA class I
subtypes.