Indiana University School of Informatics

Download Report

Transcript Indiana University School of Informatics

Grids and the
School of Informatics at
Indiana University
Sun Yat-sen University
Guangzhou China
November 4 2006
Geoffrey Fox
Computer Science, Informatics, Physics
Pervasive Technology Laboratories
Indiana University Bloomington IN 47401
[email protected]
http://www.infomall.org
The Central Goal
of Informatics
data
information
knowledge
What is Informatics?


Informatics is the integration of the art,
science, and the human dimensions of
information technology to provide solutions
to discipline-specific problems
Informatics is a response to the
data/information/knowledge gaps (data
deluge) caused by “billions and billions of
bits”
• Grids are technology supporting this in
distributed research
Bioinformatics Data Deluge
Challenge and Opportunity
1985
1 experiment
2000
1 experiment
1 gene
10,000 genes
10 data
10,000,000 data
Tech Centered Informatics
Computer &
Information Science
including Web, Text,
Data Mining
Domain Centered Informatics
Bio-, Health-, Chemical-, Music-, etc.
Informatics, e-Science, Complex systems,
Modeling, Simulation
Technology
Content
People
Human Centered Informatics
Human Computer Interaction,
New Media,Social/Organizational
Informatics, Security
School of Informatics Programs
B.S.

Computer Science (IUB)

Informatics (IUB/IUPUI/IUSB)

New Media: Media Arts and
Science (IUPUI)

Health Information Administration
(IUPUI)
M.S.

Computer Science (IUIB)

New Media: Media Arts and
Science (IUPUI)

Human Computer Interaction
(IUB/IUPUI)

Bioinformatics (IUB/IUPUI)

Chemical Informatics (IUB/IUPUI)

Music Informatics (IUB)

Laboratory Informatics (IUPUI)

Health Informatics (IUPUI)

Cybersecurity (IUB)
 Indiana University has 8
separate campuses
 School currently at 3 of 8
campuses
 Largest Campuses:
 IUB Bloomington
 IUPUI Indianapolis
Ph.D.

Computer Science (IUB)

Informatics (IUB/IUPUI)
IUB Faculty with One or More of Degrees
Listed -- undergrad or grad -- of 65 total faculty
CS 40
Math 7
Chemistry 4
Hist.of Sci./Tech. 3
Philosophy 2
EE 3
Biology 2
Comp. Lit. 1
Anthropology 1
Music 2
Journalism 1
Library/Info Science 2
Linguistics 1
Physics 5
Psychology 2
Mathematics 5
Design 1
Cog. Sci. 2
Aero. Engineering 1
Public Policy 1
Undergraduate Profile– Bloomington








Informatics Majors (BS):.............................
Computer Science (BS and BA): ..................
Women: ......................................................
International Students: ..............................
Number of Undergraduates Statewide: .......
Average Starting Salary : ............................
Placement rate ……………………………………..
382 students
135 students
13%
8%
1,250
$42,000
90%
Note BA in Computer Science administered by the College of Arts
and Sciences
e-moreorlessanything and the Grid







‘e-Science is about global collaboration in key areas of science,
and the next generation of infrastructure that will enable it.’ from
its inventor John Taylor Director General of Research Councils
UK, Office of Science and Technology
e-Science is about developing tools and technologies that allow
scientists to do ‘faster, better or different’ research
Similarly e-Business captures an emerging view of corporations as
dynamic virtual organizations linking employees, customers and
stakeholders across the world.
• The growing use of outsourcing is one example
The Grid provides the information technology e-infrastructure for
e-moreorlessanything.
A deluge of data of unprecedented and inevitable size must be
managed and understood.
People, computers, data and instruments must be linked.
On demand assignment of experts, computers, networks and
storage resources must be supported
Why Grids/ Cyberinfrastructure Useful







Supports distributed science – data, people, computers
Exploits Internet technology (Web2.0) adding management,
security, supercomputers etc.
It has two aspects: parallel – low latency (microseconds)
between nodes and distributed – highish latency
(microseconds) between nodes
Parallel needed to get high performance on individual 3D
simulations, data analysis etc.; must decompose problem
Distributed aspect integrates already distinct components
Cyberinfrastructure is in general a distributed collection of
parallel systems
Grids are made of services that are “just” programs or data
sources packaged for distributed access
TeraGrid: Integrating NSF Cyberinfrastructure
Buffalo
Wisc
UC/ANL
Utah
Cornell
Iowa
PU
NCAR
IU
NCSA
Caltech
PSC
ORNL
USC-ISI
UNC-RENCI
SDSC
TACC
TeraGrid is a facility that integrates computational, information, and analysis resources at the
San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of
Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications,
Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh
Supercomputing Center, and the National Center for Atmospheric Research.
Today 100 Teraflop; tomorrow a petaflop; Indiana 20 teraflop today.
Virtual Observatory Astronomy Grid
Integrate Experiments
Radio
Far-Infrared
Visible
Dust Map
Visible + X-ray
Galaxy Density Map
Grid Capabilities for Science

Open technologies for any large scale distributed system that is adopted by
industry, many sciences and many countries (including UK, EU, USA, Asia)
• Security, Reliability, Management and state standards
Service and messaging specifications
User interfaces via portals and portlets virtualizing to desktops, email,
PDA’s etc.
• ~20 TeraGrid Science Gateways (their name for portals)
• OGCE Portal technology effort led by Indiana
Uniform approach to access distributed (super)computers supporting single
(large) jobs and spawning lots of related jobs
Data and meta-data architecture supporting real-time and archives as well
as federation
• Links to Semantic web and annotation
Grid (Web service) workflow with standards and several successful
instantiations (such as Taverna and MyLead)
Many grids including Bioinformatics Chemistry and Earth Science

http://www.nsf.gov/od/oci/ci-v7.pdf






APEC Cooperation for Earthquake Simulation

ACES is a seven year-long collaboration among scientists
interested in earthquake and tsunami predication
• iSERVO is Infrastructure to support
work of ACES
• SERVOGrid is (completed) US Grid that is
a prototype of iSERVO
• http://www.quakes.uq.edu.au/ACES/

Chartered under APEC –
the Asia Pacific Economic
Cooperation of 21 economies
Repositories
Federated Databases
Database
Sensors
Streaming
Data
Field Trip Data
Database
Sensor Grid
Database Grid
Research
SERVOGrid
Education
Compute Grid
Data
Filter
Services Research
Simulations
?
GIS
Discovery Grid
Services
Customization
Services
From
Research
to Education
Analysis and
Visualization
Portal
Grid of Grids: Research Grid and Education Grid
Education
Grid
Computer
Farm
SERVOGrid and Cyberinfrastructure


Grids are the technology based on Web services that implement
Cyberinfrastructure i.e. support eScience or science as a team
sport
• Internet scale managed services that link computers data
repositories sensors instruments and people
There is a portal and services in SERVOGrid for
• Applications such as GeoFEST, RDAHMM, Pattern
Informatics, Virtual California (VC), Simplex, mesh
generating programs …..
• Job management and monitoring web services for running
the above codes.
• File management web services for moving files between
various machines.
• Geographical Information System services
• Quaketables earthquake specific database
• Sensors as well as databases
• Context (dynamic metadata) and UDDI system long term
metadata services
• Services support streaming real-time data
a
Site-specific Irregular
Scalar Measurements
Ice Sheets
Constellations for Plate
Boundary-Scale Vector
Measurements
a
a
Volcanoes
PBO
Greenland
Long Valley, CA
Topography
1 km
Stress Change
Northridge, CA
Earthquakes
Hector Mine, CA
Some Grid Concepts I


Services are “just” (distributed) programs sending and
receiving messages with well defined syntax
Interfaces (input-output) must be open; innards can be
open source (allowing you to modify) or proprietary
• Services can be any language from Fortran, Shell scripts, C,
C#, C++, Java, Python, Perl – your choice!!
• Web Services supported by all vendors (IBM, Microsoft …)

Service overhead will be just a few milliseconds (more
now) which is < typical network transit time
• Any program that is distributed can be a Web service
• Any program taking execution time ≥ 20ms can be an
efficient Web service
Web services

Programs
Computational resources
service logic
BPEL, Java, .NET
Databases
resources
Humans
<env:Envelope>
<env:Header>
...
</env:header>
<env:Body>
...
</env:Body>
</env:Envelope>
SOAP messages
message processing

Web Services build
loosely-coupled,
distributed
applications, (wrapping
existing codes and
databases) based on the
SOA (service oriented
architecture) principles.
Web Services interact
by exchanging messages
in SOAP format
The contracts for the
message exchanges that
implement those
interactions are
described via WSDL
interfaces.
SOAP and WSDL

Devices
A typical Web Service


In principle, services can be in any language (Fortran .. Java ..
Perl .. Python) and the interfaces can be method calls, Java RMI
Messages, CGI Web invocations, totally compiled away (inlining)
The simplest implementations involve XML messages (SOAP) and
programs written in net friendly languages like Java and Python
Web Services
WSDL interfaces
Portal
Service
Security
WSDL interfaces
Web Services
Payment
Credit Card
Catalog
Warehouse
Shipping
control
Some Grid Concepts II

Systems are built from contributions from many different groups
– you do not need one “vendor” for all components as Web
services allow interoperability between components
• One reason DoD likes Grids (called Net-Centric computing)

Grids are distributed in services and data allowing anybody to
store their data and to produce “their” view
• Some think that University Library of future will curate/store data of
their faculty



“2 level programming model”: Classic programming of services
and services are composed using workflow consistent with
industry standards (BPEL)
Grid of Grids: (System of Systems) Realistically Grid-like
systems will be built using multiple technologies and “standards”
–integrate separate Grids for Sensors, GIS, Visualization,
computing etc. with OGSA (Open Grid Service Architecture
from OGF) system Grid (Security, registry) into a single Grid
Existing codes UNCHANGED; wrap as a service with metadata
TeraGrid User Portal
LEAD Gateway Portal
NSF Large ITR and Teragrid Gateway
- Adaptive Response to Mesoscale
weather events
- Supports Data exploration,Grid Workflow
Background: Earthquake Forecast – Published Feb 19, 2002, in PNAS.
( JB Rundle et al., PNAS, v99, Supl 1, 2514-2521, Feb 19, 2002; KF Tiampo et al., Europhys. Lett., 60, 481-487, 2002; JB Rundle
et al.,Rev. Geophys. Space Phys., 41(4), DOI 10.1029/2003RG000135 ,2003. http://quakesim.jpl.nasa.gov )
D.T. => “false alarms” vs. “failures to predict”
6≤M
5≤M≤6
Plot of Log10 (Seismic Potential)
Increase in Potential for significant events, ~ 2000 to 2010
Eighteen significant earthquakes (M > 4.9;
blue circles) have occurred in Central or
Southern California. Margin of error of the
anomalies is +/- 11 km; Data from S. CA.
and N. CA catalogs:
After the work was completed
1. Big Bear I, M = 5.1, Feb 10, 2001
2. Coso, M = 5.1, July 17, 2001
After the paper was in press ( September 1, 2001 )
3. Anza I, M = 5.1, Oct 31, 2001
After the paper was published ( February 19, 2002 )
4. Baja, M = 5.7, Feb 22, 2002
5. Gilroy, M=4.9 - 5.1, May 13, 2002
6. Big Bear II, M=5.4, Feb 22, 2003
7. San Simeon, M = 6.5, Dec 22, 2003
8. San Clemente Island, M = 5.2, June 15, 2004
9. Bodie I, M=5.5, Sept. 18, 2004
10. Bodie II, M=5.4, Sept. 18, 2004
11. Parkfield I, M = 6.0, Sept. 28, 2004
12. Parkfield II, M = 5.2, Sept. 29, 2004
13. Arvin, M = 5.0, Sept. 29, 2004
14. Parkfield III, M = 5.0, Sept. 30, 2004
15. Wheeler Ridge, M = 5.2, April 16, 2005
16. Anza II, M = 5.2, June 12, 2005
17. Yucaipa, M = 4.9 - 5.2, June 16, 2005
18. Obsidian Butte, M = 5.1, Sept. 2, 2005
CL#03-2015
Color Scale  Decision Threshold
ACES Components
Country
and/or
Economies
Data (shared
as part of a
collaboration)
Earthquake
Forecast/Model
Wave
Motion
Infrastructure
Institutions
Australia
Seismic data,
fault database,
GPS
Finley, LSM
PANDAS
prototype
Access
Canada
Polaris Radarsat
Pattern
Informatics
P.R. China
Seismic GPS
LURR
CAS
China National Grid
Japan
GPS
Seismic
Daichi (InSAR)
GeoFEM
JST-CREST
Earth Simulator
Naregi
Chinese
Taipei
FORMOSAT3/COSMIC (F/C)
U.S.A.
QuakeTables
Sesismic
InSAR
PBO (GPS)
Pattern
Informatics
ALLCAL
GeoFEST, PARK,
VirtualCalifornia
TeraShake
SERVOGrid
GEON
SCECGrid
Vlab
International
IMS
Pacific Rim Universities
(APRU ) PRAGMA
Grid Workflow Datamining in Earth Science

NASA GPS

Work with Scripps Institute
Grid services controlled by workflow process real time
data from ~70 GPS Sensors in Southern California
Earthquake
Streaming Data
Support
Transformations
Data Checking
Hidden Markov
Datamining (JPL)
Display (GIS)
Grid Workflow Data Assimilation in Earth Science

Grid services triggered by abnormal events and controlled by workflow process real
time data from radar and high resolution simulations for tornado forecasts
Use a Portlet-based user portal to access
and control services and workflow
China National Grid
Beijing
Xi’an
Shanghai
Hefei
Changsha
From Qian Depei Beihang
Hong Kong
New drug discovery grid


Undertaken by Shanghai Institute of
Materia Medica CAS
Compound screening for new drug
discovery




Speed up the process by computer
simulation
Higher accuracy
Using HPC in P2P mode
New drug for diabetes is under
development and will enter clinic
testing by the end of 2005
New Drug Discovery Grid Platform
神
威
PC
机
群
曙
光
4
0
0
0
A
Shanghai SCC
主
服
务
器
神
威

机
群
Shanghai Institute of
Materia Medica CAS
神
威
PC
机
群
Beijing Medical
Institute
PC
DDG Portal
Bio-informatics Grid


Undertaken by Genomics &
Bioinformatics Institute, CAS
Provide computing, data, and
information grids for bioinformation research in the
country
ChinaGrid (from Hai Jin) Huazhong
University
of
Science
and
Technology Wuhan, China
33
ChinaGrid in a Nutshell
• China Education and Research Grid
• Funded by Ministry of Education
• As the pilot grid application supported by China
National Grid (CNGrid)
• Based on CERNET (China Education and Research
Network)
• First Phase
 From 2003-2005
 12 key universities as initiative
 20 key universities now
34
Architecture of
Medical Image Processing Grid
35
Bioinformatics Grid
36
BioGrid Applications
• Protein target selection for rice genome
• Multi-sequence alignment for ganoderma family
• Gene joint for white mice
• Cardiovascular disease research
37
Chemical Informatics and Cyberinfrastructure
Collaboratory CICC Grid Vision







Drug Discovery and other academic chemistry and pharmacology
research will be aided by powerful modern information technology
ChemBioGrid set up as distributed cyberinfrastructure in eScience model
ChemBioGrid will provide portals (user interfaces) to distributed
databases, results of high throughput screening instruments, results of
computational chemical simulations and other analyses
ChemBioGrid will provide services to manipulate this data and combine in
workflows; it will have convenient ways to submit and manage multiple
jobs
ChemBioGrid will include access to PubChem, PubMed, PubMed Central,
the Internet and its derivatives like Microsoft Academic Live and Google
Scholar
The services include open-source software like CDK, commercial code from
vendors from BCI, OpenEye, Gaussian and Google, and any user
contributed programs
ChemBioGrid will define open interfaces to use for a particular type of
service allowing plug and play choice between different implementations
http://www.chembiogrid.org
Formal Cheminformatics Courses

I571 Chemical Information Technology (3 cr.)
• Distance Ed section had 10 students in Fall 2005, from California to
Connecticut





I572 Computational Chemistry and Molecular Modeling (3 cr.)
I573 Programming Techniques for Chemical and Life Science
Informatics (3 cr.)
I553 Independent Study in Chemical Informatics (3 cr.)
Above courses required for the new Graduate Certificate
Program in Chemical Informatics
I533 Seminar in Chemical Informatics
• Spring 2006 Topic: Molecular Informatics, the Data Grid, and an
Introduction to eScience
• http://www.indiana.edu/~cheminfo/I533/533home.html

I647 Seminar in Chemical Informatics
• Fall 2006 Topic: Bridging Bioinformatics and Chemical Informatics
• http://www.indiana.edu/~cheminfo/I647/647home.html
Related Courses






L519 Bioinformatics: Theory and Application (3 cr.) (at
IUPUI: CSCI 548)
L529 Bioinformatics in Molecular Biology and
Genetics: Practical Applications (4 cr.) (not offered at
IUPUI)
I619 Structural Bioinformatics (3 cr.)
I617 Informatics in Life Sciences and Chemistry (3 cr.)
(for non-majors)
B649 Topics in Systems: Service Architectures and
Science (3 cr.)
I590 Topics in Informatics: Scientific Applications of
XML (IUPUI)
Total Grad Enrollment: Chem-, Lab,
Bio-, Health Informatics, Fall 2005
Red = Chem, Fall 2006
MS
IUB
Chem
3/3
Lab
Bio
0
38
Health
0
IUPUI
6/3
15
34
36
TOTAL
9/6
15
72
36
PhD
Chem
Lab
Bio
Health
IUB
1/3
0
3
0
IUPUI
0/1
0
4
3
TOTAL
1/4
0
7
3
CICC Prototype Web Services
Basic cheminformatics
Molecular weights
Molecular formulae
Tanimoto similarity
2D Structure diagrams
Molecular descriptors
3D structures
InChi generation/search
CMLRSS
Application based services
Compare (NIH)
Toxicity predictions (ToxTree)
Literature extraction (OSCAR3)
Clustering (BCI Toolkit)
Docking, filtering, ... (OpenEye)
Varuna simulation
Key Ideas
Add value to PubChem with additional distributed services
and databases
 Wrapping existing code in web services is not difficult
 Provide “core” (CDK) services and exemplars of typical tools
 Provide access to key databases via a web service interface
 Provide access to major Compute Grids

Next steps?
Define WSDL interfaces to enable global production of
compatible Web services; refine CML
 Ready to try “Prototype Production”
 Develop more training material
 Refine/go into production with key services including both
tools, workflows and TeraGrid style simulations in capacity
and capability modes
 In-house algorithm work for new services in clustering,
diversity analysis, QSAR methodologies

Web Service Locations
Indiana University

Clustering

VOTables

OSCAR3

Toxicity classification

Database services
Cambridge University

InChi generation / search

CMLRSS

OpenBabel
SDSC
Typical
TeraGrid Site
InfoChem

SPRESI
database
NIH
PubChem …..
Compare …..
Penn State University
CDK based services

Fingerprints

Similarity calculations

2D structure diagrams

Molecular descriptors
Workflows Using Chemical Literature
Find similar
documents
Bulk download of
Pubmed abstracts
OSCAR3
program
All of PubMed
“just” takes
about a day to
run through
OSCAR3 on
2048 node Big
Red
Extract chemical
structures
Find similar
molecules
PDBBind
OSCAR3
Service
PubChem
Local DTP
database
SMILES NAME Pubmed ID
CCC
propane 1425356
CC
ethane 3546453
..... ............. ............. Clustering of documents linked to
clustering of chemicals
Searchable
(structure/similarity)
Grid database
Large Scale Calculations on “All of PubChem/Med”



TeraGrid: 100 Teraflop now to 1000 Teraflop next year
IU 2048 node Big Red supercomputer: 20 Teraflop today
The CDK can currently calculate approx. 107 Descriptors

Whole of PubChem (6M compounds) – 276 hours, 1 CPU

On IU's Big Red, 2048 CPU's, 20 TF: < 7 minutes
Even increasing the descriptor count by 5 times gives us < 35 minutes
of compute time on Big Red


OSCAR3 takes a few seconds per abstract to text-mine all
compounds in it



All of PubMed would take < a day on Big Red
Cleanup and Iteration would take some time
Can pre-calculate properties of smaller compounds using CDK
(logP, BCUT, CPSA, …) and programs likes GAMESS

100,000 compounds take < a week each on a single CPU and would be
a practical computation over next year
Prototype CICC Project: Controlling the TGFb pathway
Collaboration between Baik & Zhang at IU
in-house Molecules in Varuna
QM Database
Simulations
VARUNA
TeraGrid
Supercomputers
“Flocks”
Web Service to
generate custom
force fields
AutoGeFF
Can afford
few ms overhead!
Conceptual
Understanding of TGFb
Inhibition
Inactive TGFb
1IAS
Active TGFb
With inhibitor
PubChem
Questions:
- What molecular feature
controls inhibitor binding?
PDB
Experiments
in the Zhang
Lab
- How do mutations impact
binding?
MLSCN Post-HTS Biology Decision Support
Percent Inhibition or
IC50 data is retrieved
from HTS
Question: Was this
screen successful?
Workflows encoding plate
& control well statistics,
distribution analysis, etc
Question: What should the
active/inactive cutoffs be?
Workflows encoding
distribution analysis of
screening results
Question: What can we learn
about the target protein or cell
line from this screen?
Workflows encoding
statistical comparison of
results to similar screens,
docking of compounds
into proteins to correlate
binding, with activity,
literature search of active
compounds, etc
Compounds submitted to
PubChem
PROCESS
CHEMINFORMATICS
Grids can link data
analysis ( e.g image
processing developed in
existing Grids),
traditional Cheminformatics tools, as well
as annotation tools
(Semantic Web,
del.icio.us) and enhance
lead ID and SAR analysis
A Grid of Grids linking
collections of services at
PubChem
ECCR centers
MLSCN centers
GRIDS
MLSCN Data - How services and workflows are used
Data is stored in
Pubchem
PubChem interfaces to
workflows via SOAP
MLSCN submits HTS data
to Pubchem and/or sends
directly to workflow for
real-time feedback
End-user applications and
interfaces utilize the
information streams from
the workflows for human
interaction with the data
and analysis
Workflows perform different kinds of
analysis on the MLSCN data, including
SAR, clustering, literature searching,
protein searching, toxicity testing, etc…
Example HTS workflow: finding cell-protein relationships
A protein implicated in tumor
growth with known ligand is
selected (in this case HSP90 taken
from the PDB 1Y4 complex)
The screening data from a
cellular HTS assay is
similarity searched for
compounds with similar
2D structures to the
ligand.
Docking results and
activity patterns fed into
R services for building of
activity models and
correlations
Least
Squares
Regression
SImilar structures to the
ligand can be browsed
using client portlets.
Similar structures are
filtered for drugability, are
converted to 3D, and are
automatically passed to
the OpenEye FRED
docking program for
docking into the target
protein.
Random
Forests
Once docking is complete,
the user visualizes the highscoring docked structures
in a portlet using the JMOL
applet.
Neural
Nets
Protein Function
ubiquitination site
Automated functional annotation
• Prediction of global functional class
– molecular function
– biological process
– cellular localization
• Prediction of residue based annotation
– post-translational modifications
– binding sites
– active sites
– deleterious mutations (disease implications)
Molecular function:
transcription regulator activity
(GO:0030528)
Predrag Radivojac
• Inferences made from
www.informatics.indiana.edu/predrag
– amino acid sequence
– protein 3D structure
– evolutionary data
– protein-interaction (network) data
Proteomics
Approaches based on MS/MS
• Peptide identification
– using machine learning
– using novel scoring for database searching
– post-translationally modified peptides
– de novo
• Protein identification and quantification
– label-free methodology
– based on machine learning
• Glycomics and glycoproteomics
– Glycan sequencing
– Mapping of site-specific glycosylations
Predrag Radivojac and Haixu Tang
www.informatics.indiana.edu/predrag
www.informatics.indiana.edu/hatang
Comparative genomics
From bacterial to eukaryotic genomes
• Platcom: a comparative genomics platform
– a web based system for comparing genomes on the web at
– several systems are developed on top of Platcom:
• A pathway analysis system, ComPath,
• A comparative genome annotation system, CGAS
• Non-coding sequences in eukaryotic genomes
– segmental duplications in human genome
– LTR retrotransposons
– RNA regulatory elements
Sun Kim and Haixu Tang
bio.informatics.indiana.edu/sunkim
www.informatics.indiana.edu/hatang
Motif Discovery in Proteins
From unaligned and aligned sequences
•iGibbs: An improved Gibbs Motif Sampler for Proteins by
Sequence Clustering and Iterative Pattern Refinement.
•ADBG : Motif Discovery Using Approximate De Bruijn Graphs
•ARCS : An Aggregated Related Column Scoring Scheme for
For a list of current algorithms:
http://bio.informatics.indiana.edu/bioalgo/
Mehmet Dalkilic, Sun Kim and Haixu Tang
www.informatics.indiana.edu/dalkilic
bio.informatics.indiana.edu/sunkim
www.informatics.indiana.edu/hatang
Systems Biology & Disease
http://www.informatics.indiana.edu/dalkilic
Integrated Discovery in Gene
Networks
Using Drosophila m. data to discover
disease-related protein interactions in
humans
Mehmet Dalkilic, James Costello, John Colbourne, Brian Eads
In collaboration with Dept. Biology, CGB, & DGRC
Curation and Alignment Tool for
Protein Annotation: www.catpa.org
Provide a database system that allows annotation at the residue level of protein
families (multiple and/or non-continguous) with images, text. Deletions and
Insertions can be displayed too. Motifs can be automatically brought in from any
motif discovery system provided a simple XML format is used.
Annotate any collection of residues
High level search
Mehmet Dalkilic, Andrew Albrecht