History and Philosophy of Science

Download Report

Transcript History and Philosophy of Science

Discovering Descriptive Knowledge
Lecture 18
Descriptive Knowledge in Science
In an earlier lecture, we introduced the representation and
use of taxonomies and laws.
Informatics tools for working with taxonomies
• represent them as a collection of hypotheses about
categories and their is-a relationships;
• use them to organize knowledge and to classify new
observations.
Informatics tools for working with laws
• represent them as hypotheses about quantitative and/or
qualitative relationships among an object’s properties;
• use them to predict the static or dynamic properties of
an entity or an interconnected system.
The Taxonomy Formation Task
Taxonomy formation consists of three tasks that may be
solved separately or simultaneously:
• the construction of categories;
• the organization of the categories into a hierarchy; and
• the explicit definition of the categories.
Informatics tools for taxonomy formation fall into two
general categories:
• those that analyze finite batches of observations and
create separate taxonomies for each batch; and
• those that incrementally construct and refine taxonomies
based on an effectively continuous stream of data.
Cluster 3.0
Cluster is designed to construct and organize categories
from a batch of gene expression data.
As input, Cluster takes gene expression levels from
multiple experiments.
The program clusters genes
based on their expression
patterns across experiments.
Scientists can select the
clustering method and set
the available parameters.
Cluster produces a text file
that contains the taxonomy.
Cluster 3.0: Results
Viewing the taxonomy produced by Cluster requires a
separate program, such as Tree View.
taxonomy
data
selected section
of the taxonomy
gene annotations
ReTAX
ReTAX is an interactive environment that helps scientists
revise taxonomies in response to new observations.
A taxonomy in ReTAX includes hierarchically organized
categories and their definitions.
The data for ReTAX are a set of features, such as the size
of a plant’s leaf, the type of its fruit, etc. and a category.
As a scientist enters data, ReTAX ensures that the new
item’s features
• match or specialize the category’s defining features; and
• distinguish it from other categories in the taxonomy.
If the new item violates either of these rules, then ReTAX
attempts to revise its taxonomy.
ReTAX
Ericaceae
Andromeda
A. uva-ursi
Pernettya
P. tasmanica
…
Gaultheria
G. oppositifolia
G. rupestris
G. antipoda
Working in the context of a botanical taxonomy like this
one, ReTAX replicated historical revisions.
In the course of its use, ReTAX
• identified descriptive features that were insufficient for
distinguishing members of two taxa;
• searched for new features to refine the taxa; and
• eventually suggested that the genera Pernettya and
Gaultheria should be merged.
Qualitative Law Discovery
Qualitative laws fall into two primary categories:
• those involving categorical statements about objects,
such as “all ravens are black”; and
• those describing qualitative changes, such as
“temperature and pressure increase proportionately”.
Informatics tools that discover categorical relationships
have received the majority of the attention in this area.
These tools typically address a supervised learning task:
• data are described by multiple features (color = black,
wings = present);
• one of these features serves as a target for classification
(species = C. corax); and
• the tool relates the features to the target.
RL
RL addresses the supervised learning task to produce
qualitative laws that are expressed as logical rules.
The rules are qualitative laws such that if all the conditions
are true of a datum, then it is assigned to the target class.
As input, RL takes a data set and information that controls
the characteristics of the rules, such as
• taxonomies of the
values for features,
• constraints among
features in each rule,
• minimum accuracy, &
• maximum features.
RL
As an example, consider the task of finding law-like
relationships that link medical findings to a disease class.
The data are patient findings, and the target is a syndrome
that covers several ailments (lower respiratory syndrome).
RL produces rules that relate
the findings to the syndrome.
Each rule has numeric
measures of support.
RL has been applied
• to identify carcinogens, and
• to determine parameters for
crystallographic experiments.
Quantitative Law Discovery
Quantitative laws may describe:
• algebraic relationships such as Newton’s second law of
motion, a=F/m; and
• dynamic responses such as the unbounded growth rate
of a population, dP/dt = kP.
Informatics tools address both classes of laws through a
variety of techniques.
BACON discovers quantitative, algebraic laws through
problem space search guided by declarative heuristics.
Cubist discovers conditional, algebraic laws using
techniques for linear regression.
LAGRAMGE
LAGRAMGE, and it’s precursor LAGRANGE, were the first
in a line of law discovery systems for differential equations.
LAGRAMGE takes as input
• time series for multiple variables,
• an indicator that identifies the dependent variable, and
• knowledge about the structure of plausible solutions.
As output, the system produces an algebraic or differential
equation for the dependent variable.
LAGRAMGE has been applied in ecosystem dynamics,
fjord hydrodynamics, and other domains.
Discovering Descriptive Knowledge: Summary
The computational scientific discovery has a long history
particularly in the context of descriptive knowledge.
Such systems have played a large role in exploring,
analyzing, and understanding data.
Work in this area laid the foundations for the field of data
mining both in terms of research and applications.
However, the discovery of descriptive knowledge
• can lead to a shallow interpretation of data;
• generally avoids statements of causality; and
• makes limited contact with the rich, theoretical content of
a scientific discipline
Next we will discuss systems that address these concerns.