Presentation - The National Academies of Sciences, Engineering

Download Report

Transcript Presentation - The National Academies of Sciences, Engineering

Foundations of Data
Science: Mathematics
Raphy Coifman
Depts. of Mathematics & of Computer Science, Yale University
Eric Kolaczyk
Dept. of Mathematics & Statistics, Boston University
View @ 10K ft
Common to group key players of data science into

“Computational” Sciences – E.g., Computer science, engineering, & statistics
(previous talks)

“Domain” sciences – E.g., Genomics, neuroscience, text analysis, etc.
Mathematics plays a critical … but sometimes-differentiated -- role in support of
both (i.e., often in the sense of `infrastructure’).
Mathematical Infrastructure: General
The “computational” side has traditionally been supported by, e.g.,

Linear algebra

Numerical analysis

Graph theory
as well as supporting aspects of statistics, signal processing, etc.
Math Infrastructure: Domain

Support for the “domain” side frequently is domain/problem-specific.

Most representatives from the physical sciences.

Provides shortcuts and enabling tools for processing and information
extraction, based on mathematical physical models.

Ideas about representation and data features are mostly based on
mathematical analysis tools.
Math Infrastructure: Domain (cont)
Example: Linear(ized) inverse problems

Seismology/geophysics (Abel xform & related)

Medical imaging (Radon xform, etc.)

Radar

Weather
Note: These all historically had “big” data long before it became common in
other domains.
Cartoon Version of Data Science Goals
Self-organization /
clustering
Massive Data
Dimension
reduction /
compression
Modeling
Etc.
Mathematics can contribute both theoretical models/structure
and a corresponding “calculus”.
Knowledge
Looking Back: Core Mathematical
Activities
The following core activities were essential in the past to model our world, and
with some adaptation and modifications are expected to still be essential in
current and future environments.

Linear algebra: the basics, high dimensional, and effective numerical
analysis.

Analysis: PDEs (both stochastic or deterministic), Harmonic analysis and
approximation theory, Functional analysis.
Core Mathematical Activities (cont)

Geometries: Riemannian, metric geometry , differential geometry, some
topology.

Optimization: Dynamic programming, convex optimization, relaxation
methods.

Etc.
Key Point
In the current and future environment for
data science we are lacking theoretical
models, as well as related calculus.
Illustration: Linear Algebra

Linear algebra -- more precisely, linear algebra in high dimensions -- is the
main computational toolbox for all data analysis.

Superficially, it has changed little in recent memory.
However … the reality is that computational numerical algebra, for very large
dimensions requires the integration of all the classical fields mentioned (i.e.,
analysis, geometry, optimization, probability, etc.)
Extending Linear Algebra

The challenges of linear algebra in, e.g., high-dimensional discretized
function spaces, requires a good deal of understanding of analysis,
randomization, and stochastic processes to enable very high dimensional
linear algebra, as it requires reorganization of massive matrices .

Extensions to tensor or multilinear algebra are an essential ingredient of
combined source processing.
Extending Linear Algebra: Beyond Linear

The moment we exit the linear regime all expectations crumble.

The simplest bilinear transformations, such as the pointwise product of two
functions, become discontinuous in the weak topology, resulting potentially
in major disruptions due to noise, which in the linear regime usually cancels,
but builds up in nonlinear regimes.

This is not academic, it requires serious deep mathematical analysis to
understand empirical observations and structures in nonlinear regimes.
Mathematical Conceptualization of
Modern Data Science Challenge

A cloud of points in high dimensional Euclidean space (a database) that needs
to be characterized or modeled.

The density of the points could be estimated, the geometry or configuration
of the points described. (Manifold learning, graph/networks , probabilistic
generative models , metric spaces etc.)

Various functions on the data , such as low dimensional embeddings,
classifiers, or “features may need to be regressed or “learned”.
The Main Point wrt Data Science
Education

This activity involves a blend of all the (sub)fields listed above.

Importantly, the corresponding mathematical abstractions usually discussed
in the mathematical curriculum as independent courses (e.g., differential
geometry, partial differential equations ,probability and stochastic processes,
dynamical systems , optimization , Fourier analysis, etc. ) all come to life and
blend in many instances of the modern setting, providing insights in all
directions.

This will require a blended integrative curriculum, in which all these tools
are explained and jointly motivated.
Illustration: Comparing Two Random
Samples

A fundamental problem in statistics is to assess whether two sets of
observations are random samples (a) of the same probability distribution, or
(b) governed by different distributions.

Canonical solution is, e.g., two-sample Kolmogorov-Smirnov test.

Fundamentally, requires the creation of histogram bins.

The main challenge in high dimension is to match the bins to the geometry of
the data, to its precision, and to the reason for matching the two
populations.
Illustration (cont)

Currently soft bins are created (implicitly) through kernel functions, or use of
various smoothness classes (e.g., Lipschitz functions).

Unfortunately, these heuristic methods are not informed by the intrinsic data
geometry/statistics.

Attempts to use deep nets to define appropriate binning might work better
when driven by statistical analytic tools.

At this point , kernel methods or smoothness class methods are assuming
conventional geometries whether they fit or not, what is required is an
appropriate analytic approximation theory, governed by statistics, as well as
by filtering needs.
Summary / Discussion
Mathematics provides

Conceptual framework

Language

“Calculus”
for data science.
Traditional framework has been found to extend well-beyond what might have
been expected during its development … but we appear to be reaching a point
where additional fundamental development will prove crucial.
Summary / Discussion

How best to evolve the mathematics curriculum, both foundational and interdisciplinary, to meet data science needs?

How can we better foster integrative teaching/learning of topics from across
traditionally diverse areas, both within mathematics (e.g., linear algebra,
geometry, analysis) and across mathematical sciences (e.g., mathematics,
statistics, probability)?

How best to facilitate a more integrated development of theoretical error
bounds / guarantees and computational / data analysis advances?