Transcript Slide 1

DATA SCIENCE IN EDUCATION
AND FOR DISCOVERY
Kirk D. Borne
School of Physics, Astronomy, & Computational Sciences
George Mason University
[email protected]
http://classweb.gmu.edu/kborne/
Abstract
I will discuss the rise of data science as a new academic and
research discipline. Data-intensive opportunities are growing
significantly across the spectrum of academic, government, and
business enterprises. In order to respond to this data-driven digital
transformation, it is imperative to train the next-generation workforce
in the data-science skill areas. Among these skills are knowledge
discovery and information extraction from massive data collections. I
will describe some of the techniques that we are applying both in
research (for scientific discovery) and in the classroom (to engage
students in inquiry-driven evidence-based learning). Specific
examples of surprise detection in big data will be presented.
Ever since humans began to explore the world…
… … humans have asked questions and …
… have collected evidence (data) to help answer those questions.
Astronomy: the world’s
second oldest profession !
Now, the Data Flood is everywhere
• Huge quantities of data are being generated, collected, and stored
within all business, government, research, and personal domains.
• Two significant challenges of this Data Flood will be addressed:
•
Training the next-generation workforce to manage and expertly use these data
•
•
“The Rise of the Data Scientist”
Discovering the hidden knowledge and surprises that are hidden within the data
•
Transforming our repositories from a data representation to a knowledge representation
• So how do we address these challenges?
• First, we must face it – i.e., the students that we train as well as
knowledge workers (those who extract knowledge from data and
information) must recognize the need and face the challenge …
Visualize This: A sea of Data (sea of CDs)
This is the CD Sea in Kilmington, England (600,000 CDs ~ 300 TB)
More Data is Different
• The message should be clear: “more data is not simply more data,
but more data is different.”
• Numerous federal agencies (and others, of course) have addressed
this, including the August 9, 2010 announcement from the OMB and
White House OSTP:
•
•
Big Data is a national challenge and a national priority, along with healthcare and
national security.
See http://www.aip.org/fyi (#87)
• International initiative by the CODATA organization to address this
challenge: ADMIRE = Advanced Data Methods and Information
technologies for Research and Education
• Many U.S. national study groups in the sciences have issued reports
on the urgency of establishing both research and educational
programs to face the Big Data challenges.
•
Each of these reports have issued a call to action …
Data Science: A National Imperative
1. National Academies report: Bits of Power: Issues in Global Access to Scientific Data, (1997) downloaded from
http://www.nap.edu/catalog.php?record_id=5504
2. NSF (National Science Foundation) report: Knowledge Lost in Information: Research Directions for Digital Libraries, (2003) downloaded from
http://www.sis.pitt.edu/~dlwkshop/report.pdf
3. NSF report: Cyberinfrastructure for Environmental Research and Education, (2003) downloaded from
http://www.ncar.ucar.edu/cyber/cyberreport.pdf
4. NSB (National Science Board) report: Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century, (2005)
downloaded from http://www.nsf.gov/nsb/documents/2005/LLDDC_report.pdf
5. NSF report with the Computing Research Association: Cyberinfrastructure for Education and Learning for the Future: A Vision and Research
Agenda, (2005) downloaded from http://www.cra.org/reports/cyberinfrastructure.pdf
6. NSF Atkins Report: Revolutionizing Science & Engineering Through Cyberinfrastructure: Report of the NSF Blue-Ribbon Advisory Panel on
Cyberinfrastructure, (2005) downloaded from http://www.nsf.gov/od/oci/reports/atkins.pdf
7. NSF report: The Role of Academic Libraries in the Digital Data Universe, (2006) downloaded from http://www.arl.org/bm~doc/digdatarpt.pdf
8. National Research Council, National Academies Press report: Learning to Think Spatially, (2006) downloaded from
http://www.nap.edu/catalog.php?record_id=11019
9. NSF report: Cyberinfrastructure Vision for 21st Century Discovery, (2007) downloaded from http://www.nsf.gov/od/oci/ci_v5.pdf
10. JISC/NSF Workshop report on Data-Driven Science & Repositories, (2007) http://www.sis.pitt.edu/~repwkshop/NSF-JISC-report.pdf
11. DOE report: Visualization and Knowledge Discovery: Report from the DOE/ASCR Workshop on Visual Analysis and Data Exploration at Extreme
Scale, (2007) downloaded from http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/DOE-Visualization-Report-2007.pdf
12. DOE report: Mathematics for Analysis of Petascale Data Workshop Report, (2008) downloaded from
http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/PetascaleDataWorkshopReport.pdf
13. NSTC Interagency Working Group on Digital Data report: Harnessing the Power of Digital Data for Science and Society, (2009) downloaded from
http://www.nitrd.gov/about/Harnessing_Power_Web.pdf
14. National Academies report: Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age, (2009) downloaded from
http://www.nap.edu/catalog.php?record_id=12615
15. NSF report: Data-Enabled Science in the Mathematical and Physical Sciences, (2010) http://www.cra.org/ccc/docs/reports/DES-report_final.pdf
Data Science Education: Two Perspectives
• Informatics in Education – working with data in all learning settings
•
•
•
•
Informatics (Data Science) enables transparent reuse and analysis of data in
inquiry-based classroom learning.
Learning is enhanced when students work with real data and information
(especially online data) that are related to the topic (any topic) being studied.
http://serc.carleton.edu/usingdata/ (“Using Data in the Classroom”)
Example: CSI The Cosmos
• An Education in Informatics – students are specifically trained:
•
•
•
•
… to access large distributed data repositories
… to conduct meaningful inquiries into the data
… to mine, visualize, and analyze the data
… to make objective data-driven inferences, discoveries, and decisions
• Numerous Data Science programs now exist at several universities
(GMU, Caltech, RPI, Michigan, Cornell, U. Illinois, and more)
•
http://cds.gmu.edu/ (Computational & Data Sciences @ GMU)
Data Science Education Goal
• Primary Goal: to increase student’s understanding of the role that
data and information play across all disciplines, and to increase the
student’s ability to use the technologies and methodologies
associated with data acquisition, management, search, mining,
analysis, and visualization.
• Secondary goals:
•
•
•
•
•
To increase student’s abilities to use databases for inquiry
To increase student’s abilities to acquire, process, and explore data with the use
of a computer
To increase student’s confidence and comfort in using data to address real-world
problems (in their chosen scientific discipline, or in any endeavor)
To increase student’s awareness of ethical issues pertaining to data and
information, including privacy, ownership, proper attribution, misuse and abuse of
statistics and graphs, data falsification, and objective reasoning from data
To demonstrate and to share the joy of discovery from data
Knowledge Discovery from Data: Many names
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Data Mining
Machine Learning (ML)
Exploratory Data Analysis (EDA)
Intelligent Data Analysis (IDA)
Data Analytics
Predictive Analytics
Discovery Informatics
On-Line Analytical Processing (OLAP)
Business Intelligence (BI)
Business Analytics
Customer Relationship Management (CRM)
Target Marketing
Cross-Selling
Market Basket Analysis
Credit Scoring
Case-Based Reasoning (CBR)
Connecting the Dots
Intrusion Detection Systems (IDS)
Recommendation / Personalization Systems!
Data-driven Discovery (Unsupervised Learning)
• Class Discovery – Clustering
•
•
•
Distinguish different classes of behavior or different types of objects
Find new classes of behavior or new types of objects
Describe a large data collection by a small number of condensed representations
• Principal Component Analysis – Dimension Reduction
•
•
•
Find the dominant features among all of the data attributes
Enables low-dimensional descriptions of events and behaviors, while revealing
correlations and dependencies among parameters
Addresses the Curse of Dimensionality
• Outlier Detection – Surprise / Anomaly / Novelty Discovery
•
•
•
Find objects and events that are outside the bounds of our expectations
These could be garbage (erroneous measurements) or true discoveries
Used for data quality assurance and/or for discovery of new / rare / interesting
data items
• Link Analysis – Association Analysis – Network Analysis
•
•
•
Identify connections between different events (or objects)
Find unusual (improbable) co-occurring combinations of data attribute values
Find data items that have much fewer than “6 degrees of separation”
Addressing the D2K (Data-to-Knowledge) Challenge
• Complete end-to-end application of Informatics:
•
•
•
Data management, metadata management, data search, information extraction,
data mining, knowledge discovery
All steps are necessary – skilled workforce needed to take data to knowledge
Applies to any discipline (not just science)
Characterize First, then Classify
• The Scientific Method does not begin with “hypothesis formulation.”
• Neither should any reasoning process jump to conclusions.
• We should teach by example: follow an evidence-based “forensics”
approach.
• “Big Data” provide an excellent framework and environment for this.
• By including Data Science in our education programs as well as in
our own business practice, this should lead to informed, objective,
data-driven decision-making.
• Isn’t this what we expect from all of our citizens?
• Example from scientific method:
•
•
Step 1: Data Collection – observe, describe, characterize
Step 2: Hypothesis Formulation – classify, diagnose, predict
Summary
• All enterprises are being inundated with data.
• The knowledge discovery potential from these data is enormous.
• Now is the time to implement data-oriented methodologies
(Informatics / Data Science) into the enterprise.
• This is especially important in training and degree programs –
training the next-generation workforce to use data for knowledge
discovery and decision support.
• We have before us a grand opportunity to establish dialogue and
information-sharing across diverse data-intensive research and
application communities.
• DATA SUMMIT 2011 has been a fantastic realization of that
opportunity.