Introduction to Data Science Section 2

Download Report

Transcript Introduction to Data Science Section 2

Introduction to Data Science
Section 2
Data Matters 2015
Sponsored by the Odum Institute, RENCI, and
NCDS
Thomas M. Carsey
[email protected]
1
The Data Lifecycle
2
Data Science is More than Analysis
• Data analysis gets most of the attention in data
science.
• In that sense, many people struggle to distinguish
data science from applied statistics.
• Analysis is obviously important, but statistical
analysis skills are only useful if the data can be
collected in put in a usable form.
• Data Science is much broader than just data
analysis.
3
The Data Lifecycle
• Data science considers data at every stage of what is called
the data lifecycle.
• This lifecycle generally refers to everything from collecting
data to analyzing it to sharing it so others can re-analyze it.
– In fact, it includes the planning process that should be in place
before any other work begins.
• New visions of this process in particular focus on
integrating every action that creates, analyzes, or otherwise
touches data.
• These same new visions treat the process as dynamic –
data archives are not just digital shoe boxes under the bed.
• There are many representations of the this lifecycle.
4
5
6
7
8
Lessons from the Lifecycle
• Data Science is more than just data analysis.
• Effective data science requires
– Planning
– Vision
– Storage
– Interoperability of systems
– A team approach
– Adaptability and Scalability
9
What is Missing?
• Most definitions of data science underplay or
leave out discussions of:
– Substantive theory
– Metadata
– Privacy and Ethics
– Greater Consideration for missing data,
representativeness, and uncertainty
– More thinking about the proper Null hypothesis
– Leadership on leveraging data science for the
public good
10
Substantive Theory
11
The Data Generating Process (DGP)
• Most of the time we don’t care about the data
itself.
• Most of the time we are trying to learn
something about an underlying process that
produces the data – a DGP.
• Technically trained folks might be good at
uncovering patterns in data, but you need
substantive expertise to:
– Know where to look in the first place
– Know what to look for
– Know what you find actually might mean
12
What is the DGP?
• Good analysis starts with a question you want to
answer.
– Blind data mining can only get you so far, and really, there
is no such thing as completely blind mining
• Answering that question requires laying out
expectations of what you will find and explanations for
those expectations.
• Those expectations and explanations rest on
assumptions.
• If your data collection, data management, and data
analysis are not compatible with those assumptions,
you risk producing meaningless or misleading answers.
13
The DGP (cont.)
• Think of the world you are interested in as governed by
dynamic processes.
• Those processes produce observable bits of information
about themselves – data
• We can use data science to:
–
–
–
–
Collect, catalog, and organize those bits of information
Discover patterns in data and fit models to that data
Make predictions outside of our data
Inform explanations of both those patterns and those
predictions.
• Real discovery is NOT about modeling patterns in
observable data. It is about understanding the processes
that produced that data.
14
Theories and DGPs
• Theories provide explanations for the
processes we care about.
• They answer the question, Why does
something work the way it does.
• Theories make predictions about what we
should see in data.
• We use data to test the predictions, but we
never completely test a theory.
15
Why do we need theory?
• Can’t we just find “truth” in the data if we have
enough of it? Especially if we have all of it?
• No!
– More data does not mean more representative data.
– Every method of analysis makes some assumptions, so
we are better off if we make them explicit.
– Patterns without understanding are a best
uninformative and at worst deeply misleading.
16
Robert Mathews Aston, 2000. “Storks Deliver Babies (P=0.008).”
Teaching Statistics. Volume 22, Number 2, Summer 2000
17
New Behaviors Require New Theories
• The Target example illustrated how existing theories
about habit formation informed their data mining
efforts.
• However, whole new behaviors exist that are creating a
lot of the data that data scientists want to analyze:
–
–
–
–
–
Online shopping
Cell phone usage
Crowd sourced recommendation systems
Facebook, Google searching, etc.
Online mobilization of social protests
• We need new theories for these new behaviors.
18
Metadata
19
What is Metadata?
• Metadata is data about data. It is frequently
ignored or misunderstood.
• Metadata is required to give data meaning.
• It includes:
– Variable names and labels, value labels, information
on who collected the data, when, by what methods, in
what locations, for what purpose, etc.
• Metadata is essential to use data effectively, to
reuse data, to share data, and to integrate data.
• Data without metadata is worthless.
20
The Value of Metadata
• Data by itself is just a bunch of 0’s and 1’s.
• Metadata
– Provides meaning
– Allows for cataloging
– Facilitates search and discovery
– Enables linking data sets
21
Types of Metadata
• NICO Defines three types:
– Structural: describes how the components of the
data are organized (columns, rows, chapters, etc.)
– Descriptive: provides titles, authors, keywords,
subjects, etc. that facilitate attribution and
search/discovery.
– Administrative: technical information on how file
was created, software used, formats for storage,
etc.
• Includes rights and preservation metadata
22
Metadata Standards
• There are emerging standards for metadata
– The American National Standards Institute
– The International Organization for Standardization
• Dublin Core – 15 classis metadata terms.
– Title, Creator, Subject, Description, Publisher,
Contributor, Data, Type, Format, Identifier, Source,
Language, Relation, Coverage, Rights
23
Privacy and Ethics
We will do this at the end
24