Toolbox of a data scientist: multiple approaches to work with

Download Report

Transcript Toolbox of a data scientist: multiple approaches to work with

Toolbox of a data scientist: multiple
approaches to work with behavioural data
Philippe J. Giabbanelli, PhD
Data Insight Meetup, February 5th 2015
Outline
3 – How do we use data
1 – What’s data science?
science tools to get answers?
Toolbox of a data scientist:
scientist multiple
approaches to work with behavioural data
2 – What questions can we
ask of behavioural data?
Food behaviours
PJ Giabbanelli
Drinking behaviours
Insurgencies
Toolbox of a data scientist: multiple approaches to work with behavioural data
2
What’s data science?
Visualization
PJ Giabbanelli
Simulation and modelling
Data mining
Toolbox of a data scientist: multiple approaches to work with behavioural data
3
Tableau
Imagine that people have completed some kind of questionnaire. Typically you get an
Excel spreadsheet. And you’d like to understand what relates to the target behaviour.
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
4
Imagine that you have a very complex system, where tons of
variables interact… You may want to look at it as a network.
Gephi
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
5
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
6
What if you have a lot of text instead?
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
7
Here I am primarily concerned with visualization as
seen from a data scientist’s viewpoint. I would use…
Tool
Data
$ Tableau, Qlik, Spotfire
Relational (spreadsheet)
$
Gephi or Visone
Network
$
$
Datawatch
Streaming relational
Many-eyes
A bit of everything
$
GeoTime
Spatial data over time
$
Jigsaw, CZSaw,
$ InSpire, Leximancer,
Viz as data scientist
Text
≠ Making pretty pictures
If you’re producing a visual for an audience, you
show what you found. When you start with viz as a
data scientist, you want to find something!
Visual Capitalist
Abusing the tool
If you watch CSI, you’ll see that when
they search for a fingerprint match, the
software shows all fingerprints it has!
Wasting computer resources
for useless displays
If it looks like your data is normally
distributed, that must be it, right?
Proper
statistical
testing
Relying on visuals instead of doing proper statistics
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
9
Abusing the tool
When all you have is a hammer, everything starts looking like a nail.
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
10
What’s data science?
Visualization
PJ Giabbanelli
Simulation and modelling
Data mining
Toolbox of a data scientist: multiple approaches to work with behavioural data
11
What’s data science?
Imagine that you’re working for CSI (again!) and you want to identify the dude in the picture.
When you know what you’re after, and it can be mathematically expressed, data mining helps.
?
?
?
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
12
What’s data science?
Communication
A
A: rules
If<daily
If ≥ daily
C: comm.
If < weekly
If ≥ weekly
D: rules
C
If ≥ very often
Never
If < very often
B
Daily
B: comm.
If ≥ often
Weekly
If < often
D
Never
Suggested tools: RapidMiner, Weka
$
PJ Giabbanelli
Often
Binge drinker
Very often
Rules
Non-binge drinker
$
Toolbox of a data scientist: multiple approaches to work with behavioural data
13
What’s data science?
Data mining involves automatically testing lots
of hypotheses by searching for combinations of
variables that might show a correlation.
Which variables are in the winning combination?
You partly do data mining to answer this question…
« For every variable that you
seek to collect, provide a
detailed rationale. »
A. Wood
Data
Manager
PJ Giabbanelli
V. Lo
Ethics
Board
Toolbox of a data scientist: multiple approaches to work with behavioural data
14
What’s data science?
Visualization
PJ Giabbanelli
Simulation and modelling
Data mining
Toolbox of a data scientist: multiple approaches to work with behavioural data
15
I offered coupons to some customers. Would
they spend more? Who should I target?
I raised prices of fast foods. Would it curb
obesity? Who would benefit the most?
I put people on antiretroviral therapy when they
don’t have AIDS. Would it help? For whom?
There are lots of big questions for which you don’t necessarily have all the data. Also, methods
that help you understand what happened may not be helpful to know what may happen if…
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
16
What’s data science?
Imagine that you want to change the urban environment to see if it helps people exercise more.
You hopefully won’t be doing that.
Rather you might want to create a virtual
environment that simplifies reality so you
can test your hypothesis safely.
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
17
What’s data science?
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
18
What’s data science?
There are lots of ways to do
modelling, depending on desired
spatial & individual resolution.
The most common approaches
are agent-based modelling and
system dynamics.
Tool
Approach
$
Anylogic
ABM / SD
$
NetLogo
ABM
$ Vensim, iThink
PJ Giabbanelli
SD
Toolbox of a data scientist: multiple approaches to work with behavioural data
19
What’s data science?
Also: The emergence of Computational Sociology
(J. of Math. Soc., ‘95); Why model? (JASS ’08)
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
20
Data Science as a Technique
Visualization
Modelling & Simulation
Data mining & Machine Learning
Applications
Defense
Health
Chronic diseases
PJ Giabbanelli
Infectious diseases
Toolbox of a data scientist: multiple approaches to work with behavioural data
21
Why?
Tell me what
people will do
in the future!
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
22
Applications of Data Science
How would climate change policies
impact the health of Canadians by 2030?
Simulated data for 2030
Dietary patterns
Built environment Socio-economics
Systems model
Expected
health impacts
Physical health
Inputs
PJ Giabbanelli
Outputs
Well-being
Toolbox of a data scientist: multiple approaches to work with behavioural data
23
Applications of Data Science
There are many reasons other than prediction to do data science.
Explaining
To simulate far into the future, you need to understand what you have now and how it changes.
1 - Explain
2014
PJ Giabbanelli
2024
2 - Predict
2044
Toolbox of a data scientist: multiple approaches to work with behavioural data
25
Applications of Data Science
There are many reasons other than prediction to do data science.
Explaining
explains
lightning,
“Plate “Electrostatics
tectonics explains
earthquakes,
but we
when or
where
will strike.”
But does
notcannot
permitpredict
us to predict
the
time the
andnext
placebolt
of their
occurence"
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
26
Applications of Data Science
There are many reasons other than prediction to do data science.
Explaining
Schelling’s model of segregation
A preference that one's neighbors be of the same
color, or even a preference for a mixture "up to
some limit", could lead to total segregation.
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
27
Applications of Data Science
There are many reasons other than prediction to do data science.
What are the core dynamics in my problem?
Where are the gaps? Where do I need to collect data?
What would happen if?
How can we best do monitoring and surveillance?
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
28
Illuminate core dynamics
“There is increasing evidence that social influence and social
network structures are significant factors in obesity.”
Eating
PJ Giabbanelli
Exercising
Toolbox of a data scientist: multiple approaches to work with behavioural data
29
Illuminate core dynamics
To which extent could social influences account for the dynamics of obesity?
Let’s tackle the question using modelling & simulation.
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
30
Illuminate core dynamics
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
31
Illuminate core dynamics
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
32
Illuminate core dynamics
Motivating question: to which extent is this model supported by interviewees?
Let’s tackle this question using interactive visualizations.
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
33
We measured the strength of a relationship between two factors as the number
of responses in the interviews that used words relevant to both factors.
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
34
Explaining
Can we explain why people engage in binge drinking? Let’s start with
modelling and simulation, and make some hypotheses.
Structure
You select peers with
whom to drink…
PJ Giabbanelli
Process
…and then, their drinking
habits influence yours.
Toolbox of a data scientist: multiple approaches to work with behavioural data
35
Explaining
If we assume:
• that individuals select similar peers
• that individuals are prompted to drink if at least a fraction of their peers drink
• that one’s context known from drinking motives may deter/promote drinking
Then we can correctly infer the behaviour of half of the binge
drinkers and 4 out of 5 non binge drinkers.
But without making any assumptions ourselves, if we just used data mining we would
get roughly the same accuracy. The computer would build an explanation for us.
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
36
Monitoring
The situation might change as you are intervening.
How can you monitor changes and adapt?
March 2011: Emergence
PJ Giabbanelli
Escalation
Early 2012: Militarisation
Toolbox of a data scientist: multiple approaches to work with behavioural data
37
The model guides the analyst in the
exploration of the new data.
Visualizations allows the analyst to interactively
explore the data and improve the model.
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
38
There is a lot of potential in the tight coupling of techniques (e.g., modelling / interactive
visualizations) but currently you’d have to come up with a technical solution yourself for that.
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
39
Challenges
Needing to understand a very wide range of tools
Continuously need to improve the tools
Visualization
Modelling & Simulation
Data mining & Machine Learning
Interdisciplinary: shock of cultures
Data science in the world
Getting good quality data
Defense
Health
Chronic diseases
PJ Giabbanelli
Infectious diseases
Toolbox of a data scientist: multiple approaches to work with behavioural data
40
Challenges – Need new tools
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
41
Challenges – Interdisciplinary
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
42
Challenges – Interdisciplinary
In my field, good papers are
published in conferences.
In my field, we just put data
on our website for others.
Why don’t I just pick a book
and learn your whole field?
In my field, good papers are
published in journals.
In my field, we own the
data and selectively share it.
Why don’t I just watch a couple
videos to learn your job?
We need to build mutual trust and accomodate
each other in a system that’s unsupportive.
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
43
Challenges – Getting good data
There is a lot of data out there. But
most is unstructured (text, video…)
and hard to deal with.
There are public repositories for data
but a lot of that are lists of junk,
localisations, or population-level data
split at best per age and gender.
http://ukdataservice.ac.uk
http://data.gouv.fr
http://data.gov
http://adsfree.com
http://kaggle.com
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
44
Challenges – Getting good data
Kaggle
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
45
Get in touch? [email protected]
Investigator Scientist
Founder
University of Cambridge
(@Addenbrooke’s)
Vancouver Computational
Modelling
• PJ Giabbanelli. Modelling the spatial and social
dynamics of insurgency. Security Informatics ‘14
(Simulation & Modelling in Defense)
• Pratt, Giabbanelli & Mercier. Detecting unfolding
crises with visual analytics and conceptual maps:
emerging phenomena and big data. Proc of IEEE ISI ‘13
(Visual Analytics + Simulation & Modelling in Defense)
• Crutzen & Giabbanelli. Using classifiers to identify
binge drinkers based on drinking motives. Substance
use & misuse ‘14.
(Data mining in health)
• Giabbanelli et al. Modeling the influence of social
networks and environment on energy balance and
obesity. Journal of Computational Science ‘12.
(Simulation & Modelling in Health)
PJ Giabbanelli
Toolbox of a data scientist: multiple approaches to work with behavioural data
46