Data Analysis Overview

Download Report

Transcript Data Analysis Overview

Data and Data Analysis
UNLOCKING THE SECRETS HIDDEN IN YOUR
DATA
PART 1
Data
What is Data?
 Data is information gathered from observation,
experimentation or modeling


Qualitative – not precise (usually descriptive)
Quantitative - Precise (usually numeric)
 The output of your model (i.e. number of healthy
agents, number of infected agents, time…)
Data
How do we gather data?
 Data collection is the systematic recording of
information while changing Variables (a quantity
that may assume any given value or set of values).
 Collect the output (i.e. number of healthy agents,
number of infected agents, time…) while changing
the variables (number of devils, number initially
infected) of the model
Data
Why should we get data?
 To answer questions
 To develop understanding
 To validate experiments
What should we do with data?
 Display – usually graph it to make it easier to see
trends
 Analysis – use math skills to uncover patterns and
trends in data sets
 Interpretation - involves possible explanation
those patterns and trends.
Extracting Data from StarlogoTNG
 There are three ways to extract data from
StarlogoTNG
Collect the data by hand
 Create a chart in Starlogo TNG and extract the data to Excel
 Create a table in Stalogo TNG and extract the data to Excel

Why Should We Display Data ?


Are your assumptions
correct?
Did you collect
enough data?
Rabbit Population
300
250
200
Number of Rabbits
What did you see?
 Makes your data
visible
 Helps find obvious
patterns
 Does the data makes
sense?
150
100
50
0
0
50
100
150
200
Time
250
300
350
Why Should We Analyze Data ?
What does it Mean?
 Is there is more information in the
data
Rabbit Population
300
emergent behavior
 unexpected patterns

250
 Was the hypothesis correct ?

More grass gives more rabbits
 To help you answer questions
 Provide visible evidence and
support for our conclusions to you
audience (e.g. Challenge judges)
 Validity of model, experiment,
theory, …
Number of Rabbits
Why Does it Matter?
 Draw conclusions from data
200
150
100
50
0
0
50
100
150
200
Time
250
300
350
Ways to Analyze Data
 Plotting Data

Ways to visually
understand data
30
25
 Statistics


Makes it easier to
compare data
 Mean, Median,
Mode
Makes it clear if you
have NOISY data
 Range, Variance,
Standard
Deviation
20
Mean Pink
Pink
15
Mean Blue
Blue
10
5
0
0
10
20
30
40
50
60
Ways to Analyze Data
 Derivatives (Slopes)
Tell if changes in
parameters affect data
 Parameter 2 has a
greater effect than
Parameter 1
 Get more information
from data

Great
Derivative
4
3.5
Slope = 0.39
3
2.5
Base Case
Parameter 1
2
Slope = 0.16
Parameter 2
1.5
1
Slope = 0.08
0.5
0
0.00
2.00
4.00
6.00
8.00
10.00
12.00
Collecting Data: Variable Sweeping
 Did you collect enough
data?

Did you vary the parameters
throughout their ranges?


If you have sliders (input
variables) in your program,
you need data for the full
range of those sliders.
 Minimum 3 runs for a
single variable (low,
medium, high)
More than one slider
(variable), must vary them
separately.
 2 variable perhaps 9 runs
Collecting Data from Starlogo TNG
 Gathering Data by hand





Tasmanian Devils
Variable sweep
More than one variable
Multiple runs at each
variable combination
Average the data
Collecting Data from Starlogo TNG
 Lets Do It



Open Tasmanian Devil
Run a section of the data
sheet
Do variable sweep
Initial Population
 Initial Percent Infected



Multiple runs at each set
of variables
Collect output in data
sheet

Number healthy after 200
ticks
Collecting Data from Starlogo TNG
 Put Data into Excel
 Calculate Averages
Collecting Data from Starlogo TNG
 Make a Summary Table
 Create XY Charts
Collecting Data from Starlogo TNG
 Make a 3D Chart
Plotting Data – Extracting from Starlogo TNG
LET’S DO IT – Tasmanian Devils !!

Data can be extracted from a graph or a table in Starlogo TNG
Create a graph using the line graph block

Put reset clock on Setup block to clear and reset graph

Plotting Data – Extracting from StarlogoTNG
LET’S DO IT – Tasmanian Devils !!

After program is run

Click on graph in Spaceland

Save File – Excel file
Data Analysis: Plotting Data – Types of Plots
All plots from http://www.statcan.ca
 Bar Charts – preferred snacks
 Pie Charts – music preference
Pets purchased at pet store
Data Analysis: Plotting Data – Types of Plots
All plots from http://www.statcan.ca
 XY Graphs – cell phone use
http://www.statcan.ca
 Scatter Plots
http://en.wikipedia.org/wiki/Scatterplot
Plotting Data – Activity in Excel
LET’S DO IT
 Open Tasmanian Devil Export
file (csv file ) by double
clicking on the file
 In EXCEL - Insert Chart
 Select type of chart

XY Scatter
 Hit the Next button
Plotting Data – Activity in Excel
 Select Data Range
 Highlight data to be
plotted
Plotting Data – Activity in Excel
 Label each data series
 NEXT - Label Graph and Axis
Plotting Data – Activity in Excel
 Choose where you want the
graph to be
 Get your graph
Population
Tasmani Devil Population
300
200
100
0
-100 0
Sick
Healthy
10
20
Time
30
40
Plotting Data – Extracting from Netlogo
 Two ways

1st Way: Write code to
extract the data you want –
see File Output Example in
the Code Examples

Open file in setup
procedure
Create a write-to-file
procedure

Plotting Data – Extracting from Netlogo
 2nd way: Extract data from
Netlogo graphs

Have Netlogo generate graph on
Interface page (example on later
slide)
 Create a setup-plot procedure
and a do-plot procedure
 Call the setup-plot procedure
in setup procedure
 Call do-plot procedure in go
procedure
Plotting Data – Extracting from Netlogo
LET’S DO IT – Open Rabbits Grass Weeds
Run model until sufficient data
obtained
 (PC) Right Click on Graph/
(Mac) Control Click on Graph
 Select Export
 Choose location and File name - select
save
 Excel File is created – Next Slide
 Contains all the information in the
plot and input parameters used.
 Contains excess information about
the plot (color, pen down, mode,
interval…)

Plotting Data – Extracting from Netlogo
This is what
You need
Statistics
 Statistics help you
 Summarize data
 Describe data
 Analyze data
Hard to describe the difference
Between the two data sets
22
Now it is easy to summarize, describe
and analyze the data….
The blue and the pink data have the
Same AVERAGE value (mean) but the
blue data is “NOISIER” (greater
standard deviation). Therefore…
22
18
18
Noisy
Noisier
14
14
Mean (both)
Noisy
Noisy + 2SD
Noisier
Noisy - 2SD
10
10
Noisier + 2SD
Noisier - 2SD
6
6
2
2
0
10
20
30
40
50
60
0
10
20
30
40
50
60
Statistics – How to Calculate in Excel
 +,-,*,/ used for addition,





subtraction, multiplication and
division.
Each cell has a label based on the
column and row.
Use cells to perform calculations
instead of numbers. Example :
=(A4+B4)/C4
Perform calculations on an
entire column - copy and paste
the equation .Warning : this
changes the cell number for each
line.
Fix a specific cell - use the $
symbol, example (A4+B4)/$C$1
Excel has many built in
statistical functions
Makes life easy!
E1
Calculate in Excel Activity
 Open a blank spread sheet in Excel
 Create 2 columns of numbers
 Then Add, Subtract, Multiple and Divide the first row
 Copy and paste the formulas
Statistics – Measurements of Central Tendency
Mean (Average), Median, and Mode
 Definitions



Mean (Average) – Sum divided by the number of data points
Median – Middle data point when arranged from highest to lowest
Mode – Most frequent value
LET’S DO IT : StarlogoTNG : Fish and Plankton data
Netlogo : Rabbits and Grass data
 Use data set to calculate Mean (Average) Median, Mode,
Max and Min





Select Cell where you want the value of the function to appear
Select Insert then Function
Select Statistical
Select function wanted (AVERAGE, MEDIAN, or MODE) then hit OK
Select Range of data you want to analyze by clicking on range symbol
and highlighting range. Hit enter or OK
Statistics – Measurements of Data Spread
Range, Variance and Standard Deviation
 Definitions
Range = maximum - minimum

Variance = measures noise of the data
around the mean value.

Standard Deviation (S) is the square
root of the variance. Most commonly
used measure of spread (same units
as the data). Another reason to use S:



~68% of the data are in the interval
Mean – S to Mean + S
~95% of the data are in the interval
Mean – 2 S to Mean + 2 S
~99% of the data are in the interval
Mean – 3 S to Mean + 3 S
Rabbit Population
300
250
Number of Rabvits

200
150
100
50
0
0
500
1000
1500
Ticks
Rabbits
Mean
Mean - 2 S
EXCEL does it for you!!!
LET’S DO IT : StarlogoTNG : Fish and Plankton data
Netlogo : Rabbits and Grass data
Mean + 2 S
2000
Distance
30
25
20
15
10
5
0
0
2
4
6
8
10
12
10
12
10
12
Slope of distance
8
7
6
Velocity
 What are Derivatives?
 A simple calculation using data
 Instantaneous rate of change
= SLOPE
 Why use Derivatives?
 Get more information from data
 More Ways to comparison data
 Car moving down a road
 Data = the distance traveled
 Velocity = the 1st derivative
of distance
 Acceleration = 2nd derivative
of distance
= the 1st derivative
of velocity
35
5
4
3
2
1
0
0
2
4
6
8
Slope of velocity
2
Acceleration
Derivatives
40
1
0
0
2
4
6
-1
-2
-3
-4
Time
8
A Note on Randomness
 This data is not
RANDOM
 Random means that
there is an equal
probability of getting
each outcome (like
rolling a die)
 There is scatter in the
data but it is not random
22
18
14
Noisy
Noisier
10
6
2
0
10
20
30
40
50
60
Other Things to Think About
 Is there “scatter” in your model?
 Evaluate how the “scatter” effects your results – repeat model runs
 Make sure you get enough data to get a good statistics
 Did you collect enough data?
 Did you let the model run long enough? Has the model reached
“equilibrium”
Rabbit Population
350
350
300
300
Number of Rabbits
Number of Rabbits
Rabbit Population
250
200
150
100
50
250
200
150
100
50
0
0
0
10
20
30
Ticks
40
50
60
0
100
200
300
400
500
Ticks
600
700
800
900
1000