Transcript pptx
Feature Engineering Studio
September 23, 2013
Welcome to
Mucking Around Day
Sort into pairs
• Partner with the person next to you
• One group of 3 is allowed
Sort into pairs
• Do we have a group of 3?
• One of the 3 will work with me
Sort into pairs
• Go over your reports together
– A maximum of 5 minutes apiece
5 minutes for first person
5 minutes for second person
Re-assemble into one big group
Who here found something really cool
while mucking around?
• Show us, tell us
Who here found a histogram with a
normal distribution?
• Show us, tell us
Who here found a histogram with a
hypermode?
• Show us, tell us
Who here found a histogram with a
flat distribution?
• Show us, tell us
Who here found a histogram with a
skewed distribution?
• Show us, tell us
Who here found a histogram with a
bimodal distribution?
• Show us, tell us
Who here found a histogram with
something else interesting?
• Show us, tell us
Who here found something surprising
with their min, max, average, stdev?
Categorical variables
• Who here found something curious, weird, or
interesting in the distribution of their
categorical variables?
Who here hasn’t spoken yet?
(and analyzed data)
• Tell us something interesting you found in
your data
Who here played with pivot tables?
• What did you learn?
My turn to play with pivot tables
• Who wants to volunteer their data?
• (I might request a 2nd or 3rd data set,
depending on how the 1st one goes)
Who here played with vlookup?
• What did you learn?
My turn to play with vlookup
• Using the same volunteered data set(s)
Other cool things you can create with
a few simple formulas (plus demos!)
Identifying specific cases of interest
Did event of interest ever occur for
student?
Counts-so-far
(and total value for student)
Counts-last-N-actions
First attempts
Ratios between events of interest
How many students had 3 (or 4, 5,
2,…) of an event
Times-so-far
Cutoff-based features
Unitized actions (such as unitized time)
Last 3 or 5 unitized
Comparing earlier behaviors to later
behaviors through caching
Counts-if
Percentages of action type
Percentages of time spent per
action/location/KC/etc.
Questions? Comments?
Other cool ideas?
Assignment 3
• Feature Engineering 1
“Bring Me a Rock”
• Get your data set
• Open it in Excel
• Create as many features as you feel inspired to create
– Features should be created with the goal of predicting your ground
truth variable
– At least 12 separate features that are not just variations on a theme
(e.g. “time for last 3 actions” and “time for last 4 actions” are
variations on a theme; but “time for last 3 actions” and “total time
between help requests and next action” are two separate features)
• For each feature, write a 1-3 sentence “just so story” for why it
might work
• Test how good each features is
Testing Feature Goodness
• For this assignment, there are a bunch of ways to test
feature goodness
• Single-feature prediction models in data mining or stats
package, giving correlation or kappa (special session
this Wednesday)
• Compute correlation in Excel (want to see?)
– You can do this with binaries variables too, although it’s
not really optimal
• Compute t-test in Excel (want to see?)
• Compute kappa in Excel (if you don’t know how, easier
to do in RapidMiner)
Were you right?
• Which of your “just so stories” seem to be
correct?
• Did any of your feature correlate in the
opposite direction from what you expected?
Assignment 3
•
•
•
•
Write a brief report for me
Email me an excel sheet with your features
You don’t need to prepare a presentation
But be ready to discuss your features in class
Next Classes
• 9/25 Special Session
– Using RapidMiner to Produce Prediction Models
– Come to this if you’ve never built a classifier or
regressor in RapidMiner (or a similar tool)
– Statistical significance tests using linear regression
don’t count…
• 9/30 Advanced Feature Distillation in Excel
– Assignment 3 due
– Online Equation Solver Tutorials should be in your
INBOX
Upcoming Classes
• 10/2 Special session on prediction models
– Come to this if you don’t know why student-level
cross-validation is important, or if you don’t know
what J48 is
• 10/7 Advanced Feature Distillation in Google
Refine
• 10/9 Special session? TBD.