Feature Engineering Studio

Download Report

Transcript Feature Engineering Studio

Feature Engineering Studio
September 9, 2013
Welcome to
Feature Engineering Studio
• Design studio-style course teaching how to
distill and engineer features for data mining
What We’ll Cover
• The process of feature engineering and
distillation
– brainstorming features
– deciding what features to create
– criteria for selecting features
– actually creating the features
– studying the impact of features on model
goodness
Why?
• Feature engineering is the most important,
and least well-studied part of the process of
developing prediction models
• It is an art, it is human-driven design
• It involves lore rather than well-known and
validated principles
• It is hard! (But fun, and important)
Why?
• It’s well known in data mining (and statistics
for that matter)
• That your model will never be any good if your
features (predictors) aren’t very good
The Big Idea
• How can we take the voluminous, ill-formed,
and yet under-specified data that we now
have in education
• And shape it into a reasonable set of variables
• In an efficient, effective, and predictive way?
Tools We’ll Use
•
•
•
•
•
•
Excel
Java
Google Refine
EDM Workbench
RapidMiner
Other relevant tools (TBD/your choice)
Course times
• Monday 11am-12:40pm
• Wednesday 11am-12:40pm special sessions
Course Prerequisite
• Core Methods in Educational Data Mining
• Or instructor approval
• I will approve anyone who has at least a little
bit of background building prediction models
or similar statistical models
– Talk to me after class, during my office hours, or
by appointment
That said…
• If you haven’t had experience building prediction
models in RapidMiner or a similar tool, then
you’ll need to learn
• I will be using the first few Wednesday sessions to
help students catch up if they don’t have
experience with this paradigm or tools
• You can definitely catch up
Who here?
• Took or audited my Core Methods course?
• Has built a prediction model using a
classification algorithm and cross-validation?
• Has built a regression model in a stats package
using stepwise regression?
• Has run a regression in a stats package?
• Has built any kind of mathematical model?
How this class works
• Lots of assignments (13)
– They can’t be late, because we will discuss them in
class
– 3 of 12 regular assignments can be missed
without penalty, but not the final presentation
(13)
• Not many required readings
• Essential to participate in critique and class
discussions
Who here?
• Has had a design studio style course before?
This is not…
• A lecture class
• A reading discussion seminar
This is…
• A class where you will be working on a project
of your own choosing the whole semester
• A class where you’ll get, and give, a lot of
constructive criticism
• A class where we will hopefully have fun too,
to keep the mind flowing
The semester project
• You will build a prediction model
• If you have your own data set, and research
question – perfect!
• If you don’t have your own data set, and
research question – no worries! I will help you
find one!
Assignments
1.
2.
3.
4.
5.
6.
Problem Proposal
Mucking Around
Bring Me a Rock
Bring Me Another Rock
Standing on the Shoulders of Giants
Ideation
Assignments
7. One Who Visions Must Be Steeped in Data
8. Keep Running!
9. The Fresh Mind
10. This One’s For Nikolai Ivonavich Lobachevsky!
11. The Slog
12. Son of Slog
13. Final Project Presentation
Upcoming Classes
• 9/11 Special session on data set finding
– Come to this if you don’t have a data set in mind
• 9/16 Problem proposal (Asgn. 1 due)
• 9/23 Feature distillation in Excel (Asgn.2 due)
• 9/25 Special session on prediction models
– Come to this if you don’t know why student-level crossvalidation is important, or if you don’t know what J48 is
• 9/30 Advanced feature distillation in Excel (Asgn. 3 due)
• 10/2 Special session on RapidMiner
– Come to this if you’ve never built a classifier or regressor in
RapidMiner (or a similar tool)
– Statistical significance tests using linear regression don’t
count…
Assignment One
• Problem Proposal
– Due next Monday
• Be ready to talk for 3 minutes on:
– A data set
• Give where it came from and how big it is
• You need to already have this data set, or be able to acquire
it in the next two weeks
–
–
–
–
A prediction model you will build in this data set
What variable will you predict?
What kind of variables will you use to predict it?
Why is this worth doing?
Example
(Pardos et al., 2013)
• Data set
– ASSISTments system, formative assessment and
learning software for math used by 60k students a
year (Razzaq et al., 2007)
– 810,000 data points from 229 students studied
– Student actions in the software have been overlaid
with synchronized field codes of student affect
(boredom, frustration, etc.)
• 3075 field codes
• Each field code connects to 20 seconds of log file actions
Example
(Pardos et al., 2013)
• We will predict whether a student is bored at a
specific time
– So that we can replicate the human judgments
without needing a field observer
• We will predict this from what was going on in
the log files at the time the field observation was
made
– We know every student action’s correctness, timing,
relevant skill, and probability they knew the skill
Example
(Pardos et al., 2013)
• This is worth doing because boredom is
known to predict student learning (Craig et al.,
2004; Rodrigo et al., 2009; Pekrun et al., 2010)
• And building a detector will help us study
boredom more thoroughly
• As well as enabling us to intervene on
boredom in real time
Important Considerations
• Is the problem genuinely important? (usable
or publishable)
• Is there a good measure of ground truth? (the
variable you want to predict)
• Do we have rich enough data to distill
meaningful features?
• Is there enough data to be able to take
advantage of data mining?
You don’t need to be able to answer
these questions in a week
• Think about them
• Think about your problem
• Email me or come to my office hours
(or set up an appointment)
• Bring it to class
• We’ll discuss it in class
• No idea is perfect right from the start!
Be ready to answer questions
Be ready to answer questions
• Be ready to ask questions too…
No data ready at hand?
• Come to this Wednesday’s session, we will
find you data!