Transcript Giga-Mining
Giga-Mining
Corinna Cortes and Daryl Pregibon
AT&T Labs-Research
Presented by:
Kevin R. Gee
28 October 1999
Case Study
Statistical modeling
Processing of multi-GB databases
Data warehousing
Prediction and classification
User interfaces
Three Goals
Daily perform meaningful mining on multiGB of data
Classify telephone numbers as business or
residential (pattern deviation, etc.)
Maintain operational data for each phone
number.
Quantity of data
1997: 275 million phone calls per week day
-- total of 76 billion for whole year
65M unique TNs per weekday
350M unique TNs over a 40-day period
“Universe list”: Set of all TNs observed on
network, each with a 7-byte profile
Contents of each profile
Inactivity -- number of days since TN used
Minutes of use -- average daily minutes TN
is observed on network
Frequency -- estimated number of days
between observing a TN
“Bizocity” -- Business-like behavior of TN
Stored for inbound/outbound, toll/toll-free
Calculation of each variable
Inactivity: Set to 0 if observed, and
(Inactivity++) if not observed.
Other variables are calculated via an
exponential weighted average:
X(TN)new = λX(TN)today + (1-λ)X(TN)old,
0<λ<1
Aging factor λ
Provides for estimate as a weighted sum of all
previous daily values, where weights decrease
smoothly over time.
Most recent day’s activity is weighted higher
than 2 weeks ago.
Weight of a call k days ago is wk = (1-λ)k λ
Old data is “aged out” as new data is “blended
in”
“Bizocity”
Concerns over whether a TN is residential
or business.
Different operations for residences and
businesses for customer care, billing,
collections, fraud detection, etc.
“Bizocity” continued
AT&T has confirmed residential/business
status for 30% of 350M TNs.
Incomplete data is due to lack of
communication with local companies,
additional lines, out of date information.
Behavioral estimate is generated by
observing behavior of all 350M TNs,
generating a bizocity score, and combining
it with previous days’ totals.
Generating “Bizocity”
When a call completes, data such as
originating TN, dialed TN, connect time,
and call duration (note that callers are not
identified, just phone numbers).
Those with known biz/res status are
flagged, and training sets are generated.
Noise and outliers are usually eliminated by
the volume of data.
Generating “Bizocity” -examples
Example: Long calls originating at night
are usually residential, not business.
Example: Residential calls peak in eve.,
business calls peak between 9am-5pm
Example: Business calls are generally
shorter, call other businesses, or call 800
services.
Processed every 24 hours
Provides better aggregate data for each TN
Reduces I/O by 75%
Have to store all call details and sort them.
Each call is reduced to a 32-byte binary
record, resulting in 8GB daily.
Sorting takes 30 min. (3GB RAM, 1
processor)
Processing -- continued
4d data cube is generated
Dimensions are day-of-week, time-of-day,
duration, and biz/res/800 status (7x6x5x3)
Have previously developed logistic
regression models for scoring TNs based on
each profile (to estimate “Bizocity”)
Biz(TN)new = λBiz(TN)today + (1-λ)Biz(TN)old
0<λ<1
Processing -- continued
Training set is used to classify TNs with
unknown status based on probabilities
Inactive TNs are not updated
“Bizocity” scores for unknown TNs are
generated using probabilities
Accuracy
Accuracy of prediction of status is 75%
Failures due to incorrectly provided status
of shifting status (ex. home businesses, cell
phones, etc.)
Data Structures
Exploit the “exchange” concept (1st 6 digits
form an exchange)
Only about 150,000 of 1M exchanges are in
use
All 10,000 TNs for each exchange are
stored sequentially, whether used or not
Each data structure is 2GB for each variable
(lower bound is 1.5GB)
Interface
Variety of visualization tools (start at top,
drill-down)
Web interface with password protection
Images are computed on the fly
C-code directly computes images in gif
format
Toll Fraud Detection
Same methodology, but event-driven
Only have to track about 15M TNs.
Profiles are about 512 bytes each (7.5GB)