Transcript Giga-Mining

Giga-Mining
Corinna Cortes and Daryl Pregibon
AT&T Labs-Research
Presented by:
Kevin R. Gee
28 October 1999
Case Study
Statistical modeling
 Processing of multi-GB databases
 Data warehousing
 Prediction and classification
 User interfaces

Three Goals
Daily perform meaningful mining on multiGB of data
 Classify telephone numbers as business or
residential (pattern deviation, etc.)
 Maintain operational data for each phone
number.

Quantity of data
1997: 275 million phone calls per week day
-- total of 76 billion for whole year
 65M unique TNs per weekday
 350M unique TNs over a 40-day period
 “Universe list”: Set of all TNs observed on
network, each with a 7-byte profile

Contents of each profile
Inactivity -- number of days since TN used
 Minutes of use -- average daily minutes TN
is observed on network
 Frequency -- estimated number of days
between observing a TN
 “Bizocity” -- Business-like behavior of TN


Stored for inbound/outbound, toll/toll-free
Calculation of each variable

Inactivity: Set to 0 if observed, and
(Inactivity++) if not observed.

Other variables are calculated via an
exponential weighted average:

X(TN)new = λX(TN)today + (1-λ)X(TN)old,
0<λ<1
Aging factor λ
Provides for estimate as a weighted sum of all
previous daily values, where weights decrease
smoothly over time.
 Most recent day’s activity is weighted higher
than 2 weeks ago.
 Weight of a call k days ago is wk = (1-λ)k λ
 Old data is “aged out” as new data is “blended
in”

“Bizocity”
Concerns over whether a TN is residential
or business.
 Different operations for residences and
businesses for customer care, billing,
collections, fraud detection, etc.

“Bizocity” continued
AT&T has confirmed residential/business
status for 30% of 350M TNs.
 Incomplete data is due to lack of
communication with local companies,
additional lines, out of date information.
 Behavioral estimate is generated by
observing behavior of all 350M TNs,
generating a bizocity score, and combining
it with previous days’ totals.

Generating “Bizocity”
When a call completes, data such as
originating TN, dialed TN, connect time,
and call duration (note that callers are not
identified, just phone numbers).
 Those with known biz/res status are
flagged, and training sets are generated.
 Noise and outliers are usually eliminated by
the volume of data.

Generating “Bizocity” -examples
Example: Long calls originating at night
are usually residential, not business.
 Example: Residential calls peak in eve.,
business calls peak between 9am-5pm
 Example: Business calls are generally
shorter, call other businesses, or call 800
services.

Processed every 24 hours
Provides better aggregate data for each TN
 Reduces I/O by 75%
 Have to store all call details and sort them.
 Each call is reduced to a 32-byte binary
record, resulting in 8GB daily.
 Sorting takes 30 min. (3GB RAM, 1
processor)

Processing -- continued
4d data cube is generated
 Dimensions are day-of-week, time-of-day,
duration, and biz/res/800 status (7x6x5x3)
 Have previously developed logistic
regression models for scoring TNs based on
each profile (to estimate “Bizocity”)


Biz(TN)new = λBiz(TN)today + (1-λ)Biz(TN)old
0<λ<1
Processing -- continued
Training set is used to classify TNs with
unknown status based on probabilities
 Inactive TNs are not updated
 “Bizocity” scores for unknown TNs are
generated using probabilities

Accuracy
Accuracy of prediction of status is 75%
 Failures due to incorrectly provided status
of shifting status (ex. home businesses, cell
phones, etc.)

Data Structures
Exploit the “exchange” concept (1st 6 digits
form an exchange)
 Only about 150,000 of 1M exchanges are in
use
 All 10,000 TNs for each exchange are
stored sequentially, whether used or not
 Each data structure is 2GB for each variable
(lower bound is 1.5GB)

Interface
Variety of visualization tools (start at top,
drill-down)
 Web interface with password protection
 Images are computed on the fly
 C-code directly computes images in gif
format

Toll Fraud Detection
Same methodology, but event-driven
 Only have to track about 15M TNs.
 Profiles are about 512 bytes each (7.5GB)
