#### Transcript Data Mining Slides - San Diego Supercomputer Center

Intro to Data Mining 1 Natasha Balac, Ph.D. Predictive Analytics Center of Excellence, Director San Diego Supercomputer Center University of California, San Diego Necessity is the Mother of Invention Problem Data explosion Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories “We are drowning in data, but starving for knowledge!” (John Naisbitt, 1982) 2 Necessity is the Mother of Invention Solution Predictive Analytics or Data Mining Extraction or “mining” of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases Data -driven discovery and modeling of hidden patterns in large volumes of data Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data 3 Predictive Analytics Analytical Tools Top-Down Methodology Surface Shallow Data Bottom-Up Methodology 4 Hidden SQL tools for simple queries and reporting Statistical & OLAP tools for summaries and analysis Data Mining methods for knowledge discovery Query Reporting OLAP Data Mining Extraction of data - detailed and/or summarized Analysis, summaries, Trends Information Analysis Who purchased the product in the last 2 quarters? What is an average income of the buyers per quarter by district? Discovery of hidden patterns, information, predicting future trends Insight knowledge and prediction Which customers are likely to buy a similar product in the future and why? 5 DM Enables Predictive Analytics Role of Software Data mining Proactive Predictive Analytics Interactive OLAP Ad-hoc reporting Passive Canned reporting Presentatio n Exploration Discovery Business Insight What Is Data Mining? Data mining: 7 Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Data Mining is NOT Data Warehousing (Deductive) query processing 8 SQL/ Reporting Software Agents Expert Systems Online Analytical Processing (OLAP) Statistical Analysis Tool Data visualization Machine Learning Techniques Technical basis for data mining: algorithms for acquiring structural descriptions from examples Methods originate from artificial intelligence, statistics, and research on databases 9 Multidisciplinary Field Database Technology Machine Learning Artificial Intelligence 10 Statistics Data Mining Visualization Other Disciplines Multidisciplinary Field Database technology Artificial Intelligence 11 Machine Learning including Neural Networks Statistics Pattern recognition Knowledge-based systems/acquisition High-performance computing Data visualization History of Data Mining 12 History Emerged late 1980s Flourished –1990s Roots traced back along three family lines 13 Classical Statistics Artificial Intelligence Machine Learning Statistics Foundation of most DM technologies 14 Regression analysis, standard distribution/deviation/variance, cluster analysis, confidence intervals Building blocks Significant role in today’s data mining – but alone is not powerful enough Artificial Intelligence 15 Heuristics vs. Statistics Human-thought-like processing Requires vast computer processing power Supercomputers Machine Learning Union of statistics and AI Blends AI heuristics with advanced statistical analysis Machine Learning – let computer programs 16 learn about data they study - make different decisions based on the quality of studied data using statistics for fundamental concepts and adding more advanced AI heuristics and algorithms Data Mining Adoption of the Machine learning techniques to the real world problems Union: Statistics, AI, Machine learning Used to find previously hidden trends or patterns Finding increasing acceptance in science and business areas which need to analyze large amount of data to discover trends which could not be found otherwise 17 Terminology 18 Gold Mining Knowledge mining from databases Knowledge extraction Data/pattern analysis Knowledge Discovery Databases or KDD Information harvesting Business intelligence What does Data Mining Do? Explores Your Data Finds Patterns Performs Predictions What can we do with Data Mining? Exploratory Data Analysis Predictive Modeling: Classification and Regression Descriptive Modeling Discovering Patterns and Rules 20 Cluster analysis/segmentation Association/Dependency rules Sequential patterns Temporal sequences Deviation detection CRISP-DM - Cross Industry Standard Process for Data Mining CRISP-DM Process Model Data Mining Applications Science: Chemistry, Physics, Medicine Biochemical analysis, remote sensors on a satellite, Telescopes – star galaxy classification, medical image analysis Bioscience Sequence-based analysis, protein structure and function prediction, protein family classification, microarray gene expression Pharmaceutical companies, Insurance and Health care, Medicine Drug development, identify successful medical therapies, claims analysis, fraudulent behavior, medical diagnostic tools, predict office visits Financial Industry, Banks, Businesses, E-commerce Stock and investment analysis, identify loyal customers vs. risky customer, predict customer spending, risk management, sales forecasting 22 Data Mining Applications Market analysis and management Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis (Banking and Credit Card scoring) Fraud detection and management 23 Target marketing, customer relation management, market basket analysis, cross selling, market segmentation (Credit Card scoring, Personalization & Customer Profiling ) Fraud, waste and abuse including: illegal practices, waste, payment error, non-compliance, incorrect billing practices Data Mining Applications Sports and Entertainment Astronomy 24 IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Campaign Management and Database Marketing Data Mining Tasks Concept/Class description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions; “normal” vs. fraudulent behavior Association (correlation and causality) Multi-dimensional interactions and associations age(X, “20-29”) ^ income(X, “60-90K”) buys(X, “TV”) Hospital(area code) ^ procedure(X) ->claim (type) ^ claim(cost) 25 Data Mining Tasks Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts for future prediction Example: classify countries based on climate, or classify cars based on gas mileage, fraud based on claims information Presentation: 26 If-THEN rules, decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values Data Mining Tasks Cluster analysis Class label is unknown: Group data to form new classes 27 Example: cluster claims or providers to find distribution patterns of unusual behavior Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity Data Mining Tasks Outlier analysis Outlier: a data object that does not comply with the general behavior of the data Mostly considered as noise or exception, but is quite useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation: regression analysis Sequential pattern mining, periodicity analysis 28 KDD Process Database Selection Transformation Data Preparation Training Data Data Mining Evaluation, Verification 29 Model, Patterns Learning and Modeling Methods Decision Tree Induction (C4.5, J48) Regression Tree Induction (CART, MP5) Multivariate Regression Tree (MARS) Clustering (K-means, EM, Cobweb) Artificial Neural Networks (Backpropagation, Recurrent) 30 Support Vector Machines (SVM) Various other models TAXONOMY Predictive Methods Use some variables to predict some unknown or future values of other variables Descriptive Methods Find human –interpretable patterns that describe the data Supervised vs. Unsupervised Labeled vs. unlabeled data Data Mining Challenges 32 Computationally expensive to investigate all possibilities Dealing with noise/missing information and errors in data Mining methodology and user interaction Mining different kinds of knowledge in databases Incorporation of background knowledge Handling noise and incomplete data Pattern evaluation: the interestingness problem Expression and visualization of data mining results Data Mining Heuristics and Guide Choosing appropriate attributes/input representation Finding the minimal attribute space Finding adequate evaluation function(s) Extracting meaningful information Model evaluation accuracy vs. overfitting 33 Available Data Mining Tools 34 COTs: IBM Intelligent Miner SAS Enterprise Miner Oracle ODM Microstrategy Microsoft DBMiner Pentaho Matlab Teradata Open Source: WEKA KNIME Orange RapidMiner NLTK R Rattle Data Mining Trends Many data mining problems involve large, complex databases, complicated modeling techniques and substantial computer processing Scalability is very important due to the rapid growth in the amount the data and the need to build and deploy the models at rapid rates Needs for Scalable High Performance Data Mining The size of a database as well as factors in building, testing, validation and deploying a data mining solution Taking advantage of parallel database/file system systems and additional CPUs Work with more data, build more models, and improve their accuracy by simply adding additional CPUs Build a good data mining model as quickly as possible! Scalable High Performance Data Mining One solution: run parallel data mining algorithms and parallel DBMSs on parallel hardware Once the model is build – still need for parallel computing to apply the model to a large amount of data Scoring: data mining model applied to a batch of data, record or events at the time; require that a scalable solution be deployed Data mining applications on Gordon DM Suites MathWorks Octave Computational Packages with DM tools Others as Requested Libraries for building tools 38 Summary Discovering interesting patterns from large amounts of data CRISP-DM Industry standard Learn from the past Predict for the future 39 High quality, evidence based decisions Prevent future instances of fraud, waste & abuse React to changing circumstances Current models, continuous learning Thank you! Questions: [email protected] 40 An Introduction to the Gordon Architecture San Diego Supercomputer Center Gordon Design Partnership Funding and Oversight from the Office of Cyberinfrastructure Design, Deployment, Support, Management System Integrator Sandy Bridge processors, flash memory, motherboards vSMP Foundation Memory Aggregation Software 3D Torus Subnet Manager, IB Switches, HCAs Gordon is not about FLOPS, but … A conservative estimate puts Gordon in the top 50 on the Top 500 list Access to Big Data Comes with a Latency Penalty 1.00E+00 Data Oasis Lustre 4PB PFS 1.00E-01 64 I/O nodes 300 TB Intel SSD (lower is better) Latency (seconds) 1.00E-02 1.00E-03 Quick Path Interconnect 10’s of GB 1.00E-04 1.00E-05 1.00E-06 1.00E-07 1.00E-08 QDR InfiniBand Interconnect 100’s of GB L3 Cache MB L1 Cache KB 1.00E-09 1.00E-05 L2 Cache KB 1.00E-03 1.00E-01 1.00E+01 DDR3 Memory 10’s of GB 1.00E+03 1.00E+05 1.00E+07 Data Capacity (GB) (higher is better) 1.00E+09 Access to Big Data Comes with a Latency Penalty 1.00E+00 Data Oasis Lustre 4PB PFS 1.00E-01 64 I/O nodes 300 TB Intel SSD (lower is better) Latency (seconds) 1.00E-02 1.00E-03 Quick Path Interconnect 10’s of GB 1.00E-04 1.00E-05 1.00E-06 1.00E-07 1.00E-08 QDR InfiniBand Interconnect 100’s of GB L3 Cache MB L1 Cache KB 1.00E-09 1.00E-05 L2 Cache KB 1.00E-03 1.00E-01 1.00E+01 DDR3 Memory 10’s of GB 1.00E+03 1.00E+05 1.00E+07 Data Capacity (GB) (higher is better) 1.00E+09 Flash Drives are a Good Fit for Data Intensive Computing Flash Drive Typical HDD Good for Data Intensive Apps < .1 ms 10 ms ✔ 250 /170 MB/s 100 MB/s ✔ 35,000/ 2000 100 ✔ 2-5 W 6-10 W ✔ 1M hours 1M hours - Price/GB $2/GB $.50/GB - Endurance 2-10PB N/A ✔ Latency Bandwidth (r/w) IOPS (r/w) Power consumption MTBF Total Cost of Ownership *** The jury is still out *** . Apart from the differences between HDD and SSD it is not common to find local storage “close” to the compute. We have found this to be attractive in our Trestles cluster, which has local flash on the compute, but is used for traditional HPC applications (not high IOPS). Flash Glossary of Terms Term Definition NAND Physical silicon MOSFET storage array. The flash storage media. SLC: Single-level cell Type of NAND storage device. Lower storage density for the same price. Higher endurance and performance than MLC. 1 bits of data per cell. MLC: Multi-level cell Type of NAND storage device. Higher storage density for the same price. Lower endurance and performance than SLC. 2 bits of data per cell. eMLC: Enterprise MLC Form of eMLC that uses high quality NAND, and additional controller software to achieve higher performance and endurance. IOPS: I/O operations per second Key storage benchmark that is relevant for data intensive applications. The ability to sustain high IOPS provides performance for applications that exhibit high random access data patterns. Wear Leveling The process of distributing write operations uniformly over all of the blocks in the flash memory chips to preserve the life of the NAND. Controlled by the SSD controller software. Write Amplification Factor Ratio of actual erase operations to the minimum required to erase the data. Greater than 1 is bad. Less than 1 is possible with compression (though most HPC data is not compressible) Overprovisioning Space on the flash drive not available to the user that is used for garbage collection, wear leveling, and containing bad blocks. Overprovisioning can increase endurance. Program/Erase (P/E) cycle A flash memory cell must be erased before it can be written to. This causes cell wear out and is what sets the endurance of an SSD. Minimum erase size for SSD’s is a block. Endurance How much data can be written to an SSD before it ceases to function properly. Set by the P/E cycle count. This is lower for MLC due to the number of programmable Gordon System Specification Intel Sandy Bridge Compute node Sockets & Cores 2 & 12* Clock speed 2.0 * DIMM slots per socket 4 DRAM capacity 64 GB Intel Flash I/O Node NAND flash SSD drives 16 or more SSD capacity per drive/Capacity per node 256 GB / 4 TB * Bandwidth per drive (r/w) 250 MB/s / 170 MB/s IOPS (r/w) 35,000 / 2000 SMP Super-Node (via vSMP) Compute nodes / I/O Nodes 32 / 2 Addressable DRAM 2 TB Addressable memory including flash 10 TB Gordon Compute Nodes 1,024 Total compute cores 12,288 Peak performance 200 TF * Aggregate memory 64 TB DRAM; 256 TB flash InfiniBand Interconnect Aggregate torus BW 9.2 TB/s Type Dual-Rail QDR InfiniBand Link Bandwidth 8 GB/s (bidirectional) Latency (min-max) 1.25 µs – 2.5 µs Lustre-based Disk I/O Subsystem Total storage 4 PB (raw) * May be revised I/O bandwidth 100 GB/s Gordon Applications Software chemistry adf amber gamess gaussian gromacs lammps namd nwchem distributed computing globus Hadoop MapReduce visualization idl NCL paraview tecplot visit VTK genomics abyss blast hmmer soapdenovo velvet compilers/languages gcc, intel, pgi MATLAB, Octave, R PGAS (UPC) DB2, PostgreSQL data mining IntelligentMiner RapidMiner RATTLE Weka libraries ATLAS BLACS fftw HDF5 Hypre SPRNG superLU MapReduce Paradigmatic Example: string counting Scheduler: manage threads, initiate data split and call Map Map: count strings, output key=string & value=count Scheduler: re-partitions keys & values Reduce: sum up counts User defines keys & values 50 User defined functions MR provides parallelization,concurrency, and intermediate data functions (by key&value) Matlab and Data Analysis Mathematical and matrix operations Interactive or scripts Comes with tools (scripts) for data analysis, e.g. clustering, neural networks, SVMs, … Uses MKL (Intels math kernel library) for Matrix calculations with threads 51 Matlab in HPC setting Distributed toolbox provides distribute/gather functions In a nutshell: Create job object: createMatlabPoolJob(scheduler information) Create tasks for that job createTask(job,@myfunction,#tasks,{parameters..}) In your code: spmd D=codistributed(X); or D=codistributed.build(X); < statements> end; 52 Matlab in vSMP setting vSMP submission indicates threads In submission script, set environment variables for MKL In matlab code: setenv('MKL_NUM_THREADS', getenv(number_of_procs); No programming changes necessary, but programming considerations exist 53 Matlab: matrix multiplication In vSMP: Y=X’NxN * XNxN 8 threads time(s) 32 threads N=10K Gb=2 54 20K 6.5 30K 14 40K 25 Matrix size 50K 40 Matlab: matrix inversion In vSMP: Y=inv(XNxN + I*0.0001) 32 threads time(s) time(s) 16 threads N=10K Gb=2 55 20K 6.5 30K 14 Matrix size 40K 25 50K 40 Matlab and Data Mining Case Study Kmeans clustering Assign each point to one of a few clusters so that total distance to center is minimized Options: distance function, number of clusters, initial cluster centers, number of iterations, stopping criteria 56 Matlab original Kmeans Script 1. Difference_by_col=X(:,1)-Cluster_Means(1,1) XNxP each row is a point in Rp 00100 0200 …00100100010000010001000100 0 00002 0000 …000030010001000001010000000 00000 0100 …10000000100010000 00100 0000 … … … … 0 0 0000 0400 0100 0100 0200100 0200 …00100 Cluster_MeansMxP 0 0 1 0 0 0 0 .5 0 0 … 0 0 1 0 0 .5 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 .5 0 0 0 … 1 0 0 0 .4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 … … 0 0 0000 0400 0100 0100 0200100 0200 …00 2. square difference 3. sum as you loop across cols to get Distances to cluster center Works better for large N small P 57 Matlab Kmeans Script altered 1. Difference_by_row=X(1,:)-Cluster_Means(1,:) XNxP each row is a point in Rp 00100 0200 …00100100010000010001000100 00002 0000 …000030010001000001010000000 00000 0100 …10000000100010000 00100 0000 … … … … 0 0 0000 0400 0100 0100 0200100 0200 …00100 Cluster_MeansMxP 0 0 1 0 0 0 0 .5 0 0 … 0 0 1 0 0 .5 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 .5 0 0 0 … 1 0 0 0 .4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 … … 0 0 0000 0400 0100 0100 0200100 0200 …00 2. dot(difference_by_row) 3. loop across rows to get Distances Works better for large P and dot( ) will use threads 58 Matlab Kmeans Benchmarks Kmeans on 10,000,000 entries from NYTimes articles (http://archive.ics.uci.edu/ml/datasets/Bag+of+Words) Running as full data matrix ~ 45K articles x102K words, Each cell holds word count (double float) about 37Gb in Matlab, total memory for script about 61Gb Kmeans (original) runtime ~ 50 hours Kmeans (altered) runtime ~ 10 hours, 8 threads 59 Matlab Kmeans Results cluster means shown with coordinates determining fontsize 7 viable clusters found 60 MapReduce Framework A library for distributed computing Started by Google, gaining popularity Various implementations: Hadoop (distributed), Phoenix (threaded) User outputs keys & values 61 User defined functions MR provides parallelization,concurrency, and intermediate data functions (by key&value) MapReduce Paradigmatic Example: string counting Scheduler: manage threads, initiate data split and call Map Map: count strings, output key=string & value=count Scheduler: re-partitions keys & values Reduce: sum up counts User defines keys & values 62 User defined functions MR provides parallelization,concurrency, and intermediate data functions (by key&value) MapReduce Kmeans clustering C-code for Kmeans (sample code with MapReduce) Use 10,000,000 entries from NYTimes articles Running as full data matrix ~ 45Kx102K ints, about 20 Gb total in vSMP Running time about 20 min, 32 threads (but not a completely fair comparison to Matlab kmeans) 63 Case Study: Matrix Factorization (work by R. Anil & C. Elkan, UCSD, CSE for KDD 2011 competition) Given large N sparse data matrix sparse missing data XNxP customer ratings items rated ***1** *2** **1**1***1* * 2 * * * * * * * * 3 * * 1 * * * ** * * * ****0*5**1** 1*******1* … … … … … … … * * **** *4** *1** *1** * Find vectors U,V such that X ~ ∑ f(U • V’) + penalty 64 Case Study: Matrix Factorization C code with pthreads For different f(U • V’) functions 65 Function Time (s) 1 iteration Memory Dash node Sigmoidal 74 29Gb non-vSMP Alternating Least Squares 673 15Gb non-vSMP Log-linear 1110 70Gb vSMP Additional Methods Slides 66 Decision Tree Induction Method for approximating discrete-valued functions 67 robust to noisy/missing data can learn non-linear relationships inductive bias towards shorter trees Decision Tree Induction Applications: 68 medical diagnosis – ex. heart disease analysis of complex chemical compounds classifying equipment malfunction risk of loan applicants Boston housing project – price prediction fraud detection DT for Medical Diagnosis and Prognosis Heart Disease <= 91 Class 2: Minimum systolic blood pressure over a 24-hour period following admission to the hospital > 91 Age of Patient <=62.5 >62.5 Early death Class 1: Was there sinus tachycardia? Survivors NO Beriman et. al, 1984 69 YES Class 1: Class 2: Survivors Early death Regression Tree Induction Why Regression tree? Ability to: 70 Predict continuous variable Model conditional effects Model uncertainty Regression Trees 71 Quinlan, 1992 Continuous goal variables Induction by means of an efficient recursive partitioning algorithm Uses linear regression to select internal nodes Clustering Basic idea: Group similar things together Unsupervised Learning – Useful when no other info is available K-means Partitioning instances into k disjoint clusters Measure of similarity 72 Clustering X X X 73 X Kmeans Results from 10 million NYTimes articles cluster means shown with coordinates determining fontsize 7 viable clusters found 74 Artificial Neural Networks (ANNs) Network of many simple units • Main Components • •Inputs •Hidden layers •Outputs Adjusting weights of connections • Backpropagation • 75 Support Vector Machines (SVM) Blend of linear modeling and instance based learning SVM select a small number of critical boundary instances called support vectors from each class and build a linear discriminant function that separates them as widely as possible They transcend the limitations of linear boundaries by making it practical to include extra nonlinear terms in the calculations 76 making it possible to form quadratic, cubic, higher-order decision boundaries Support vectors Resilient to overfitting because they learn a particular linear decision boundary The instances closest to the maximum margin hyperplane are called support vectors Important observation: the support vectors define the maximum margin hyperplane 77 The maximum margin hyperplane All other instances can be deleted without changing the position and orientation of the hyperplane SVM: finding the maximum margin hyperplane 78 Data Miners: Past and Present Traditional approaches have been for DM experts: “White-coat PhD statisticians” DM tools also fairly expensive Today: approach is designed for those with some Database/Analytics skills DM built into DB, easy to use GUI, Workflows Many jobs available from Statistical analyst to Data Scientist! Data scientist: The hot new gig in tech 80 Article in Fortune – “The unemployment rate in the U.S. continues to be abysmal (9.1% in July), but the tech world has spawned a new kind of highly skilled, nerdy-cool job that companies are scrambling to fill: data scientist”