#### Transcript Data Mining over the Deep Web - Computer Science and Engineering

Data Mining over the Deep Web Tantan Liu, Gagan Agrawal {liut,agrawal}@cse.ohio-state.edu Ohio State University April 12, 2011 1 Outline • Introduction – Deep Web – Data Mining on the deep web • Frequent itemset mining over the deep web – Bayesian network – Active learning based sampling method • Experiment Result • Conclusion 2 Deep Web • Data sources hidden from the Internet – Online query interface vs. Database – Database accessible through online Interface – Input attribute vs. Output attribute • An example of Deep Web 3 Data Mining over the Deep Web • High level summary of data – Scenario 1: A student wants to find a job as a software Engineer • • • • Will a master degree help? Which language to learn: Java, C, or C#? Try MSN careers – to much information! Frequent itemset mining! 4 Challenges • Databases cannot be accessed directly – Sampling method for Deep web mining • Obtaining data is time consuming – Efficient sampling method – High accuracy with low sampling cost 5 Roadmap • Introduction – Deep Web – Data Mining • Frequent Itemset mining over the deep web – Bayesian Network – Active learning based sampling method • Experiment Result • Conclusion 6 Frequent Itemset Mining • • • Itemset: a set of attributes with instantiations, e.g I={Brand=benz, Age>5} Support(Brand=Benz, Age>5)=2/8=0.25 Frequent Itemset: Support is larger than a threshold 7 Frequent Itemset Mining on Deep Web • Challenges – Support of itemsets is unavailable – The size of itemsets could be huge • Considering 1-itemsets – Simple random sample – Inefficient • Support of itemsets of input attributes is known – # of data records satisfying the query is provided 8 Main Idea • Task: Estimating the support of itemsets of output attributes • Questions – Can we use information about input attributes? • Bayesian Network – Relation between input attributes and output attributes – Compute support for itemsets of output attributes – How to quickly build the model • Active learning based sample method 9 Bayesian Network • Relation between input and output attributes • Graphical model – Random variables • Input and output attributes – Conditional dependencies • Output attributes depend on input attributes 10 Active Learning • In machine learning – Passive learning: data are randomly chosen – Active Learning • Certain data are selected, to help build a better model • Active Learning – Obtaining data is costly and/or time-consuming • Frequent Itemset Mining on Deep Web 11 An Example of Bayesian Network Age Brand <=5, >5 H, B known Mileage Estimate! Price Price Brand • Support of Itemsets depends on parameters in the Bayesian network H • Parameters are estimated based on Sample H ‒ Parameter: p(price<=5000|H,<=5) • 2 data records satisfying brand=H, Age<=5 • 1 data records satisfying brand=H, Age<=5, price<=5000 B B Age p <=5000 >5000 <=5 0.25 >5 0.25 0.5 0.5 0.5 0.5 <=5 0.25 >5 0.25 0.0 1.0 0.0 1.0 [0.125 0.125 0.0 0.0] ‒ p(price<=5000|H,<=5)=1/2=0.5 Support(Price<=5000)= 0.25 12 Example of Active learning on Deep Web Deep Web Data Source Qi, i=1,…, 4 Q1 B=H& Age<=5 Sampled Data Q2 Q3 B=H& Age>5 Q4 B=B& Age>5 B=B& Age<=5 Price <=5000 >5000 Price Q1 Q2 Q3 Q4 p11 p21 p31 p41 p12 p22 p32 p42 13 An Example of Active Learning Based Sampling • Hidden idea – Sampling heavily on query spaces with high impurity Deep Web Data Source Q1(B=H) Q1 0.5 0.5 Q2(B=B) Price Q2 0.01 0.99 14 Detailed Formulation • Support for output attributes ‒ – : an instantiation of input attributes, or a query : prior probability • Known – , • Conditional probability • Parameters in conditional table • Unknown, need to estimate 15 Parameters in Bayesian Network • are estimated based on a sample • Difference between estimated values and true values • Consider as statistical variables • Conjugate distribution – After observing data D, is in the same family with • Hyper parameter – Expectation: , where • Estimation for support of output attributes – Expectation on the distribution 16 Active Learning on Deep Web • Risk Function – Risk with the estimation for 1-itemsets composed of output attributes – Based on the hyper parameter in the Bayesian Network, • Data Selection – Data are obtained by queries : query selection – Data records are selected step by step – Choosing the query with most reduction on risk function • Updating Model – For where , and sample denotes the number of data records containing 17 Support for n-itemsets(n>1) • Estimation based on the Bayesian network – • Support value of in the query space 18 Roadmap • Introduction – Deep Web – Data Mining • Frequent Itemset mining over the deep web – Bayesian Network – Active learning based sampling method • Experiment Result • Conclusion 19 Experiment Result • Data set: US census – 2008 US Census on the income of US households – 40,000 data records • Three Methods – Dir: • Random Sample • Direct Computation – Bay • Random Sample • Computation Based on Bayesian Network – Act: our proposed method • Active Learning based Sample • Computation Based on Bayesian Network 20 US census • Square Error Rate: • Absolute Error Rate (AER): 21 Conclusion • • • • • Data mining on the deep web is challenging Frequent itemset mining over the deep web Bayesian network is used to model the deep web A active learning based sampling method The experiment results show the efficiency of our work 22 Questions? 23