1MileStone1PPTx - Cs Team Site | courses.cs.tau.ac.il

Download Report

Transcript 1MileStone1PPTx - Cs Team Site | courses.cs.tau.ac.il

Milestone #1
PRIVACY IN A
DEMOGRAPHIC
DATABASE
Razi Mukatren, Golan Salman
MILESTONE #1
• We started the privacy analysis of the Data.
• we manually generate tables from the Israel Central Bureau of Statistic's
website - more than 40 tables.
• Understanding the specific technique that the CBS uses for their
website.
• From the pulled Data, we learned the tables, we manually looked for
intersection between the data in order to understand more about the
surveys
• Next Step: pulling the data/tables from the website using a script.
THE PRIVACY ANALYSIS OF THE SYSTEM
We run manually tests, we saw it’s possible to create information about specific
participant in the survey.
For example:
Taking all 7,500 participants data and filtering only those who:
1) Studied some subject that connects to education.
2) Has incoming profit of more than 24,000 NIS per month.
FOR EXAMPLE :
 We generated 10 Tables and use the following filters:
 Arab villages and

Religion – Muslims.
 Filter used to reduce the size of the table, what we mean that we will get the info
only related to the above Filters.
 The survey has only 12 people who live in
Arab villages and Muslims (we can learn this
from Table #1. Six of them are men, and six
are women. Also, we can see the ages of
those 12 people in the tables below.
 Now we’ll look in the tables which includes
in total 12 participates, since they for sure
will include all the 12 participates from the
Table #1.
 Table 5,7,9,10 includes all the 12 participates.
 From table #5 we can learn that for example the participates between
age 20-24 one his height 120 -124 the second 185-189.
 From table #5 we can learn that for example the participates between
age 20-24 one his height 120 -124 the second 185-189.
 Now if we go back to table #1 we will see that one is man one is
women, to see who is who we will generate new table includes same
filters and we will add second column for gender
 Will name it table 11, from table 11 we can see the Women her height
is 160-164, and the men 185-189.
 Let’s focus only on this 2 participates for example because one of them
appears in all the 10 tables (we have age 20-24 in all the 10 tables).
 From table #2, we can see that one of them hired worker, let’s generate
new table (called table12) and check who is the hired worker the man
or the women. We can see from table number 12 that the man is the
hired worker.
 So far we know about the Man, his age 20-24, Muslim, from Arab village,
his height 185-189, and hired worker.
 From table #3 and table #4, we can learn that he work in the
constructions and he far about 15-30 min driving from his work.
 From table #6, both of them the man and the women study 11-12 years
 From table #7 one of them weight 90- 94 and the other 65- 69, let’s
generate new table (13) and check which one is the man, from table 13
we can see that the man weight between 90-94 Kg.
 From table #8 he makes from 5K – 6K NIS gross.
 Table #9 he is from the north.
 Table #10 we need to generate new table #14, from table 14 we can see
that his family includes more than 7 members.
IN CONCLUSION:
 We know about the Man,
 His age 20-24
 Muslim
 From Arab village,
 His height 185-189
 Hired worker
 Distance from work 15-30 min driving
 Studying Years11-12
 he weighs 90-94 Kg
 His salary 5K-6KNIS gross per month
 he is from the north
 His family includes more than 7 members.
WHERE ARE WE GOING FROM HERE
NEXT STEPS
 Two major points (the plane is to finish them until milestone 2):
 automatic extracting and generating survey’s tables from the CBS (it will be the first
script).
 Start working in the algorithm for searching in the data for the “1”, and try to find
intersections between this information (it will be the second script).
THE FIRST SCRIPT AND MAJOR ISSUES
 The website support only IE.
 We though that we can use a macro script using FF or Chrome, but
since the IL Governments sites support only IE so we can’t use the
macros scripts.
 Now we are testing alternatives:
 Either Scrapy: http://scrapy.org/
 used to crawl websites and extract structured data from their pages. It
can be used for a wide range of purposes, from data mining to
monitoring and automated testing.
 or curl in bash
 or java with http://jtidy.sourceforge.net/ - JTidy is a Java port of HTML Tidy