cs.joensuu.fi

Download Report

Transcript cs.joensuu.fi

World-Wide Location-based
Search Using OpenStreetMap
Andrei Tabarcea
Main idea
Input:
•
•
user location (lat, lon)
keyword (s)
Output: list of services containing:
•
•
•
•
•
•
name/title
website
address (street, number. city)
location (lat, lon)
image
other info (opening hours, telephone etc.)
Main idea:
•
preprocess the search results of an external search engine (Google,
Yahoo, Bing etc.) by detecting postal address in order to find the location
Proposed steps
1. Convert user location (lat, lon) into user address = Geocoding step
1. Perform a search with the query "keyword+city" using an external search
engine API (or using own search API, which is not realistic) and download
the first k results (web pages) = Web page retrieval step
1. Detect addresses and services from the downloaded web pages = Data
mining step
1. Ranking the results (distance, relevance etc.) = Ranking step
1. Display the search results to the use
Current implementation is on /usr/local/www_root/mopsi/mopsiMetaSearch
1. Geocoding
Main idea:
•
We have (lat, lon) from user and we need to determine what city the user
is and maybe neighboring cities
What we have:
•
•
For the mobile search, we use our own geocoded database (the Location
table) which links all the street addresses in Finland to geographic
coordinates (northing, easting in ETRS system)
For the Mopsi website (photo, user tracking etc.) we use geocoding
services from Google
What we propose:
•
Since we are going to use OpenStreetMap data for the data mining step,
we could also use OpenStreetMap for geocoding, using Nomitatim project:
http://wiki.openstreetmap.org/wiki/Nominatim and
http://nominatim.openstreetmap.org/
2. Web page retrieval
Main idea:
•
Download k webpages from the query <keyword, city>
What we have:
•
•
•
•
We manually query Google downloading this URL:
"http://www.google.$country/search?as_q=$keyword&num=10&hl=$countr
y&btnG=Googlehaku&as_epq=$city&as_oq=&as_eq=&lr=lang_$country&as_ft=i&as_filety
pe=&as_qdr=all&as_occt=any&as_dt=i&as_sitesearch="
We parse the requested HTML page to find the website links using regular
expression
We download the retrieved websites
This method is not official and may contradict to Google's terms of usage
What we propose:
•
Use a search engine API to get the required links and download them
3. Data mining
Main idea:
•
Find location information in HTML pages by find postal addresses
Steps:
1. Parse and segment the HTML page
2. Identify addresses and locations (convert address to coordinates)
3. Identify the services the addresses are pointing to (name/title)
4. Find extra information (image, opening hours, telephone etc.)
3.1 Parse and segment the HTML
page
Main idea:
•
Prepare the HTML page to data mining
What we have:
•
•
The HTML page is stripped of tags and just the text is kept. The output is plain text
divided into paragraphs
This is done either by using regular expressions to remove what is between <> or by
using HTML DOM parser (Html2Text.class.php and Html2TextDom.class.php)
What we propose:
•
•
•
•
Convert the page into XHTML
If needed, convert the XHTML to other data structure (by using DOM) tree, in order
to segment the page into blocks of text
Segmentation should be visual-based (check articles based on the VIPS algorithm),
using feature such as: text color, text size, background color, block position etc.
We can later use the segmentation to find out relationships between the blocks of
text which contains addresses and blocks of text which contain descriptive
3.2 Identifying addresses
Main idea:
•
•
Identify postal address (street/number/city) from blocks of text in order to
determine geographical location on the content (lat, lon)
This may be a special case of dictionary search or of pattern matching,
there are several papers on this
What we have:
•
•
•
We are building a fast data structure (prefix tree) from every city we have
(we have data just for Finland, here we would use/need OpenStreetMap
data)
For each word in the text we check if it's a street name (if it exists in prefix
tree) and then we are looking around it to find street number, postal code
or city name
After we have an address candidate, we are validating it by converting it
into coordinates (lat, lon)
3.2 Identifying addresses
What we propose:
•
•
•
•
This method works just for Finland, because we are using our geocoded
database
We need to use OpenStreetMap data to retrieve all the street names in a
city/village and use the same method as before to identify street names
We need better pattern matching to build address candidates
We need to use the same geocoder for step 1 in order to convert the
address to coordinates
3.3 Identifying the services
(name/title)
Main idea:
•
We have identified the addresses, now we need to find out what the
addresses are pointing to (e.g. we have Science Park, Länsikatu 15,
Joensuu, we already identified Länsiaktu 15 as address, and we need to
identify Science Park as a service)
What we have:
•
•
We are employing some heuristics: take 5 words before the address (from
the same paragraph), take the first line above the address etc.
We were working on a simple classifier-based solution by using a decision
tree and some features, but we did not have good results
What we propose:
•
•
Use the results of segmentation in order to find out which block relates to
the address block
Figure out a better solution than just taking what is before the address, but
3.4 Find extra information
Main idea:
•
Once we have a service, it would be good to have some other information
such as image, opening hours, telephone number etc.
What we have:
•
Nothing
What we propose:
•
Use some simple heuristics:
o For image, take either the biggest image on the web page (as
Facebook does) if there is just one result on the page, or take the
closest image to the service using segmentation results and some
simple rules
o For telephone and opening hours, use some simple rules on the
blocks related to the address or the service block
4. Ranking
Main idea:
•
Once we have the results, we need to sort them by relevance
What we have:
•
Results are ranked by the distance from the user's location
What we propose:
•
We could use some rules from the recommendation system to rank the
results, but this is the last problem we need to take care of
Process diagram
lat, lon
User
Geocoder
keywords
Search
engine
web
pages
ranked
result list
Data
mining
result
list
Result
ranking
Data mining diagram
To follow soon