Mopsi Search - Itä

Download Report

Transcript Mopsi Search - Itä

Location-based search: services,
photos, web
Andrei Tabarcea
Mohammad Rezaei
4.12.2013
Introduction
keyword
Results on map
User location
The goal is to find services, photos and points of interest close
to the user’s location
We call this “location-based search”
We try to search our local database of photos and services and
to find location information in web-pages
MOPSI search
Mopsi search
Mopsi Services Database
User location
keyword
Mopsi Photo Collection
Mopsi Web Search
Combination
of
search
results
Web interface
Input: (keyword, user location)
Output: array of results
keyword
Search options
Results on map
Results list
User location
Mobile interface
User location
Search results
Mopsi search (server workflow)
Input: (keyword, flagS, flagP, flagW, user location)
Output: (g_markersData) array of results
keyword
Search options
Results on map
flagS
flagP
flagW
Results list
User location
Location based search
Input: keyword, flagS, flagP, user location (lat,lon)
Output: list of results
Note: A service has a list of keywords and a title
A photo has just a description
So, Keyword search is done according to this information
Notation:
S: service, P: photo
text(S): keywords and title of service
text(P): description of photo
flagS: search for services if true
flagP: search for photos if true
Start
Overall flow
Update keywords statistics
Update keywords history
N
flagS
Y
Local service search and display
results in list
When a keyword is searched:
flagP
N
Y
statistics: the count of it in
database is incremented,
keyword and city are stored
Stage 1:
Search mopsi
services
Photo search and add
results to the list
Stage 2:
Search mopsi
photos
Display all results on Map
history: keyword, location,
userid and time are stored
N
flagW
Y
Web search
Add results to the list and on map
End
Stage 3:
Search web
Local service search
Start
The list of results
Do search on server
nL=number of results
nL>0
N
Y
Cluster results with almost
same title and location
Sort the results
(distance to user location)
Display results in the list
End
Take and display one
of the similar results
as representative
Photo search
Start
Do search on server
nP=number of results
nP>0
N
Y
Cluster results with almost
same title and location
Cluster the results and
Local services with almost
same title and location
Sort the results
(distance to user location)
Add results to the list
End
Start
Web search
Do search on server
nW=number of results
nW>0
N
Y
Cluster results with almost
same title and location
Cluster the results and Local services
and photos with almost same title and
location
Sort the results
(distance to user location)
Add results to the list
Add results on the map
End
Filtering results: old solution
Fixed distance to user location: d
Find services where
text(S) ≈ keyword AND dist(S,User) < d
Find photos where
text(P) ≈ keyword AND dist(P,User) < d
Advantages:
Simple
Same time for any search
Disdvantages:
Parameter d (User can choose d, but still not automatic)
There are many cases with “no results”
d
Current solution: Binary search
K-nearest services
Show all the results in 10 km
If number of results is less than K, double the distance (until
whole earth), when number of results is bigger than K,
divide the distance
Example with k=5:
Number of results n in distance d: 1 < k
Double distance: in 2d, n=2 < k
In 4d, n=8 > k
Now dividing distance in colored area:
In 3d, n=4 < k
In 3.5d, n=5 (=k)
So, we have 5 nearest results to user location in
distance x
x
4d
d
2d
User location
A photo or service with required keyword
Algorithm
d=10000: initial distance
K=10: number of required results
delta_dist: minimum distance for dividing
ns: number of resulted services res_S
np: number of resulted photos res_P
Δ
res_S = services where text(S) ≈ keyword
res_P = photos where text(P) ≈ keyword
if ( ns+np > K )
(res_S res_P dist) = extend_distance();
(res_S res_P dist) = contract_distance();
display (ns+np) services and photos
extend_distance()
ns= 0; np=0;
While ( ns+np < K AND dist < earth_r*pi)
res_S = services where text(S) ≈ keyword AND dist(S,User) < dist
res_P = photos where text(P) ≈ keyword AND dist(P,User) < dist
dist = dist*2
dist = dist/2
4d
d
2d
Algorithm (cont.)
contract_distance(dist, K)
d1 = dist/2
d2 = dist
dist = (d1 + d2)/2
delta = dist – d1
ns=np=0
While ( ns+np != K AND delta > delta_dist AND dist > d )
res_S = services where text(S) ≈ keyword AND dist(S,User) < dist
res_P = photos where text(P) ≈ keyword AND dist(P,User) < dist
if ( ns+np > K )
d1 = d1; d2= dist
else
d1 = dist; d2 = d2
dist = (d1 + d2)/2
delta = dist-d1
Simplifying distance calculation
Since there is no spatial dist function in mysql:
Points with distance < d from user location
Simplified: |lat-lat1|< Δlat AND |lon-lon1|< Δlon
(lat1+ Δlat, lon1)
d
d
lat1, lon1
d (in meter)
(lat1, lon1+ Δlon)
Δlat and Δlon?
User location
(lat1, lon1)
Δlat and Δlon?
Distance d (in meter) between two points (lat1, lon1) and (lat2, lon2):
Haversine distance:
d  E  arcsin( sin 2 ( lat / 2)  cos( lat1 ) cos( lat2 ) sin 2 ( lon / 2)
Earth diameter (in meter)
(lat1, lon1) and (lat1, lon1+ Δlon)  Δlat=0
d  E  arcsin( cos2 (lat1 ) sin 2 ( lon / 2) )
sin( d / E )  cos(lat1 ) sin( lon / 2)
lon  2  a sin(
sin( d / E )
)
cos( lat1 )
(lat1, lon1) and (lat1+ Δlat, lon1)  Δlon=0
d  E  arcsin( sin 2 ( lat / 2)
lat 
2d
E
Note : if
sin( d / E )
 1  lon  
cos( lat1 )
Location-based web data mining
How to find location-information in web-pages?
Mopsi web search
Web
mining
Geo-referencing
Geo-referencing:
A geographic reference is an information entity that is
discovered from the context and can be mapped to a
geographic location
Strategies for geographic reference extraction:
–
–
–
–
–
Gazetteer-based text matching
Rule-based linguistic analysis
Regular-expression based text matching
Using host location
Geographic meta-tags
Hu, Y. H., Lim, S., & Rizos, C. Georeferencing of Web Pages based on Context-Aware Conceptual Relationship
Analysis. 2006
Ad-Hoc Georeferencing
<HTML>
<HEAD profile"="http://geotags.com/geo>
<META name="geo.position" content="62.35;29.44">
<META name="geo.region" content="FI">
<META name="geo.placename" content="Joensuu">
<META http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
<link rel="stylesheet"
href="http://www.joensuu.fi/tkt/sivutyyli.css"
type="text/css">
<TITLE>Pages of Pasi Fränti</TITLE>
</HEAD>
VS.
The problem is how to extract and validate location data from semistructured text
Postal address is the most common location data found
Our goal is to give geographical coordinates to services mentioned in
web-pages
We call this method ad-hoc georeferencing
Location Information in Webpages
Site hosting information (owner address, server
address etc.)
HTML tags (geo-tags, address-tags, vcards for
Google Maps etc.)
Natural language descriptions
Addresses, postal codes, phone numbers
Site hosting information
domain: uef.fi
descr: ITÄ-SUOMEN YLIOPISTO (UNIV OF EASTERN FINLAND)
descr: 22857339
address: TIETOTEKNIIKKAKESKUS (IT-CENTRE)/Jarno Huuskonen
address: PL 1627
address: 70211
address: KUOPIO FINLAND
phone: +358 44 7162810
status: Granted
created: 26.5.2010
modified: 19.8.2011
expires: 26.5.2015
nserver: ns-secondary.funet.fi [Ok]
nserver: ns1.uef.fi [Ok]
nserver: ns2.uef.fi [Ok]
dnssec: no
HTML tags
geo-tags, address-tags, vcards for Google Maps etc.
<HTML>
<HEAD profile"="http://geotags.com/geo>
<META name="geo.position" content="62.35;29.44">
<META name="geo.region" content="FI">
<META name="geo.placename" content="Joensuu">
<META http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<link rel="stylesheet" href="http://www.joensuu.fi/tkt/sivutyyli.css"
type="text/css">
<TITLE>Pages of Pasi Fränti</TITLE>
</HEAD>
Natural language descriptions
Scouts' Youth Hostel
Show map
(8.3 km from Joensuu Airport)
Good, 7.4 Latest booking: January 23
Scouts’ Youth Hostel is located at the
outfall of River Pielisjoki, 1.5 km from
Joensuu city centre. It offers free Wi-Fi and
rooms with shared bathroom and kitchen facilities.
Olga Saint-Petersburg, Russia "Great price for the nice
room. Friendly stuff, cozy atmosphere. But a bit
loud."
from
€ 46
Postal addresses
Mopsi search
Input:
user location (lat, lon)
keywords
•
•
Output: list of services containing:
name/title
website
address (street, number. city)
location (lat, lon)
image
other info (opening hours, telephone etc.)
•
•
•
•
•
•
Main idea:
preprocess the search results of an external search engine (Google, Yahoo, Bing
etc.) by detecting postal address in order to find the location
•
Problems
- How to evaluate relevance?
- Mixed keyword meanings
- No relation between keywords and addresses
Mopsi Web Search Workflow
Mobile
application
Web user
interface
Keyword
Coordinates
Search
results
Keyword
Coordinates
Search
results
Coordinates
Geo-referencing
module
Address
Geocoded
street-name
database
Georeferencing module
Geocoded
database
Keyword, Address,
Coordinates
Relevant
municipalities
detector
Keyword
Municipalities
Page
parser
Word
list
Address and
description
detector
Results
list
Address
validator
Georeferencing module
<keyword,
municipality>
query
Result
links
Sorted results
list
Proposed steps
1. Convert user location (lat, lon) into user address = Geocoding step
2. Search with the query "keyword+city" using an external search engine API
and download the first k results (web pages) = Web page retrieval step
3. Detect addresses and additional informatio from the downloaded web pages
= Data mining step
4. Ranking the results (distance, relevance etc.) = Ranking step
5. Display the search results to the user
lat, lon
User
1.
Geocoder
keywords
2.
Web page
retrieval
web
pages
5. ranked
result list
3.
Data
mining
result
list
4.
Result
ranking
1. Geocoding
Convert user location (lat, lon) into user address using:
lat, lon
User
Geocoder
keywords
Web page
retrieval
web
pages
ranked
result list
Data
mining
result
list
Result
ranking
2.Web page retrieval
Download k webpages from the query <keyword, city> using API of:
lat, lon
Geocoder
User
keywords
Web page
retrieval
web
pages
ranked
result list
Data
mining
result
list
Result
ranking
3.Data mining
Main idea:
Find location information in HTML pages by detecting postal addresses
Steps:
1. Parse and segment the HTML page
2. Identify addresses and locations
3. Identify the services the addresses are pointing to (name/title) and
retrieve extra information (photos, opening hours, telephone etc.)
lat, lon
Geocoder
User
keywords
Web page
retrieval
web
pages
ranked
result
list
Data
mining
result
list
Result
ranking
3.1 Parsing HTML pages
Joen Pizza Special Y-tunnus 2129577-6
Käyntiosoite Koskikatu 17 80100
JOENSUU Postiosoite Koskikatu 17
80100 JOENSUU Puhelin: 013-220246
Virallinen toimiala Kahvila-ravintolat
Current solution extracts an array of text from HTML pages
We don’t exploit the advantage that we extract data from
web pages
Proposed future solution:
-
Segmentation of web pages using DOM trees
Detection of the address block
Nearest-neighbor search considering text and visual characteristics
Web page example - Homepage
DOM tree
blue: links (the A tag)
red: tables (TABLE, TR and TD tags)
green: dividers (DIV tag)
violet: images (the IMG tag)
yellow: forms (FORM, INPUT, TEXTAREA,
SELECT and OPTION tags)
orange: linebreaks and blockquotes (BR, P,
and BLOCKQUOTE tags)
black: HTML tag, the root node
gray: all other tags
DOM subtree
<body>
<tr>
<div>
<tr>
<td>
PizzaPojat Niinivaara
<html>
<table>
<table align="center“>
<tr>
<td>
<div id="footerleft">
<h3>PizzaPojat Niinivaara</h3>
<p>Niinivaarantie 19</p>
<p>80200 Joensuu</p>
<br />
<p>013 - 137 017</p>
</div>
<td>
</tr>
</table>
<td>
<table>
<div>
Niinivaarantie 19
80200 Joensuu
013 - 137 017
<br/>
Web page example - Catalog
Miami
Bosbor kebab
Fiesta
Proposed implementation
1.
Convert HTML pages to xHTML for using xQuery
2.
Detect addresses and postal codes
3.
Break the DOM tree into subtrees
4.
Use heuristics and regular expressions to detect extra information
from the subtree (service name, telephone, opening hours etc.)
<body>
<tr>
<div>
<tr>
<td>
PizzaPojat Niinivaara
<html>
<table>
<td>
<table>
Niinivaarantie 19
80200 Joensuu
013 - 137 017
<br/>
3.2 Postal address detection
Rule-based pattern matching algorithm
Starting point: the detection of street-names
Prefix trees are used for fast text matching for street-names
An address-block candidate is constructed by detecting:
• street names and number
• postal codes
• municipal names
We will use OpenStreetMap database for global detection
Street names
Telephone
numbers
City names
Street
numbers
3.2 Postal address detection
AddressDetection(words)
i=0
streetName number postcode city
while i < count(words)
set street, number, postcode, city as empty
if word[i] is streetName
Joen Pizza Special Y-tunnus: 2129577-6
i++
street = words[i]
Käyntiosoite: Koskikatu 17 80100 JOENSUU
for j = i to i+5
Puhelin: 013-220246 Virallinen toimiala:
if words[j] is number
Kahvila-ravintolat
number = words[j]
break
for k = j+1 to j+5
if word[k] is postcode
postcode = words[k]
j=k
break
for k = j+1 to j+5
if words[k] is city
city = words[k]
i = k+1
break
if street is not empty AND number is not empty AND city is not empty
candidate = (street, number, postcode, city)
Prefix Trees
Invented by Friedkin (1960)
The prefix tree (or trie) is a fast ordered
tree data structure used for retrieval
Root is associated with an empty string
All the descendants of a node have a
common prefix of the string associated
with that node
Some nodes can have associated values
(usually they mark the end of a word)
Street-name prefix trees
Prefix Tree Statistics
Finland
Singapore
34
14
Average tree depth
12.7
7.4
Average tree width
105
167
Average number of nodes per
tree
2338
2335
Total size (MB)
74.4
0.18
Maximum tree depth
Our solution is to detect street-names using prefix trees constructed from
the gazetteer
A street-name prefix tree is build for each municipality used in the search
The user’s location and his area of interested are known, therefore prefixtrees can be limited to municipalities
3.3 Retrieve extra information
-
Title detection (or company detection) is a
Named Entity Recognition problem
Joen Pizza Special Y-tunnus: 2129577-6
Käyntiosoite: Koskikatu 17 80100 JOENSUU
Postiosoite Koskikatu 17 80100 JOENSUU
Puhelin: 013-220246 Virallinen toimiala:
Kahvila-ravintolat
words before
the address
address
Usually, the text before the address holds relevant information
There are other methods to investigate such as using classifiers or
using web page structure
4. Ranking
Main criterion:
distance from the user’s location
Future idea:
relevance to user’s profile and history
lat, lon
Geocoder
User
keywords
Web page
retrieval
web
pages
ranked
result list
Data
mining
result
list
Result
ranking
Future ideas recap
– Use freely available geographical sources for
extending the prototype to other regions
– Use geographical scope of a web page to improve
address detection and disambiguation
– Use the structure of the HTML page and DOM tree
semantic analysis for better data extraction
– Gather and tag a testing dataset for better
evaluation of the algorithms