Transcript Analyzers

Richa Arora








Tool Identified and Overview
Schema.xml
Tokenization, Stop words, and Synonym Handling
Indexing
Data Import Handler
Query format and Matching documents to query
Function Queries
Bibliography


SOLR - Open Source enterprise search platform from
Apache Lucene project
Purpose
◦ To implement a full text search functionality in a web
application

Commercial Websites using SOLR
◦ www.digg.com
◦ http://www.whitehouse.gov/ - Uses SOLR via Drupal for site
search w/highlighting & faceting
◦ http://beta.fcc.gov/
◦ http://www.netflix.com/
Web server
Database server
Web
Application
Document
Database
SOLR

Features
◦
◦
◦
◦

Full text search
Rich document handling (including MS Word, PDF, RTF etc.)
HTML administration interface
Scalable
Technology
◦ Java programming language
◦ Lucene Java search library
◦ Runs as a search server within a servlet container such as
Tomcat or Jetty
Browser based web interface
Documents
Search Queries
Documents for
indexing
Search Results
Solr Server
Searching
Indexing
schema.xml
solrconfig.x
ml
Index






Documents form the basic unit of SOLR
Documents are composed of fields
Examples:
◦ Document for Person: Fields – name, height, age, etc.
◦ Document for Recipes: Fields – origin, ingredients, etc.
Documents are fed to SOLR
SOLR extracts the information from the fields in the
documents and makes it searchable
Steps:
◦
◦
◦
◦
Field Analysis
Tokenization
Filter application
Indexing



Governs how should SOLR build indexes from input
documents
Defines field types and specific fields that the
documents can contain
Describes how SOLR should handle the fields when
adding documents to the index or when querying
those fields
<schema>
<types>
<fields>
<uniqueKey>
<defaultSearchField>
<solrQueryParser defaultOperator>
<copyField>
</schema>



These are used for examining the text of fields and to generate a token
stream
Indexing Analyzers: The results of the analysis are added to an index
and a set of terms like positions, sizes, etc for a field are defined
Querying Analyzers: The values being searched for are analyzed and
the terms that result are matched against those that are stored in the
field's index
<fieldType name=“nametext” class=“solr.TextField”>
<analyzer type=“index”>
<tokenizer class=“solr.StandardTokenizerFactory”/>
<filter class=“solr.LowerCaseFilterFactory”/>
<filter class=“solr.KeepWordFilterFactory” words=“keepwords.txt”/>
<filter class=“solr.SynonymFilterFactory” synonyms=“syns.txt”/>
</analyzer>
<analyzer type=“query”>
<tokenizer class=“solr.StandardTokenizerFactory”/>
<filter class=“solr.LowerCaseFilterFactory”/>
</analyzer>
</fieldType>



To splits a stream of text into tokens
Tokens are subsequences of the characters
A token contains various metadata in addition to its text value, such as the
location at which the token occurs in the field
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>

Example
◦ Standard Tokenizer: Treats whitespace and punctuation as delimiters
 Input: “Email: [email protected]”
 Output: “Email:”, “[email protected]”
◦ N-Gram Tokenizer: Reads the field text and generates n-gram tokens of sizes in the given
range (default minimum is 1 and maximum is 2)
 Input: “hello world”
 Output: “h”, “e”, “l”, “l”, “o”, “ “, “w”, “o”, “r”, “l”, “d”, “he ”, “el”, “ll”, “lo”, “o “, “wo”, “or”, “rl”,
“ld”



Filters take tokens as input from the Tokenizers and
produce another stream of tokens as output
Multiple filters can be used one after the other
Example:
<fieldType name="text" class="solr.TextField">
<analyzer> <tokenizer
class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
</fieldType>


Stop Filter: This filter is used to discard tokens that are on
the given stop words list. A standard stop words list is
included in the SOLR config directory, named
stopwords.txt, for English language text
Example: Using the standard stopwords.txt
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"/>
</analyzer>
Tokenizer Input : “welcome to the world of Solr”
Tokenizer Output/Filter Input: “welcome”(1), “to”(2), “the”(3),
“world”(4), “of”(5), “Solr”(6)
Filter Output: “welcome”(1), “world”(2), “Solr”(3)


Synonym Filter: This is used for finding synonyms at the time of
indexing as well as while querying. Tokens are looked up in the list of
synonyms and if a match is found, then the synonyms are put in place
of the token
Example: We can define the synonyms in a file (test_synonyms.txt) and
use it for comparing the tokens
◦ home, dwelling, house
◦ shop => workshop, store
◦ teh => the
<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms=“test_synonyms.txt"/>
</analyzer>
Tokenizer Input : “teh home shop”
Tokenizer Output/Filter Input: “teh”(1), “home”(2), “shop”(3)
Filter Output: “the”(1), “workshop”(2), “shop”(2), “home”(2), “dwelling”(3),
“house”(3)



Refers to adding the content to a SOLR index
To make the content searchable
Sources of data for indexing:
◦
◦
◦
◦
XML
CSV
Rich text formats (PDF, MS Word, MS Excel, text etc.)
Data extracted from tables in a database

Uploading Data with SOLR Cell
◦ Using ExtractingRequestHandler
◦ With a POST
◦ With SOLR Cell and SOLRJ

Uploading Data with Index Handlers
◦ XMLUpdateRequestHandler for XML-formatted Data
◦ Using the CSVRequestHandler for CSV Content
◦ Indexing Using SOLRJ


Uploading Structure Data Store Data with the Data
Import Handler
Content Streams



curl posts and retrieves data over HTTP, FTP, and many other protocols
In the example below, the Extraction Request Handler is called, uploads the
file tutorial.html and assigns it the unique ID doc1
curl “http://localhost:8983/solr/update/extract?
literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true”
-F "[email protected]"





literal.id provides a unique ID to the
document uploaded to SOLR
commit=true makes the document searchable
after indexing
The -F flag instructs curl to POST data using
the Content-Type multipart/form-data and
supports the uploading of binary files
The @ symbol instructs curl to upload the
attached file
The argument [email protected] needs a
valid file path
Order of operation:
1. Modify the schema.xml file to add the fields which may not be already existing in the schema.xml file, example:
authors, dd, isbn, yearpub, publisher
2. Modify the schema.xml file to copy the newly created fields to text field to make the search results viewable
3. Run the curl utility with the command for adding XML document:
curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary
"<add><doc><field name='id'>doc26</field><field name='authors'>Patrick
Eagar</field><field name='subject'>Sports</field><field name='dd'>796.35</field><field
name='isbn'>0002166313</field><field name='yearpub'>1982</field><field
name='publisher'>Collins</field></doc><commit waitFlush='false'
waitSearcher='false'/></add>"



Often data is stored in relational databases
Data Import Handler (DIH) provides a mechanism to
import data from database and to index it
DIH can also index content from RSS and ATOM feeds,
e-mail repositories and structured XML

Handler to be registered in the solrconfig.xml file
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">${solr.config.dir:./solr/conf}/dataimporthandler/dataconfig.xml</str>
</lst>
</requestHandler>

There can be multiple configuration files
1. Create a database in SQL Server 2005
2. The tables and the relationships in the database are shown below
3.
Create an XML file called DIH_Test.xml for importing into SOLR
4.
Modify solrconfig.xml file to instruct SOLR to import data as per the file
DIH_Test.xml
5.
Do a full-import of the DIH from the
browser using:
http://localhost:8983/solr/dataimport?command=full
-import
7.
8.
Run queries on the newly
indexed data from the
database
Example:
http://localhost:8983/solr/select?q=
ipad2
The above query returns the result.
Executing queries on the original
database returns similar results
qt: selects a Request Handler for a query using /select
Request
Handler
wt: selects a
response writer
for formatting the
query response
Response
Writer
defType: selects a
query parser for
the query
qf: selects which
field to query in
the index
Query Parser
fq: flters the query by
applying an additional
query to the initial
query’s results; caches
the results
rows: specifies
the number of
rows to be
displayed at run
time
Index
start: specifies an
offset into the
query results where
the returned
response should
begin



Advantage - Enables the user to specify very precise queries
Disadvantage – Is less tolerant of syntax errors than the DisMax
query parser
Parameters Supported
◦
◦
◦
◦
◦
◦
◦
◦
Terms – Use of wild card characters, Fuzzy Searches, Boosts and Ranges
Fields – Identified by name followed by a colon
Boolean Operators – AND, OR, NOT, &&, !, ||
Common query parameters – debugQuery, defType, explainOther, fl, fq,
omitHeader, rows, sort, start, timeAllowed
Functions – abs, constant, div, fieldValue, log, linear, max, etc.
Faceting
Highlighting
MoreLikeThis (mlt)

q – Defines a query using standard query syntax. This
parameter is mandatory

q.op – Specifies the default operator for query expressions
(this parameter’s value is defined in schema.xml). Possible
values are “AND” or “OR”

df – Specifies a default field, overriding the definition of a
default field in schema.xml
Default parameter values are specified in solrconfig.xml

Query
http://localhost:8983/solr/select?
q=id:6H500F0&popularity=6




Fuzzy Searches - based on the Levenshtein Distance or
Edit Distance
E.g. tight~ will match terms like flight, slight etc.
Additional parameter to specify degree of similarity –
tight~0.8 will match sight. When set closer to 1,
optional parameter causes only terms with higher
similarity to be matched
If numerical parameter is omitted, the default value
taken is 0.5

Range Searches
◦ Specifies a range(with an
upper and lower bound)
of values for a field
◦ Can be inclusive or
exclusive of the lower and
upper bounds
Query:
http://localhost:8983/solr/select?
q=popularity:{5 TO 7}
Parameter
Description
defType
Query parser to be used (DisMax or Standard
Query Parser)
Sort
Sorts the response to a query in asc or desc
order based on response’s score or other
characteristic
Start
Offset into the responses at which solr should
begin displaying content
Rows
Number of rows of responses displayed at a
time
fq
Filter query for search results
fl
Limits responses to a listed set of fields
Parameter
Description
debugQuery
Include debugging information
timeAllowed
Time allowed for a query to be processed.
If time elapses before response is
complete are returned, partial
information returned
omitHeader
Excludes header information from
returned results
wt
Specifies the response writer


Used to generate a relevancy score using the actual value
of one or more numeric fields
Functions available for function queries
◦
◦
◦
◦
◦
◦

abs – abs(x); abs(-5)
constant - 1.5; _val_:1.5
div – div(1,y); div(sum(x,100), max(y,1))
linear – linear(x, m, c); linear(x, 2, 4) returns 2*x+4
log – log(x); log(sum(x,100))
…
Include function query in a SOLR query
◦ With a _val_keyword – e.g. _val_:myNumericField
◦ Parameter with an explicit type of FunctionQuery (DisMax query
parser’s bf parameter)
http://localhost:8983/solr/select/?q=cat:electronics+_val_:”div(price,weight)”&fl=*,score



Generated a formatted response of a search
wt parameter sets the response writer
Response writers supported
◦
◦
◦
◦
◦
◦
◦
Json
Php
Phps
Python
Ruby
Xml
xslt
http://wiki.apache.org/solr/FrontPage
(link last accessed on 04/25/2011)

Lucid Works SOLR Reference Guide 1.4
http://www.lucidimagination.com/user_down
load/certified/cdrg/lucidworks-solrrefguide-1.4.pdf
(link last accessed on 04/25/2011)
