Practical Solr

Download Report

Transcript Practical Solr

Practical Solr
Guide for Developers
First…some questions.
• How many of you in the room know what Solr is?
• How many have worked with Solr?
• How many will be using Solr or text search technology in
their upcoming projects?
Why am I here speaking to you about this?
• Several projects in 2011/2012 involving search technology
• One of most visited recipe sites un the US with 200,000 hits per
hour during peak times
• Resource portal for world’s leading vendor of large format printers
• First encounter was with Lucene.NET which lead to Solr
• Second encounter with Solr on Azure
• Afterwards Jetty and Tomcat configurations
• Currently working on https://github.com/radekz2/SolrStarterKit
99bugs.com
Solr and Lucene
Solr is the popular, blazing fast open source enterprise search platform
from the Apache Lucene project. Its major features include powerful fulltext search, hit highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and geospatial
search. Solr is highly scalable, providing distributed search and index
replication, and it powers the search and navigation features of many of
the world's largest internet sites.
Apache Lucene(TM) is a high-performance, full-featured text search
engine library written entirely in Java
Not Frictionless
• Java
• Complex configuration
• Still evolving documentation
• Too many brief tutorials
What we will talk about today.
•
•
•
•
•
•
•
•
•
Getting up and running
Setting up as service
Importing data
Spelling
Stopwords, Synonyms, Elevate
Facets
Replication, Zoo Keeper (Cloud setup)
Integration deep dives
Etc.
Solr and Lucene
Web Server
Web Clients
Solr web application (Solr.war)
Core1 (recipes)
data-config.xml
solrconfig.xml
schema.xml
Core2 (food articles)
data-config.xml
solrconfig.xml
schema.xml
Core3 (etc.)
data-config.xml
solrconfig.xml
schema.xml
CMS
Bash/PowerShell etc.
PHP
Document Repositories
Solr Terminology
Solr Core: Also referred to as just a "Core" This is a running instance of a Solr index
along with all of its configuration (SolrConfigXml, SchemaXml, etc...). A single Solr
application can contain 0 or more cores which are run largely in isolation but can
communicate with each other if necessary via the CoreContainer. From a historical
perspective: Solr initially only supported one index, and the SolrCore class was a
singleton for coordinating the low-level functionality at the "core" of Solr. When
support was added for creating and managing multiple Cores on the fly, the class was
refactored to no longer be a Singleton, but the name stuck.
Facet: A distinct feature or aspect of a set of objects; "a way in which a resource can
be classified" (*)
Request Handler: A Solr component that processes requests. For example,
the DisMaxRequestHandler processes search queries by calling the DisMax Query
Parser. Request Handlers can perform other functions, as well.
http://wiki.apache.org/solr/SolrTerminology
Solr Terminology
Solr Core: Searchable grouping of documents (index).
E.g.
Core 1 = Recipes
Core 2 = Articles about Food
Facet: categorisation
Request Handler: Functional grouping under a URL, a
lot like a route under PHP frameworks e.g
/core1/search -> searches recipes
/core1/importxml-> triggers importing from XML files
Starting Solr under 1 minute
Requirements:
• Downloaded and unpackaged Solr
• JRE Installed http://java.com
1. Via command line Navigate to /apache-solr-3.6.1/example
2. Run java -Dsolr.solr.home=multicore -jar start.jar
* Also see README.txt in /apache-solr-3.6.1/example
Solr With Tomcat
<?xml version="1.0" encoding="utf-8"?>
<Context docBase="C:/solr_tomcat/apache-solr-3.5.0.war"
debug="0" crossContext="true">
<Environment name="solr/home" type="java.lang.String"
value="C:/solr_tomcat" override="true"/>
</Context>
http://wiki.apache.org/solr/SolrTomcat
C:\Program Files\Apache Software Foundation\Tomcat 6.0\conf\Catalina\localhost
Files and Directories
• solr
• core0
• conf
• schema.xml
• solrconfig.xml
• data-config.xml
• dataimport.properties
• solrcore.properties
• data
• core1
• solr.xml
solr.xml
Solr web application settings, Define your cores here along a few global settings.
<solr persistent="false“ sharedLib="global_libs“ >
<!-adminPath: RequestHandler path to manage cores.
If 'null' (or absent), cores will not be manageable via request handler
-->
<cores adminPath="/admin/cores">
<core name="core0" instanceDir="core0" />
<core name="core1" instanceDir="core1" />
</cores>
</solr>
Tip: use sharedLib="global_libs“ attribute
Other options: http://wiki.apache.org/solr/CoreAdmin
schema.xml
Schema XML is there you describe your data.
•
•
•
•
•
Lucene Field definitions with analysis chain
Column names and their respective Lucene types
Unique key
Default search field
Default operator (AND/OR) – being deprecated in the future
http://lucidworks.lucidimagination.com/display/solr/Solr+Field+Types#SolrFieldTypesFieldTypesIncludedwithSolr
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
http://wiki.apache.org/solr/LanguageAnalysis
Gotcha: Multivalued fields cannot be sorted
dataimport.properties and solrcore.properties
• dataimport.properties
• Status file
• Managed by solr
• Contains import information such as last import etc.
• solrcore.properties
• Contains core specific settings assigned by developer
• Settings can be passed to data import definition file
mycore.languagegroup=en
mycore.filenamefilter=.*(en|eew|enw|eez|eep)\.(xml)
In data config, these options can be retrieved as:
${mycore.languagegroup}
${mycore.filenamefilter}
Etc.
Importing
• From XML
• XML can originate in a single file, multiple files (same schema) or HTTP
• Solr with loop over common data nodes using it’s for-each mechanism
• From Database
• You will need a JDBC driver for your database
• Can run multiple queries with reference variables passed from one entity to
another
Gotcha: The XPathEntityProcessor implements a streaming parser which supports a subset of xpath
syntax. Complete xpath syntax is not supported but most of the common use cases are covered as
follows:xpath="/a/b/subject[@qualifier='fullTitle']"
xpath="/a/b/subject/@qualifier"
xpath="/a/b/c"
Gotcha: SQL Timeouts
JDBC Timeouts
<dataSource name="jdbc"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
type="JdbcDataSource"
url="jdbc:sqlserver://dvtaoomb.database.windows.net:1433;database=DB
_Infrastructure;user=ConsumerSitesDev@dvtaoomb;password=@SecurePwd;e
ncrypt=true;hostNameInCertificate=data.ch1-1.database.windows.net"
responseBuffering="full" >
<property name="testOnBorrow" value="true"/>
<property name="validationQuery" value="SELECT 1"/>
</dataSource>
http://commons.apache.org/dbcp/configuration.html
Stop Words
• Stop words list in
• /apache-solr-3.6.1/example/example-DIH/solr/solr/conf
• You can find more stopwords using schema browser
a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
s
such
t
that
the
their
then
there
these
they
this
to
was
will
with
Spellcheck
• Solr will build a spell index from existing index
• Spell index will be a separate set of index files and it’s building
needs to be triggered
• Spell index generation is called only once, do not call with
every query
http://localhost:8080/solr/Core_ImportXml/select/?q=*%3A*&version=2.2&start=0&
rows=10&indent=on&spellcheck.build=true&spellcheck.q=stering&spellcheck=true
Note: the spellcheck.build=true which is needed only once to build the spellcheck
index from the main Solr index. It takes time and should not be specified with each
request.
Note: Combine multiple fields into single spell field using
<copyField source="ProductDescription" dest="ProductSpellText"/>
Gotcha: solr.PorterStemFilterFactory
http://wiki.apache.org/solr/SpellCheckComponent
Faceting
Just Facets:
http://localhost:8080/solr/Core_ScriptTransformer/select/?q=*%3A*&version=2.2&
start=0&rows=5&indent=on&facet=true&facet.field=ProductScale&facet.field=Prod
uctLine
For predictive search:
http://localhost:8080/solr/Core_ScriptTransformer/select/?q=*%3A*&version=2.2&
start=0&rows=0&indent=on&facet=true&facet.field=Keywords&facet.prefix=a
More with Facets:
http://wiki.apache.org/solr/SimpleFacetParameters
Transformers
•
•
•
•
•
•
•
•
RegexTransformer
ScriptTransformer
DateFormatTransformer
NumberFormatTransformer
TemplateTransformer
HTMLStripTransformer
ClobTransformer
LogTransformer
http://wiki.apache.org/solr/DataImportHandler#Transformer
Synonyms
beefstew = Beef stew
Query Elevate
bring certain documents to the top
based on query
Documentation
•
•
•
•
•
•
•
•
•
•
http://wiki.apache.org/solr/DataImportHandler
http://wiki.apache.org/solr/SolrTerminology
http://wiki.apache.org/solr/SpellCheckComponent
http://lucidworks.lucidimagination.com/display/solr/Solr+Field
+Types#SolrFieldTypes-FieldTypesIncludedwithSolr
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
http://wiki.apache.org/solr/LanguageAnalysis
http://commons.apache.org/dbcp/configuration.html
http://wiki.apache.org/solr/SimpleFacetParameters
http://wiki.apache.org/solr/SolrRequestHandler
http://wiki.apache.org/solr/IntegratingSolr
Gotchas
• Form content type
• http://stackoverflow.com/questions/2997014/can-you-use-post-to-run-aquery-in-solr-select
• application/xml (not application/x-www-form-urlencoded)
• Mutlivalue fields cannot be sorted
• Dates (use date transformers)
• JDBC Timeouts
• Slow indexing with multiple database entities
• XPath Limitations
• Can you recreate your updates?
• Are you storing enough data?
Thank You!
Radek Zajkowski
https://joind.in/7458
www.99bugs.com