lucidworks-ist441-presentation
Download
Report
Transcript lucidworks-ist441-presentation
How to Use LucidWorks
Search
Sagnik Ray Choudhury
[email protected]
Installation and Search Components
• Installation: http://www.lucidworks.com/download/
• Access control.
• Crawling
• Aperture crawler.
• Web, filesystem, amazon S3 bucket
• Information extraction:
• Aperture parser
• Indexing
• Lucene.
• Ranking
• Lucene.
• Result interface
• Standard/Flair interface
lucidworks IST 441 PSU
2
Start Page
• http://ist441.ist.psu.edu:8988
lucidworks IST 441 PSU
3
Access Control: Admin Panel
• Admin screen: login here (username admin, password admin)
lucidworks IST 441 PSU
4
Admin Dashboard
User control
Collections
lucidworks IST 441 PSU
5
Adding Users
• If you use local installation:
• May or may not create users.
Creating
new user
• If you use server installation:
• Create a new user with admin privilege.
• Delete the admin account.
• Do not use PSU/IST credentials.
Deleting admin
lucidworks IST 441 PSU
6
Crawling: Step 1
• Add a new collection with default template.
lucidworks IST 441 PSU
7
Crawling: Choosing a Data Source
• Click on the new collection.
• Note index size and number of documents.
• Add a new data source (web site)
lucidworks IST 441 PSU
8
Crawling: Parameter Selection
• Name, url, crawl depth
• Constraint to
• Allow crawling within the
site/ outside the site.
• Include paths
• Particular set of pages you wish
to crawl.
• Exclude paths
• Filetypes/ pages you do not
Want to crawl.
• Small scale single thread crawler, for better performance, nutch can be
integrated.
• http://docs.lucidworks.com/display/help/Create+a+New+Web+Site+Data+Source
lucidworks IST 441 PSU
9
Starting the Crawling Process
• Click create to move to crawl-job screen.
• Start crawling (you can add a schedule too to crawl periodically).
• You can add another website by going back to collection page (slide
8).
lucidworks IST 441 PSU
10
Information Extraction and Indexing
• Information extraction from crawled web pages.
•
•
•
•
Default: Aperture parser.
Fallback: Apache Tika.
Extracted information: author, fulltext, date etc.
http://docs.lucidworks.com/display/lweug/Overview+of+Crawling (field
mapping section)
• Information extraction and indexing runs simultaneously with the
crawling.
• Need to do a “hard commit” to ensure that index is up to date.
• To know more about the index, go to the Solr page for the collection.
lucidworks IST 441 PSU
11
Searching
• Default interface: click on “tools” link on the top panel.
lucidworks IST 441 PSU
12
Searching: Flare interface
• The “Apps” page links to the starting point for Flare interface.
• For advanced searching and statistics, click on your collection.
lucidworks IST 441 PSU
13
Conclusion
• Basic crawling, indexing and searching using LucidWorks.
• Simple to use, but do not offer much flexibilities.
• Things to try:
• Incorporating new crawlers.
• Changing the information extraction process.
• Changing the indexing schema and ranking functions.
• Questions?
lucidworks IST 441 PSU
14