Archive for the ‘Lucene’ category

庖丁分词自定义词库

July 18th, 2013

注意分词格式

庖丁分词支持自定义词库,但是有一点要注意:
词库文件必须保存为UTF-8格式。在windows下用记事本创建的文件默认为ASCII编码,因此里面的词不会被识别。

自定义分词步骤

  1. 在paoding-dic-home.properties文件中配置自定义词库的位置。
    • 首先配置paoding.dic.home.config-fisrt。其有2个选项:system-env和this。
      • paoding.dic.home.config-fisrt=system-env 表示使用系统变量。如果使用这个,那个需要配置环境变量 PAODING_DIC_HOME 为字典所在目录。
      • paoding.dic.home.config-fisrt=this 表示使用本配置文件里面的配置。如果使用这个,需要在本配置文件中继续配置 paoding.dic.home。
    • 配置paoding.dic.home (如果在上面选择了paoding.dic.home.config-fisrt=this才需要),也有2种方法:使用相对路径和绝对路径
      • paoding.dic.home=classpath:dic 使用相对路径。表示使用本项目的classpath中列出的任意一个文件夹,然后把dic文件夹放置其中。一般在eclipse项目中,把dic文件夹放在源码文件夹(src)中。
      • paoding.dic.home=D:/somepath/dic 使用绝对路径。
  2. 配置好路径之后,可以在dic文件夹中新建任意以.dic为后缀的词库文件。每行一个词。还是注意:必须保存为UTF-8格式。

Install Lucene Solar with Tomcat on Windows

May 7th, 2013

Installing the example solar web application is actually super easy. Here we suppose you already installed the Tomcat sevlet container on your computer TOMCAT_HOME (for example: D:\prog\apache-tomcat-7.0.35). If not, you can download it here http://tomcat.apache.org and install it (at TOMCAT_HOME).

The next step is to install Solar:

  1. download Solr at: http://lucene.apache.org/solr/
  2. extract the solar package into, for example, solr-4.3.0
  3. copy the sample “Solr Home” directory solr-4.3.0\example\solr\ (Note that it’s the folder ‘example\solr\’ itself, not the content under it!!!) into the home directory of tomcat: TOMCAT_HOME if you start tomcat with Tomcat Monitor or into TOMCAT_HOME\bin if you start tomcat with TOMCAT_HOME\bin\startup.bat. Note that the solr hoe directory must be put in the right directory. This is very important!!! Otherwise you will get such an exception: “HTTP Status 500 – {msg=SolrCore ‘collection1′ is not available due to init failure: Could not load config for solrconfig.xml …… ”
    In a word, the solr home directory should be put in the Java current working directory. You can also “Configure the servlet container such that a JNDI lookup of “java:comp/env/solr/home” by the Solr webapp will point to the Solr home”.
  4. copy the solr war file (solr-N.N.N.war) under solr-4.3.0\dist\ as solr.war into the tomcat webapps directory TOMCAT_HOME/webapps. Tomcat will automatically deploy it.
  5. After solr.war is extracted, copy the SLF4j logging jars from solr-4.3.0/example/lib/ext into TOMCAT_HOME/lib or TOMCAT_HOME/webapps/solr/WEB-INF/lib. If this step is ignored, this exception will be thrown out: org.apache.catalina.core.StandardContext filterStart
    SEVERE: Exception starting filter SolrRequestFilter
    org.apache.solr.common.SolrException: Could not find necessary SLF4j logging jars. If using Jetty, the SLF4j logging jars need to go in the jetty lib/ext directory. For other containers, the corresponding directory should be used. For more information, see: http://wiki.apache.org/solr/SolrLogging

OK. That’s it! Now locate your web browser to http://localhost:8080/solr (Change the port if necessary) and you will see the admin page of solr.

If there are exceptions, check your tomcat logs (such as localhost.YYYY-MM-DD.log) under TOMCAT_HOME\logs\.

PS. actually to install solar on windows is not so much different from that on any other OS. :)

More details can be found on the Solr Wiki: http://wiki.apache.org/solr/SolrInstall

DbSight: A full-text search platform for databse

April 11th, 2010

As many people are using Lucene to index and search full-text of their database, there is actually a more convenient way — to use DbSight. Here are the advantages listed on their official website:

Quick to develop

Instead of weeks or even months to develop a full-text search for your data, if you know how to use DBSight, you can easily create the full-text search literally in minutes.

Feature-rich

Besides Google-like full-text search, you can have:

  • Configurable Ranking by combination of relevance and fields like product price, score, comments count, etc
  • Advanced Facet Search provides results counted for each category, and sub-category
  • Tag cloud for current search results
  • Order results by the field you choose
  • Summarized and highlighted results
  • Spelling check for existing content
  • Pagination of the results
  • Recent searches history
  • Multi-Server mode for Server Clustering
  • RSS feed for latest match

And you can have several database searches available in one central server. They can be different applications, or different databases like Oracle, DB2, MySQL, SQL Server, Postgres, or any JDBC supported databases. SQL-based content retrieval is flexible, versatile, and customizable.

Beside database, you can customize the crawler to search files in your own way, be it files on disk, or XML file via HTTP, or any other data sources.

Easy to Customize

You can choose one of existing templates to start with. You add your own logo, change the layout, and add/remove components — all by web UI. You can render the search results similar to Google or Yahoo style, or render them like product catalog, or results that fit for mobile phones, or directly jump to the most relevant match.

Easy to Integrate

With the result template based on simple Freemarker or Velocity macro language, you can send back the search results in HTML, XML, JSON, CSV (Comma Separated Values), or just a list of document IDs so you can process later.

You can also use JSP to render search results.

API to search via Protocol Buffer is also supported. You can use Python, Java, C++, or other languages to search. Default Java API is provided.

Easy to Manage

With Incremental Indexing, Hot Index Swapping, DBSight is maintenance free with 100% uptime and less than 0.5 seconds performance, and less than 0.2 seconds for small indexes. All of the operations including search index updates, spell check index updates and remote index replication are trouble free. Configurable log system can notify administrators about recent indexing processes.

DBSight has search statistics for performance analysis, including diagrams for search frequency, most popular searches, zero result search. And DBSight has index analyzing tools to examine the content, most frequent terms, etc.

Easy to Scale

DBSight is fairly efficient, and DBSight search clustering can provide more search throughput and search failover in case of unexpected machine errors.

Integrated Web UI

Configuration changes are all done via Web UI and a few clicks away. The operations include, but not limited to:

  • Adjust ranking algorithms
  • Adjust SQLs to extract database content
  • Adjust SQL caching or batching mode.

For advanced users:

  • Change text Analyzer, visually compare different Analyzers
  • Change Similarity
  • Change searchable, filterable,
  • Change indexing threads, memory size.

Easy to package

DBSight configuration can be easily downloaded and uploaded. Search can be applied to different system instances.

The powerful scaffolding can be used to quickly create all kinds of search result templates. So you can easily create a search on any legacy contents, and maybe reuse the configuation or resell it!

DBSight is also OS independent, same configuration can run well on windows and linux.

Unique Features

Many features are pioneered by DBSight, and not available anywhere else:

  • Advanced Facet Search for multi-valued facets, much efficient storage and fast search.
  • Facet Search caching for even better performance
  • Count, Sum, Average, Minimum, Maximum, and grouping for numeric facet searches.
  • Hierachical Date facet search, date range search.
  • Time based ranking
  • Flexible Date Range Search

Use Lucene to index a database

April 11th, 2010

According to Luncene FAQ:

How can I use Lucene to index a database?

Connect to the database using JDBC and use an SQL “SELECT” statement to query the database. Then create one Lucene Document object per row and add it to the index. You will probably want to store the ID column so you can later access the matching items. For other (text) columns it might make more sense to only index (not store) them, as the original data is still available in your database.

For a more high level approach you might want to have a look at LuSql (a specialized tool for moving data from JDBC-accessible databases into Lucene), Hibernate Search, Compass, DBSight, or Solr’s Data Import Handler which all use Lucene internally.

An example: Apache Lucene – Indexing a Database and Searching the Content