lucene百科

广告

Lucene 介绍

2011-04-28 19:06:41 本文行家:esx

Lucene 介绍、什么是Apache Lucene、Apache基金会介绍、新闻。

Lucene logo
Lucene logo

什么是Apache Lucene

Apache Lucene™ 是一个高性能,全功能的文本搜索引擎库,全部由Java编写。其技术上几乎适合任何需要全文搜索的应用程序,尤其是对跨平台的应用。

Apache Lucene是一个可以免费下载的开源项目,你可以通过点击左边下载它。

Apache Lucene™ 项目包括:
Apache Lucene Core™ (之前叫 Lucene Java),我们主要的子项目,提供了一个基于Java的索引和搜索实现,也包括拼写检查,结果高亮和高级分析/分词功能。

Apache Solr™ 是我们的高性能企业搜索服务器,包括 XML/HTTP和JSON/Python/Ruby为API,结果高亮,分面搜索,缓存,复制,分布式搜索,数据库整合,web管理和搜索界面。

Apache PyLucene™ 是一个Lucene Core项目的Python移植。

Apache Open Relevance Project™ 是一个为相关测试和性能评估而收集和分配免费材料为目的的子项目。

The Apache Software Foundation

The Apache Software Foundation provides support for the Apache community of open-source software projects. The Apache projects are defined by collaborative consensus based processes, an open, pragmatic software license and a desire to create high quality software that leads the way in its field. Apache Lucene, Apache Solr, Apache PyLucene, Apache Open Relevance Project and their respective logos are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.

News


28 July 2011 - WARNING: Index corruption and crashes in Apache Lucene Core / Apache Solr with Java 7

Oracle released Java 7 today. Unfortunately it contains hotspot compiler optimizations, which miscompile some loops. This can affect code of several Apache projects. Sometimes JVMs only crash, but in several cases, results calculated can be incorrect, leading to bugs in applications (see Hotspot bugs 7070134, 7044738, 7068051).

Apache Lucene Core and Apache Solr are two Apache projects, which are affected by these bugs, namely all versions released until today. Solr users with the default configuration will have Java crashing with SIGSEGV as soon as they start to index documents, as one affected part is the well-known Porter stemmer (see LUCENE-3335). Other loops in Lucene may be miscompiled, too, leading to index corruption (especially on Lucene trunk with pulsing codec; other loops may be affected, too - LUCENE-3346).

These problems were detected only 5 days before the official Java 7 release, so Oracle had no time to fix those bugs, affecting also many more applications. In response to our questions, they proposed to include the fixes into service release u2 (eventually into service release u1, see this mail). This means you cannot use Apache Lucene/Solr with Java 7 releases before Update 2! If you do, please don't open bug reports, it is not the committers' fault! At least disable loop optimizations using the -XX:-UseLoopPredicate JVM option to not risk index corruptions.

Please note: Also Java 6 users are affected, if they use one of those JVM options, which are not enabled by default: -XX:+OptimizeStringConcat or -XX:+AggressiveOpts.

It is strongly recommended not to use any hotspot optimization switches in any Java version without extensive testing!

In case you upgrade to Java 7, remember that you may have to reindex, as the unicode version shipped with Java 7 changed and tokenization behaves differently (e.g. lowercasing). For more information, read JRE_VERSION_MIGRATION.txt in your distribution package!
1 July 2011 - Lucene Core 3.3 and Solr 3.3 Available

The Lucene PMC is pleased to announce the availability of Apache Lucene 3.3 and Apache Solr 3.3.

Lucene can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/java/ and Solr can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/solr/

Highlights of the Lucene release include:
The spellchecker module now includes suggest/auto-complete functionality, with three implementations: Jaspell, Ternary Trie, and Finite State.
Support for merging results from multiple shards, for both "normal" search results (TopDocs.merge) as well as grouped results using the grouping module (SearchGroup.merge, TopGroups.merge).
An optimized implementation of KStem, a less aggressive stemmer for English
Single-pass grouping implementation based on block document indexing.
Improvements to MMapDirectory (now also the default implementation returned by FSDirectory.open on 64-bit Linux).
NRTManager simplifies handling near-real-time search with multiple search threads, allowing the application to control which indexing changes must be visible to which search requests.
TwoPhaseCommitTool facilitates performing a multi-resource two-phased commit, including IndexWriter.
The default merge policy, TieredMergePolicy, has a new method (set/getReclaimDeletesWeight) to control how aggressively it targets segments with deletions, and is now more aggressive than before by default.
PKIndexSplitter tool splits an index by a mid-point term.

Highlights of the Solr release include:
Grouping / Field Collapsing
A new, automaton-based suggest/autocomplete implementation offering an order of magnitude smaller RAM consumption.
KStemFilterFactory, an optimized implementation of a less aggressive stemmer for English.
Solr defaults to a new, more efficient merge policy (TieredMergePolicy). See http://s.apache.org/merging for more information.
Important bugfixes, including extremely high RAM usage in spellchecking.
Bugfixes and improvements from Apache Lucene 3.3
4 June 2011 - Lucene Core 3.2 and Solr 3.2 Available

The Lucene PMC is pleased to announce the availability of Apache Lucene 3.2 and Apache Solr 3.2.

Lucene can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/java/ and Solr can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/solr/

Highlights of the Lucene release include:
A new grouping module, under lucene/contrib/grouping, enables search results to be grouped by a single-valued indexed field
A new IndexUpgrader tool fully converts an old index to the current format.
A new Directory implementation, NRTCachingDirectory, caches small segments in RAM, to reduce the I/O load for applications with fast NRT reopen rates.
A new Collector implementation, CachingCollector, is able to gather search hits (document IDs and optionally also scores) and then replay them. This is useful for Collectors that require two or more passes to produce results.
Index a document block using IndexWriter's new addDocuments or updateDocuments methods. These experimental APIs ensure that the block of documents will forever remain contiguous in the index, enabling interesting future features like grouping and joins.
A new default merge policy, TieredMergePolicy, which is more efficient due to being able to merge non-contiguous segments. See http://s.apache.org/merging for details.
NumericField is now returned correctly when you load a stored document (previously you received a normal Field back, with the numeric value converted string).
Deleted terms are now applied during flushing to the newly flushed segment, which is more efficient than having to later initialize a reader for that segment.

Highlights of the Solr release include:
Ability to specify overwrite and commitWithin as request parameters when using the JSON update format.
TermQParserPlugin, useful when generating filter queries from terms returned from field faceting or the terms component.
DebugComponent now supports using a NamedList to model Explanation objects in its responses instead of Explanation.toString.
Improvements to the UIMA and Carrot2 integrations.
Highlighting performance improvements.
A test-framework jar for easy testing of Solr extensions.
Bugfixes and improvements from Apache Lucene 3.2.
31 March 2011 - Lucene Core 3.1 and Solr 3.1 Available

The Lucene PMC is pleased to announce the availability of Apache Lucene 3.1 and Apache Solr 3.1. The version number for Solr 3.1 was chosen to reflect the merge of development with Lucene, which is currently also on 3.1. Going forward, we expect the Solr version to be the same as the Lucene version. Solr 3.1 contains Lucene 3.1 and is the release after Solr 1.4.1.

Lucene can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/java/ and Solr can be downloaded from http://www.apache.org/dyn/closer.cgi/lucene/solr/

Highlights of the Lucene release include:
Numerous performance improvements: faster exact PhraseQuery; merging favors segments with deletions; primary key lookup is faster; IndexWriter.addIndexes(Directory[]) uses file copy instead of merging; various Directory performance improvements; compound file is dynamically turned off for large segments; fully deleted segments are dropped on commit; faster snowball analyzers (in contrib); ConcurrentMergeScheduler is more careful about setting priority of merge threads.
ReusableAnalyzerBase makes it easier to reuse TokenStreams correctly.
Improved Analysis capabilities: Improved Unicode support, including Unicode 4, more friendly term handling (CharTermAttribute), easier object reuse and better support for protected words in lossy token filters (e.g. stemmers).
ConstantScoreQuery now allows directly wrapping a Query.
IndexWriter is now configured with a new separate builder API, IndexWriterConfig. You can now control IndexWriter's previously fixed internal thread limit by calling setMaxThreadStates.
IndexWriter.getReader is replaced by IndexReader.open(IndexWriter). In addition you can now specify whether deletes should be resolved when you open an NRT reader.
MultiSearcher is deprecated; ParallelMultiSearcher has been absorbed directly into IndexSearcher.
On 64bit Windows and Solaris JVMs, MMapDirectory is now the default implementation (returned by FSDirectory.open). MMapDirectory also enables unmapping if the JVM supports it.
New TotalHitCountCollector just counts total number of hits.
ReaderFinishedListener API enables external caches to evict entries once a segment is finished.

Highlights of the Solr release include:
Numeric range facets (similar to date faceting).
New spatial search, including spatial filtering, boosting and sorting capabilities.
Example Velocity driven search UI at http://localhost:8983/solr/browse
A new termvector-based highlighter
Extend dismax (edismax) query parser which addresses some missing features in the dismax query parser along with some extensions.
Several more components now support distributed mode: TermsComponent, SpellCheckComponent.
A new Auto Suggest component.
Ability to sort by functions.
JSON document indexing.
CSV response format.
Apache UIMA integration for metadata extraction.
Leverages Lucene 3.1 and it's inherent optimizations and bug fixes as well as new analysis capabilities.
Numerous improvements, bug fixes, and optimizations.
3 December 2010 - Lucene Java 3.0.3 and 2.9.4 available

Both releases fix bugs in the previous versions:
2.9.4 is a bugfix release for the Lucene Java 2.x series, based on Java 1.4.
3.0.3 has the same bug fix level but is for the Lucene Java 3.x series, based on Java 5.

New users of Lucene are advised to use version 3.0.3 for new developments, because it has a clean, type-safe API.

This release contains numerous bug fixes and improvements since 2.9.3 / 3.0.2, including:
a memory leak in IndexWriter exacerbated by frequent commits
a file handle leak in IndexWriter when near-real-time readers are opened with compound file format enabled
a rare index corruption case on disk full
NumericRangeQuery / NumericRangeFilter sometimes returning incorrect results with bounds near Long.MIN_VALUE and Long.MAX_VALUE
various thread safety issues
Lucene 2.9.4 can now also read indexes created by 3.0.x

Both releases are fully compatible with the corresponding previous versions. We strongly recommend upgrading to 2.9.4 if you are using 2.9.x; and to 3.0.3 if you are using 3.0.x.

See 3.0.3 CHANGES and 2.9.4 CHANGES for details. Binary and source distributions are available here. Maven artifacts are available here.
25 June 2010 - Solr 1.4.1 Released

Solr 1.4.1 has been released and is now available for public download! Solr 1.4.1 is a bug fix release for Solr 1.4 that includes many Solr bug fixes as well as Lucene bug fixes from Lucene 2.9.3.

See the release notes for more details.
18 June 2010 - Lucene Java 3.0.2 and 2.9.3 available

Both releases fix bugs in the previous versions:
2.9.3 is a bugfix release for the Lucene Java 2.x series, based on Java 1.4.
3.0.2 has the same bug fix level but is for the Lucene Java 3.x series, based on Java 5.

New users of Lucene are advised to use version 3.0.2 for new developments, because it has a clean, type-safe API.

Important improvements in these releases include:
Fixed memory leaks in IndexWriter when large documents are indexed. It also uses now shared memory pools for term vectors and stored fields. IndexWriter now releases Fieldables and Readers on close.
NativeFSLockFactory fixes and improvements. Release write lock if exception occurs in IndexWriter ctors.
Improve concurrency of IndexReader, especially in the context of near real-time readers.
Near real-time readers, opened while addIndexes* is running, no longer miss some segments.
Performance improvements in ParallelMultiSearcher (3.0.2 only).
IndexSearcher no longer throws NegativeArraySizeException if you pass Integer.MAX_VALUE as nDocs to search methods.

Both releases are fully compatible with the corresponding previous versions. We strongly recommend upgrading to 2.9.3 if you are using 2.9.x; and to 3.0.2 if you are using 3.0.x.

See 3.0.2 CHANGES and 2.9.3 CHANGES for details. Binary and source distributions are available here. Maven artifacts are available here.
分享:
标签: lucene apache 全文检索 搜索 | 收藏
百科的文章(含所附图片)系由网友上传,如果涉嫌侵权,请与客服联系,我们将按照法律之相关规定及时进行处理。如需转载,请注明来源于www.baike.com