使用Apache Solr检索提取的文本

我是Apache Solr的新手,我想用它来索引pdf文件。 我设法启动并运行到目前为止,我现在可以search添加的PDF文件。

但是,我需要能够从结果中检索search到的文本。

我在默认的solrconfig.xml中find了一个xml代码片段:

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy"> <lst name="defaults"> <!-- All the main content goes into "text"... if you need to return the extracted text or do highlighting, use a stored field. --> <str name="fmap.content">text</str> <str name="lowernames">true</str> <str name="uprefix">ignored_</str> <!-- capture link hrefs but ignore div attributes --> <str name="captureAttr">true</str> <str name="fmap.a">links</str> <str name="fmap.div">ignored_</str> </lst> 

从我从这里得到(http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika),我想我必须添加一个新的字段schema.xml如“内容”)存储=“真”和索引=“真”。 但是,我不确定如何完成这一切?

任何帮助表示赞赏,thx

添加一个schema.xml如下所示:

 <?xml version="1.0" encoding="UTF-8" ?> <schema name="whatever" version="1.2"> <types> <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/> <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/> <fieldType name="long" class="solr.TrieLongField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/> <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/> <fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/> <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <charFilter class="solr.HTMLStripCharFilterFactory"/> <charFilter class="solr.MappingCharFilterFactory" mapping="../../mapping-ISOLatin1Accent.txt"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <charFilter class="solr.HTMLStripCharFilterFactory"/> <charFilter class="solr.MappingCharFilterFactory" mapping="../../mapping-ISOLatin1Accent.txt"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> </types> <fields> <field name="internal_id" type="string" indexed="true" stored="true"/> <field name="cat" type="int" indexed="true" stored="true"/> <field name="desc" type="text" indexed="true" stored="true"/> </fields> <uniqueKey>internal_id</uniqueKey> <defaultSearchField>desc</defaultSearchField> <solrQueryParser defaultOperator="OR"/> <similarity class="org.apache.lucene.search.DefaultSimilarity"/> </schema> 

如果“字段”是“存储”,默认情况下会显示在结果中。