A lot of time i spent this weekend for indexing documents into Solr. And i was really unhappy with the community support provided by Apache mailing list. I believe Apache has some really nice software packages and frameworks that are very useful for developers and enterprises, but it all boils down to community support if a product is being preferred in the developer world.
Solr, is a search platform by Apache Lucene project. It is used for full-text search and indexing. All those familiar with search engines must be knowing, how a search engine like Google provides us results.
Initially, fed with a seed file containing URLs, the search engine keeps on parsing these files (webpages) and putting all URLs into a stack. After reading each file it pops another webpage from stack and parses its URL and this goes on. During this crawling phase, all URLs interlinked to a page can be crawled. Apache Nutch is a software which can be useful for such purposes. Spider or Bots for Search engines are sued for crawling the web.
The next step for a search engine, is indexing those documents the functionality which can be provided by Solr. After indexing webpages, which are stored using an inverted index data structure, Search engine’s next step is displaying query results. These results are displayed on basis of propriety algorithms of search engines like page rank or some biasing of the results (for which Google has been recently criticized a lot).
So indexing is a very important steps in Web search engines, and often documents are indexed using tags. For indexing in Solr using custom fields, one may need to follow these steps:
1. Navigate to Solr installation folder, in my case it was inside xampp\solr\conf\ . Here in the schema.xml file the field to be indexed is added. with its properties. The schema.xml file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields.
Thesection inside shema.xml file is where you list the customdeclarations you wish to use in your documents along with various field options that apply to a field.
Common options that fields that can be customized are…
1. default
The default value for this field if none is provided while adding documents
2. indexed=true|false
True if this field should be “indexed”. If a field is indexed, then it is searchable, sortable, and facetable.
3. stored=true|false
True if the value of the field should be retrievable during a search
4. compressed=true|false
True if this field should be stored using gzip compression.
5. compressThreshold=
6. multiValued=true|false
True if this field may contain multiple values per document, i.e. if it can appear multiple times in a document
7. omitNorms=true|false
Set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms.
8. termVectors=false|true
If set, include full term vector info.
If enabled, often also used with termPositions=”true” and termOffsets=”true”.
9. omitTermFreqAndPositions=true|false
If set, omit term freq, positions and payloads from postings for this field. This can be a performance boost for fields that don’t require that information and reduces storage space required for the index. Queries that rely on position that are issued on a field with this option will silently fail to find documents.
10. omitPositions=true|false
If set, omits positions, but keeps term frequencies
To include two fields in the document which i am reading, i wanted them to index using their Author Name/ Maximum frequency term. To add these fields i simply added the code in the schema.xml under the
<field name="AuthorName" type="string" indexed="true" stored="true" required="true" omitNorms="false" default="defAuthor"/>
<field name="MaximumFrequencyTerm" type="string" indexed="true" stored="true" required="true" omitNorms="false" default="testLong"/>
Once done with editing, schema.xml, the necessary code to index files using SolrJ API for Solr is :
SolrServer server = new CommonsHttpSolrServer("http://localhost:8080/solr/");
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
SolrInputDocument doc = new SolrInputDocument();
doc.addField("MaximumFrequencyTerm", MaximumFrequencyTerm,35);
doc.addField("AuthorName", AuthorName,35);
Collection
docs.add( doc );
server.add( docs );
server.commit();
Make sure to add any unique keys along with to add in the document before you commit. Once you run the code all files are indexed with the custom field and can be seen in the Solr search at http://localhost:8080/solr/admin using the search query as “*:*” .