Indexing Drupal Content with Apache Solr
In this post we will cover configuring Apache Solr to index the content of a Drupal site and retrieve items from that index. This post assumes that you have Apache Solr installed and running and can access the administration interface at http://<tomcat_host>:8080/solr/admin. If not, see my previous article Installing Apache Solr on Tomcat to get started.
Configuring the DataImportHandler
To index data with Apache Solr we will need to configure Apache Solr's DataImportHandler to query our Drupal database and extract all relevant information for indexing. The first step is to download and install the appropriate Java connector for our database, which in this case is MySQL. Other database types will require different drivers, check the DataImportHandler documentation to find out more about your specific database.
Extract the download and copy the jar file into your Solr installation:
$ tar -xzf mysql-connector-java-5.1.18.tar.gz $ mkdir -p $CATALINA_HOME/solr/lib/mysql-connector-java/lib $ cp mysql-connector-java-5.1.18/mysql-connector-java-5.1.18-bin.jar $CATALINA_HOME/solr/lib/mysql-connector-java/lib
Next we will need to inform Solr of the new plugin by adding a line to the solrconfig.xml file that we modified in the first part of this series:
<lib dir="/path/to/tomcat/solr/lib/mysql-connector-java/lib" />
Not that above, /path/to/tomcat is the same path as $CATALINA_HOME
By default some versions of Apache Solr come without a preconfigured
URL by which we can access the indexer. Since this is the primary way
that we will index data from Drupal we are going to make sure the URL is
properly configured. This only requires making sure the following block
of XML exists in the solrconfig.xml:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> </lst> </requestHandler>
Defining Drupal Content for Solr
First we need to create a file called data-config.xml in your Solr configuration directory ($CATALINA_HOME/solr/conf). This file contains three pieces of information: connection information, an SQL query and a mapping of returned fields to Apache Solr named fields. This defines how Solr will connect to your database and collect the relevant information for indexing. Below is an example data-config.xml that will index all published Drupal nodes and assumes a MySQL database:
<dataConfig> <dataSource type="JdbcDataSource driver="com.mysql.jdbc.Driver" url="jdbc.mysql://<database_host>/<drupal_database>" user="<database_user>" password="<database_password>" /> <document name="content"> <entity name="node" query=" SELECT n.nid, n.title, nr.body FROM node n LEFT JOIN node_revisions nr ON n.vid = nr.vid WHERE n.status = 1 "> <field column="nid" name="id" /> <field column="title" name="title" /> <field column="body" name="body" /> </entity> </document> <dataConfig>
Finally we need to map the fields named in data-config.xml to data types that Solr understands. This is done in the <fields> section of the schema.xml file in Solr's configuration directory. Solr supports many different data types and includes support for creating your own custom data types. For the purpose of this article we will use the string data type, which indexes each word in the content as a keyword; for more advanced configurations see the comments and data type definitions in schema.xml and Solr's Tokenizer documentation. Here is what our example fields look like:
<fields> <field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="title" type="string" indexed="true" stored="true" /> <field name="body" type="string" indexed="true" stored="false" /> </fields>
The only requirement is the 'id' field, which will be come the unique id of the content in Solr's index. For fields that are to be matched on during a search should have the "indexed" attribute set to true and fields that we want returned in the search results should have "stored" set to true.
We are now ready to index our Drupal content. To do so, simply navigate to http://<tomcat_host>:8080/solr/dataimport?command=full-import. You should see some XML describing the success of the indexer. If an error is returned, see $CATALINA_HOST/logs/catalina.out for more information about the problem. If there are no problems, we can do some test searches using Solr's admin interface. Navigate to http://<tomcat_host>:8080/solr/admin and use the search box to enter a query. More information on Solr queries is available in the SolrQueryIndex documentation.