You are here

Getting Going with Solr in Alfresco 4

Martin Bergljung's picture
Martin Bergljung

So you have been using Alfresco for a while and now you are just about to start a project and the client wants to go with Alfresco 4.0 and use Solr. Then it’s time to dig into the Solr search engine and have a look at how it is integrated with Alfresco and how it is configured. To use Solr with Alfresco is actually straight forward, just install Alfresco and by default it will use Solr for searching. Everything is done for you and Solr is quite transparent in this case. But this is just the way you would use it for testing and during PoC projects.

In a production environment you would want to separate Solr to run on its own server in its own Apache Tomcat application server. It is also useful to know about the different configuration files that Solr uses. So you can update the default schema with extra field types and fields, add synonym lists etc. This article will cover these things to a level where you can install Solr on a different server, configure it, and understand how it is integrated with Alfresco.

For more information about the build project that I use and its directory structure see the following blog entry.

What is Solr?

Apache Solr is an open source enterprise search server and has been around long enough to be mature and power search on sites such as CNET and Netflix. It uses Apache Lucene as indexing and search engine. It is written in Java and provides plug-in interfaces for building extensions to the search server. It can be run in an application server such as Apache Tomcat and you can talk to Solr via HTTP and XML, with it responding with XML or for example JSON.

Solr has the possibility to return a search result for a query (would not be much of a search engine otherwise) but it also has other features such as faceted searches and navigation as one can see on many e-commerce sites, results highlighting, “more like this” functionality for finding similar documents, query spell correction, query completion, and geospatial search for filtering and sorting by distance.

The HTTP and XML interface of Solr has two main access points: the update URL, which maintains the index, and the select URL, which is used for queries. In the default configuration, they are found at:

http://localhost:8983/solr/update

http://localhost:8983/solr/select

To add a document to the index, we POST an XML representation of the fields to index to the update URL. The XML looks like the example below, with a <field> element for each field to index. Such documents represent the metadata and content of the actual documents that we're indexing:

<add>
  <doc>
    <field name="id">CAN5DMARKIII</field>
    <field name="name"> Canon EOS 5D Mark III</field>
    <field name="category">camera</field>
    <field name="features">22.3MP full-frame CMOS sensor</field>
    <field name="features">Canon DIGIC 5+ image processor</field>
    <field name="features">ISO 100 - 25,600</field>
    <field name="features">1080/30p Full HD movie recording</field>
    <field name="features">3.2in, 1040k-dot LCD monitor</field>
    <field name="features">Weather-sealed aluminium chassis</field>
    <field name="weight">950</field>
    <field name="price">3000.00</field>
  </doc>
</add>

The <add> element tells Solr that we want to add the document to the index (or replace it if it's already indexed), and with the default configuration, the id field is used as a unique identifier for the document. Posting another document with the same id will overwrite existing fields and add new ones to the indexed data.

Note that the added document isn't yet visible in queries. To speed up the addition of multiple documents (an <add> element can contain multiple <doc> elements), changes aren't committed after each document, so we must POST an XML document containing a <commit> element to make our changes visible.

Once we have indexed some data, an HTTP GET on the select URL does the querying. The example below searches for the word "video" in the default search field and asks for the name and id fields to be included in the response.


$ curl "http://localhost:8983/solr/select/?indent=on&q=video&fl=name,id"

<?xml version="1.0" encoding="UTF-8"?>
<response>
  <responseHeader>
    <status>0</status><QTime>1</QTime>
  </responseHeader>

  <result numFound="2" start="0">
   <doc>
    <str name="id">MA147LL/A</str>
    <str name="name">Apple 60 GB iPod Black</str>
   </doc>
   <doc>
    <str name="id">EN7800GTX/2DHTV/256M</str>
    <str name="name">ASUS Extreme N7800GTX</str>
   </doc>
  </result>
</response>

The query language used by Solr is based on Lucene queries, with the addition of optional sort clauses in the query. Asking for video; inStock asc, price desc, for example, searches for the word "video" in the default search field and returns results sorted on the inStock field, ascending, and price field, descending.

The default search field is specified in Solr's schema.xml configuration file, as in this example:

<defaultSearchField>text</defaultSearchField>

A query can refer to several fields, like handheld AND category:camera which searches the category field in addition to the default search field.

Besides the <add> and <commit> operations, <delete> can be used to remove documents from the index, either by using the document's unique ID:

<delete><id>MA147LL/A</id></delete>

A query can also be used to remove a range of documents from the index:

<delete><query>category:camera</query></delete>

As with add/update operations, a <delete> must be followed by a <commit> to make the resulting changes visible in queries.

In Lucene indexes, fields are created as you go; adding a document to an empty index with a numeric field named "price," for example, makes the field instantly searchable, without prior configuration.

When indexing lots of data, however, it is often a good idea to predefine a set of fields and their characteristics to ensure consistent indexing. To allow this, Solr adds a data schema on top of Lucene, where fields, data types, and content analysis chains can be precisely defined.

Here is an example based on Solr's default schema.xml configuration file. It’s a simple string type, indexed and stored as is, without any tokenizing or filtering:


<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    </analyzer>
</fieldType>

The optional positionIncrementGap attribute puts space between multiple fields of this type on the same document, with the purpose of preventing false phrase matching across fields. For example, suppose a document has a multiValued "features" field as follows:

features: 1080/30p Full HD movie recording

features: 3.2in, 1040k-dot LCD monitor

With a positionIncrementGap of 0, a phrase query of "HD monitor" would be a match. But often it is undesirable for that kind of match across different field values. The positionIncrementGap controls the virtual space between the last token of one field instance and the first token of the next instance. With a gap of 100, this prevents phrase queries from matching across instances.

The actual fields can now be configured, mapping them to the field types that we have defined:

<field name="id" type="string" indexed="true" stored="true" required="true" />

<field name="category" type="text_ws" indexed="true" stored="true"/>

To avoid losing the free-form indexing provided by Lucene, dynamic fields can also be defined, using field name patterns with wildcards to specify ranges of field names with common properties. Here's an example where all field names ending in _s are treated as string fields:

<dynamicField name="*_s" type="string" indexed="true" stored="true"/>

The combination of strict schema-based data types with looser wildcard-based types helps in building consistent indexes, while allowing the addition of new strongly typed fields on the fly.

Why is Alfresco using it?

The following are some of the reasons why Alfresco decided to move from an embedded Lucene search engine to the stand-alone Solr search server:

  • To be able to scale content search independently of authoring and publishing of content
  • To be able to make clustering easier (instead of having an index on each node, and index tracking from each node to the database, a dedicated search server or search cluster is used)
  • To improve performance, no longer permission evaluation in second pass or “in transaction” indexing, read permissions are evaluated at query time, path search is much faster
  • To support cross-locale ordering for d:text and d:mltext properties

So what content in the Alfresco repository is searchable via Solr?

  • All content in the Workspace store (the main store used for most of the live content)
  • All content in the Archive store (content that has been soft deleted)
  • Content in other stores such as AVM (deprecated and being phased out) is not supported
  • Multi-tenant searches are not supported in version 4.0.0 (supposed to be supported in later versions)
  • There are also other advantages, not directly related to Alfresco, of moving to Solr. When Solr is deployed on its own server it can be used to index content from many different enterprise resources in the organization, giving the users a one stop shop to go for all searches.

Solr Cores

When talking about Solr you will quite early on hear people mentioning something called Solr cores. It is important to know what a Solr core is before moving on. Previously when working with Alfresco and its embedded Lucene index engine you were talking about “the” index.

With Lucene there is one index and that’s it. Not so with Solr. A Solr core holds one Lucene index and the supporting Solr configuration for that index; sometimes the word "core" is used synonymously with "index".

The Solr server can manage multiple cores, meaning it can manage multiple Lucene indexes. Nearly all interactions with Solr are targeted at a specific core. Using multiple cores has some advantages in a production environment:

  • You can rebuild an index (i.e. core) offline with minimal impact on performance, and the offline index could then also be optimized for updating content
  • You can test configuration changes to an index without effecting the live environment by copying the live index to a test index/li>
  • You can rename cores at runtime and keep multiple versions of the same core, deciding which one should be used by the end users.

How is Solr integrated with Alfresco?

Okay; so we now know what Solr is and why Alfresco has chosen to use it. So how is it integrated with Alfresco? The easiest way to put it is as follows: Alfresco uses HTTP GETs to talk to Solr and search for content. Solr updates the indexes by talking to Alfresco and looking at the number of transactions that have been committed since it last talked to Alfresco, a bit like index tracking in a cluster.

Solr will also keep track of new custom content models that have been deployed and download them to be able to index the properties in these models. Any permission settings will also be downloaded by Solr so it can do query time permission filtering.

There are 2 cores:

  • WorkspaceStore: for searching all live content (i.e. stuff in alf_data/contentstore)
  • ArchiveStore: for searching soft deleted content (i.e. content in alf_data/contentstore that has been marked as deleted)

The following picture gives you an overview of how it works:

As the picture shows, Solr version 1.4 is used and has been extended with code to talk to Alfresco. So if you are going to install Solr on a separate dedicated search box you need to download it from Alfresco download site and not from Apache (more information about this later on).

If you wanted to use a newer version of Solr such as for example 3.4, then that would be a bit of a problem. It would not be a simple task to upgrade, and Alfresco might not support you after the upgrade.

So now when Solr is used to search for content, how is it affecting Alfresco Share if for example the Solr server stops working or goes down? It is not affecting the normal browsing and navigating around in the Alfresco Share UI, as this is not using Solr but instead regular database queries.

Solr is used in the following situations:

  • Full Text Search (search field in top right corner)
  • Advanced Search
  • Filters
  • Tags
  • Categories (implemented as facets)
  • Dashlets such as the Recently Modified Documents
  • Wildcard searches for People, Groups, Sites (uses database search if not wildcard)

The following properties in alfresco-public.properties are related to Solr, and they are setup as follows by default:

### Solr indexing ###
index.subsystem.name=solr
dir.keystore=${dir.root}/keystore
solr.port.ssl=8443
### Solr indexing ###

As you can see, search has been moved into a sub-system with a solr and a lucene implementation. The Solr search sub-system supports the same query languages as the embedded lucene sub-system. The same fields (e.g. ID , PARENT, TYPE, properties) are also available. The only minor difference is that SOLR only supports the OpenCMIS based CMIS query language. This is stricter in its adherence to the CMIS specification - type and aspect names are case sensitive.

Looking in the alfresco/tomcat/webapps/alfresco/WEB-INF/classes/alfresco/repository.properties configuration file one can see that there are a couple of more Solr related properties that can be configured (copy them to alfresco-public.properties if you need to change them):


# SOLR connection details (e.g. for JMX)
solr.host=localhost
solr.port=8080
solr.port.ssl=8443
solr.solrUser=solr
solr.solrPassword=solr
# none, https
solr.secureComms=https
solr.max.total.connections=40
solr.max.host.connections=40
# Solr connection timeouts
# solr connect timeout in ms
solr.solrConnectTimeout=5000

# cron expression defining how often the Solr Admin client (used by JMX) pings Solr if it goes away
solr.solrPingCronExpression=0 0/5 * * * ? *

These properties are mostly related to how Alfresco connects to the Solr server, and as it is running in the same Tomcat instance as Alfresco, the connection properties will be setup to connect to a locally running Solr server. Alfresco will also by default use HTTPS to connect to the Solr server.

Logging into Solr Admin UI

The Solr web application comes with an administration UI that can be useful for finding out stuff about the Solr installation, such as deployed schemas, Solr configuration, indexed fields etc. The admin console can be accessed via the following URL:

https://localhost:8443/solr/admin/

However, when we got multiple cores this URL will not work, we need to refer to the core we want to administrate. Best is to use the base Solr web application URL, it will then respond with a list of cores:

https://localhost:8443/solr

The first time you hit this URL it will respond with an error message as follows:

All URLs for the Solr web application (i.e. /solr) are protected by SSL. In order to use these from a browser you need to import a browser-compatible keystore to allow mutual authentication and decryption to work. Follow these steps to import the keystore into your browser (I’m using Firefox, other browsers will have a similar mechanism):

1) Open the Firefox Certificate Manager (Tools | Options | Advanced | Encryption | View Certificates | Your Certificates):

 

2) Import the browser keystore browser.p12 that is located in your tomcat\webapps\alfresco\WEB-INF\classes\alfresco\keystore directory or in the alfresco\alf_data\keystore directory. You will be prompted for password. The password is alfresco. This should result in a dialog indicating that the keystore has been imported successfully.

3) The Certificate Manager should now show one imported certificate as follows:

 

 

4) Now close the Certificate Manager by clicking OK

5) Then hit https://localhost:8443/solr again, this will show the following dialog:

 

 

6) The Solr web application now wants you to identify (i.e. authenticate) yourself with a certificate for access to the Solr admin console. Select the Alfresco Repository cert and click OK. This means that we are using the same certificate as the Alfresco Repository is using when it talks to the Solr web application during for example a search.

7) We should now see a list of the cores:

 

 

8) The first one (i.e. Admin alfresco) is the workspace core for live content, click on it and the following page should display:

 

 

Testing some searches

Normally when we want to try out different queries with Solr we would use the Solr admin interface and the “Make a Query” component. This will not work as it will use the default request handler that does not know anything about the Alfresco Solr schema and how fields are indexed. If you for example specify a Query String “alfresco”, it will execute a query looking something like this:

phttps://localhost:8443/solr/alfresco/select/?q=alfresco&version=2.2&start=0&rows=10&indent=on

And it will respond with 0 rows, which is incorrect as it should give you lots of hits, you can try by searching from Alfresco Share.

We will have a look at the Alfresco schema and Alfresco related query handlers (i.e. AlfrescoLucene, AlfrescoFTS, CMIS) later on in this article. For now, you can try out searching in live content in the Alfresco repository via Solr as follows with the Alfresco FTS (Full Text Search) query handler:

phttps://localhost:8443/solr/alfresco/afts?q=@cm\:name:alfresco&indent=on

The first item of the /alfresco/afts URL path points to the core for live content and the second item points to the Alfresco FTS query handler, which is registered like this requestHandler name="/afts".

This query should respond with a result such as follows:


{
 "responseHeader":{
  "status":0,
  "QTime":223},
 "response":{"numFound":20,"start":0,"docs":[
          {
           "INTXID":["6"],
           "ID":["LEAF-429"],
           "ASPECT":["{http://www.alfresco.org/model/system/1.0}localized","{http://www.alfresco.org/model/application/1.0}uifacets","{http://www.alfresco.org/model/content/1.0}auditable border="1"","{http://www.alfresco.org/model/system/1.0}referenceable","{http://www.alfresco.org/model/content/1.0}titled"],
           "ISNODE":["T"],
           "TYPE":["{http://www.alfresco.org/model/content/1.0}folder"],
           "DBID":["429"]},
          {
           "INTXID":["6"],
           "ID":["LEAF-476"],
           "ASPECT":["{http://www.alfresco.org/model/system/1.0}referenceable","{http://www.alfresco.org/model/system/1.0}localized"],
           "ISNODE":["T"],
           "TYPE":["{http://www.alfresco.org/model/content/1.0}authorityContainer"],
           "DBID":["476"]},
          {
. . .

As we can see not that many fields are stored for a document. The type of the node can be found out, and the applied aspects, but that is pretty much it. We would need to make a database call and get the rest of the properties via the DBID.

To get all nodes of a certain type use the following type of query:

https://localhost:8443/solr/alfresco/afts?q=TYPE:%22cm:folder%22&indent=true

To get all nodes with a certain ASPECT applied the following query can be used:

https://localhost:8443/solr/alfresco/afts?q=ASPECT:%22mc:published%22&indent=true

In this case I searched for nodes with the custom aspect mc:published applied, more about this custom aspect later in the article.

Warning: When you search like this the result is not filtered by permission settings, so you should avoid it for anything else then testing.

How to turn on Logging during search

If you want to have a look at the queries that Alfresco is running against Solr when you click around in Alfresco Share then enable debug logging as follows in log4j.properties (located in tomcat\webapps\alfresco\WEB-INF\classes):

log4j.logger.org.alfresco.repo.search.impl.solr.SolrQueryHTTPClient=debug

A log for a full text search on “adamo” looks like this:

2012-05-17 08:21:15,696  DEBUG [impl.solr.SolrQueryHTTPClient] [http-8080-26] Sent :/solr/alfresco/afts?q=%28%28PATH%3A%22%2Fapp%3Acompany_home%2Fst%3Asites%2Fcm%3Atest2%2F*%2F%2F*%22+AND+%28adamo++AND+%28%2BTYPE%3A%22cm%3Acontent%22+%2BTYPE%3A%22cm%3Afolder%22%29%29+%29+AND+-TYPE%3A%22cm%3Athumbnail%22+AND+-TYPE%3A%22cm%3AfailedThumbnail%22+AND+-TYPE%3A%22cm%3Arating%22%29+AND+NOT+ASPECT%3A%22sys%3Ahidden%22&wt=json&fl=*%2Cscore&rows=502&df=keywords&start=0&locale=en_GB&fq=%7B%21afts%7DAUTHORITY_FILTER_FROM_JSON&fq=%7B%21afts%7DTENANT_FILTER_FROM_JSON
 2012-05-17 08:21:15,697  DEBUG [impl.solr.SolrQueryHTTPClient] [http-8080-26]    with: {"textAttributes":[],"allAttributes":[],"templates":[{"template":"%(cm:name cm:title cm:description ia:whatEvent ia:descriptionEvent lnk:title lnk:description TEXT TAG)","name":"keywords"}],"authorities":["GROUP_ALFRESCO_ADMINISTRATORS","GROUP_EMAIL_CONTRIBUTORS","GROUP_EVERYONE","GROUP_site_test","GROUP_site_test2","GROUP_site_test2_SiteManager","GROUP_site_test_SiteManager","ROLE_ADMINISTRATOR","ROLE_AUTHENTICATED","admin"],"tenants":[""],"query":"((PATH:\"/app:company_home/st:sites/cm:test2/*//*\" AND (adamo  AND (+TYPE:\"cm:content\" +TYPE:\"cm:folder\")) ) AND -TYPE:\"cm:thumbnail\" AND -TYPE:\"cm:failedThumbnail\" AND -TYPE:\"cm:rating\") AND NOT ASPECT:\"sys:hidden\"","locales":["en_GB"],"defaultNamespace":"http://www.alfresco.org/model/content/1.0","defaultFTSFieldOperator":"OR","defaultFTSOperator":"OR"}
 2012-05-17 08:21:15,697  DEBUG [impl.solr.SolrQueryHTTPClient] [http-8080-26] Got: 1 in 48 ms
 

Analysing the Index

Sometimes it is useful to be able to open up the index files and have a look at the fields that have been indexed etc. This can easily be done with Luke. Download the tool from here.

Start it up as follows:

C:\Users\mbergljung\Downloads\>java -jar lukeall-3.5.0.jar

 

First thing you need to do when Luke starts is to point to where the index files are. Select C:\alfresco\alf_data\solr\workspace\SpacesStore\index (this is for live content). The overview tab should display something like this:

In Luke you will be able to see all fields that have been indexed and you can browse through the documents in the index.

Solr related directory structure and files

After you have installed Alfresco 4.0 there will be several new directories and configuration files having to do with Solr. Let’s have a look at them.

Alfresco\tomcat\conf\Catalina\localhost\solr.xml (Solr Web App Context)

This file defines the Solr web application context and points to where the Solr WAR file is located and sets up the Solr Home directory. The contents of this file looks like this:


<?xml version="1.0" encoding="utf-8"?>
<Context  docBase="C:/Alfresco/alf_data/solr/apache-solr-1.4.1.war"
    debug="0"
    crossContext="true">
  <Environment name="solr/home"
               type="java.lang.String"
        value="C:/Alfresco/alf_data/solr"
        override="true"/>
</Context>

Note. The tomcat/webapps directory does not contain a Solr WAR file.

Alfresco\alf_data\solr (Solr Home Directory)

This directory is the Solr home directory and it mainly lists Solr cores and contains a configuration file called solr.xml. In our case there are 2 cores, one for live/workspace content and one for deleted/archived content. As mentioned before, a Solr core holds one Lucene index and the supporting configuration for that index. Pretty much all interactions with Solr are targeted at a specific core.

 

The following picture shows the content of the directory:

The sub-directories and files in this directory have the following meaning:

Alfresco\alf_data\solr\solr.xml (Configuration of all Solr cores)

The solr.xml configuration file specifies the cores that will be handled by this Solr server. In our case this file has the following content:


<?xml version="1.0" encoding="UTF-8" ?>
<solr persistent="true" sharedLib="lib" >
  <cores  adminPath="/admin/cores"
   adminHandler="org.alfresco.solr.AlfrescoCoreAdminHandler">
    <core name="alfresco" instanceDir="workspace-SpacesStore" />
    <core name="archive"  instanceDir="archive-SpacesStore" />
  </cores>
</solr>

Here we can see that there are two cores, one called alfresco that has its instance directory under workspace-SpacesStore and another core called archive that has its instance directory under archive-SpacesStore.

Each instance directory contains the configuration for that Solr core but not the index. The index is located in the workspace directory for live content and in the archive directory for archived content. The index location is configured in the solrcore.properties file in each instance directory.

Alfresco\alf_data\solr\workspace-SpacesStore (instance directory for a core)

This directory, as well as the archive-SpacesStore directory, is a so called instance directory for a Solr core/index. For live content it looks like this:

The sub-directories have the following meaning:

Alfresco\alf_data\solr\workspace-SpacesStore\conf (configuration directory for a core)

This directory, as well as the archive-SpacesStore directory, is the configuration directory for the Solr core. For live content it looks like this:

The configuration files have the following meaning:

Alfresco\alf_data\solr\<core>\solrcore.properties

This is the property configuration file for a core, and as of Solr 1.4 all the properties that need to be substituted can be put into this properties file. It is usually expected to be located in the {solr.home}/conf directory when using a single core, but with multiple cores, as in our case, there will be one solrcore.properties file in each core’s configuration directory.

Opening the solrcore.properties file in the C:\Alfresco\alf_data\solr\workspace-SpacesStore directory shows the following:


data.dir.root=C:/Alfresco/alf_data/solr
data.dir.store=workspace/SpacesStore
enable.alfresco.tracking=true
cache.alfresco.size=100
max.field.length=2147483647

alfresco.host=localhost
alfresco.port=8080
alfresco.port.ssl=8443
alfresco.cron=0/15 * * * * ? *
alfresco.stores=workspace://SpacesStore
alfresco.lag=1000
alfresco.hole.retention=3600000
alfresco.batch.count=1000

alfresco.secureComms=https

alfresco.encryption.ssl.keystore.type=JCEKS
alfresco.encryption.ssl.keystore.provider=
alfresco.encryption.ssl.keystore.location=ssl.repo.client.keystore
alfresco.encryption.ssl.keystore.passwordFileLocation=ssl-keystore-passwords.properties
alfresco.encryption.ssl.truststore.type=JCEKS
alfresco.encryption.ssl.truststore.provider=
alfresco.encryption.ssl.truststore.location=ssl.repo.client.truststore
alfresco.encryption.ssl.truststore.passwordFileLocation=ssl-truststore-passwords.properties

So we can see by this configuration that the index files for the live content are stored in the c:\Alfresco\alf_data\solr\workspace\SpacesStore directory (i.e ${data.dir.root}/ ${data.dir.store}). The default location for the index directory is /data under the Solr home directory.

The properties of the file have the following meaning:

Communications between the Alfresco Repository and Solr are protected by SSL with mutual authentication. Both the repository and Solr have their own public/private RSA key pair, signed by an Alfresco Certificate Authority, and stored in their own respective key stores. These key stores are bundled with Alfresco. You can also create your own key stores. It is assumed that the keystore files are stored in alfresco/alf_data/keystore directory.

The following properties are only relevant when Solr talks to Alfresco over a secure connection:

Alfresco\alf_data\solr\<core>\solrconfig.xml

This is the file that contains most of the parameters for configuring Solr itself. A big part of this file is made up of request handlers, which are defined with <requestHandler> elements. This is also the file where you will be adding Solr search components.

The configuration file starts off by configuring what libraries that will be used. Library (lib) directives can be used to instruct Solr to load any Jars identified and use them to resolve any "plugins" specified in the solrconfig.xml or schema.xml (i.e. Analyzers, Search components, Request Handlers, etc...). All directories and paths are resolved relative to the instanceDir directory (e.g. relative to alfresco\alf_data\solr\workspace-SpacesStore).

The following is an example of a library directive:

<lib dir="../../contrib/extraction/lib" />

 

If a /lib directory exists in the instanceDir, all files found in it are included as if you had used the following syntax:

<lib dir="./lib" />

 

The data directory (i.e. Lucene index directory) for the core is specified next in this file as follows:

<dataDir>${data.dir.root}/${data.dir.store}</dataDir>

Next stuff of interest is the default update handler that will add, delete and update documents to the index:

<updateHandler class="org.alfresco.solr.AlfrescoUpdateHandler">

 

The AlfrescoUpdateHandler is a customized version of the solr.DirectUpdateHandler2 that comes with Apache Solr out of the box. This handler will be visible under the /update context. So, if your Solr is running on a local machine, the full URL would be as follows for live content:

https://localhost:8443/solr/alfresco/update

 

For example, if you wanted to delete the complete index for live content you could do this:

$ curl https://localhost:8443/solr/alfresco/update?commit=true -d '<delete><query>*:*</query></delete>'

 

The AlfrescoUpdateHandler will handle incoming POSTed index update requests such as the following XML:


<add allowDups="true">
  <doc boost="1.0">
    <field name="ISCATEGORY">F</field>   
    <field name="PARENT">Root</field>    
    <field name="QNAME">{http://www.alfresco.org/model/content/1.0}one</field>   
    <field name="PRIMARYPARENT">Root</field>   
    <field name="ASSOCTYPEQNAME">{http://www.alfresco.org/model/system/1.0}children</field>   
    <field name="FTSSTATUS">Clean</field>   
    <field name="PRIMARYASSOCTYPEQNAME">{http://www.alfresco.org/model/system/1.0}children</field>   
    <field name="ANCESTOR">Root</field>   
    <field name="ID">Doc-1</field>   
    <field name="TX">Tx-1</field>   
    <field name="ISROOT">F</field>   
    <field name="ISNODE">T</field>   
    <field name="TYPE">{http://www.alfresco.org/model/content/1.0}folder</field>   
    <field name="ASPECT">{http://www.alfresco.org/model/content/1.0}auditable border="1"</field>   
    <field name="@{http://www.alfresco.org/model/system/1.0}locale">en</field>   
    <field name="@{http://www.alfresco.org/model/content/1.0}title">Document number 1</field>   
    <field name="@{http://www.alfresco.org/model/content/1.0}name">Doc 1</field>   
    <field name="@{http://www.alfresco.org/model/content/1.0}created">2010-07-21T10:52:00.000Z</field>   
    <field name="@{http://www.alfresco.org/model/content/1.0}creator">Andy</field>   
    <field name="@{http://www.alfresco.org/model/content/1.0}modified">2010-07-21T10:52:00.000Z</field>   
    <field name="@{http://www.alfresco.org/model/content/1.0}modifier">Andy</field>     
    <field name="DBID">1</field>   
  </doc>
</add>

After the update handler there will be query configurations such as different caches and query listeners. Following this section is the big section with the request handlers.

The request handlers are responsible for answering incoming query requests. All request handlers for a Solr core are configured in the solrconf.xml file. Dispatching requests to the different request handlers is managed by the request dispatcher that has its configuration inside the <requestDispatcher> element.

The <requestDispatcher> element has a handleSelect attribute which defaults to true (as of Solr 3.6 it defaults to false).

When handle select is true an additional dispatch rule comes into place if the request uses "/select" and there is no request handler by that name. Instead of it being an error, Solr uses the qt parameter (qt = query type) to lookup the handler by name. For example, to use the DisMax request handler (handles user entered phrase searches better) you would make the following call:

https://localhost:8443/solr/alfresco/select?qt=dismax&q=@cm\:name:alfresco&indent=on

 

Note. Unfortunately this does not work as the DisMax search handler does not know anything about the Alfresco Solr schema. It assumes the out-of-the box Solr example schema.

If there is no qt parameter then the default request handler is chosen.

Request handlers are defined with a name and the class that is responsible for handling the request. The default request handler also has the attribute default set to true:

<requestHandler name="standard" class="solr.SearchHandler" default="true">...

 

If the name starts with a "/" then you can reach the request handler by calling the correct URL path.

For example, the Alfresco Lucene request handler is configured like this:

<requestHandler name="/alfresco" class="solr.SearchHandler">...

 

That means you can reach this handler by calling https://localhost:8443/solr/alfresco/alfresco.

There are several other Alfresco specific request handlers such as the one that handles CMIS queries:


>requestHandler name="/cmis" class="solr.SearchHandler" >
    >lst name="defaults">
     >str name="defType">cmis>/str>
    >/lst>
    >arr name="components">
      >str>setLocale>/str>
      >str>query>/str>
      >str>facet>/str>
      >str>mlt>/str>
      >str>highlight>/str>
      >str>stats>/str>
      >str>debug>/str>
      >str>clearLocale>/str>
    >/arr>
  >/requestHandler>
  

The query parser for this custom CMIS request handler is specified with the defType attribute, which is set to cmis in the above request handler. You will find the query parsers defined further down in the file, the cmis one looks like this:

<queryParser name="cmis" class="org.alfresco.solr.query.CmisQParserPlugin"/>

Among the request handlers you will also start to see different search components defined, such as the spell checker component:

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">

Search components enable a request handler to chain together reusable pieces of functionality to create custom search handlers without writing code. Search components can be reused by multiple instances of request handlers, either by pre-pending (first-components), appending (last-components), or replacing (components) the default list:


>requestHandler name="/spell" class="solr.SearchHandler" lazy="true"<
    >lst name="defaults"<
      >!-- omp = Only More Popular --<
      >str name="spellcheck.onlyMorePopular"<false>/str<
      >!-- exr = Extended Results --<
      >str name="spellcheck.extendedResults"<false>/str<
      >!--  The number of suggestions to return --<
      >str name="spellcheck.count"<1>/str<
    >/lst<
    >arr name="last-components"<
      >str<spellcheck>/str<
    >/arr<
  >/requestHandler<
  

Alfresco\alf_data\solr\<core>\schema.xml

A Solr schema defines the relationship between content in a document and a Solr core (i.e. index). The schema identifies the document properties to index in Solr and maps property names to Solr types. The schema for a core is defined in a schema.xml file. For an Alfresco & Solr integration you would think that the schema would be huge. Not so, looking at the schema.xml file for the live core we will see that it looks something like this:


>?xml version="1.0" encoding="UTF-8"?<
>schema name="alfresco" version="1.0"<
   >types<
      >fieldType name="alfrescoDataType" class="org.alfresco.solr.AlfrescoDataType"<
         >analyzer<
            >tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /<
            >filter class="org.apache.solr.analysis.WordDelimiterFilterFactory"
                    generateWordParts="1"
                    generateNumberParts="1"
                    catenateWords="1"
                    catenateNumbers="1"
                    catenateAll="1"
                    splitOnCaseChange="1"
                    splitOnNumerics="1"
                    preserveOriginal="1"
                    stemEnglishPossessive="1"/<
            >filter class="org.apache.solr.analysis.LowerCaseFilterFactory" /<
         >/analyzer<
      >/fieldType<
   >/types<
   >fields<
      >field name="ID" type="alfrescoDataType" indexed="true" omitNorms="true" stored="true" multiValued="true"<>/field<
      >dynamicField name="*" type="alfrescoDataType" indexed="true" omitNorms="true" stored="true" multiValued="true"<>/dynamicField<
   >/fields<
   >uniqueKey<ID>/uniqueKey<
   >defaultSearchField<ID>/defaultSearchField<
>/schema<

First off, notice the name of the schema, which is set to alfresco in the <schema> tag. Solr supports one schema per core and index. In the future, it may support multiple schemas, but for now, only one schema is allowed. Different cores can have different schemas though.

The schema is organized into three sections:

  • Types
  • Fields
  • Other declarations

In the <types> section are common, reusable definitions of how fields should be processed by Solr (and Lucene). In the alfresco schema there is only one field type called alfrescoDataType. The reason for this is that there are plenty of content models out of the box to support and there is usually also custom content models deployed. It is not possible to know all fields (i.e. content model properties) that will exists after custom models have been deployed, so a more dynamic approach to field management was chosen.

If you remember the alf_data\solr\<core>\alfrescoModels directory you will recall that it contains information about all the content models deployed to Alfresco. The alfrescoDataType uses this information and loads all the content models from the alfrescoModels directory when Solr starts.

It then knows about all the fields (i.e. properties) that can possibly be indexed. This data type is then used in the field definitions section to define the unique identifier field for indexed documents:

<field name="ID" type="alfrescoDataType" indexed="true" omitNorms="true" stored="true" multiValued="true"></field>

This means that Solr will be able to handle incoming update XML with the following field:

<field name="ID">Doc-1</field>

There is only one more field defined in the alfresco schema and it is a dynamic field that also uses the alfresco data type and it looks like this:

<dynamicField name="*" type="alfrescoDataType" indexed="true" omitNorms="true" stored="true" multiValued="true"></dynamicField>

As we have seen with the ID field, fields can be defined with explicit field names in the schema, but you can also have some defined on the fly based on the name supplied for indexing. The Alfresco schema contains a dynamic field with the name set to *.

If at index time a document contains a field that isn't matched by an explicit field definition, such as for the ID field, but does have a name matching a dynamic field name pattern (in our case any field name as we have specified *), then it gets processed according to that dynamic field definition.

Fields from the content model that match the dynamic field definition, which is pretty much all except ID, are indexed as follows (you can find out field names by going into the Solr admin console here: https://localhost:8443/solr/alfresco/admin/schema.jsp and click FIELDS in the left navigation menu, you will also find out if anything has been indexed with the field. Or use Luke to browse index fields):

If we take the cm:name field POSTed as follows with an update:

<field name="@{http://www.alfresco.org/model/content/1.0}name">some.pdf</field>

It will be indexed in different ways with the following field names:


@{http://www.alfresco.org/model/content/1.0}name                  
@{http://www.alfresco.org/model/content/1.0}name.u                
@{http://www.alfresco.org/model/content/1.0}name.__               
@{http://www.alfresco.org/model/content/1.0}name.__.u             
@{http://www.alfresco.org/model/content/1.0}name.sort 

The Alfresco schema.xml meets the requirement to have a unique key and no duplicate rows. The unique key maps to the document ID. This unique key is like a primary key in SQL. The uniqueKey element in the schema.xml designates that the unique key is ID.

The last element in the schema is called defaultSearchField and it sets the default field to use when searching if it was not specified in the q parameter such as in the following example:

https://localhost:8443/solr/alfresco/select?q=Doc-1

 

This is the same as the following query:

https://localhost:8443/solr/alfresco/select?q=ID:Doc-1

Now, is there any problem with having just one dynamic field representing all possible fields that we want to index? Yes, it could be if we wanted to do special processing for a field during index or query time. The field would then have to be added to the schema. Let’s say we have a date field that we are doing range queries based on but they are not performing very well. Then we could specify a new field definition for this date property and use the Trie (http://en.wikipedia.org/wiki/Trie) field types for better range query performance (can be up to as much as 40x faster than standard range queries).

Let’s assume we have a published aspect with a publishing date as follows:


>aspect name="mc:published"<
      >title<Web Publishing information>/title<
      >description<Information about where and when content was published to the Web>/description<
      >properties<
         >property name="mc:publishedDate"<
            >title<First published date>/title<
            >type<d:date>/type<
         >/property<
         >property name="mc:publishedLocation"<
            >title<Published to location (Test, Staging, Live)>/title<
            >type<d:text>/type<
         >/property<
      >/properties<
   >/aspect<
   

When this custom model is loaded Solr will index the new properties with field names as follows:


@{http://www.mycompany.com/model/content/1.0}publishedDate
...
@{http://www.mycompany.com/model/content/1.0}publishedLocation
...

We can then update the schema for live content to get better date range query performance for the publishedDate property:


>?xml version="1.0" encoding="UTF-8"?<
>schema name="alfresco" version="1.0"<
   >types<
      >fieldType name="alfrescoDataType" class="org.alfresco.solr.AlfrescoDataType"<
         >analyzer<
            >tokenizer class="org.apache.solr.analysis.WhitespaceTokenizerFactory" /<
            >filter class="org.apache.solr.analysis.WordDelimiterFilterFactory"
                    generateWordParts="1"
                    generateNumberParts="1"
                    catenateWords="1"
                    catenateNumbers="1"
                    catenateAll="1"
                    splitOnCaseChange="1"
                    splitOnNumerics="1"
                    preserveOriginal="1"
                    stemEnglishPossessive="1"/<
            >filter class="org.apache.solr.analysis.LowerCaseFilterFactory" /<
         >/analyzer<
      >/fieldType<
    >fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6"
                             positionIncrementGap="0"/<
   >/types<
   >fields<
      >field name="ID" type="alfrescoDataType" indexed="true" omitNorms="true" stored="true" multiValued="true"<>/field<
      >field name="@{http://www.mycompany.com/model/content/1.0}publishedDate" type="tdate" indexed="true" stored="true"/<
      >dynamicField name="*" type="alfrescoDataType" indexed="true" omitNorms="true" stored="true" multiValued="true"<>/dynamicField<
   >/fields<
   >uniqueKey<ID>/uniqueKey<
   >defaultSearchField<ID>/defaultSearchField<
>/schema<

When Alfresco is started up you will see a new file popping up in the alf_data\solr\workspace-SpacesStore\alfrescoModels folder and it is called something like mc.contentModel.937173980.xml. The name is built up of the content model namespace prefix (i.e. mc) and the content model name (i.e. contentModel) plus a randomly generated number. This means that Solr has talked to Alfresco and updated itself with the new custom content model.

If you enable logging in Tomcat (in alfresco/tomcat/conf/server.xml enable the AccessLogValve at the end of the file) then you will see something like this in the logs when Solr talks to Alfresco:


127.0.0.1 - CN=Alfresco Repository Client, OU=Unknown, O=Alfresco Software Ltd., L=Maidenhead, ST=UK, C=GB [09/May/2012:09:24:08 +0100] "POST /alfresco/service/api/solr/modelsdiff HTTP/1.1" 200 242
127.0.0.1 - CN=Alfresco Repository Client, OU=Unknown, O=Alfresco Software Ltd., L=Maidenhead, ST=UK, C=GB [09/May/2012:09:24:08 +0100] "GET /alfresco/service/api/solr/model?modelQName=%7Bhttp%3A%2F%2Fwww.mycompany.com%2Fmodel%2Fcontent%2F1.0%7DcontentModel HTTP/1.1" 200 1566

Solr will first ask Alfresco if there are any new or updated models (i.e. /modelsdiff) and then download any updated stuff.

If the publishedDate field has been indexed properly you should see something like this if you access fields via admin console (https://localhost:8443/solr/alfresco/admin/schema.jsp), and have on content item that has the mc:published aspect applied:

In this case with a custom field definition Solr will not create all these other fields ending in .__, .sort etc for publishedDate, it is just indexed with one field.

Running Solr on a separate tomcat/box

In a production environment you would want to run Solr in its own Apache Tomcat on its own box. This is good so you can scale and monitor it separately from the application server environment where Alfresco is running, and most of all it allows for easy clustering of the Alfresco application servers as indexes are kept on the Solr box (or by a Solr cluster if you need to scale). I will use a Windows 7 box to run Alfresco 4.0 and an Ubuntu 10.04 box to run Apache Solr 1.4 (patched by Alfresco).

To do this start off by downloading the Solr distribution from the support.alfresco.com portal, it should be called something like alfresco-enterprise-solr-4.0.0.zip (or if using community: alfresco-community-solr-4.0.d.zip).

This ZIP file contains a file structure looking like this:

So basically we got everything that we saw under alf_data/solr in the full Alfresco installation directory structure, except the index directories (i.e. /archive and /workspace) as they are not created until Solr starts and talks to Alfresco about what should be indexed, what custom models exists etc.

Installing Solr on Ubuntu and configure it to talk to Alfresco

The first thing we need to do is download and install Apache Tomcat 6.0.29 on the Linux box (I am assuming Java has already been installed, if not install OpenJDK 1.6.0.20 with sudo apt-get install openjdk-6-jdk):

$ sudo apt-get install tomcat6

This will install a Tomcat server with just a default ROOT webapp that displays a minimal "It works" page by default. Before moving on shut down Tomcat:

$ sudo /etc/init.d/tomcat6 stop

Next step is to configure Tomcat for the Apache Solr web application and the Alfresco cores/indexes. I have unpacked the alfresco-enterprise-solr-4.0.0.zip file in the /home/martin/alfresco-enterprise-solr-4.0.0 directory on the Ubuntu box.

Copy the unpacked files to a directory under /var/lib/tomcat6 and set them up as accessible by the tomcat6 user:


/var/lib/tomcat6$ sudo mkdir data
/var/lib/tomcat6$ sudo chown tomcat6 data/
/var/lib/tomcat6$ sudo chgrp tomcat6 data/
/var/lib/tomcat6$ cd data/
/var/lib/tomcat6/data$ sudo cp -r /home/martin/alfresco-enterprise-solr-4.0.0/* .
/var/lib/tomcat6/data$ sudo chown -R tomcat6 *
/var/lib/tomcat6/data$ sudo chgrp -R tomcat6 *

We also need to copy the Alfresco Repository keystore to the Ubuntu box as it will be used by Tomcat to manage HTTPS connections (this keystore is not part of the Alfresco Solr distribution). It is normally found under alfresco/alf_data/keystore or under alfresco/tomcat/webapps/alfresco/WEB-INF/classes/alfresco/keystore. I am going to copy it from alfresco/alf_data/keystore under the Alfresco 4.0 installation on the Windows box to my home directory on the Ubuntu box. Then I will copy it to the Tomcat installation:


/var/lib/tomcat6/data$ sudo cp -r /home/martin/keystore .
/var/lib/tomcat6/data$ sudo chown -R tomcat6 keystore/
/var/lib/tomcat6/data$ sudo chgrp -R tomcat6 keystore/

Now, setup Tomcat to deploy the Solr web application, the Alfresco Solr distribution comes with a web application context file that we can use:


/var/lib/tomcat6$ sudo cp data/solr-tomcat-context.xml conf/Catalina/localhost/solr.xml
/var/lib/tomcat6$ cd conf/Catalina/localhost/
/var/lib/tomcat6/conf/Catalina/localhost$ sudo chown tomcat6 solr.xml
/var/lib/tomcat6/conf/Catalina/localhost$ sudo chgrp tomcat6 solr.xml

Update the solr.xml so paths match the installation, set the location of the Solr war file and the location of the Solr home directory:


<?xml version="1.0" encoding="utf-8"?>
<Context docBase="/var/lib/tomcat6/data/apache-solr-1.4.1.war" debug="0" crossContext="true">
  <Environment name="solr/home" type="java.lang.String" value="/var/lib/tomcat6/data" override="true"/>
</Context>

Next thing we need to do is update each core’s configuration and tell it where the index data dir is and where Alfresco is running (the instance directory for each core has already been configured in the /var/lib/tomcat6/data/solr.xml file so no need to change):

/var/lib/tomcat6/data/workspace-SpacesStore/conf$ sudo vi solrcore.properties

Then set the following properties (the IP address is where my Alfresco 4.0 server is running):


data.dir.root=/var/lib/tomcat6/data
alfresco.host=192.168.0.2

Then set these properties to the same values for the archive core:

/var/lib/tomcat6/data/archive-SpacesStore/conf$ sudo vi solrcore.properties

Alfresco Repository <-> Solr communications are protected by SSL with mutual authentication out of the box. Both the repository and Solr have their own private/public RSA key pair, signed by an Alfresco Certificate Authority.

So for Alfresco to be able to talk over HTTPS with Solr we need to configure that in server.xml:

/var/lib/tomcat6/conf$ sudo vi server.xml

Define a new SSL HTTP Connector on port 8443 as follows, the keystore files are available in both cores and I am using the files in the workspace core:


&lgt;Connector port="8443" protocol="org.apache.coyote.http11.Http11Protocol" SSLEnabled="true"
maxThreads="150" scheme="https"
keystoreFile="/var/lib/tomcat6/data/keystore/ssl.keystore" keystorePass="kT9X6oe68t" keystoreType="JCEKS"
secure="true" connectionTimeout="240000" truststoreFile="/var/lib/tomcat6/data/keystore/ssl.truststore" truststorePass="kT9X6oe68t" truststoreType="JCEKS"
               clientAuth="false" sslProtocol="TLS" allowUnsafeLegacyRenegotiation="true" /> 
			   

Add the following user to the tomcat-users.xml file located in the /var/lib/tomcat6/conf directory, this will allow the Alfresco Repository to SSL authenticate with Solr:

<tomcat-users>

<user username="CN=Alfresco Repository, OU=Unknown, O=Alfresco Software Ltd., L=Maidenhead, ST=UK, C=GB" roles="repository" password="null"/>

</tomcat-users>

This should be all that is needed to setup Alfresco Solr in a separate Tomcat. Make sure Alfresco is running on the Windows box and that https://192.168.0.2:8443/alfresco is reachable from the Ubuntu box (the 192.168.0.2 IP is for my Windows box), try telnet into the box:


&lgt;tomcat-users>
  &lgt;user username="CN=Alfresco Repository, OU=Unknown, O=Alfresco Software Ltd., L=Maidenhead, ST=UK, C=GB" roles="repository" password="null"/>
&lgt;/tomcat-users>

If it hangs like this then you need to open up Windows internal firewall for incoming HTTPS connections (i.e. Windows Firewall | Inbound Rules | New Rule | Tomcat Secure Connections on port 8443).

Then start Solr on the Ubuntu box:

$ sudo /etc/init.d/tomcat6 start

We should now see the two core/index directories being created:

/var/lib/tomcat6/data/archive

/var/lib/tomcat6/data/workspace

The custom content models should being loaded into the following directory, which should also be created:

/var/lib/tomcat6/data/workspace-SpacesStore/alfrescoModels

Configure Alfresco to use a stand-alone Solr server

We now got Solr on the separate box talking to Alfresco and indexing the content store. However, Alfresco is still using the local Solr installation so we need to tell it about the new Solr server.

Open up alfresco-public.properties located in the C:\Alfresco\tomcat\shared\classes directory and add the solr.host property specifying the new Solr host:


### Solr indexing ###
index.subsystem.name=solr
dir.keystore=${dir.root}/keystore
solr.port.ssl=8443
solr.host=192.168.0.11

Then disable the Solr web app running locally by renaming solr.xml in C:\Alfresco\tomcat\conf\Catalina\localhost to solr.xml.bak. And remove the alfresco\tomcat\webapps\solr directory.

Now restart Alfresco and it should start talking to the Solr installation on the Ubuntu box when it is doing searches etc.


<Connector port="8443" protocol="org.apache.coyote.http11.Http11Protocol" SSLEnabled="true"
maxThreads="150" scheme="https" keystoreFile="C:\Alfresco/alf_data/keystore/ssl.keystore" keystorePass="kT9X6oe68t" keystoreType="JCEKS"
secure="true" connectionTimeout="240000" truststoreFile="C:\Alfresco/alf_data/keystore/ssl.truststore" truststorePass="kT9X6oe68t" truststoreType="JCEKS"
               clientAuth="false" sslProtocol="TLS" allowUnsafeLegacyRenegotiation="true" /> 
			   

Add a user to tomcat-users.xml so Solr can SSL authenticate with Alfresco Repository:

<tomcat-users>

<user username="CN=Alfresco Repository Client, OU=Unknown, O=Alfresco Software Ltd., L=Maidenhead, ST=UK, C=GB" roles="repoclient" password="null"/>

</tomcat-users>

How to rebuild the indexes

Last thing that might be useful is how to rebuild the indexes from scratch.

Do as follows:

  1. Stop Tomcat that runs Solr web application
  2. Remove index data of archive core at alf_data/solr/archive/SpacesStore
  3. Remove index data of workspace core at alf_data/solr/workspace/SpacesStore
  4. Remove cached content model info of archive core at alf_data/solr/archive-SpacesStore/alfrescoModels/*
  5. Remove cached content model info of workspace core at alf_data/solr/workspace-SpacesStore/alfrescoModels/*
  6. Restart Tomcat that runs Solr web application
  7. Wait a bit for Solr to start catching up...

Note. index.recovery.mode=FULL is not used by Solr - only by Lucene

Add new comment