Google Search Appliance - Connector Developer's Guide: Introduction

Google Search Appliance software version 6.2
Connector manager version 2.4.0
Posted December 2009

This section:

Describes a connector and the connector manager
Provides an overview of how you can create a connector
Describes Spring Framework and how connectors are instantiated
Provides an understanding of connector feed types, authentication and authorization, localizing, scheduling, and search results
Explains how the connector manager pushes documents to the Google Search Appliance

For connector terminology definitions, see the Google Enterprise Glossary. This document complements the Javadoc provided at the Google connector manager open source code site.

Chapters: About This Guide, Introduction, Getting Started, SPI Overview, Traversing Documents,
Authentication, Authorization, Configuration, Appendix A: Building a Debug Connector Manager

Chapter Contents: Introduction

What is a Connector?

A connector is a Java application that you create that contains methods and classes that the connector manager calls to perform the tasks of acquiring documents from a content management system, and authenticating and authorizing users to view search results.

The following illustration shows the relationship between a Google Search Appliance and a connector:

Connector architecture

Components:

The Google Search Appliance provides the Admin Console. You can use the Admin Console to identify a connector manager and configure a connector using a configuration form that the connector provides. The search appliance contains the feed interface, to which the connector manager sends documents for the search appliance to index.
The connector manager resides on a separate server that runs a servlet container (similar to a web server) that processes requests and responses over HTTP to the search appliance. The connector manager:
1. Communicates through its SPI (service provider interface) to the connector by calling interfaces or methods to perform tasks.
2. Creates feed sources from the documents, metadata, and URLs that a connector acquires from a content management system (CMS).
The SPI defines the Java interfaces and methods that a connector must contain. The connector manager makes SPI calls to request the connector to acquire documents, authenticate and authorize users to access search results, and to post the configuration form and validate its content.
A connector:
- Is a Java application that contains SPI functions that the connector manager calls:
- Resides on the same server as the connector manager.
- Interacts with a web interface on a content management system using the API for the content management system.
- Communicates through the content management system's API to perform the tasks requested by the connector manager's SPI.
- Provides a configuration form that displays in the Admin Console so that administrators can enter credentials for the access account of the content management system, specify the listening port, and indicate how often the connector should acquire documents from the content management system.
The CMS API consists of functions native to the content management system, which the connector calls to acquire documents, metadata on each document, and a URL that points to a document's location in the CMS.
A content management system consists of a content server that manages documents in a storage system. The storage system is known as a repository. A content management system needs to provide three components for each of its documents, the document itself, metadata that describes the document, and a URL that provides the document's location in the content management system. Some content management systems provide only metadata and a URL.

Why Create a Connector?

The reasons to create a connector are:

You have a content management system with a web interface from which you want to acquire documents for indexing by the Google Search Appliance..
Improve access to content for the the search appliance, for example, when the content is obscured by too much JavaScript, flash code, or other document coding schemes that restrict the access of the search appliance.
Associate the metadata for a document with its content.
Provide the more efficient batch authorization instead of requiring the search appliance to do individual head requests to authorize each document in the search results.

What is the Connector Manager?

The connector manager is an open source Java application that manages communications between the Google Search Appliance and a connector.

The connector manager is provided in open source at http://google-enterprise-connector-manager.googlecode.com.

The connector manager is part of the Google Enterprise Connector Framework, which consists of the connector manager, SPI, Javadoc for the SPI, and Google support for the connector manager. Google provides open source code for the connector manager on a project code site with access to downloads, issue information, additional documentation, and discussion groups. You can return to the project code site regularly for software and documentation updates.

How the Connector Manager Pushes Documents to the Google Search Appliance

The connector manager provides the DocPusher implementation, which implements the Pusher interface in the connector manager. DocPusher processes documents from a connector and creates a feed source that is conveyed to the search appliance. (Documents in this sense can be any content received from a content management system, such as documents, metadata, and URLs.) The DocPusher enables connectors to add documents to or delete documents from the search appliance.

You can view the source code for DocPusher.java in the connector manager open source site.

The following picture shows how a document moves between a connector and a search appliance.

Connector pushes documents to the connector manager's DocPusher function, which uses the document information to create a feed to the search appliance

The components in this illustration are:

The connector provides documents from the content management system
The connector manager pulls a document from the connector by calling the connector's DocumentList.nextDocument method. DocPusher wraps the document in an XML structure and pushes the document as a feed source to the search appliance. For more information, see DocumentList Interface in the SPI Overview.
The DocPusher implementation:
- Acquires documents from a connector and creates an XML structure that represents the documents.
- Opens a connection with the search appliance and provides the XML structure and the connector name as a feed source to the search appliance.
- If the connecter has set the PROPNAME_SEARCHURL property, DocPusher expects the connector to provide a metadata-and-URL feed. If the property is not set, then DocPusher expects the connector to provide a content feed.
- Creates the googleconnector:// URL for the documents in the feed source.
- Sets the feed data source depending on the connector instance name.
The Google Search Appliance acknowledges that the feed was successfully received.
The search appliance sends back to DocPusher values for:
- Success
- Unauthorized response
- Internal error

DocPusher Logging

DocPusher logs messages to the ${catalina.base}/logs/google-connectors.%g.log file where ${catalina.base} is the Apache Tomcat installation folder. The connector manager substitutes %g with a log generation in the format of google-connectors.*.log, with the logs written to the Tomcat log directory.

Another useful DocPusher log file is the Feed log, written to the ${catalina.base}/logs/google-connectors.feed%g.log file, which contains the XML Feed records pushed to the search appliance (without the content). The connector manager substitutes %g with a numerical value.

For more information, see Logging.

You can use the log file warnings and errors that DocPusher generates to diagnose problems in a connector, for example, DocPusher generates the following messages on the state of a connector:

Property names set is empty
Encoding error
Document missing required property
Supplied search URL <url> is malformed
IO error. failed to create a default document stream
Skipped this document for feeding, continuing
Cannot write file
Client is not authorized to send feeds, make sure the Google Search Appliance is configured to trust feeds from your host.
Cannot close file

Note: If you see messages about not being able to send feeds, verify that the following in the Admin Console is correct:

On the Admin Console, ensure that the Crawl and Index > Crawl URLs > Follow and Crawl Only URLs with the Following Patterns field contains the ^googleconnector:// statement.
Ensure that the Crawl and Index > Feeds > List of Trusted IP Addresses field is set to "Only trust feeds from these IP addresses" and contains the IP address of the server on which the connector manager is running. You can also set the List of Trusted IP Addresses field to "Trust feeds from all IP addresses" if that setting agrees with your security policy.

Understanding Connector Development

A connector consists of classes and helper methods that you code to implement the SPI. You also need to create Spring Framework XML files to describe connector components.

The connector manager uses Spring Framework to create an instance of a connector from the information you supply in the XML files.

To develop a connector, you need access to the following components:

Content management system. The system that manages the documents to search.
Connector host system. A computer on which the connector manager and its connectors run.

When developing a connector, you can use a single computer for the connector host system and the development systems. When you install a connector in a production environment, the connector host is a distinct machine, such as a dedicated servlet container on your production network.

Before creating a connector, you need to understand the following topics:

The API for the content management system and how to use API functions to perform the following tasks:
- Traversing documents from the ECM system:
  - Passing login credentials from the SPI functions to the API functions to start a session with a content management system.
  - Querying for documents.
  - Receiving documents, metadata for each document, and a URL for each document, and converting content results into a document list with associated properties.
  - Resuming a document query and acquiring more content.
- Handling search queries from end users:
  - Authenticating users by requesting their user name and password credentials from the API.
  - Authorizing users so that they can view secured access documents.
You can match the metadata that the content management system provides to the metadata properties that the connector manager requires. A connector works well when a content management system meets the following criteria:
- Manages documents and its associated metadata.
- Provides a unique document ID, a MIME type, and optionally, a last modified date.

Creating and deploying a connector consists of the following tasks:

Installing the content management system and ensuring that the system is serving documents.
Installing software and testing a connector with all components to ensure the connector framework setup works. For more information on running a test connector, see Getting Started.
Creating the connectorType.xml file to configure a connector type. For more information, see Configuration.
Creating a configuration form for the Admin Console specifying how to access the content management system. For more information, see Configuration.
Creating the connectorInstance.xml file to indicate to the connector manager how to instantiate your connector. For more information, see Configuration.
Creating the traversal classes and methods. For more information, see Traversing Documents.
Creating authentication and authorization classes and methods if the content management system contains controlled-access content. For more information, see Authentication and Authorization.
Deploy your connector and the connector manager in a servlet container such as Apache Tomcat. Connectors require JDK version 1.5 and later.

Spring Framework's Relation to a Connector

The Spring Framework is an open source software application framework that the connector manager uses to create connector instances.

The following illustration shows how Spring Framework communicates with each connector component:

Config interface components

Communication process:

The Spring Framework uses the parameters in the connectorType.xml file to instantiate the ConnectorType at the same time that Spring instantiates the context of the connector manager. When you develop your connector, you create the connector type that specifies the XHTML for a configuration form.
When the administrator adds a connector of that type, the ConnectorType object is consulted for the configuration form.
The connector manager sends the configuration form to the search appliance, the administrator fills in the form values, and the connector manager extracts the values from the form and stores the values in a configuration data map. The connector validates the configuration form data. The search appliance conveys the configuration form values to Spring Framework.
Spring Framework writes the configuration form information to the .properties file.
Spring Framework uses the data in the configuration form information to substitute the placeholders in the connectorDefaults.xml and connectorInstance.xml files. Spring Framework uses the bean definitions in the connectorDefaults.xml and connectorInstance.xml files to create a connector instance. The connector instance communicates with the content management system through its API.

For information on the Spring Framework DTD, see spring-beans.dtd at SpringFramework.org.

The Spring Framework requires files that contain special XML tags and attributes that Spring uses to locate a connector and its components.

The following is an example of a Spring configuration file that identifies a connector:

<beans>
  <bean id="helloworld-connector" 
        class="com.acme.connector.HelloWorldConnectorType">
  </bean>
</beans>

Where:

id - Identifies the name of the bean--this name is unimportant. Spring uses this name internally.
class - Indicates the location of the HelloWorldConnectorType.class object.

Spring Framework provides Inversion of Control (IoC) to instantiate the connector manager and the connector.

Spring is available from http://www.springframework.org and the Spring version 2.5.6 .jar file that Google supports is available in the software distribution for the connector manager in the projects/connector-manager/third-party/prod directory.

SpringFramework.org provides resources for the enterprise Java community. See also the Introduction to the Spring Framework article, by Rod Johnson.

To work with Spring, you need Java J2SE version 1.5 or later.

XML Configuration Files

You can provide connector type information in the connectorType.xml file and provide connector instance information in the connectorInstance.xml file:

connectorType.xml, connectorInstance.xml, and connectorDefaults.xml are Java Bean definition files.
connectorType.xml and connectorDefaults.xml are invariant across all connector instances of that type, such that all connector instances derive information from a common connectorType.xml and connectorDefaults.xml.
connectorInstance.xml and ConnectorName.properties file may contain configuration information that is unique to each connector instance. This information distinguishes one connector instance from another.

Spring Framework Instantiates a Connector

The Spring Framework instantiates a connector as follows:

Server software starts.
The server application starts followed by the connector manager's web application.
Spring Framework starts.
The startup servlet sets the servlet context and creates an XML web application context that engages the Spring Framework.
Spring Framework instantiates the connector manager.
Spring Framework looks for the applicationContext.xml file in the WEB-INF/ directory and uses that XML file to instantiate all the beans defined in that file. The Spring beans in the applicationContext.xml file indicate the location of the connector manager and its classes.
Spring Framework generates a connector type.
Connector manager uses Spring to instantiate a connector type for each installed type of connector (for each connectorType.xml file found on the classpath).
For each installed connector type, the connector manager uses Spring Framework to instantiate each connector instance, using connectorInstance.xml, connectorDefaults.xml, and the connector's .properties file.

Connector Type and Implementation Processes

Connectors consist of the following separate component groups that provide the

Connector Type - The ConnectorType object provides an XHTML configuration form that administrators use in the Admin Console of a search appliance to specify information about how to contact a content management system and how often to acquire documents.
Connector Instance - Spring Framework uses the information from the connector configuration form to provide properties to the connector manager.
You can optionally create the connectorDefaults.xml file to contain the default settings for properties in the configuration form. The connectorInstance.xml file is required. The connectorDefaults.xml file is optional. If you do not have a connectorDefaults.xml file, then the placeholders for each property on the configuration form have to go in connectorInstance.xml. For the Google connectors, connectorInstance.xml is empty, and is only used to override non-placeholder values in connectorDefaults.xml.

The Admin Console creates the properties file, which is a set of name and value pairs. For example, a properties file contains a user name and password, the name of a repository, the basis of the display URL, and other values. The properties file contains the values that correspond to each field in the configuration form, plus additional values used by the connector manager.

Spring Framework injects the properties, the defaults (if present), and the information in the connectorInstance.xml file into the bean, which Spring Framework uses to create a connector instance.

Spring Framework:
1. Gets a value that the administrator enters in the configuration form.
2. Puts the value in the placeholder wherever the placeholder is, connectorDefaults.xml if it exists, or if not, the connectorInstance.xml file.
3. Injects the value into the bean.
4. Uses the value with the rest of the configuration form values to create the connector instance.

Note: When you implement a connector, define setters for all properties.

For example, if connector_name.properties contains "London Bridge is falling down":

In connectorInstance.xml:

  <bean id="helloworld-connector"
      class="com.example.connector.HelloWorldConnector"
      parent="helloworld-connector-defaults">
    <property name="repetitions" value="5"/>
  </bean>

In connectorDefaults.xml:

  <bean id="helloworld-connector-defaults">
    <property name="text" value="${content}"/>
    <property name="repetitions" value="1"/>
  </bean>

Adding a Connector

The following illustration shows the steps that occur when an administrator adds

Call sequence for the getConfigForm method

The steps are as follows:

An administrator creates a new connector at the Admin Console by clicking Connector Administration > Connectors > Add New Connector.
The connector manager determines which connectorType is to receive the request.
The connector manager calls the ConnectorType.getConfigForm method to have the connector return an XHTML form in the ConfigureResponse class.
The connector manager packages the form in an XML wrapper and sends the XML to the search appliance.
The search appliance displays the form in the Admin Console.

Editing a Connector Form and Storing the Parameters

The following illustration shows how a completed form is validated and either returned for

Call sequence for validateConfig method

The steps are as follows:

The administrator fills in the connector configuration form in the Admin Console with information about how the connector manager can connect to the content management system, including a host name and port, and clicks Submit.
The connector manager routes the request to the proper connector type.
The connector manager calls the ConnectorType.validateConfig method to ensure that all required information is present. The validateConfig method can also test a connection to the content management system using the ConnectorFactory class to call Spring Framework to create a connector instance to verify that the connector can communicate with the content management system. A connector can also change properties as needed in the validateConfig method.
If the validateConfig method fails, the connector returns a ConfigureResponse object that contains the XHTML form and a message to inform the administrator what corrections to make in the form. After the administrator fills in the information, this sequence repeats until the administrator specifies correct information.
If the validateConfig method succeeds, the connector manager stores property information in the ConnectorName.properties file. The connector manager also creates a connector instance and sends acknowledgment to the search appliance.
The search appliance displays the connector entry in the Admin Console at Connector Administration > Connectors > List of Connectors.

Getting Stored Parameters and Displaying the Form

The following illustration shows what occurs when an administrator edits the

Call sequence for the getPopulatedConfigForm() method

The steps are as follows:

An administrator clicks Edit to change the information for an existing connector in the Connector Administration > Connectors > List of Connectors.
The connector manager routes the request to the appropriate connector instance. The connector manager locates the existing connector instance that is being modified, gets its configuration map, and hands that off to the appropriate connectorType as a parameter to the getPopulatedConfigForm method.
The connector manager calls the connector's ConnectorType.getPopulatedConfigForm method to construct the XHTML configuration form for the Admin Console. The getPopulatedConfigForm method should use the values in the supplied configuration Map to populate the XHTML form.
The connector manager puts the XHTML code in an XML wrapper and sends the XML to the search appliance.
The search appliance displays the form in XHTML on the Admin Console. After the administrator completes the changes, the connector manager handles the form submit as described in Editing a Connector Form and Storing the Parameters.

Content Feed and Metadata-and-URL Feed

A connector that can acquire a document, metadata, and a URL for the location of a document from a content management system is known as a content feed. Alternatively, a connector that can acquire just the metadata for a document and its URL is known as a metadata-and-URL feed. The connector manager packages this information into an XML wrapper and passes these components to the Google Search Appliance as a feed source.

The sections that follow provide more information about each feed type.

Content Feed Connector

A connector using content feed works with the content management system's API to acquire a document for the search appliance to index, metadata for the document, and a URL that points to the document's location. The connector then requests documents from the content management

A connector with content feed performs the following tasks:

Traverses the content management system to index documents.
Provides documents to the connector manager as a stream of bytes. The connector manager sends the content feed to the search appliance.
Provides authentication and authorization services except:
- If the documents that the connector supplies to the search appliance are public (world readable).
- If the search appliance delegates authentication to a single sign-on (SSO) provider; the connector may provide authorization.

More information: Understanding Authentication and Authorization.

Metadata-and-URL Feed Connector

A connector using metadata-and-URL feed works with the content management system's API to provide a URL for the location of the document in the content management system and metadata associated with the document. The search appliance then crawls the content management system using the URL to locate the document.

A connector with metadata-and-URL feed performs the following tasks:

The search appliance crawls the content management system using the URL supplied by the connector. Note the distinction that a metadata-and-URL feed connector does not traverse the content management system. The metadata-and-url feed connector does not fetch the content from the content management system. Instead, it provides a URL and the search appliance fetches the content. With a metadata-and-url feed, the URLs discovered on the fetched page may then be fetched according to the search appliance's normal behavior as if that same content had been discovered by crawl (rather than by feed).
The search appliance handles authentication and authorization with the content management system. The expectation of a metadata-and-URL feed connector is that the search appliance performs authentication and authorization using the HTTP authentication methods that the search appliance supports, which are the same that the search appliance uses for content crawling. A metadata-and-URL feed connector is never used for authentication and authorization, it is only used for traversal.
In the Admin Console, the feed type appears as metadata-and-URL.

Understanding Authentication and Authorization

A content feed connector provides authentication and authorization services to enable users to view controlled-access documents. This section explains how a content feed connector handles authentication and authorization. For a metadata-and-URL feed connector, the search appliance handles authentication and authorization.

For a connector to authenticate correctly, a user must have the same user name and password at all content management systems served by the connectors that are associated with a single search appliance. That is, if you have two document sources that map to a user who has different credentials at each source, the search appliance can't serve from both sources.

The search appliance sends the URL for each document to the web client on the content management system along with the credentials of the user making a search request for controlled-access documents. The web client passes the request to the content server, which returns the document or an error message.

The following illustration shows the request and response sequence between the search appliance when a user requests to view controlled-access documents:

Authorization Flow Diagram

For a detailed explanation of each component, see Batch Authorization.

When a user requests a search of controlled-access content, an authentication request from the search appliance is sent to the connector manager, which sends back an authentication response from the connector communicating with the content management system. If the user is correctly authenticated, the search appliance requests authorization and the connector manager sends back a response from the content management system to either allow or deny a user's access to the requested content.

Authentication and authorization in a connector operate as follows:

Authentication - Called no matter what type of connector you have (content feed or metadata-and-URL feed). Authentication depends on the configuration in the Admin Console. Authentication is a priority based algorithm. The search appliance gives all the other mechanisms a chance (SSL, SSO, LDAP) before it tries the connector mechanism. If the user's identity has been verified by the time the connector mechanism is attempted, the search appliance simply translates that identity for potential later use with connector authorization. If the user's identity is not verified, then the search appliance consults the connector to authenticate the user. If you set up LDAP, for example, and the user is authenticated using LDAP, then the connector is not consulted.
Authorization - If a URL of a document starts with googleconnector:// then the connector is called. If the URL does not start with googleconnector://, then authorization in the connector is not called. This is also a priority based algorithm. The googleconnector:// URL condition exists only in connectors using content feed.

Authentication and authorization depend on the connector feed type:

Content feed - Must provide authentication and authorization except when:
- The documents that the connector supplies to the search appliance are public (world readable).
- The search appliance delegates authentication to a single sign-on (SSO) provider. Authorization is still required in this situation.
Metadata-and-URL feed - Authentication can be called if the search appliance requires authentication and a user's identity has not been confirmed from any other security mechanism. However, the connector manager never requests authorization from a connector using metadata-and-URL feed.

For normal (non-connector-based) authorization, the search appliance combines authentication and authorization as follows:

Collects the authentication information at the start of a search query.
Runs a query.
Checks authorization with a HEAD request for each result.
Presents the authentication information and the name of the document to the content management system's web client.
The search appliance then receives either an HTTP 401 error message (authorization denied - do not display results) or a response authorizing the display of the document.

More information: Authentication and Authorization

Localizing a Connector

The methods in the ConnectorType interface support the Locale parameter. If you decided to localize your connector, you should add support for internationalization. The connector follows Java internationalization conventions. You need to add resource bundles as .properties files and build them into your connector.

The service provider interface supports the use of UTF-8 character sets.

Scheduling Connector Access

The Admin Console provides scheduling for when a connector can traverse documents in a content management system with starting and ending times for traversal and the number of documents to traverse per minute.

If the connector manager does not receive content from a connector, the connector manager waits 5 minutes before asking a connector for updated content.

Because the connector manager can start and stop traversals for a connector while the connector manager manages multiple connectors, a connector must be interruptible. The connector manager monitors the state of a connector. The connector periodically saves a checkpoint string. The connector manager provides this string when restarting a connector. A connector may also store information in files on the file system.

The important consideration for a connector is that you do not handle scheduling in your connector. The search appliance and connector manager handle this for you.

More information: Traversing Documents, Checkpointing, File Access.

Displaying Search Results to End Users

A search appliance displays search results as a web page that contains hyperlinks to content that matches a search query. By default, each search result provides a snippet of document content, metadata, or other data to help a search user decide whether the result is relevant.

The goal of the search results page is to give a user enough information to decide to navigate to the document.

For documents that come from connectors, when a user clicks a URL in the search results page, a connector can cause users to view:

A web page. The connector displays the appropriate page in the web interface of the content management system, perhaps a page displaying metadata or content of the document. If a content management system provides its own search interface, then you may want to send users to the same place that they would have gone if they had used that search.
Document content. The connector displays the document content itself, taking advantage of the browser's ability to display a document in a proprietary format using the appropriate application, such as a PDF. As in the previous item, if a content management system provides its own search interface, then you may want to send users to the same place that they would have gone if they had used that search.
A customized servlet. The connector executes a customized servlet that you write. You can implement such a servlet separately from the connector implementation. For security reasons or if the content management system does not have a web client, this choice may be preferred.
No content. The connector does not display content and omits the URL. The user relies on the search engine's normal ability to serve a cached copy of the document.
Another location. The connector displays other content or executes an application as you determine.

You can control each target by how your connector sets the PROPNAME_DISPLAYURL property TraversalManager class.

More information: Metadata Properties and Traversing Documents.

Previous Chapter: About This Guide
Next Chapter: Getting Started