Google Search Appliance - Connector Developer's Guide: Introduction

Google Search Appliance software version 6.2
Connector manager version 2.4.0
Posted December 2009

This section:

For connector terminology definitions, see the Google Enterprise Glossary. This document complements the Javadoc provided at the Google connector manager open source code site.

Chapters: About This Guide, Introduction, Getting Started, SPI Overview, Traversing Documents,
Authentication, Authorization, Configuration, Appendix A: Building a Debug Connector Manager

Chapter Contents: Introduction

  1. What is a Connector?
    1. Why Create a Connector?
  2. What is the Connector Manager?
    1. How the Connector Manager Pushes Documents to the Google Search Appliance
    2. DocPusher Logging
  3. Understanding Connector Development
  4. Spring Framework's Relation to a Connector
    1. XML Configuration Files
    2. Spring Framework Instantiates a Connector
  5. Connector Type and Implementation Processes
    1. Adding a Connector
    2. Editing a Connector Form and Storing the Parameters
    3. Getting Stored Parameters and Displaying the Form
  6. Content Feed and Metadata-and-URL Feed
  7. Understanding Authentication and Authorization
  8. Localizing a Connector
  9. Scheduling Connector Access
  10. Displaying Search Results to End Users

What is a Connector?

A connector is a Java application that you create that contains methods and classes that the connector manager calls to perform the tasks of acquiring documents from a content management system, and authenticating and authorizing users to view search results.

The following illustration shows the relationship between a Google Search Appliance and a connector:

Connector architecture

Components:

  1. The Google Search Appliance provides the Admin Console. You can use the Admin Console to identify a connector manager and configure a connector using a configuration form that the connector provides. The search appliance contains the feed interface, to which the connector manager sends documents for the search appliance to index.
  2. The connector manager resides on a separate server that runs a servlet container (similar to a web server) that processes requests and responses over HTTP to the search appliance. The connector manager:
    1. Communicates through its SPI (service provider interface) to the connector by calling interfaces or methods to perform tasks.
    2. Creates feed sources from the documents, metadata, and URLs that a connector acquires from a content management system (CMS).
  3. The SPI defines the Java interfaces and methods that a connector must contain. The connector manager makes SPI calls to request the connector to acquire documents, authenticate and authorize users to access search results, and to post the configuration form and validate its content.
  4. A connector:
  5. The CMS API consists of functions native to the content management system, which the connector calls to acquire documents, metadata on each document, and a URL that points to a document's location in the CMS.
  6. A content management system consists of a content server that manages documents in a storage system. The storage system is known as a repository. A content management system needs to provide three components for each of its documents, the document itself, metadata that describes the document, and a URL that provides the document's location in the content management system. Some content management systems provide only metadata and a URL.

Back to top

Why Create a Connector?

The reasons to create a connector are:

What is the Connector Manager?

The connector manager is an open source Java application that manages communications between the Google Search Appliance and a connector.

The connector manager is provided in open source at http://google-enterprise-connector-manager.googlecode.com.

The connector manager is part of the Google Enterprise Connector Framework, which consists of the connector manager, SPI, Javadoc for the SPI, and Google support for the connector manager. Google provides open source code for the connector manager on a project code site with access to downloads, issue information, additional documentation, and discussion groups. You can return to the project code site regularly for software and documentation updates.

How the Connector Manager Pushes Documents to the Google Search Appliance

The connector manager provides the DocPusher implementation, which implements the Pusher interface in the connector manager. DocPusher processes documents from a connector and creates a feed source that is conveyed to the search appliance. (Documents in this sense can be any content received from a content management system, such as documents, metadata, and URLs.) The DocPusher enables connectors to add documents to or delete documents from the search appliance.

You can view the source code for DocPusher.java in the connector manager open source site.

The following picture shows how a document moves between a connector and a search appliance.

Connector pushes documents to the connector manager's DocPusher function, which uses the document information to create a feed to the search appliance

The components in this illustration are:

  1. The connector provides documents from the content management system
  2. The connector manager pulls a document from the connector by calling the connector's DocumentList.nextDocument method. DocPusher wraps the document in an XML structure and pushes the document as a feed source to the search appliance. For more information, see DocumentList Interface in the SPI Overview.
  3. The DocPusher implementation:
  4. The Google Search Appliance acknowledges that the feed was successfully received.
    The search appliance sends back to DocPusher values for:

DocPusher Logging

DocPusher logs messages to the ${catalina.base}/logs/google-connectors.%g.log file where ${catalina.base} is the Apache Tomcat installation folder. The connector manager substitutes %g with a log generation in the format of google-connectors.*.log, with the logs written to the Tomcat log directory.

Another useful DocPusher log file is the Feed log, written to the ${catalina.base}/logs/google-connectors.feed%g.log file, which contains the XML Feed records pushed to the search appliance (without the content). The connector manager substitutes %g with a numerical value.

For more information, see Logging.

You can use the log file warnings and errors that DocPusher generates to diagnose problems in a connector, for example, DocPusher generates the following messages on the state of a connector:

Note: If you see messages about not being able to send feeds, verify that the following in the Admin Console is correct:

Back to top

Understanding Connector Development

A connector consists of classes and helper methods that you code to implement the SPI. You also need to create Spring Framework XML files to describe connector components.

The connector manager uses Spring Framework to create an instance of a connector from the information you supply in the XML files.

To develop a connector, you need access to the following components:

When developing a connector, you can use a single computer for the connector host system and the development systems. When you install a connector in a production environment, the connector host is a distinct machine, such as a dedicated servlet container on your production network.

Before creating a connector, you need to understand the following topics:

Creating and deploying a connector consists of the following tasks:

  1. Installing the content management system and ensuring that the system is serving documents.
  2. Installing software and testing a connector with all components to ensure the connector framework setup works. For more information on running a test connector, see Getting Started.
  3. Creating the connectorType.xml file to configure a connector type. For more information, see Configuration.
  4. Creating a configuration form for the Admin Console specifying how to access the content management system. For more information, see Configuration.
  5. Creating the connectorInstance.xml file to indicate to the connector manager how to instantiate your connector. For more information, see Configuration.
  6. Creating the traversal classes and methods. For more information, see Traversing Documents.
  7. Creating authentication and authorization classes and methods if the content management system contains controlled-access content. For more information, see Authentication and Authorization.
  8. Deploy your connector and the connector manager in a servlet container such as Apache Tomcat. Connectors require JDK version 1.5 and later.

Back to top

Spring Framework's Relation to a Connector

The Spring Framework is an open source software application framework that the connector manager uses to create connector instances.

The following illustration shows how Spring Framework communicates with each connector component:

Config interface components

Communication process:

  1. The Spring Framework uses the parameters in the connectorType.xml file to instantiate the ConnectorType at the same time that Spring instantiates the context of the connector manager. When you develop your connector, you create the connector type that specifies the XHTML for a configuration form.
  2. When the administrator adds a connector of that type, the ConnectorType object is consulted for the configuration form.
  3. The connector manager sends the configuration form to the search appliance, the administrator fills in the form values, and the connector manager extracts the values from the form and stores the values in a configuration data map. The connector validates the configuration form data. The search appliance conveys the configuration form values to Spring Framework.
  4. Spring Framework writes the configuration form information to the .properties file.
  5. Spring Framework uses the data in the configuration form information to substitute the placeholders in the connectorDefaults.xml and connectorInstance.xml files. Spring Framework uses the bean definitions in the connectorDefaults.xml and connectorInstance.xml files to create a connector instance. The connector instance communicates with the content management system through its API.

For information on the Spring Framework DTD, see spring-beans.dtd at SpringFramework.org.

The Spring Framework requires files that contain special XML tags and attributes that Spring uses to locate a connector and its components.

The following is an example of a Spring configuration file that identifies a connector:

<beans>
  <bean id="helloworld-connector" 
        class="com.acme.connector.HelloWorldConnectorType">
  </bean>
</beans> 

Where:

Spring Framework provides Inversion of Control (IoC) to instantiate the connector manager and the connector.

Spring is available from http://www.springframework.org and the Spring version 2.5.6 .jar file that Google supports is available in the software distribution for the connector manager in the projects/connector-manager/third-party/prod directory.

SpringFramework.org provides resources for the enterprise Java community. See also the Introduction to the Spring Framework article, by Rod Johnson.

To work with Spring, you need Java J2SE version 1.5 or later.

XML Configuration Files

You can provide connector type information in the connectorType.xml file and provide connector instance information in the connectorInstance.xml file:

Spring Framework Instantiates a Connector

The Spring Framework instantiates a connector as follows:

  1. Server software starts.

    The server application starts followed by the connector manager's web application.

  2. Spring Framework starts.

    The startup servlet sets the servlet context and creates an XML web application context that engages the Spring Framework.

  3. Spring Framework instantiates the connector manager.

    Spring Framework looks for the applicationContext.xml file in the WEB-INF/ directory and uses that XML file to instantiate all the beans defined in that file. The Spring beans in the applicationContext.xml file indicate the location of the connector manager and its classes.

  4. Spring Framework generates a connector type.

    Connector manager uses Spring to instantiate a connector type for each installed type of connector (for each connectorType.xml file found on the classpath).

  5. For each installed connector type, the connector manager uses Spring Framework to instantiate each connector instance, using connectorInstance.xml, connectorDefaults.xml, and the connector's .properties file.

Back to top

Connector Type and Implementation Processes

Connectors consist of the following separate component groups that provide the

Note: When you implement a connector, define setters for all properties.

For example, if connector_name.properties contains "London Bridge is falling down":

In connectorInstance.xml:

  <bean id="helloworld-connector"
      class="com.example.connector.HelloWorldConnector"
      parent="helloworld-connector-defaults">
    <property name="repetitions" value="5"/>
  </bean>

In connectorDefaults.xml:

  <bean id="helloworld-connector-defaults">
    <property name="text" value="${content}"/>
    <property name="repetitions" value="1"/>
  </bean>

Adding a Connector

The following illustration shows the steps that occur when an administrator adds

Call sequence for the getConfigForm method

The steps are as follows:

  1. An administrator creates a new connector at the Admin Console by clicking Connector Administration > Connectors > Add New Connector.
  2. The connector manager determines which connectorType is to receive the request.
  3. The connector manager calls the ConnectorType.getConfigForm method to have the connector return an XHTML form in the ConfigureResponse class.
  4. The connector manager packages the form in an XML wrapper and sends the XML to the search appliance.
  5. The search appliance displays the form in the Admin Console.

Back to top

Editing a Connector Form and Storing the Parameters

The following illustration shows how a completed form is validated and either returned for

Call sequence for validateConfig method

The steps are as follows:

  1. The administrator fills in the connector configuration form in the Admin Console with information about how the connector manager can connect to the content management system, including a host name and port, and clicks Submit.
  2. The connector manager routes the request to the proper connector type.
  3. The connector manager calls the ConnectorType.validateConfig method to ensure that all required information is present. The validateConfig method can also test a connection to the content management system using the ConnectorFactory class to call Spring Framework to create a connector instance to verify that the connector can communicate with the content management system. A connector can also change properties as needed in the validateConfig method.
  4. If the validateConfig method fails, the connector returns a ConfigureResponse object that contains the XHTML form and a message to inform the administrator what corrections to make in the form. After the administrator fills in the information, this sequence repeats until the administrator specifies correct information.
  5. If the validateConfig method succeeds, the connector manager stores property information in the ConnectorName.properties file. The connector manager also creates a connector instance and sends acknowledgment to the search appliance.
  6. The search appliance displays the connector entry in the Admin Console at Connector Administration > Connectors > List of Connectors.

Back to top

Getting Stored Parameters and Displaying the Form

The following illustration shows what occurs when an administrator edits the

Call sequence for the getPopulatedConfigForm() method

The steps are as follows:

  1. An administrator clicks Edit to change the information for an existing connector in the Connector Administration > Connectors > List of Connectors.
  2. The connector manager routes the request to the appropriate connector instance. The connector manager locates the existing connector instance that is being modified, gets its configuration map, and hands that off to the appropriate connectorType as a parameter to the getPopulatedConfigForm method.
  3. The connector manager calls the connector's ConnectorType.getPopulatedConfigForm method to construct the XHTML configuration form for the Admin Console. The getPopulatedConfigForm method should use the values in the supplied configuration Map to populate the XHTML form.
  4. The connector manager puts the XHTML code in an XML wrapper and sends the XML to the search appliance.
  5. The search appliance displays the form in XHTML on the Admin Console. After the administrator completes the changes, the connector manager handles the form submit as described in Editing a Connector Form and Storing the Parameters.

Back to top

Content Feed and Metadata-and-URL Feed

A connector that can acquire a document, metadata, and a URL for the location of a document from a content management system is known as a content feed. Alternatively, a connector that can acquire just the metadata for a document and its URL is known as a metadata-and-URL feed. The connector manager packages this information into an XML wrapper and passes these components to the Google Search Appliance as a feed source.

The sections that follow provide more information about each feed type.

Content Feed Connector

A connector using content feed works with the content management system's API to acquire a document for the search appliance to index, metadata for the document, and a URL that points to the document's location. The connector then requests documents from the content management

A connector with content feed performs the following tasks:

More information: Understanding Authentication and Authorization.

Metadata-and-URL Feed Connector

A connector using metadata-and-URL feed works with the content management system's API to provide a URL for the location of the document in the content management system and metadata associated with the document. The search appliance then crawls the content management system using the URL to locate the document.

A connector with metadata-and-URL feed performs the following tasks:

Back to top

Understanding Authentication and Authorization

A content feed connector provides authentication and authorization services to enable users to view controlled-access documents. This section explains how a content feed connector handles authentication and authorization. For a metadata-and-URL feed connector, the search appliance handles authentication and authorization.

For a connector to authenticate correctly, a user must have the same user name and password at all content management systems served by the connectors that are associated with a single search appliance. That is, if you have two document sources that map to a user who has different credentials at each source, the search appliance can't serve from both sources.

The search appliance sends the URL for each document to the web client on the content management system along with the credentials of the user making a search request for controlled-access documents. The web client passes the request to the content server, which returns the document or an error message.

The following illustration shows the request and response sequence between the search appliance when a user requests to view controlled-access documents:

Authorization Flow Diagram

For a detailed explanation of each component, see Batch Authorization.

When a user requests a search of controlled-access content, an authentication request from the search appliance is sent to the connector manager, which sends back an authentication response from the connector communicating with the content management system. If the user is correctly authenticated, the search appliance requests authorization and the connector manager sends back a response from the content management system to either allow or deny a user's access to the requested content.

Authentication and authorization in a connector operate as follows:

Authentication and authorization depend on the connector feed type:

For normal (non-connector-based) authorization, the search appliance combines authentication and authorization as follows:

  1. Collects the authentication information at the start of a search query.
  2. Runs a query.
  3. Checks authorization with a HEAD request for each result.
  4. Presents the authentication information and the name of the document to the content management system's web client.
  5. The search appliance then receives either an HTTP 401 error message (authorization denied - do not display results) or a response authorizing the display of the document.

More information: Authentication and Authorization

Back to top

Localizing a Connector

The methods in the ConnectorType interface support the Locale parameter. If you decided to localize your connector, you should add support for internationalization. The connector follows Java internationalization conventions. You need to add resource bundles as .properties files and build them into your connector.

The service provider interface supports the use of UTF-8 character sets.

Scheduling Connector Access

The Admin Console provides scheduling for when a connector can traverse documents in a content management system with starting and ending times for traversal and the number of documents to traverse per minute.

If the connector manager does not receive content from a connector, the connector manager waits 5 minutes before asking a connector for updated content.

Because the connector manager can start and stop traversals for a connector while the connector manager manages multiple connectors, a connector must be interruptible. The connector manager monitors the state of a connector. The connector periodically saves a checkpoint string. The connector manager provides this string when restarting a connector. A connector may also store information in files on the file system.

The important consideration for a connector is that you do not handle scheduling in your connector. The search appliance and connector manager handle this for you.

More information: Traversing Documents, Checkpointing, File Access.

Back to top

Displaying Search Results to End Users

A search appliance displays search results as a web page that contains hyperlinks to content that matches a search query. By default, each search result provides a snippet of document content, metadata, or other data to help a search user decide whether the result is relevant.

The goal of the search results page is to give a user enough information to decide to navigate to the document.

For documents that come from connectors, when a user clicks a URL in the search results page, a connector can cause users to view:

You can control each target by how your connector sets the PROPNAME_DISPLAYURL property TraversalManager class.

More information: Metadata Properties and Traversing Documents.

Back to top

Previous Chapter: About This Guide
Next Chapter: Getting Started