Google Search Appliance - Connector Developer's Guide: Traversing Documents

Google Search Appliance software version 6.2
Connector manager version 2.4.0
Posted December 2009

Traversing means acquiring documents from a content management system that the connector manager feeds to a Google Search Appliance. Traversing is analogous to the search appliance concept of crawling.

Chapters: Home, About This Guide, Introduction, Getting Started, SPI Overview, Traversing Documents,
Authentication, Authorization, Configuration, Appendix A: Building a Debug Connector Manager

Chapter Contents: Traversing Documents

Overview
1. Initial Traversal Calls
2. Subsequent Traversal Calls
Acquiring Documents
Iterating Over a Document List
Batch Hint
1. Batch Hint Example
Checkpointing
1. Checkpointing Example
Traversal Context
Resetting a Connector
Metadata Properties
1. Property Lookup
2. SpiConstants Properties
Deleting a Document From the Search Appliance Index
Logging
Count of Feed Files Awaiting Indexing
File Access
1. Scheduling and Checkpoint Data Files
2. Accessing Properties using Spring Framework Inversion of Control
Managing Work Queue Timeouts

Overview

Traversing is acquiring documents from a content management system for the purpose of indexing the documents at a Google Search Appliance. For connector terminology definitions, see the Google Enterprise Glossary.

A connector that uses content feed acquires documents, metadata about each document, and URLs that point to the location of each acquired document in the content management system. A connector that uses metadata-and-URL feed acquires metadata for each document and a URL to the location of the document.

Traversing a content management system consists of the following tasks:

Receiving a batch hint
Acquiring documents, metadata properties, and URLs
Checkpointing the last read document

The following illustration shows the traversal interface:

Call sequence of the Session.getTraversalManager, Session.login method, and the startTraversal and resumeTraversal methods

The steps in the illustration are:

Traversal starts with a call to the getTraversalManager method from the Session interface.
The Session interface is called from Connector interface's login method.
The connector manager determines when to start a traversal, resume traversal, and to provide a batch hint, which is the recommended number of documents to acquire from the content management system. The startTraversal and resumeTraversal methods return a DocumentList object. For more information, see getTraversalManager Method .

A connector controls traversal by setting the return value of the DocumentList interface:

null - Stop traversing, all documents are traversed. Wait 5 minutes before asking the content management system for updated documents.
DocumentList containing zero documents - Set a progress checkpoint and resume traversing.
DocumentList (non-empty) - Send documents in the DocumentList to the search appliance.

For more information on DocumentList return values and document traversing exceptions, see Iterating Over a Document List.

This section explains how to code the following interfaces:

TraversalManager. Starts and resumes acquiring documents from a content management system. Also sets the number of documents in a batch that a connector requests from a content management system.
DocumentList. Extracts a document from the list of documents. Also enables checkpointing the last traversed document.
Document. Extracts property names from a document.
Property. Extracts values from a property.
Value. Provides a wrapper class for all data items from a content management system.

The traversing section also provides related topics on metadata properties, logging, file access, persistent storage, and deleting documents from an index.

Initial Traversal Calls

The first time the connector manager calls a connector instance, the connector manager calls the following methods:

The Connector.login method of the connector instance to start a session.
The Session.getTraversalManager method to start the traversal manager.
The TraversalManager.setBatchHint method to inform the connector that the number of documents to traverse need not be higher than the number specified in the setBatchHint method. Implementations must provide the setBatchHint method to receive a preferred size for the number of documents in a batch to request from a content management system. Connectors can ignore this call or acquire approximately a recommended quantity of documents.
The TraversalManager.startTraversal method to iterate over documents from the logical beginning of the content management system. Traversal continues until the connector manager interrupts the connector instance to perform other tasks.

Subsequent Traversal Calls

After the connector manager is ready to resume acquiring documents, the connector manager calls the following methods:

The DocumentList.checkpoint method to bookmark the location in the content management system repository where traversal should resume.
The resumeTraversal method of the connector instance to resume traversing at the document indicated by the checkpoint method.

Acquiring Documents

The following code example provides a sample TraversalManager implementation. The code declares the class, specifies 500 documents for the fake content management system.

public class MockTraversalManager implements TraversalManager {
  // The maximum number of documents in our fake repository.
  private static final int MAX_DOCID = 500;

  // The recommended number of documents to return in each batch.
  private int batchHint;

  // Set the recommended number of documents to return in next batch.
  public void setBatchHint(int hint) throws RepositoryException {
    if (hint > 0) {
      this.batchHint = hint;
    } else {
      throw new RepositoryException("BatchHint must be a positive integer");
    }
  }

The next section lists the startTraversal method, which the connector manager calls to start acquiring documents from the content management system.

  // Start the traversal from the logical beginning of the repository.
  public DocumentList startTraversal() throws RepositoryException {
  // Start indexing at beginning of our fake repository.
  return collectDocumentsFromId(0);
  }

The next section lists the resumeTraversal method, which the connector manager calls when it's ready to acquire more documents.

  // Resume traversal from the saved checkpoint position in the repository.
  public DocumentList resumeTraversal(String checkpoint) throws RepositoryException {
    int checkpointId;
    try {
      // Extract the last docId returned from the checkpoint.
      checkpointId = Integer.parseInt(checkpoint);
    } catch (NumberFormatException e) {
      throw new RepositoryException("Invalid checkpoint");
    }
    // Resume traversing at the next document after the last one returned.
    return collectDocumentsFromId(checkpointId + 1);
  }

Iterating Over a Document List

After acquiring documents from the content management system, a connector iterates through the documents to process the documents. Thereafter, the connector manager passes the documents to the Google Search Appliance.

The DocumentList interface provides the checkpoint method that indicates the current position within the document list, that is where to start a resumeTraversal method. The nextDocument method gets the next document from the document list that the connector acquires from the content management system.

The return values for DocumentList are as follows.

Return Value	Description
`null`	Informs the connector manager that there are no new documents in the content management system to traverse. The connector manager then waits 5 minutes before calling the `resumeTraversal` method to check the content management system for new documents to traverse.
Zero documents (empty)	Traversing is not done, set a checkpoint to indicate the current progress in the content management system, and immediately resume traversing. This status enables a connector to handle very large processing tasks, such as finding just a few documents to which a user has access in a large database, or traversing a very large repository. If a connector tries to execute long processing tasks, the connector manager timeout mechanism assumes that the connector is hung and attempts to restart the process. This can cause an infinite loop of failing to reach documents, timing out, restarting, and again failing.
Documents	Documents are ready for the connector manager to send to the search appliance.

The checkpoint method should return information that allows the resumeTraversal method to resume on the document that was returned by nextDocument.

The connector manager may not acquire all documents from the content management system. The connector manager may interrupt the traversal of documents, call the checkpoint method, and start another task. When the connector manager is again ready to start traversing the content management system, the connector manager calls the resumeTraversal method and passes the checkpoint method to provide the current position.

Connectors can have these exceptions for documents:

RepositoryException. A transient error occurred while traversing a document, such as a lost connection to the content management system. The connector needs to reread the document. The connector manager waits 15 minutes and retries the same batch before resuming traversal.
RepositoryDocumentException. A document has an error and cannot be retrieved. Skip the document.
OutofMemoryError. The connector is out of memory. The connector manager waits 15 minutes and retries the same batch before resuming traversal. This exception occurs for documents with a length that exceeds the maxDocumentSize parameter specified in connectorInstance.xml. By default, this parameter is set to 30 MB, however you can set maxDocumentSize to any value up to 1 GB. If this exception occurs frequently, raising the maxDocumentSize may require adjusting the Java virtual machine (JVM) memory configuration options -Xms and -Xmx, which are set during Apache Tomcat startup. Tomcat deployments created by the Google Connector Installer (GCI) configure the JVM -Xms and -Xmx settings in $CATALINA_BASE/bin/catalina.sh (or .bat), rather than in the conventional location, $CATALINA_BASE/bin/setenv.sh (or .bat).

The code that follows is a helper method that both startTraversal and resumeTraversal call to build a list of documents. The batch hint provides a useful end point in the for loop for adding documents to the list. The value for baseDocId is passed from the caller.

  // Build a DocumentList of no more than batchHint documents starting from baseDocId.
  private DocumentList collectDocumentsFromId(int baseDocId) throws RepositoryException {
    if (baseDocId > MAX_DOCID)
      return null;  // No more documents available.

    MockDocumentList documentList = new MockDocumentList(batchHint);
    for (int i = 0; i < batchHint; i++) {
      int newDocId = baseDocId + i;
      if (newDocId > MAX_DOCID)
        break;  // End of our fake repository.
      documentList.add(new MockDocument(newDocId));
    }

    return documentList;
  }

This section provides a DocumentList object.

  // DocumentList returned from startTraversal() or resumeTraversal().
  private class MockDocumentList extends ArrayList 
    implements DocumentList {
    // The ID of the last document that was returned by nextDocument().
    private int lastDocId = -1;

    // The Iterator over the DocumentList.
    private Iterator iterator = null;

    public MockDocumentList(int size) {
      super(size);
    }

The next section gets the next document and stores the document ID to use as a checkpoint that the resumeTraversal method uses to get the next document to acquire from the content management system.

    // Return the next document in the DocumentList.
    public Document nextDocument() throws RepositoryException {
      if (iterator == null)
        iterator = super.iterator();

      if (iterator.hasNext()) {
        MockDocument document = (MockDocument) iterator.next();
        // Remember the id of the last document returned.
        lastDocId = document.docId;
        return document;
      } else {
        return null;  // No more documents in list.
      }
    }

The next section provides the checkpoint value when called, or null to indicate to use the previous stored checkpoint.

    // Return a checkpoint recording the last document returned 
    // from nextDocument().
    public String checkpoint() throws RepositoryException {
      if (lastDocId < 0)
        return null;  // No new checkpoint.
      else
        return String.valueOf(lastDocId);
    }
  }

The next section provides a Document object.

  // Trivial Document with no content and only a DOCID Property.
  private class MockDocument implements Document {
    public int docId;

    public MockDocument(int id) {
      this.docId = id;
    }

The connector manager uses the next section to find a property or to get all the properties for a document. Note the use of the String equalsIgnoreCase operator appended to the property to ensure the case is ignored in the property.

    public Property findProperty(String name) 
      throws RepositoryException {
      if (SpiConstants.PROPNAME_DOCID.equalsIgnoreCase(name))
        return new MockProperty(String.valueOf(docId));
      else
        return null;
    }

    public Set getPropertyNames() throws RepositoryException {
      // We only provide the mandatory DOCID Property.
      HashSet names = new HashSet();
      names.add(SpiConstants.PROPNAME_DOCID);
      return names;
    }
  }

The next section provides a Property object to return a string value.

  // Trivial Property that returns a single string value.
  private class MockProperty implements Property {
    String value = null;

    public MockProperty(String value) {
      this.value = value;
    }

The last section provides the nextValue method that packages a property value by type.

    public Value nextValue() throws RepositoryException {
      if (Value == null) {
        return null;
      } else {
        // Use the Value class to package up the String.
        Value returnValue = Value.getStringValue(value);
        value = null;
        return returnValue;
      }
    }
  }
}

Batch Hint

The connector manager provides the TraversalManager.setBatchHint method so that the connector manager can indicate to the connector to acquire only as many documents in the DocumentList as are specified in the hint value. This is a hint and the connector need not observe the number of results. For example, the traversal may be completely up to date, so perhaps there are no results to return.

The connector manager always calls setBatchHint before calling startTraversal or resumeTraversal. The setBatchHint is determined by the traversal rate. The first call to a traversal method in a particular minute has a batch hint matching the traversal rate. Subsequent calls in the same minute will have a batch hint of the traversal rate minus the number of documents traversed so far during that minute. If the batch hint becomes zero, the traversal method will not be called until the minute has elapsed.

Traversal methods should always pay attention to the batch hint, because this value specifies how many documents the connector manager processes in a DocumentList.

Batch Hint Example

In the following example, the connector manager passes the batch hint value and the setBatchHint method assigns the value.

private int batchHint = -1;
...
public void setBatchHint(int batchHint) throws RepositoryException {
    this.batchHint = batchHint;
}

The following example limits the batch hint to 100,000 documents. This example connector later references the batchSize parameter in a call to a database for storing values.

public void setBatchHint(int hint) {
  if (hint < 0)
    throw new RepositoryException();
  else if (hint == 0)
    batchSize = 100; 
  else if (hint > 100000)
    batchSize = 100000;
  else
    batchSize = hint;
}

Checkpointing

Checkpointing consists of recording how far a traversal has progressed through the documents in the content management system. The connector manager calls the resumeTraversal method to pass the saved checkpoint to the connector. The DocumentList.checkpoint method lets a connector indicate the next document to traverse.

A connector can also store the state of the traversal on its own and assume that the resumeTraversal method knows where to look for this state information. If a connector maintains its own state, the connector must wait until the connector manager calls the checkpoint method before committing changes to the state. The connector can't assume that a document returned from nextDocument has been successfully processed.

Return values for the checkpoint method:

Return Value	Description
`null`	Indicates that there is no new checkpoint
Next document ID	Indicates the next document to traverse. The connector manager saves the string returned by the `checkpoint` method and passes the saved checkpoint to the `resumeTraversal` method. This string must convey enough information for the `resumeTraversal` method to know where to continue.
Non-`null`	Indicates that the connector cannot return a checkpoint string. If a connector decides that it cannot return a checkpoint string, the connector must return a non-`null` string, which can be an empty string or a constant string.

For more information on the checkpoint method, see getTraversalManager Method.

Returning a checkpoint string may be impractical for performance, such as when taking too much work to get the state of a traversal into a string and then parsing out the string.

For information on writing information, see File Access and Scheduling and Checkpoint Data Files.

Checkpointing Example

The following code example iterates over a list of documents and returns a checkpoint. In the last section of this code block, the code maintains the last document ID (lastDocId) in NextDocument and incrementing as it goes. When the checkpoint method is called, it knows how far it has gotten through the document list.

    // Return the next document in the DocumentList.
    public Document nextDocument() throws RepositoryException {
      if (iterator == null)
        iterator = super.iterator();

      if (iterator.hasNext()) {
        MockDocument document = (MockDocument) iterator.next();
        // Remember the id of the last document returned.
        lastDocId = document.docId;
        return document;
      } else {
        return null;  // No more documents in list.
      }
    }

    // Return a checkpoint recording the last document returned from nextDocument().
    public String checkpoint() throws RepositoryException {
      if (lastDocId < 0)
        return null;  // No new checkpoint.
      else
        return String.valueOf(lastDocId);
    }
  }

If a connector needs to record the traversal state, it must only do so in the checkpoint method, and not in the nextDocument method. The recorded state must be retrieved in resumeTraversal, for example:

// Resume traversal from the saved checkpoint position in the repository. 
public DocumentList resumeTraversal(String dummy) throws RepositoryException { 
  int checkpointId; 
  try {
    // Read the checkpoint from disk. 
    BufferedReader in = new BufferedReader(new FileReader("checkpoint.txt"));
    try {
      String checkpoint = in.readLine(); 
      // Extract the last docId returned from the checkpoint.
      checkpointId = Integer.parseInt(checkpoint);
    } finally {
      in.close();
    }
  } catch (IOException e) {
    throw new RepositoryException(e); 
  } catch (NumberFormatException e) { 
    throw new RepositoryException("Invalid checkpoint"); 
  } 
  // Resume traversing at the next document after the last one returned. 
  return collectDocumentsFromId(checkpointId + 1); 
} 
...
  // Return a checkpoint recording the last document returned  
  // from nextDocument(). 
  public String checkpoint() throws RepositoryException { 
    if (lastDocId < 0) 
      return null;  // No new checkpoint. 
    else {
      try {
        // Save the checkpoint to disk.
        FileWriter out = new FileWriter("checkpoint.txt");
        try {
          out.write(String.valueOf(lastDocId));
        } finally {
          out.close();
        }
      } catch (IOException e) {
        throw new RepositoryException(e);
      }
      return ""; // Any non-null string is OK.
    }
  } 
}

For information on writing information to disk, see File Access.

Traversal Context

The connector manager provides the TraversalContext class implementation so that a connector can determine document types to provide during a traversal. Connectors may use the information provided by the TraversalContext class to limit content provided for indexing, based upon document size or MIME type.

The Spring Framework constructs the TraversalContext using a bean definition in the applicationContext.xml file. The bean definition also allows an administer to modify the TraversalContext constraints to suit their needs. The maximum document size, unknown MIME type support level, and the sets of MIME types to prefer, support, or not support can all be configured.

Note: A MIME type consists of two parts, a major class and a minor class, for example, application/text. If a major and minor MIME type is not explicitly defined in the applicationContext.xml file and a MIME type cannot be resolved, then the rule applies to the all uses of the major MIME type class. For example, if you have a corporate MIME type such as application/treefrog and this MIME type is not defined in the XML file, the connector manager applies the rules in the applicationContext.xml file to all MIME types with the application major class.

Note: The default settings for the supported MIME types are comprehensive so that a connector can continue to feed useful data and exclude only content that would be excluded normally by the default configuration of the Google Search Appliance--after enabling traversal context awareness, administrators need not change the applicationContext.xml unless a change is made applications to not crawl in the Admin Console.

This section provides the following topics

Making a Connector Aware of Traversal Context
Using TraversalContext for Feeding Documents
Traversal Context Methods
Tailoring the TraversalContext Settings to your Enterprise
Setting the Maximum Document Size
MIME Type Support Levels

Making a Connector Aware of Traversal Context

If a connector's TraversalManager implementation adds the com.google.enterprise.connector.spi.TraversalContextAware interface, the connector manager can call the setTraversalContext method, and supply context values for the connector to use before the connector manager calls the methods in the TraversalManager interface.

A connector needs to import access to the TraversalContext and TraversalContextAware packages.

The following example demonstrates this calling sequence:

import com.google.enterprise.connector.spi.TraversalContext;
import com.google.enterprise.connector.spi.TraversalContextAware;

class MyTraversalManager
    implements TraversalManager, TraversalContextAware {

  /** The TraversalContext from TraversalContextAware Interface */
  private TraversalContext traversalContext = null;

  public void setTraversalContext(TraversalContext traversalContext) {
    this.traversalContext = traversalContext;
  }
  ...
}

Using TraversalContext for Feeding Documents

If a connector contains an instance of the TraversalContext class, the connector's TraversalManager class can then use the context to tailor its document feed.

The mimeTypeSupportLevel method determines whether to send a document's content to the search appliance. The preferredMimeType method returns the most preferred (highest support level) MIME type from the supplied set.

For example, in the following code snippet, the TraversalContext class determines whether or not to supply a google:content property for a document.

private void addContentProperty() throws RepositoryException {
  File contentFile;
  String mimeType;
  ...

  long fileSize = contentFile.length();

  // Empty/NonExistent file has no content.
  if (fileSize <= 0L)
    return;

  // The TraversalContext interface provides additional
  // screening  depending on content size and MIME type.
  if (traversalContext != null) {
    // Is the content too large?
    if (fileSize > traversalContext.maxDocumentSize())
      return;

    // Is this MIME type supported?
    if (traversalContext.mimeTypeSupportLevel(mimeType) <= 0)
      return;
  }

  // If okay, create a content stream property
  // and add it to the property map.
  InputStream contentStream = new FileInputStream(contentFile);
  Value contentValue = Value.getBinaryValue(contentStream);
  props.addProperty(SpiConstants.PROPNAME_CONTENT, contentValue);
  return;
 }

Traversal Context Methods

The TraversalContext interface provides the following methods:

maxDocumentSize. Sets the maximum file size in bytes that that the connector can traverse. This value must be less than or equal to 30 MB (31457280 bytes), which is the maximum file size for indexing on the Google Search Appliance.
mimeTypeSupportLevel. Determines if a MIME type is supported in the applicationContext.xml file.
preferredMimeType. Returns the preferred MIME type for a set of MIME types.

The methods are as follows.

public interface TraversalContext {
 /**
  * Gets a size limit for contents passed through the connector
  * framework. If a developer has a way of asking the repository
  * for the size of a content file before fetching it, then a
  * comparison with this size would save the developer the cost
  * of fetching a content that is too big to be used.
  *
  * @return The size limit in bytes
  */
 long maxDocumentSize();

 /**
  * Gets information about whether a MIME type is supported.
  * Non-positive numbers mean that there is no support for this
  * MIME type. Positive values indicate possible support for this
  * MIME type, with larger values indicating better support or
  * preference.
  *
  * @return The support level - non-positive means no support
  */
 int mimeTypeSupportLevel(String mimeType);

 /**
  * Returns the most preferred MIME type from the supplied set.
  * This returns the MIME type from the set with the highest support
  * level. MIME types with "/vnd.*" subtypes are preferred over
  * others, and MIME types registered with IANA are preferred over
  * those with "/x-*" experimental subtypes.
  *
  * If a repository contains multiple renditions of a particular
  * item, it may use this to select the best rendition to supply
  * for indexing.
  *
  * The support level values for MIME types are defined in the
  *  MimeTypeMap bean definition in applicationContext.xml.
  *
  * @param mimeTypes a set of MIME types.
  * @returns the most preferred MIME type from the Set.
  */
 String preferredMimeType(Set mimeTypes);
}

Tailoring the TraversalContext Settings to your Enterprise

The default values for the TraversalContext settings reflect either common usage or Google Search Appliance limitations. However, a knowledgeable user or administrator can tailor some of the search appliance and TraversalContext configuration to better suit their enterprise content.

For example, the Google Search Appliance doesn't index a document whose content exceeds 30 megabytes in size and the default TraversalContext.maxDocumentSize value reflects that limit. However, perhaps you wish to impose a lower limit, only indexing documents that are 5 megabytes or less. Or perhaps you prefer to index Postscript or PDF pre-press versions of documents, rather than the original Office suite formats.

You may tune TraversalContext settings by editing one or more of the connector Manager web application configuration files. In some cases it may also be necessary to modify the Search Appliance configuration (using the search appliance Administration pages).

Setting the Maximum Document Size

The maximum document size limit may be configured by adjusting the maxDocumentSize property of the FileSizeLimitInfo bean definition in applicationContext.xml. Do not set maxDocumentSize larger than the 30 MB search appliance document limit.

If the document size exceeds the maxDocumentSize value or if the content is an unsupported MIME type, the connector should not feed the document content. The search appliance indexes all metadata about a document, but not the document's content.

The use of the maxDocumentSize parameter avoids the occurrence of throwing the OutOfMemoryErrors exception--if the maxDocumentSize parameter is set, then MIME types with large content size can be excluded from being fed to the Google Search Appliance. For more information on exceptions, see Iterating Over a Document List.

MIME Type Support Levels

The support level values for MIME types that are used by the mimeTypeSupportLevel and preferredMimeType methods are defined in the MimeTypeMap bean definition in the applicationContext.xml file. Support levels are integer values, with higher positive integers have a higher preference than lower values. Support levels less than or equal to zero indicate that MIME type is not supported. The default value is 1, which indicates that a document type is unknown. The value 1 sends unidentifiable content to the Google Search Appliance, 0 does not send unidentifiable content to the search appliance.

The TraversalContext.maxDocumentSize(String mimeType) method returns the support level of the specified MIME type. MIME type support levels are used to rank items supplied to TraversalContext.preferredMimeType(Set mimeTypes).

The MimeTypeMap bean definition in applicationContext.xml groups MIME types into the following categories:

Category	Description
`unsupported`	Document types that must not be sent to the Google Search Appliance because the URL pattern has been designated in the Admin Console for not crawling or a document type that has file sizes greater than 30 MB (the maximum file size that the search appliance can handle for indexing a document). If a document's MIME type is not supported, then its content is not sent in the feed to the search appliance, however the metadata for the document MIME type is sent in the feed to the search appliance for indexing.
`supported`	Document types that the search appliance can index.
`preferred`	MIME types with higher support levels than the unsupported MIME types
`ignored`	Skip a document and neither send the content or metadata in the feed to the search appliance. For example, if a company had a large number of JPG pictures from a company party in the content management system, you could use `ignored` to say that you don't want the company pictures to be indexed.
`unknown`	A document type that is not `unsupported`, `supported`, or `preferred`.

The entries or each of the unsupported, supported, or preferred sets are a list of content types that may or may not include subtypes. Exact (case-insensitive) matches are attempted first. If an exact match is not found, a match is attempted using the base type without the subtype.

For example:

preferredMimeTypes={} (empty), supportedMimeTypes={"foo/bar"},
unsupportedMimeTypes={"foo", "cat"}

Foo/Bar matches (case-insensitively) foo/bar, so it would be considered supported. Foo/baz does not have an exact match, but its content type, without a subtype, foo, does have a match in the unsupported table, so it should be considered unsupported. Similarly, cat/persian would be considered unsupported. Xyzzy/bar lacks an exact match, and its content type without a subtype, xyzzy, is also not present, so it would be assigned the unknownMimeTypeSupportLevel.

Nearly all of the IANA recognized content type classes are well represented in the preferred, supported, unsupported MIME types sets, so very few MIME types should end up as unknown. Removing a content type without a subtype entries from the preferred, supported, unsupported MIME types sets forces more MIME types to become unknown. For more information, see MIME Media Types at IANA.org.

Moving MIME types between preferred, supported, and unsupported adjusts the support level for that MIME type. Move MIME types you explicitly don't want to index to the unsupported set. Moving the MIME type (without a subtype) entries, like application, can have sweeping effects.

You could add MIME types for new, experimental, or custom content. Deleting entries with explicit subtypes forces an attempt to match an entry sans subtype. For example, deleting the application/postscript entry from the supported set would cause the support level for such content to drop to the application support level, currently unsupported.

Deleting an entry without a subtype forces all subtypes not explicitly listed to be considered unknown. For example, the application content type class is currently listed in the unsupported set. As such, the MIME type application/vnd.informix-visionary, not explicitly mentioned elsewhere, would have the support level of unsupported. Deleting the application entry from the unsupported set forces application/vnd.informix-visionary, lacking both an explicit entry and an application entry without a subtype, to be assigned the unknownMimeType support level.

Resetting a Connector

The Admin Console provides the capability to reset a connector. The connector manager uses its restartConnectorTraversal servlet. At the connector level, the connector manager performs the following actions:

Deletes the connector's saved traversal state
Cancels any traversal work that is in progress for the connector
Calls startTraversal to force a new traversal of the content management system from the beginning

Metadata Properties

A connector requests documents, metadata for a document and a URL for a document's location from the API for the content management system. A connector provides the search appliance with a URL for each document and metadata that describes document attributes, such as a document ID value, a MIME type, whether documents are publicly available or controlled-access, and an optional last modified date. The connector may also provide metadata specific to a content management system.

The last-modified date ordering is useful for a search appliance to acquire the latest documents for indexing. You need to create connector software that maps the metadata from the content management system into the properties of the service provider interface (SPI) of the connector manager. For information on the SPI, see the SpiConstants section in the Javadoc.

The connector manager uses the Document.findProperty method to find a property by name in a document or the Document.getPropertyNames method to get all the properties for a document.

Important: If the connector manager requests a property from the connector that a document does not contain, the connector must return null and not throw an exception.

Content management systems use their own names for each metadata property. A connector needs to map those names to the property names in the SpiConstants class. The connector manager requires the use of the PROPNAME_DOCID property and recommends the use of the PROPNAME_LASTMODIFIED property. Google Enterprise strongly suggests that you specify a MIME type in the PROPNAME_MIMETYPE property.

Properties With Multiple Values

Even though some content management systems allow properties with multiple values, they typically represent a small set of the metadata associated with documents. Because properties with multiple values can exist, the Property interface was designed to allow them. However, some properties are expected to have only one value and PROPNAME_DOCID is one of them. A document must have a single, unique document ID value.

The following example shows how to return a property with multiple values:

  public Value nextValue() throws RepositoryException {
    if (iterator != null) {
      // Return the values of a Property with multiple values.
      if (iterator.hasNext()) {
        return fromObject(iterator.next());
      }
      return null;
    } else {
      // Return the value of a Property with a single value, but only once;
      Value returnValue = singleValue;
      singleValue = null;
      return returnValue;
    }
  }

Property Names Outside SpiConstants

You can create property names outside of the SpiConstants class if required to indicate the metadata from an underlying content management system. These entries can be searchable and users can select the entries in advanced queries with search keywords such as using the inmeta filter. For more information, see "Using inmeta to Filter by Meta Tags" in the Search Protocol Reference. Property names like author, title, license, and tags are examples of metadata that the connector manager does not explicitly look for, but which you can pass to the search appliance for indexing.

Note: If a connector uses content feed, do not set the PROPNAME_SEARCHURL property. If a connector uses metadata-and-URL feed, set the PROPNAME_SEARCHURL property. For more information, see Understanding Content Feed and Metadata-and-URL Feed.

Property Lookup

The connector manager uses the Property.findProperty method to find a property in a document by name or the Property.getPropertyNames method to provide a list of all properties that a document contains. If the connector manager requests a property that a document does not contain, the connector returns null and must not throw an exception.

Two behaviors are required:

The findProperty method returns null and does not throw an exception when an unknown property is requested.
The same property may be looked up multiple times, and the same SimpleProperty instance cannot be returned each time. This is because the nextValue method on the SimpleProperty instance will return null the second time the property is retrieved.

The following code example code provides an implementation of the Property interface that is notable for its large number of overloaded constructors, its various primitive types, and its common Object types. These constructors make an informed choice of which Value class method to call based on the type of the parameter supplied to the constructor.

public class SmartProperty implements Property {
  // Used by single-value properties.
  Value singleValue = null;

  // Used by multi-value properties.
  Iterator iterator = null;

  // Constructors for single-value Properties of primitive types.
  public SmartProperty(int anInt) {
    // Use the Value class to construct a long integer value.
    singleValue = Value.getLongValue((long) anInt);
  }
  public SmartProperty(long aLong) {
    // Use the Value class to construct a long integer value.
    singleValue = Value.getLongValue(aLong);
  }
  public SmartProperty(boolean aBool) {
    // Use the Value class to construct a boolean value.
    singleValue = Value.getBooleanValue(aBool);
  }
  public SmartProperty(float aFloat) {
    // Use the Value class to construct a floating point value.
    singleValue = Value.getDoubleValue((double) aFloat);
  }
  public SmartProperty(double aDouble) {
    // Use the Value class to construct a floating point value.
    singleValue = Value.getDoubleValue(aDouble);
  }
  public SmartProperty(byte[] byteArray) {
    // Use the Value class to construct a byteArray binary value.
    singleValue = Value.getBinaryValue(byteArray);
  }

  // Constructors for single-value Properties of Java primitive wrapper Objects.
  public SmartProperty(Integer anInt) {
    // Use the Value class to construct a long integer value.
    singleValue = Value.getLongValue(anInt.longValue());
  }
  public SmartProperty(Long aLong) {
    // Use the Value class to construct a long integer value.
    singleValue = Value.getLongValue(aLong.longValue());
  }
  public SmartProperty(Short aShort) {
    // Use the Value class to construct a long integer value.
    singleValue = Value.getLongValue(aShort.longValue());
  }
  public SmartProperty(Boolean aBool) {
    // Use the Value class to construct a boolean value.
    singleValue = Value.getBooleanValue(aBool.booleanValue());
  }
  public SmartProperty(Float aFloat) {
    // Use the Value class to construct a floating point value.
    singleValue = Value.getDoubleValue(aFloat.doubleValue());
  }
  public SmartProperty(Double aDouble) {
    // Use the Value class to construct a floating point value.
    singleValue = Value.getDoubleValue(aDouble.doubleValue());
  }
  public SmartProperty(String aString) {
    // Use the Value class to construct a String value.
    singleValue = Value.getStringValue(aString);
  }
  public SmartProperty(InputStream aStream) {
    // Use the Value class to construct a binary InputStream value.
    singleValue = Value.getBinaryValue(aStream);
  }
  public SmartProperty(Calendar aDate) {
    // Use the Value class to construct a Date value.
    singleValue = Value.getDateValue(aDate);
  }
  public SmartProperty(Date aDate) {
    // Use the Value class to construct a Date value.
    Calendar calendar = new GregorianCalendar();
    calendar.setTime(aDate);
    singleValue = Value.getDateValue(calendar);
  }

  // Constructors for multi-valued Properties.
  public SmartProperty(Collection values) {
    this.iterator = values.iterator();
  }
  public SmartProperty(Object[] objectArray) {
    this.iterator = Arrays.asList(objectArray).iterator();
  }

  // Constructor for generic Objects.
  // Try to upcast to the appropriate type.
  public SmartProperty(Object anObject) {
    singleValue = fromObject(anObject);
  }

Helper method to get a value from an object.

  private static Value fromObject(Object obj) {
    if (obj == null)
      return null;
    if ((obj instanceof Integer) || (obj instanceof Long) || (obj instanceof Short))
      return Value.getLongValue(((Number) obj).longValue());
    else if ((obj instanceof Float) || (obj instanceof Double))
      return Value.getDoubleValue(((Number) obj).doubleValue());
    else if (obj instanceof Boolean)
      return Value.getBooleanValue(((Boolean) obj).booleanValue());
    else if (obj instanceof String)
      return Value.getStringValue((String) obj);
    else if (obj instanceof InputStream)
      return Value.getBinaryValue((InputStream) obj);
    else if (obj instanceof Calendar)
      return Value.getDateValue((Calendar) obj);
    else if (obj instanceof Date) {
      Calendar calendar = new GregorianCalendar();
      calendar.setTime((Date) obj);
      return Value.getDateValue(calendar);
    } else {
      // Fall through to providing a String representation of non-primitive objects.
      return Value.getStringValue(obj.toString());
    }
  }

Check to see if the property has multiple values or a single value and return the value.

  public Value nextValue() throws RepositoryException {
    if (iterator != null) {
      // Return the values of a Property with multiple values.
      if (iterator.hasNext()) {
        return fromObject(iterator.next());
      }
      return null;
    } else {
      // Return the value of a Property with a single value, but only once;
      Value returnValue = singleValue;
      singleValue = null;
      return returnValue;
    }
  }
}

SpiConstants Properties

The SpiConstants class is a non-instantiable class that holds constants used by the SPI and documents their meanings.

The SpiConstants class provides the ActionType class, which is a type safe enum for the action types defined by the PROPNAME_ACTION property. All PROPNAME properties are of type String.

The following sections list the properties in the SpiConstants class:

Property	Description
PROPNAME_ACLGROUPS	Identifies a multiple-valued String property that gives the list of group ACL Scope IDs that are permitted `RoleType.READER` access to this document.
PROPNAME_ACLUSERS	Identifies a multiple-valued String property that provides the list of users who are permitted to access this document.
PROPNAME_ACTION	Enables a connector to delete a document from the index of the search appliance.
PROPNAME_CONTENT	Indicates the content for a document.
PROPNAME_CONTENTURL	Indicates the URL for the content of a document. This property is reserved for future use.
PROPNAME_DISPLAYURL	Indicates the URL that appears in the search results.
PROPNAME_DOCID	Identifies a document in the content management system.
PROPNAME_ISPUBLIC	Indicates whether a document is publicly readable or is a controlled-access document.
PROPNAME_LASTMODIFIED	Identifies when a document in the content management system was last modified.
PROPNAME_MIMETYPE	Specifies the type of file that a connector acquires from a content management system.
PROPNAME_SEARCHURL	Identifies the URL location of a document in the content management system for metadata-and-URL feed. This property is not used by content feed.
PROPNAME_SECURITYTOKEN	Provides a token with the identity of the user who submits a search query, and the connector indicates whether the current user has permission to view a document in a class. This property is reserved for future use.
PROPNAME_TITLE	Specifies the title of the search results if there is no content.
DEFAULT_MIMETYPE	Specifies the default type of file that a connector acquires from a content management system.
GROUP_ROLES_PROPNAME_PREFIX	Identifies the prefix added to the front of the group ACL Scope ID when creating a group roles property name.
USER_ROLES_PROPNAME_PREFIX	Identifies the prefix added to the front of the user ACL Scope ID when creating a user roles property name.

PROPNAME_ACLGROUPS

Identifies a multiple-valued String property that gives the list of group ACL Scope IDs that are permitted RoleType.READER access to this document. If either of the PROPNAME_ACLGROUPS or PROPNAME_ACLUSERS properties are non-null, then the search appliance grants or denies access to this document for a given user on the basis of whether the user's name appears as one of the Scope IDs in the PROPNAME_ACLUSERS list or one of the user's groups appears as one of the Scope IDs in the PROPNAME_ACLGROUPS list. ACL Scope ID is a group or user name within the scope of the connector.

To specify more than just RoleType.READER access to the document, the connector must add additional multi-value role properties to the document. These entries are of the form:

Name = <GROUP_ROLES_PROPNAME_PREFIX> + <scopeId>
Value = [RoleType[, ...]]

Where: <GROUP_ROLES_PROPNAME_PREFIX> is the GROUP_ROLES_PROPNAME_PREFIX, <scopeId> is the group ACL Scope ID, and RoleType is one of the possible RoleType values. User ACL roles are of the form:

Name = <USER_ROLES_PROPNAME_PREFIX> + <scopeId>
Value = [RoleType[, ...]]

Where the <scopeId> is the user ACL scope ID. If the PROPNAME_ISPUBLIC is missing or is true, then this property is ignored, because the document is public.

If both the PROPNAME_ACLGROUPS and PROPNAME_ACLUSERS properties are null or empty, then the search appliance uses the authorization SPI to grant or deny access to this document.

The search appliance may be configured to bypass on-board authorization, in which case these properties are ignored, and the search appliance uses the authorization SPI to grant or deny access to this document.

Value: google:aclgroups

PROPNAME_ACLUSERS

Identifies a multiple-valued String property that gives the list of users who are permitted access to this document. For details, see PROPNAME_ACLGROUPS.

Value: google:aclusers

PROPNAME_ACTION

Enables a connector to communicate with the connector manager to add a document (the default) or to delete a document from the index on the search appliance. Possible values are ADD or DELETE.

The default is ADD, which adds documents to the search appliance index. To delete documents, set the property to DELETE. For more information and an example for using this property, see Deleting Documents From an Index.

If the action is determined to not be ADD or DELETE, then a WARNING is logged, and the property is ignored. The document is then sent without an action (which is effectively the same as an ADD).

Value: google:action

PROPNAME_CONTENT

Static variable in the SpiConstants class that identifies a property with a single value. This variable can be either a string or binary. This variable provides direct access to a document's content that a search appliance indexes. Don't use this property with PROPNAME_TITLE.

Important: Because the connector manager does not read an entire document into memory at the same time, the connector manager can treat the document as a stream and ensure that the memory that the connector manager uses stays constant. If possible, you should stream the documents through your connector as well, rather than realizing it in memory as a String or byte array.

Value: google:content

PROPNAME_CONTENTURL

Specifies the document's URL. This property is reserved for future use.

Value: google:contenturl

PROPNAME_DISPLAYURL

Identifies a property with a single value. The search appliance uses the display URL in a search results page as the primary user reference for a document. This URL may be different from the document's URL, if present, a document's URL should give direct access to the document file, whereas a display URL may point into the web client on the content management system. The search appliance only supports use of the display URL for content feed and not for metadata-and-URL feed.

Value: google:displayurl

PROPNAME_DOCID

Identifies a required string property with a single value that uniquely identifies a document to this connector. The internal structure of this string is opaque to the search appliance. The connector manager permits only printable, non-whitespace, ASCII characters in a document ID. You should implement this property using the natural ID in the content management system.

Important: PROPNAME_DOCID is required. For previous connector manager versions, this property was recommended.

Value: google:docid

PROPNAME_ISPUBLIC

Indicates whether a document is publicly readable or is a controlled-access document. A single-valued property.

Value: google:ispublic

PROPNAME_LASTMODIFIED

Identifies a single value property of Value.getDateValue(Calendar) or Value.getStringValue(String) using a RFC 822 formatted datetime string.

Note: Using PROPNAME_LASTMODIFIED is strongly recommended. For previous connector manager versions, this property was required.

Value: google:lastmodified

PROPNAME_MIMETYPE

Identifies a string property with a single value that gives the MIME type for the contents of this document. Use of PROPNAME_MIMETYPE is strongly recommended. If you do not supply this property, the connector manager uses the value of DEFAULT_MIMETYPE. The MIME type is used by the search appliance to convert files for indexing. The MIME type must agree with the types of files that are listed in the Indexable File Formats guide. For a list of MIME types by file type, see the Webmaster Toolkit site, Mime Types.

Value: google:mimetype

PROPNAME_SEARCHURL

Identifies an optional string property with a single value that, if present, the search appliance uses as the primary URL for this document - instead of the normal ^googleconnector:// URL that the connector manager designates.

Note: If you set this property, documents are sent to the search appliance as metadata and URL feed. If you do not set this property, documents are sent as a content feed. For more information, see Understanding Content Feed and Metadata-and-URL Feeds.

You can provide this property if you want the search appliance to do web-style authentication and authorization for this document. If you specify this property, the search appliance does not call back to the connector manager at serve time.

Value: google:searchurl

PROPNAME_SECURITYTOKEN

Identifies a string property with a single value that serves as a security token. This property is reserved for future use.

At serve time, the search appliance presents this token with the identity of the user who submits a search query, and the connector indicates whether the current user has permission to view a document in a class. You can implement this property as a textual representation of an ACL.

Value: google:securitytoken

PROPNAME_TITLE

Identifies an optional string property that is the title of the document. This value is useful for providing a title for documents that don't supply content, or for which a title cannot be automatically extracted from the supplied content. This property specifies the title of the search results when there is no content in a document. If there is content, the search appliance gets the title from the content and ignores this property. This property describes the name of the document object as it would appear in the content management system. If you specify PROPNAME_TITLE, don't specify PROPNAME_CONTENT.

Value: google:title

DEFAULT_MIMETYPE

A static single-valued property that is simply defined as text/html. It is the MIME type that is assumed if PROPNAME_MIMETYPE property is not supplied.

Value: text/html

GROUP_ROLES_PROPNAME_PREFIX

Prefix added to the front of the group ACL Scope ID when creating a group roles property name. If a connector defines specific roles associated with a group ACL Scope ID related to a document, store the roles in a multi-valued property named:

GROUP_ROLES_PROPNAME_PREFIX + <scopeId>

For example, given a group ACL entry of eng=reader,writer the roles for Eng would be stored in a property as follows:

Name = "google:group:roles:eng"  
Value = [reader, writer]

USER_ROLES_PROPNAME_PREFIX

Prefix added to the front of the user ACL Scope ID when creating a user roles property name. If a connector defines specific roles associated with a user ACL Scope ID related to a document, store the roles in a multi-valued property named:

USER_ROLES_PROPNAME_PREFIX + <scopeId>

For example, given a user ACL entry of yvette=reader,writer the roles for Yvette would be stored in a property as follows:

Name = "google:user:roles:yvette"  
Value = [reader, writer]

Deleting a Document From the Search Appliance Index

The PROPNAME_ACTION property and the SpiConstants.ActionType.DELETE field delete a document from an index on a Google Search appliance. For information on the properties, see SpiConstants Properties.

Note: Delete feeds are not allowed to send metadata or documents.

If a connector attempts to delete a document that is not in the search appliance index, the search appliance ignores the request and does not generate an error.

The following example sets the DELETE action property on Document objects that are returned from the DocumentList.nextDocument method:

// Create a Document object that requests that the
// specified document be removed from the search index.
public Document simpleDeleteDoc(String docId) {
  Map props = new HashMap();
  props.put(SpiConstants.PROPNAME_DOCID, docId);
  props.put(SpiConstants.PROPNAME_ACTION, SpiConstants.ActionType.DELETE.toString());
  return new SimpleDocument(props);
}

Logging

You can use logging to help in debugging and support. If you use the java.util.Logging package, log messages appear in $CATALINA_BASE/logs/google-connectors.*.log.

This section contains the following topics:

Establishing a Logger
Logging Levels
Logging Feed Data
Retrieving Log Files
Logging Byte Range Support
Using a logging.properties Configuration File

Establishing a Logger

The following example gets the name of the logging facility and establishes a logger:

import java.util.logging.Level;
import java.util.logging.Logger;
  
private static final Logger logger = Logger.getLogger(MyClass.class.getName());

Subsequent example method calls can be as follows:

if (logger.isLoggable(Level.INFO)) {
    logger.info("...");
}

You can maintain your own set of debug levels to use as a metric for message severity.

If you use the java.util.Logging package, log messages appear in the $CATALINA_HOME/logs/catalina.out file and exceptions appear in $CATALINA_HOME/logs/localhost.current_date.log file.

Logging Levels

Connectors use the following logging levels in the order of severity.

Level	Description
`SEVERE`	A severe error condition that prevents the proper function of the product, such as, a missing configuration file or an unreachable host. Prevents an operation from completing.
`WARNING`	Something unexpected happened but it may not be a problem, such as the `google:lastmodified` property is missing. Errors or misconfigurations that may require user intervention, but which don't prevent the operation from completing
`INFO`	Information. Indicates traversal started or disabled, `ConnectorType` and `Connector` configuration
`CONFIG`	Logs traversal messages in batches. Lists checkpoint events
`FINE`	Logs batch information. For example, FINE logs information such as pushing a document list or about a traversal. Indicates authentication and authorization requests.
`FINER`	Logs information per document. Log information for a document do not appear until this setting is enabled. Lists traversal messages per document.
`FINEST`	Logs everything -- the debug setting. Also lists traversal messages per property.

Configuration parameters:

Level	Description
`ALL`	Enables all levels.
`OFF`	Logging disabled.

Logging Feed Data

You can log the feed data sent to a search appliance. The logging facility uses the teedFeedFile mechanism that you can enable in the applicationContext.properties file. The logging configuration is not per connector type, it is global to the connector manager. The teedFeedFile logs the contents of all documents that go to a search appliance. The content in the log file is Base64-encoded and can grow into a large file. If the teedFeedFile doesn't exist and the teedFeedFile property is enabled in the applicationContext.properties file, the connector manager creates the log file.

The FEED_LOGGER logging mechanism enables you to log all metadata properties of the documents into a rolling log file. You can use the FEED_LOGGER functionality to observe the feed record and metadata information that the connector manager sends to the search appliance. The information is logged using a rolling FileHandler so the log file size can be controlled. FEED_LOGGER lists the properties, truncated documents, and an extra space as a separator between connector feed listings.

The data recorded in the teedFeedFile file is in the same format as web and content feeds. For more information on feeds, read the Feeds Protocol Developer's Guide.

If you enable the teedFeedFile feature, the size of the local file increases rapidly, depending on how much data is fed by the connector manager. Monitor the size of the file and truncate or edit the file as required.

To enable data feed to a local file:

Log on to the Apache Tomcat host with the user account under which Tomcat runs.
If you are running a connector manager earlier than r783, create a logging file.
1. Navigate to the directory where you want to create the local file.
2. Create an empty file and ensure that it is writable by the user account under which Tomcat runs. For example, you might call the file ConnectorFeedFile.txt.
3. Note the path to the empty file.
Shutdown the Tomcat instance that hosts the connector manager.
Navigate to the tomcat_home/connector-manager/WEB-INF/ directory.
Open the applicationContext.properties file in a text editor.
Add a property called teedFeedFile and set the value of the property to the desired file name or, if you are running a connector manager version earlier than r783, to the name of the file you created in step 2.
For example, if the file name is ConnectorFeedFile.txt, set the value of the property on either Windows or Linux as follows:

teedFeedFile=path_to_file/ConnectorFeedFile.txt

On Windows, the following format, which uses escaped backslashes, also works:

C:\\Program Files\\GoogleConnectors\\path_to_file\\ConnectorFeedFile.txt
Save the applicationContext.properties file.
Restart Tomcat.

To enable the feed log file:

Stop Apache Tomcat.
On a Windows system, use the Tomcat\bin\shutdown batch file command from a command prompt. On a Linux or Macintosh use the Tomcat/bin/shutdown.sh shell command from a Terminal command line interface.
Configure logging from the feedLoggingLevel property or the logging.properties configuration file -- you can configure logging with either the property or configuration file, but do not use both configuration procedures:
- From the feedLoggingLevel property:
  1. Edit the applicationContext.properties file deployed in the webapps/connector-manager/WEB-INF/ directory
  2. Set the feedLoggingLevel property to ALL: feedLoggingLevel=ALL. By default feedLoggingLevel is set to OFF so that this facility is disabled.
- From the logging.properties configuration file -- Note: If you configured logging using the feedLoggingLevel property, skip this step--do not use both configuration procedures:
  1. Edit the logging.properties file currently being used by the connector manager.
  2. Add the following: com.google.enterprise.connector.pusher.DocPusher.FEED_WRAPPER.FEED.level=FINER. For information on DocPusher, see Pushing Documents to a Search Appliance.
Change logging parameters.
You can change logging parameters from the FeedHandler bean in the applicationContext.xml file. The index="0" parameter specifies the path of the log files. The index="1" parameter specifies the maximum size in bytes that a log file can grow to before a new log file creates. The index="2" parameter specifies the number of log files that can be created. The default values of 50 MB for file size and 10 increments of log files are as follows:
```
<bean id="FeedHandler" class="java.util.logging.FileHandler">
  <constructor-arg index="0" value="${catalina.base}/logs/google-connectors.feed%g.log" />
  <constructor-arg index="1" value="52428800" />
  <constructor-arg index="2" value="10" />
  <property name="level" value="FINER" />
  <property name="encoding" value="UTF-8" />
  <property name="formatter" ref="FeedFormatter" />
</bean> 
```
Restart Tomcat.
On a Windows system, use the Tomcat\bin\startup batch file command from a command prompt. On a Linux or Macintosh use the Tomcat/bin/startup.sh shell command from a Terminal command line interface.

Feed log records appear in the following file when the records are set to the search appliance:

$CATALINA_BASE/logs/google-connectors.feed*.log

Retrieving Log Files

The connector manager enables you to list available log files, fetch individual log files, and fetch a ZIP archive of all the log files. This feature enables you to retrieve connector log files, feed log files, and teed feed files. Access to the servlet is limited to the local host.

To list the available connector log files:

http://CMHostAddress/connector-manager/getConnectorLogs

To view an individual connector log file:

http://CMHostAddress/connector-manager/getConnectorLogs/LogFileName

Where LogFileName is the name of a log file returned by the list, for example, google-connectors.otex0.log. As a convenience, the log name can be the log file generation number, 0 in the example, which automatically increments.

To retrieve a ZIP archive of all the connector log files:

http://CMHostAddress/connector-manager/getConnectorLogs/*

Or:

http://CMHostAddress/connector-manager/getConnectorLogs/ALL

To list the available feed log files:

http://CMHostAddress/connector-manager/getFeedLogs

To view an individual feed log file:

http://CMHostAddress/connector-manager/getFeedLogs/LogFileName

Where LogFileName is the name of a log file returned by the list, for example, google-connectors.feed0.log. As a convenience, the log name can be the log file generation number, 0 in the example, which automatically increments.

To retrieve a ZIP archive of all the feed log files:

http://CMHostAddress/connector-manager/getFeedLogs/*

Or:

http://CMHostAddress/connector-manager/getFeedLogs/ALL

To list the name and size of the teed feed file:

http://CMHostAddress/connector-manager/getTeedFeedFile

To view the teed feed file:

http://CMHostAddress/connector-manager/getTeedFeedFile/TeedFeedName

Where TeedFeedName is the base filename of the teed feed file.

Note: The teed feed file can be huge. Either request a manageable byte range (see Logging Byte Range Support) or fetch the ZIP archive file (which may still be huge).

To retrieve a ZIP archive of the teed feed file:

http://CMHostAddress/connector-manager/getTeedFeedFile/*

Or:

http://CMHostAddress/connector-manager/getTeedFeedFile/ALL

Or:

http://CMHostAddress/connector-manager/getTeedFeed/TeedFeedName.zip

Where TeedFeedName is the filename of the teed feed file.

Logging Byte Range Support

This servlet supports a subset of the RFC 2616 byte range specification to retrieve portions of the log files. Since the connector logs are 50 MB each and the teed feed file can be gigabytes, requesting a portion of the log may be prudent. This servlet supports specifying a byte range in either the HTTP range header or in the query fragment of the request.

To return the first 1001 bytes of the current feed log:

http://ConMgrHostAddress/connector-manager/getFeedLogs/0?bytes=0-1000

To return 1000 bytes (the tail) of the current feed log:

http://ConMgrHostAddress/connector-manager/getFeedLogs/0?bytes=-1000

To return everything after the first 1000 bytes of the current feed log:

http://ConMgrHostAddress/connector-manager/getFeedLogs/0?bytes=1000-

Note: Multi-part byte ranges are not supported (such as, bytes=0-100,1000-2000). Byte range requests for log listing pages and ZIP archive files are ignored.

Using a logging.properties Configuration File

Connectors can include a copy of the logging.properties file at the root level of the connector JAR file to provide logging levels specific to the connector and any third-party libraries that accompany the connector. The logging.properties file contains one line per package for third party libraries that the connector includes. For an example, see the Documentum connector logging.properties file. The copy of the logging.properties file in the WEB-INF/classes directory (which comes from the connector, either using the installer or manual installation) overrides the copy in the JAR file in the WEB-INF/lib directory.

To use a logging.properties configuration file to log URLs and metadata:

Log on to the Apache Tomcat host with the user account under which Tomcat runs.
Shut down the Tomcat instance that hosts the connector manager.
Navigate to the logging.properties file.
- If you installed the connector using the installer, the file is in the ConnectorName/Tomcat/webapps/connector-manager/WEB-INF/classes/ directory.
- If you installed the connector manually, navigate to the location where you created a logging.properties file. The logging.properties file is probably in the $CATALINA_HOME/webapps/connector-manager/WEB-INF/classes directory. If not, copy the logging.properties file from the $JAVA_HOME/lib/ directory to the $CATALINA_HOME/webapps/connector-manager/WEB-INF/classes directory. You might have to create the /classes directory manually.
Open the logging.properties file in a text editor.

Add the following line to the file:

com.google.enterprise.connector.pusher.DocPusher.FEED_WRAPPER.FEED.level=FINER

Save the logging.properties file.
Retart Tomcat.
The logging information is recorded in ConnectorName/Tomcat/logs/google-connectors.feed%g.log, where %g is the generation number of a rotated log.

Count of Feed Files Awaiting Indexing

To view a count of how many feed files (including connector feed files) remain for the search appliance to process into its index, add /getbacklogcount to a search appliance URL at port 19900.

The syntax for /getbacklogcount is as follows:

http://SearchApplianceHostname:19900/getbacklogcount

File Access

The connector manager creates a separate directory for each connector instance. The name of this directory is made available to the connector through Spring Framework.

After you create a connector instance, the connector manager creates a .properties file for a connector, for example, myconnector.properties. The properties file contains data for the configuration form. The properties file also contains the following additional keys:

googleConnectorWorkDir.
A work directory that is private to the connector instance. Because this directory is individual to a connector instance, a connector can store anything in this directory without concern for name clashes with another connector instance or type.
googleWorkDir.
A directory that all connectors and all instances share. A connector can store global configuration information if it is careful about its namespace; for example, if a connector uses its type name as a prefix. But for data private to a connector instance, such as state data, Google recommends that you use the googleConnectorWorkDir.
googlePropertiesVersion.
Describes the version number of the connector.properties file. For the current release, the value is 2.

Scheduling and Checkpoint Data Files

The connector manager creates two files for storing scheduling and the checkpoint traversal state:

ConnectorName_schedule.txt.
ConnectorName_state.txt

These files do not contain configurable information, but are described so that you can identify them when they appear.

Accessing Properties using Spring Framework Inversion of Control

To access a property, a connector supplies a setter to pass the value to the connector instance.

For example, if a connector uses two pieces of configuration, a user name and password, and the connector also wants to know the googleConnectorWorkDir, the connector would specify the following setters in its class:

public class MyConnectorImpl implements Connector {
   public void setUsername(String username) {
      // ...
   }
   public void setPassword(String password) {
      // ...
   }
   public void setGoogleConnectorWorkDir(String googleConnectorWorkDir {
      // ...
   }//...
}

The connectorInstance.xml entry is as follows:

<?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" 
  "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>
  <bean id="MyConnectorInstance"
     class="com.mycompany.myconnector.MyConnectorImpl">
     <property name="username" value="${username}"/>
     <property name="password" value="${password}"/>
     <property name="googleConnectorWorkDir" value="${googleConnectorWorkDir}"/>
  </bean>
</beans>

You can specify the googleWorkDir in the same way using a property name= attribute.

Managing Work Queue Timeouts

The WorkQueue class provides a constructor that creates a WorkQueue with a given specification and sets the timeout of the WorkQueueItem. This feature enables a connector to extend that duration before killThreadTimeout triggers from the default value of 10 minutes and to change the workItemTimeout from its default of 20 minutes.

Note: Allowing long-running threads up to 30 minutes to complete, dramatically reduces the cases where slow traversals get killed by the connector manager.

To create a work queue, provide:

The number of threads on which to execute work.
The number of milliseconds until the work queue is killed.
The number of milliseconds that each work queue item can process before being interrupted.

To change the default work queue values:

Stop Apache Tomcat.
On a Windows system, use the Tomcat\bin\shutdown batch file command from a command prompt. On a Linux or Macintosh use the Tomcat/bin/shutdown.sh shell command from a terminal command line interface.

Edit the applicationContext.xml file and find the bean that creates the WorkQueue. If needed, change a thread time out duration to a new value in milliseconds.

For example:

<bean id="WorkQueue"
  class="com.google.enterprise.connector.common.WorkQueue">
  <!-- Number of worker threads processing in the queue.   -->
  <constructor-arg index="0" type="int" value="10"/>

  <!-- Number of milliseconds before a worker thread is    -->
  <!-- considered timed out. A value of 0 means it never   -->
  <!-- times out. Default: 20 minutes                      -->    
  <constructor-arg index="2" type="long" value="1200000"/>
    
  <!-- Number of milliseconds after a timed-out worker     -->
  <!-- thread is interrupted before it is killed outright. -->
  <!-- Default: 10 minutes                                 -->    
  <constructor-arg index="1" type="long" value="600000"/>
</bean>

Restart Tomcat.
On a Windows system, use the Tomcat\bin\startup batch file command from a command prompt. On a Linux or Macintosh use the Tomcat/bin/startup.sh shell command from a Terminal command line interface.

Previous Chapter: SPI Overview
Next Chapter: Configuration