TraversalManager

All Known Implementing Classes:

DiffingConnectorTraversalManager
```
public interface TraversalManager
```
Interface for implementing query-based traversal.
Query-based traversal is a scheme whereby a repository is traversed according to a query that visits each document in a natural order that is efficiently supported by the underlying repository and can be easily checkpointed and restarted.
A good use case is a repository that supports access to documents in last-modified-date order. In particular, suppose a repository supports a query analogous to the following SQL query (the repository need not support SQL, SQL is used here only as an example):
```
        select documentid, lastmodifydate from documents
        where  lastmodifydate < date-constant
        order by lastmodifydate
 
```
Such a repository can easily be traversed by lastmodifydate, and the state of the traversal is easily encapsulated in a single, small data item: the date of the last document processed. Increasing last-modified-date order is convenient because if a document is processed during traversal, but then later modified, then it will be picked up again later in the traversal process. Thus, this traversal is appropriate both for initial load and for incremental update.
For such a repository, the implementor is urged to let the Connector Manager (the caller) maintain the traversal state. This is achieved by implementing the interface methods as follows:
- startTraversal() Run a query that starts from the beginning, such as
```
   select documentid, lastmodifydate from documents order by lastmodifydate
 
```
- resumeTraversal(String checkpoint) Run a query that resumes traversal from the supplied checkpoint
Checkpoints are supplied by the DocumentList.checkpoint() method.
Please observe that the Connector Manager (the caller) makes no guarantee to consume the entire DocumentList returned by either the startTraversal or resumeTraversal calls. The Connector Manager will consume as many it chooses, depending on load, schedule and other factors. The Connector Manager guarantees to call checkpoint after handling the last document it has successfully processed from the DocumentList it was using. Thus, the implementor is free to use a query that only returns a small number of results, if that gets better performance.
For example, to continue the SQL analogy, a query like this could be used:
```
        select TOP 10 documentid, lastmodifydate from documents ...
 
```
The setBatchHint method is provided so that the Connector Manager can tell the implementation that it only wants that many results per call. This is a hint - the implementation need not observe it. The implementation is free to return a DocumentList with fewer or more results. For example, the traversal may be completely up to date, so perhaps there are no results to return. Or, for internal reasons, the implementation may not want to return the full batchHint number of results. When returning more results than the hint, some or all of the extra documents may be ignored.
The Connector Manager makes a distinction between the return of a null DocumentList and an empty DocumentList (a DocumentList with zero entries). Returning a null DocumentList will have an impact on scheduling - the Connector Manager may choose to wait longer after receiving a null result before it calls again. Also, if a null result is returned, the Connector Manager will not [indeed, cannot] call checkpoint before calling start or resume traversal again. Returning a null DocumentList is suitable when a traversal is completely up to date, with no new documents available and no new checkpoint state.
Returning an empty DocumentList will probably not have an impact on scheduling. The Connector Manager will call checkpoint, and will likely call resumeTraversal again immediately. Returning an empty DocumentList is not appropriate if a traversal is completely up to date, as it would effectively induce a spin, constantly calling resumeTraversal when it has no work to do. Returning an empty DocumentList is a convenient way to indicate to the Connector Manager, that although no documents were provided in this batch, the Connector wishes to continue searching the repository for suitable content. The call to checkpoint allows the Connector to record its progress through the repository. This mechanism is suitable for cases when the search for suitable content may exceed the Connector Manager's timeout.
If the Connector returns a non-null DocumentList, even one with zero entries, the Connector Manager will nearly always call checkpoint when it has finished processing the DocumentList.
An implementation need not let the Connector Manager store the traversal state, it may choose to store the state itself. Implementors are discouraged from using this technique unless necessary, because it makes transactionality more difficult and it introduces resource dependencies of which the Connector Manager is unaware. However, there may be repositories which have a natural traversal order, but this state of this traversal is not easily expressed in a small data item. For example, a repository may consist of a large number of named sub-repositories, each of which can be traversed in modify date order, but for which there is no convenient way of traversing them all in one query. In this case, the implementation may choose to maintain state itself, as a table of pairs: (sub-repository-name, per-repository-date-stamp). In such a case, the implementor may implement the interface methods as follows:
- startTraversal() Clear the internal state. Return the first few documents
- resumeTraversal(String checkpoint) Resume traversal according to the internal state of the implementation. The Connector Manager will pass in whatever checkpoint String was returned by the last call to DocumentList.checkpoint() but the implementation is free to ignore this and use its internal state. However, even in this case, checkpoint must not return a null String.
The implementation must be careful about when and how it commits its internal state to external storage. Remember again that the Connector Manager makes no guarantee to consume the entire result set return by a traversal call. If the Connector Manager does not call checkpoint, the implementation should not assume that the documents returned by DocumentList.nextDocument() have been processed. The implementation should wait until the checkpoint call, and only commit the state up to the last document returned.
Note on "Metadata and URL" feeds vs. Content feeds:
Some repositories are fully web-enabled but are difficult or impossible for the Search Appliance to crawl, because they make heavy use of ASP or JSP, or they have a metadata model that is not conveniently accessible with the content in a single page. Such repositories are good candidates for connectors. However, a developer may not choose to implement authentication and authorization through a connector. It may be sufficient to use standard web mechanisms for these tasks.
The developer can achieve this by following these steps. In the document list returned by the traversal methods, specify the SpiConstants.PROPNAME_SEARCHURL property. The value should be a URL. If this property is specified, the Connector Manager will use a "URL Feed" rather than a "Content Feed" for that document. In this case, the implementor should not supply the content of the document. The Search Appliance will fetch the content from the specified URL. Also, this URL will be used to trigger normal authentication and authorization for that document. For more details, see the documentation on Metadata and URL Feeds.
Note on Documents returned by traversal calls:
The Document objects returned by the queries defined here must contain special properties according to the following rules:
- SpiConstants.PROPNAME_DOCID This property must be present.
- SpiConstants.PROPNAME_SEARCHURL If present, this means that the Connector Manager will generate a Metadata and URL feed, with the specified URL. If this is present, then the SpiConstants.PROPNAME_CONTENT property should not be.
- SpiConstants.PROPNAME_CONTENT This property should hold the content of the document. If present, the connector framework will base-64 encode the value and present it to the Search Appliance as the primary content to be indexed. If this is present, then the SpiConstants.PROPNAME_SEARCHURL property should not be.
- SpiConstants.PROPNAME_DISPLAYURL If present, this will be used as the primary link on a results page. This should not be used with SpiConstants.PROPNAME_SEARCHURL.
Since:

1.0

Method Summary

All Methods Instance Methods Abstract Methods
Modifier and Type	Method and Description
`DocumentList`	`resumeTraversal(java.lang.String checkPoint)` Continues traversal from a supplied checkpoint.
`void`	`setBatchHint(int batchHint)` Sets the preferred batch size.
`DocumentList`	`startTraversal()` Starts (or restarts) traversal from the beginning.

- Method Detail
  - startTraversal
```
DocumentList startTraversal()
                     throws RepositoryException
```
    Starts (or restarts) traversal from the beginning. This action will return objects starting from the very oldest, or with the smallest IDs, or whatever natural order the implementation prefers. The caller may consume as many or as few of the results as it wants, but it guarantees to call DocumentList.checkpoint() passing in the last object it has successfully processed.
    
    Returns:
    
    A DocumentList of documents from the repository in natural order, or null if there are no documents.
    
    Throws:
    
    RepositoryException - if the Repository is unreachable or similar exceptional condition.
  - resumeTraversal
```
DocumentList resumeTraversal(java.lang.String checkPoint)
                      throws RepositoryException
```
    Continues traversal from a supplied checkpoint. The checkPoint parameter will have been created by a call to the DocumentList.checkpoint() method. The DocumentList object returns objects from the repository in natural order starting just after the document that was used to create the checkpoint string.
    
    Parameters:
    
    checkPoint - String that indicates from where to resume traversal.
    
    Returns:
    
    DocumentList object that returns documents starting just after the checkpoint, or null if there are no documents.
    
    Throws:
    
    RepositoryException
  - setBatchHint
```
void setBatchHint(int batchHint)
           throws RepositoryException
```
    Sets the preferred batch size. The caller advises the implementation that the result sets returned by startTraversal or resumeTraversal should be as close to this number as is reasonable. The implementation may ignore this call or do its best to return approximately this number.
    
    Parameters:
    
    batchHint -
    
    Throws:
    
    RepositoryException

Interface TraversalManager

Method Summary

Method Detail

startTraversal

resumeTraversal

setBatchHint