public interface TraversalManager
Query-based traversal is a scheme whereby a repository is traversed according to a query that visits each document in a natural order that is efficiently supported by the underlying repository and can be easily checkpointed and restarted.
A good use case is a repository that supports access to documents in last-modified-date order. In particular, suppose a repository supports a query analogous to the following SQL query (the repository need not support SQL, SQL is used here only as an example):
select documentid, lastmodifydate from documents
where lastmodifydate < date-constant
order by lastmodifydate
Such a repository can easily be traversed by lastmodifydate, and the state of the traversal is easily encapsulated in a single, small data item: the date of the last document processed. Increasing last-modified-date order is convenient because if a document is processed during traversal, but then later modified, then it will be picked up again later in the traversal process. Thus, this traversal is appropriate both for initial load and for incremental update.
For such a repository, the implementor is urged to let the Connector Manager (the caller) maintain the traversal state. This is achieved by implementing the interface methods as follows:
startTraversal() Run a query that starts from the
beginning, such as
select documentid, lastmodifydate from documents order by lastmodifydate
resumeTraversal(String checkpoint) Run a query that
resumes traversal from the supplied checkpointDocumentList.checkpoint() method.
Please observe that the Connector Manager (the caller) makes no guarantee
to consume the entire DocumentList returned by either the
startTraversal or resumeTraversal calls.
The Connector Manager will consume as many it chooses, depending on load,
schedule and other factors. The Connector Manager guarantees to call
checkpoint after handling the last document it has
successfully processed from the DocumentList it was using.
Thus, the implementor is free to use a query that only returns a small
number of results, if that gets better performance.
For example, to continue the SQL analogy, a query like this could be used:
select TOP 10 documentid, lastmodifydate from documents ...
The setBatchHint method is provided so that the Connector
Manager can tell the implementation that it only wants that many results per
call. This is a hint - the implementation need not observe it. The
implementation is free to return a DocumentList with fewer or more
results. For example, the traversal may be completely up to date, so perhaps
there are no results to return. Or, for internal reasons, the implementation
may not want to return the full batchHint number of results. When returning
more results than the hint, some or all of the extra documents may be
ignored.
The Connector Manager makes a distinction between the return of a
null DocumentList and an empty DocumentList (a DocumentList with
zero entries). Returning a null DocumentList will have an impact on
scheduling - the Connector Manager may choose to wait longer after receiving
a null result before it calls again. Also, if a null result
is returned, the Connector Manager will not [indeed, cannot] call
checkpoint before calling start or resume traversal again. Returning
a null DocumentList is suitable when a traversal is completely up to
date, with no new documents available and no new checkpoint state.
Returning an empty DocumentList will probably not have an impact on
scheduling. The Connector Manager will call checkpoint,
and will likely call resumeTraversal again immediately.
Returning an empty DocumentList is not appropriate if a traversal is
completely up to date, as it would effectively induce a spin, constantly
calling resumeTraversal when it has no work to do.
Returning an empty DocumentList is a convenient way to indicate to the
Connector Manager, that although no documents were provided in this
batch, the Connector wishes to continue searching the repository for
suitable content. The call to checkpoint allows the
Connector to record its progress through the repository. This mechanism
is suitable for cases when the search for suitable content may exceed
the Connector Manager's timeout.
If the Connector returns a non-null DocumentList, even
one with zero entries, the Connector Manager will nearly always call
checkpoint when it has finished processing the DocumentList.
An implementation need not let the Connector Manager store the traversal state, it may choose to store the state itself. Implementors are discouraged from using this technique unless necessary, because it makes transactionality more difficult and it introduces resource dependencies of which the Connector Manager is unaware. However, there may be repositories which have a natural traversal order, but this state of this traversal is not easily expressed in a small data item. For example, a repository may consist of a large number of named sub-repositories, each of which can be traversed in modify date order, but for which there is no convenient way of traversing them all in one query. In this case, the implementation may choose to maintain state itself, as a table of pairs: (sub-repository-name, per-repository-date-stamp). In such a case, the implementor may implement the interface methods as follows:
startTraversal() Clear the internal state. Return the
first few documentsresumeTraversal(String checkpoint) Resume traversal
according to the internal state of the implementation. The Connector Manager
will pass in whatever checkpoint String was returned by the last call to
DocumentList.checkpoint() but the implementation is free to ignore
this and use its internal state. However, even in this case,
checkpoint must not return a null String.DocumentList.nextDocument() have
been processed. The implementation should wait until the checkpoint call, and
only commit the state up to the last document returned.
Note on "Metadata and URL" feeds vs. Content feeds:
Some repositories are fully web-enabled but are difficult or impossible for the Search Appliance to crawl, because they make heavy use of ASP or JSP, or they have a metadata model that is not conveniently accessible with the content in a single page. Such repositories are good candidates for connectors. However, a developer may not choose to implement authentication and authorization through a connector. It may be sufficient to use standard web mechanisms for these tasks.
The developer can achieve this by following these steps. In the document list
returned by the traversal methods, specify the
SpiConstants.PROPNAME_SEARCHURL
property. The value should be a URL. If this property is specified, the
Connector Manager will use a "URL Feed" rather than a "Content Feed" for
that document. In this case, the implementor should not
supply the content of the document. The Search Appliance will fetch the
content from the specified URL. Also, this URL will be used to trigger
normal authentication and authorization for that document. For more details,
see the documentation on Metadata and URL Feeds.
Note on Documents returned by traversal calls:
The Document objects returned by the queries defined here
must contain special properties according to the following rules:
SpiConstants.PROPNAME_DOCID This property must be present.SpiConstants.PROPNAME_SEARCHURL If present, this means that the
Connector Manager will generate a Metadata and URL feed, with the specified
URL. If this is present, then the SpiConstants.PROPNAME_CONTENT
property should not be.SpiConstants.PROPNAME_CONTENT This property should hold the
content of the document. If present, the connector framework will base-64
encode the value and present it to the Search Appliance as the primary
content to be indexed. If this is present, then the
SpiConstants.PROPNAME_SEARCHURL property should not
be.SpiConstants.PROPNAME_DISPLAYURL If present, this will be used
as the primary link on a results page. This should not
be used with SpiConstants.PROPNAME_SEARCHURL.| Modifier and Type | Method and Description |
|---|---|
DocumentList |
resumeTraversal(java.lang.String checkPoint)
Continues traversal from a supplied checkpoint.
|
void |
setBatchHint(int batchHint)
Sets the preferred batch size.
|
DocumentList |
startTraversal()
Starts (or restarts) traversal from the beginning.
|
DocumentList startTraversal() throws RepositoryException
DocumentList.checkpoint() passing in the last object
it has successfully processed.null if there are no documents.RepositoryException - if the Repository is unreachable or similar
exceptional condition.DocumentList resumeTraversal(java.lang.String checkPoint) throws RepositoryException
DocumentList.checkpoint() method. The
DocumentList object returns objects from the repository in natural order
starting just after the document that was used to create the checkpoint
string.checkPoint - String that indicates from where to resume traversal.null if there are no documents.RepositoryExceptionvoid setBatchHint(int batchHint)
throws RepositoryException
batchHint - RepositoryException