com.google.enterprise.adaptor
Class CommandStreamParser

java.lang.Object
  extended by com.google.enterprise.adaptor.CommandStreamParser

public class CommandStreamParser
extends Object

Parses the adaptor data format into individual commands with associated data. This format is used for communication between the adaptor library and various command line adaptor components (lister, retriever, transformer, authorizer, etc.). It supports responses coming back from the command line adaptor implementation. The format supports a mixture of character and binary data. All character data must be encoded in UTF-8.

Character data technically supports a 'modified UTF-8'. The modified UTF-8 encoding allows newlines and the null character to be encoded as 2-bytes instead of one. Instead of byte 0x00, the null character \0 can be encoded as 0xC0 0x80. Instead of byte 0x0a, the line feed character \n can be encoded as 0xC0 0x8a.

Header Format

Communications (via either file or stream) begin with the header:

GSA Adaptor Data Version 1 [<delimiter>]

The version number must be proceeded by a single space and followed by a single space. The version number may increase in the future should the format be enhanced.

The string between the two square brackets will be used as the delimiter for the remainder of the file being read or for the duration of the communication session.

Care must be taken that the delimiter character string can never occur in a document ID, metadata name, metadata value, user name, or any other data that will be represented using the format with the exception of document contents, which can contain the delimiter string. The safest delimiter is likely to be the null character (the character with a value of zero). This character is unlikely to be present in existing names, paths, metadata, etc. Another possible choice is the newline character, though in many systems it is possible for this character to be present in document names and document paths, etc. If in doubt, the null character is recommended. Because modified UTF-8 is supported, newlines or null characters in document IDs, metadata, and the like can be encoded in their 2-byte form which which will not be confused with the delimiter. A delimiter can be made up of more than one character so it is possible to have a delimiter that is or a highly unique string (such as a GUID) that has an exceptionally low probability of occurring in the data.

The following characters may not be used in the delimiter:

'A'-'Z', 'a'-'z' and '0'-'9' the alphanumeric characters
':' colon
'/' slash
'-' hyphen
'_' underscore
' ' space
'=' equals
'+' plus
'[' left square bracket
']' right square bracket

Body Format

Elements in the file start with one of the following commands. Commands where data precedes the next delimiter include an equal sign. Commands that are immediately followed by a delimiter do not include an equal sign. The first command must specify a document ID ("id=" or "id-list"). Command that don't specify a document ID are associated with the most recent previously specified document ID.

Common Commands:

"id=" -- specifies a document id

"id-list" -- Starts a list of document ids each separated by the specified delimiter, the list is terminated by two consecutive delimiters or EOS (End-Of-Stream). ids in an id-list cannot have any of the associated commands listed below.

"repository-unavailable=" -- the document repository is unavailable. The string following the "=" character includes additional information that will be logged with the error.

Lister Commands:

"result-link=" -- specifies an alternative link to be displayed in the search results. This must be a properly formed URL. A "result link" is sometimes referred to as a "display URL". If no results-link is specified then the URL used for crawling is also used in the search results.

"last-modified=" -- Specifies the last time the document or its metadata has changed. The argument is a number representing the number of seconds since the standard base time known as the epoch", namely January 1, 1970, 00:00:00 GMT. If last-modified is specified and the document has never been crawled before or has been crawled prior to the last-modified time then the ocument will be marked as "crawl-immediately" by the GSA.

"crawl-immediately" -- Increases the crawling priority of the document such that the GSA will retrieve it sooner than normally crawled documents.

"crawl-once" -- specifies that the document will be crawled by the GSA one time but then never re-crawled.

"lock" -- Causes the document to remain in the index unless explicitly removed. Failure to retrieve the document during re-crawling will not result in removal of the document. If every document in the GSA is locked then locked document may be forced out when maximum capacity is reached.

"delete" -- this document should be deleted from the GSA index.

Retriever Commands:

"up-to-date" -- specifies that the document is up-to-date with respect to its last crawled time.

"not-found" -- the document does not exists in the repository

"mime-type=" -- specifies the document's mime-type. If unspecified then the GSA will automatically assign a type to the document.

"meta-name=" -- specifies a metadata key, to be followed by a metadata-value

"meta-value=" -- specifies a metadata value associated with immediately preceding metadata-name

"content" -- signals the beginning of binary content which continues to the end of the file or stream

"last-modified=" -- specifies the last time the document or its metadata has changed. The argument is a number representing the number of seconds since the standard base time known as the epoch", namely January 1, 1970, 00:00:00 GMT.

"secure=" -- specifies whether the document is non-public. The argument is either 'true' or 'false'.

"anchor-uri=" -- specifies an anchor URI, to be followed by anchor-text.

"anchor-text=" -- specifies the text associated with an anchor-uri.

"no-index=" -- specifies whether the document should be indexed by the GSA. The argument is either 'true' or 'false'.

"no-follow=" -- specifies whether the document's links should be followed by the GSA. The argument is either 'true' or 'false'.

"no-archive=" -- specifies whether GSA document will allow the user to see a cached version of the document. The argument is either 'true' or 'false'.

"display-url=" -- specifies an alternative link to be displayed in the search results. This must be a properly formed URL.

"crawl-once=" -- specifies that the document will be crawled by the GSA one time but then never re-crawled. The argument should be 'true' or 'false'.

"lock=" -- Causes the document to remain in the index unless explicitly removed. If every document in the GSA is locked then locked document may be forced out when maximum capacity is reached.

"acl" -- when provided, an ACL is sent along with document. The ACL is made of values provided for other commands starting with "acl-" and "namespace" command. If no acl command is provided then all other ACL commands are ignored.

"namespace=" -- namespace used on all user and group principals until another another namespace is provided. Defaults to the default namespace.

"acl-permit-user=" -- a user name, either with domain or without, that will be permitted to view document being returned.

"acl-deny-user=" -- a user name, either with domain or without, that will be denied access to document being returned.

"acl-permit-group=" -- a group name, either with domain or without, that will be permitted to view document being returned.

"acl-deny-group=" -- a group name, either with domain or without, that will be denied to view document being returned.

"acl-inherit-from=" -- document id that this document inherits permissions from.

"acl-inherit-fragment=" -- optional fragment supplementing acl-inherit-from. Together acl-inherit-from and acl-inherit-fragment are what is being inherited from.

"acl-inheritance-type=" -- the type of inheritance com.google.enterprise.adaptor.Acl.InheritanceType. Valid values are: and-both-permit, child-overrides, leaf-node, and parent-overrides

"acl-case-sensitive=" -- the principals of this document are case sensitive.

"acl-case-insensitive=" -- the principals of this document are case insensitive.

Authorizer Commands:

"authz-status=" -- specifies whether a document is visible to a specified user. The argument must be PERMIT, DENY or INDETERMINATE

"user=" -- specifies the user for whom the authorization check will be made

"password=" -- specifies the password for the user. (optional)

"group=" -- specifies a security group to which the user belongs.

End-of-stream terminates the data transmission. Multiple consecutive delimiters are collapsed into a single delimiter and terminates the current id-list should one exist.

Unrecognized commands generate a warning but are otherwise ignored.

Examples

Example 1:

 GSA Adaptor Data Version 1 [<delimiter>]
 id-list
 /home/repository/docs/file1
 /home/repository/docs/file2
 /home/repository/docs/file3
 /home/repository/docs/file4
 /home/repository/docs/file5
 
Example 2:

 GSA Adaptor Data Version 1 [<delimiter>]
 id=/home/repository/docs/file1
 id=/home/repository/docs/file2
 crawl-immediately
 last-modified=20110803 16:07:23

 meta-name=Department
 meta-content=Engineering

 meta-name=Creator
 meta-content=howardhawks

 id=/home/repository/docs/file3
 id=/home/repository/docs/file4
 id=/home/repository/docs/file5
 
Data passed to command line authorizer via stdin for authz check. Entries will always occur in this order: user, password, group, id. password and group information is optional. Any number of group and id entries can exist. Each of the documents with a listed id should be checked.
 GSA Adaptor Data Version 1 [<delimiter>]
 user=tim_smith
 password=abc123
 group=managers
 group=research
 id=/home/repository/docs/file1
 id=/home/repository/docs/file2
 
AuthZ response passed from command line authorizer via stdout. Each doc id must include an authz-status entry.
 GSA Adaptor Data Version 1 [<delimiter>]
 id=/home/repository/docs/file1
 authz-status=PERMIT
 id=/home/repository/docs/file2
 authz-status=DENY
 


Constructor Summary
CommandStreamParser(InputStream inputStream)
           
 
Method Summary
 int getVersionNumber()
           
 Map<DocId,AuthzStatus> readFromAuthorizer()
           
 DocIdPusher.Record readFromLister(DocIdPusher pusher, ExceptionHandler handler)
          Parse a listing response, sending results to pusher.
 void readFromRetriever(DocId docId, Response response)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CommandStreamParser

public CommandStreamParser(InputStream inputStream)
Method Detail

getVersionNumber

public int getVersionNumber()
                     throws IOException
Throws:
IOException

readFromAuthorizer

public Map<DocId,AuthzStatus> readFromAuthorizer()
                                          throws IOException
Throws:
IOException

readFromRetriever

public void readFromRetriever(DocId docId,
                              Response response)
                       throws IOException
Throws:
IOException

readFromLister

public DocIdPusher.Record readFromLister(DocIdPusher pusher,
                                         ExceptionHandler handler)
                                  throws IOException,
                                         InterruptedException
Parse a listing response, sending results to pusher. If handler is null, then pusher's default handler will be used. In case of failure sending in pusher, the rest of the input stream may not be read.

Returns:
null on success, otherwise the first Record to fail
Throws:
IOException
InterruptedException