public class CommandStreamParser extends Object
Character data technically supports a 'modified UTF-8'. The modified UTF-8 encoding allows
newlines and the null character to be encoded as 2-bytes instead of one. Instead of byte 0x00,
the null character \0 can be encoded as 0xC0 0x80. Instead of byte 0x0a, the line feed character
\n can be encoded as 0xC0 0x8a.
GSA Adaptor Data Version 1 [<delimiter>]
The version number must be proceeded by a single space and followed by a single space. The version number may increase in the future should the format be enhanced.
The string between the two square brackets will be used as the delimiter for the remainder of the file being read or for the duration of the communication session.
Care must be taken that the delimiter character string can never occur in a document ID, metadata name, metadata value, user name, or any other data that will be represented using the format with the exception of document contents, which can contain the delimiter string. The safest delimiter is likely to be the null character (the character with a value of zero). This character is unlikely to be present in existing names, paths, metadata, etc. Another possible choice is the newline character, though in many systems it is possible for this character to be present in document names and document paths, etc. If in doubt, the null character is recommended. Because modified UTF-8 is supported, newlines or null characters in document IDs, metadata, and the like can be encoded in their 2-byte form which which will not be confused with the delimiter. A delimiter can be made up of more than one character so it is possible to have a delimiter that is <CR><LF> or a highly unique string (such as a GUID) that has an exceptionally low probability of occurring in the data.
The following characters may not be used in the delimiter:
'A'-'Z', 'a'-'z' and '0'-'9' the
alphanumeric characters
':' colon
'/' slash
'-' hyphen
'_' underscore
' '
space
'=' equals
'+' plus
'[' left square bracket
']' right square bracket
"id-list" -- Starts a list of document ids each separated by the specified delimiter, the list is terminated by two consecutive delimiters or EOS (End-Of-Stream). ids in an id-list cannot have any of the associated commands listed below.
"repository-unavailable=" -- the document repository is unavailable. The string following the "=" character includes additional information that will be logged with the error.
"last-modified=" -- Specifies the last time the document or its metadata has changed. The argument is a number representing the number of seconds since the standard base time known as the epoch", namely January 1, 1970, 00:00:00 GMT. If last-modified is specified and the document has never been crawled before or has been crawled prior to the last-modified time then the ocument will be marked as "crawl-immediately" by the GSA.
"crawl-immediately" -- Increases the crawling priority of the document such that the GSA will retrieve it sooner than normally crawled documents.
"crawl-once" -- specifies that the document will be crawled by the GSA one time but then never re-crawled.
"lock" -- Causes the document to remain in the index unless explicitly removed. Failure to retrieve the document during re-crawling will not result in removal of the document. If every document in the GSA is locked then locked document may be forced out when maximum capacity is reached.
"delete" -- this document should be deleted from the GSA index.
"not-found" -- the document does not exists in the repository
"mime-type=" -- specifies the document's mime-type. If unspecified then the GSA will automatically assign a type to the document.
"meta-name=" -- specifies a metadata key, to be followed by a metadata-value
"meta-value=" -- specifies a metadata value associated with immediately preceding metadata-name
"param-name=" -- specifies a parameter key, to be followed by a parameter-value.
Parameters are supplied to MetadataTransforms
for use when making
transforms or decisions.
"param-value=" -- specifies a parameter value associated with immediately preceding parameter-name
"content" -- signals the beginning of binary content which continues to the end of the file or stream
"last-modified=" -- specifies the last time the document or its metadata has changed. The argument is a number representing the number of seconds since the standard base time known as the epoch", namely January 1, 1970, 00:00:00 GMT.
"secure=" -- specifies whether the document is non-public. The argument is either 'true' or 'false'.
"anchor-uri=" -- specifies an anchor URI, to be followed by anchor-text.
"anchor-text=" -- specifies the text associated with an anchor-uri.
"no-index=" -- specifies whether the document should be indexed by the GSA. The argument is either 'true' or 'false'.
"no-follow=" -- specifies whether the document's links should be followed by the GSA. The argument is either 'true' or 'false'.
"no-archive=" -- specifies whether GSA document will allow the user to see a cached version of the document. The argument is either 'true' or 'false'.
"display-url=" -- specifies an alternative link to be displayed in the search results. This must be a properly formed URL.
"crawl-once=" -- specifies that the document will be crawled by the GSA one time but then never re-crawled. The argument should be 'true' or 'false'.
"lock=" -- Causes the document to remain in the index unless explicitly removed. If every document in the GSA is locked then locked document may be forced out when maximum capacity is reached.
"acl" -- when provided, an ACL is sent along with document. The ACL is made of values provided for other commands starting with "acl-" and "namespace" command. If no acl command is provided then all other ACL commands are ignored.
"namespace=" -- namespace used on all user and group principals until another another namespace is provided. Defaults to the default namespace.
"acl-permit-user=" -- a user name, either with domain or without, that will be permitted to view document being returned.
"acl-deny-user=" -- a user name, either with domain or without, that will be denied access to document being returned.
"acl-permit-group=" -- a group name, either with domain or without, that will be permitted to view document being returned.
"acl-deny-group=" -- a group name, either with domain or without, that will be denied to view document being returned.
"acl-inherit-from=" -- document id that this document inherits permissions from.
"acl-inherit-fragment=" -- optional fragment supplementing acl-inherit-from. Together acl-inherit-from and acl-inherit-fragment are what is being inherited from.
"acl-inheritance-type=" -- the type of inheritance com.google.enterprise.adaptor.Acl.InheritanceType
. Valid values are:
and-both-permit, child-overrides, leaf-node, and parent-overrides
"acl-case-sensitive=" -- the principals of this document are case sensitive.
"acl-case-insensitive=" -- the principals of this document are case
insensitive.
"user=" -- specifies the user for whom the authorization check will be made
"password=" -- specifies the password for the user. (optional)
"group=" -- specifies a security group to which the user belongs.
End-of-stream terminates the data transmission. Multiple consecutive delimiters are collapsed into a single delimiter and terminates the current id-list should one exist.
Unrecognized commands generate a warning but are otherwise ignored.
GSA Adaptor Data Version 1 [<delimiter>]
id-list
/home/repository/docs/file1
/home/repository/docs/file2
/home/repository/docs/file3
/home/repository/docs/file4
/home/repository/docs/file5
Example 2:
GSA Adaptor Data Version 1 [<delimiter>]
id=/home/repository/docs/file1
id=/home/repository/docs/file2
crawl-immediately
last-modified=20110803 16:07:23
meta-name=Department
meta-content=Engineering
meta-name=Creator
meta-content=howardhawks
id=/home/repository/docs/file3
id=/home/repository/docs/file4
id=/home/repository/docs/file5
Data passed to command line authorizer via stdin for authz check.
Entries will always occur in this order: user, password, group, id.
password and group information is optional. Any number of group and
id entries can exist. Each of the documents with a listed id should
be checked.
GSA Adaptor Data Version 1 [<delimiter>]
user=tim_smith
password=abc123
group=managers
group=research
id=/home/repository/docs/file1
id=/home/repository/docs/file2
AuthZ response passed from command line authorizer via stdout.
Each doc id must include an authz-status entry.
GSA Adaptor Data Version 1 [<delimiter>]
id=/home/repository/docs/file1
authz-status=PERMIT
id=/home/repository/docs/file2
authz-status=DENY
Constructor and Description |
---|
CommandStreamParser(InputStream inputStream) |
Modifier and Type | Method and Description |
---|---|
int |
getVersionNumber() |
Map<DocId,AuthzStatus> |
readFromAuthorizer() |
DocIdPusher.Record |
readFromLister(DocIdPusher pusher,
ExceptionHandler handler)
Parse a listing response, sending results to
pusher . |
void |
readFromRetriever(DocId docId,
Response response) |
public CommandStreamParser(InputStream inputStream)
public int getVersionNumber() throws IOException
IOException
public Map<DocId,AuthzStatus> readFromAuthorizer() throws IOException
IOException
public void readFromRetriever(DocId docId, Response response) throws IOException
IOException
public DocIdPusher.Record readFromLister(DocIdPusher pusher, ExceptionHandler handler) throws IOException, InterruptedException
pusher
. If handler
is null
, then pusher
's default handler will be used. In case of failure sending in
pusher
, the rest of the input stream may not be read.pusher
- doc id pusherhandler
- exception handlernull
on success, otherwise the first Record to failIOException
- ioeInterruptedException
- if interrupted