or a highly unique string (such as a GUID) that has an exceptionally low probability of
occurring in the data.
The following characters may not be used in the delimiter:
'A'-'Z', 'a'-'z' and '0'-'9' the
alphanumeric characters
':' colon
'/' slash
'-' hyphen
'_' underscore
' '
space
'=' equals
'+' plus
'[' left square bracket
']' right square bracket
Body Format
Elements in the file start with one of the following commands. Commands
where data precedes the next delimiter include an equal sign. Commands that are immediately
followed by a delimiter do not include an equal sign. The first command must specify a document
ID ("id=" or "id-list"). Command that don't specify a document ID are associated with the most
recent previously specified document ID.
Common Commands:
"id=" -- specifies a document id
"id-list" -- Starts a list of document ids each separated by
the specified delimiter, the list is terminated by two consecutive delimiters or EOS
(End-Of-Stream). ids in an id-list cannot have any of the associated commands listed below.
"repository-unavailable=" -- the document repository is unavailable. The string following the "="
character includes additional information that will be logged with the error.
Lister Commands:
"result-link=" -- specifies an alternative link to be displayed in the search results.
This must be a properly formed URL. A "result link" is sometimes referred to as a "display URL".
If no results-link is specified then the URL used for crawling is also used in the
search results.
"last-modified=" -- Specifies the last time the document or its metadata has changed.
The argument is a number representing the number of seconds since the standard base
time known as the epoch", namely January 1, 1970, 00:00:00 GMT. If last-modified is specified
and the document has never been crawled before or has been crawled prior to the last-modified
time then the ocument will be marked as "crawl-immediately" by the GSA.
"crawl-immediately" -- Increases the crawling priority of the document such
that the GSA will retrieve it sooner than normally crawled documents.
"crawl-once" -- specifies that the document will be crawled by the
GSA one time but then never re-crawled.
"lock" -- Causes the document to remain in the index unless explicitly removed.
Failure to retrieve the document during re-crawling will not result in
removal of the document. If every document in the GSA is
locked then locked document may be forced out when maximum capacity is
reached.
"delete" -- this document should be deleted from the GSA index.
Retriever Commands:
"up-to-date" -- specifies that the document is up-to-date with respect to its last crawled
time.
"not-found" -- the document does not exists in the repository
"mime-type=" -- specifies the document's mime-type. If unspecified then the GSA will
automatically assign a type to the document.
"meta-name=" -- specifies a metadata key, to be followed by a metadata-value
"meta-value=" -- specifies a metadata value associated with
immediately preceding metadata-name
"content" -- signals the beginning of binary content which
continues to the end of the file or stream
"last-modified=" -- specifies the last time the document or its metadata has changed.
The argument is a number representing the number of seconds since the standard base
time known as the epoch", namely January 1, 1970, 00:00:00 GMT.
"secure=" -- specifies whether the document is non-public. The argument is either 'true' or
'false'.
"anchor-uri=" -- specifies an anchor URI, to be followed by anchor-text.
"anchor-text=" -- specifies the text associated with an anchor-uri.
"no-index=" -- specifies whether the document should be indexed by the GSA. The argument is
either 'true' or 'false'.
"no-follow=" -- specifies whether the document's links should be followed by the GSA. The
argument is either 'true' or 'false'.
"no-archive=" -- specifies whether GSA document will allow the user to see a cached version of
the document. The argument is either 'true' or 'false'.
"display-url=" -- specifies an alternative link to be displayed in the search results.
This must be a properly formed URL.
"crawl-once=" -- specifies that the document will be crawled by the
GSA one time but then never re-crawled. The argument should be 'true' or 'false'.
"lock=" -- Causes the document to remain in the index unless explicitly removed.
If every document in the GSA is locked then locked document may be forced out when maximum
capacity is reached.
"acl" -- when provided, an ACL is sent along with document. The ACL is made of
values provided for other commands starting with "acl-" and "namespace"
command. If no acl command is provided then all other ACL commands are
ignored.
"namespace=" -- namespace used on all user and group principals until another
another namespace is provided. Defaults to the default namespace.
"acl-permit-user=" -- a user name, either with domain or without, that will
be permitted to view document being returned.
"acl-deny-user=" -- a user name, either with domain or without, that will
be denied access to document being returned.
"acl-permit-group=" -- a group name, either with domain or without, that
will be permitted to view document being returned.
"acl-deny-group=" -- a group name, either with domain or without, that
will be denied to view document being returned.
"acl-inherit-from=" -- document id that this document inherits permissions
from.
"acl-inherit-fragment=" -- optional fragment supplementing acl-inherit-from.
Together acl-inherit-from and acl-inherit-fragment are what is being
inherited from.
"acl-inheritance-type=" -- the type of inheritance com.google.enterprise.adaptor.Acl.InheritanceType
. Valid values are:
and-both-permit, child-overrides, leaf-node, and parent-overrides
"acl-case-sensitive=" -- the principals of this document are case sensitive.
"acl-case-insensitive=" -- the principals of this document are case
insensitive.
Authorizer Commands:
"authz-status=" -- specifies whether a document is visible to a
specified user. The argument must be PERMIT, DENY or INDETERMINATE
"user=" -- specifies the user for whom the authorization check will be made
"password=" -- specifies the password for the user. (optional)
"group=" -- specifies a security group to which the user belongs.
End-of-stream terminates the data transmission. Multiple consecutive delimiters are collapsed
into a single delimiter and terminates the current id-list should one exist.
Unrecognized commands generate a warning but are otherwise ignored.
Examples
Example 1:
GSA Adaptor Data Version 1 [<delimiter>]
id-list
/home/repository/docs/file1
/home/repository/docs/file2
/home/repository/docs/file3
/home/repository/docs/file4
/home/repository/docs/file5
Example 2:
GSA Adaptor Data Version 1 [<delimiter>]
id=/home/repository/docs/file1
id=/home/repository/docs/file2
crawl-immediately
last-modified=20110803 16:07:23
meta-name=Department
meta-content=Engineering
meta-name=Creator
meta-content=howardhawks
id=/home/repository/docs/file3
id=/home/repository/docs/file4
id=/home/repository/docs/file5
Data passed to command line authorizer via stdin for authz check.
Entries will always occur in this order: user, password, group, id.
password and group information is optional. Any number of group and
id entries can exist. Each of the documents with a listed id should
be checked.
GSA Adaptor Data Version 1 [<delimiter>]
user=tim_smith
password=abc123
group=managers
group=research
id=/home/repository/docs/file1
id=/home/repository/docs/file2
AuthZ response passed from command line authorizer via stdout.
Each doc id must include an authz-status entry.
GSA Adaptor Data Version 1 [<delimiter>]
id=/home/repository/docs/file1
authz-status=PERMIT
id=/home/repository/docs/file2
authz-status=DENY
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CommandStreamParser
public CommandStreamParser(InputStream inputStream)
getVersionNumber
public int getVersionNumber()
throws IOException
- Throws:
IOException
readFromAuthorizer
public Map<DocId,AuthzStatus> readFromAuthorizer()
throws IOException
- Throws:
IOException
readFromRetriever
public void readFromRetriever(DocId docId,
Response response)
throws IOException
- Throws:
IOException
readFromLister
public DocIdPusher.Record readFromLister(DocIdPusher pusher,
ExceptionHandler handler)
throws IOException,
InterruptedException
- Parse a listing response, sending results to
pusher
. If handler
is null
, then pusher
's default handler will be used. In case of failure sending in
pusher
, the rest of the input stream may not be read.
- Returns:
null
on success, otherwise the first Record to fail
- Throws:
IOException
InterruptedException