A document can consist of an entire file or a portion of a file. Amberfish records ‘begin’ and ‘end’ byte offsets for each document as demarcation of the document within the file that contains it. By default the whole file is treated as a single document. For example, a file called sample.txt that is 12000 bytes in size will be indexed as a single document with ‘begin’ and ‘end’ byte offsets, 0 and 12000, respectively.
The af
tool includes the --split option as a method
of instructing Amberfish that the files to be indexed contain multiple
documents. The --split option is used to specify a string
delimiter that indicates the boundaries between documents in a file.
For example:
$ af -i -d mydb -C --split '#####' -v *.txt
As the files, *.txt, are indexed, they are scanned for the string, ‘#####’. Each instance of ‘#####’ is interpreted as the beginning of a new document, and each new document is indexed individually. Note that each instance of ‘#####’ is considered to be part of the document that follows it, as opposed to the document that precedes it. If the string delimiter happens to include text, rather than merely ‘#####’, it will (normally) be indexed as text.
The division of files into multiple documents can be verified with
af -l
after the files have been added to the database
(see Listing database information).
The af --fetch
command prints a portion of a file to
standard output:
$ af --fetch filename begin end
where ‘filename’, ‘begin’, and ‘end’ are taken from the
output of af -s
(see Searching) or af -l
(see Listing database information).
The --split option does not work with the xml
document
type, which uses a different method of dividing files into documents
(see More about XML).