Section: User Commands (1)
Index
sgmlrgn - XPAT multi-purpose SGML application
An SGML Application Conforming to International Standard ISO 8879 ---
Standard Generalized Markup Language
sgmlrgn -m mode [ -C ] [ -J ] [ -c ] [ -d ] [ -D dd_name ] [ -e ] [ -g ] [ -p ] [ -u ] [ -v ] [ -i name ] [ -o outfile_name ] [ -M meta_structure_file ] [ -G group_name ] filename ...
sgmlrgn parses and validates the SGML document entity in filename(s) and prints the results on the standard output according to the mode specification. Note that the document entity may be spread among several files. For example, the SGML document type definition (DTD) and document instance set could each be in a separate file. If the filename is a dash (-), sgmlrgn will process SGML text from standard input. This is especially useful in filter mode.
In MFS databases, the MFS system creates a ``virtual text'' from the text of all the files in the database. The portion of this virtual text that corresponds to each file consists of three pieces: the Meta-Header section, the Data section, and the Meta-Trailer section. This breakdown is illustrated in the following diagram:
<OTDoc><OTMeta>..</OTMeta><OTData>.............</OTData></OTDoc> |--------- Meta-Header ----------|| SGML Data ||- Meta-Trailer -| ^ ^ ^ ^ start start start end header data trailer pos
The data in the Meta-Header and Meta-Trailer sections is highly structured and is uniform across all the files in the MFS database. In contrast, the data in the Data sections may be untagged text, tagged text without a DTD, or tagged text with a DTD (SGML data).
The process of building region indices on such databases involves several steps. The first step involves running mfsmeta over the database to build a meta_structure_file. This file contains information about the positions of the Meta-Header, Data, and Meta-Trailer sections for each file in the database.
The second step involves building regions on the fields in the Meta-Header and Meta-Trailer sections that are common to all files. Refer to the multirgn(1) man page for further details.
The third step involves building regions for the Data sections. For the Data sections that contain tagged text without a DTD, this task is accomplished using multirgn. For SGML Data sections (that do have a DTD), this task is accomplished using sgmlrgn.
There are three types of SGML MFS databases. The first consists of a group of SGML files that all conform to the same DTD - each file is a complete document.
The second type consists of a group of SGML files that conform to several different DTD's - each file is still a complete document.
The third type consists of a group of SGML files that conform to one or more DTD's - the files may contain either complete documents or pieces of documents (i.e., the text for specific elements in the DTD). Each of the next three sections discusses how to build regions for one of the above database types.
The first step involves setting up the FilterChain section of the Data Dictionary which specifies the SGML files to be included in the database. In particular, the DisplayFmt field should be set to the value, `sgml'.
For example, the following FilterChain section might be appropriate for a the first kind of SGML database:
<FilterChain> <SearchView>meta</SearchView> <DisplayView>meta</DisplayView> <RawView>meta</RawView> <DisplayFmt>sgml</DisplayFmt> <FileGroup> <MfsDir>sgmldata</MfsDir> <MfsFile>*.sgm</MfsFile> <MfsExpand>tree</MfsExpand> </FileGroup> </FilterChain>
Once the FilterChain sections have been set up, the following command can be used to build the SGML regions (usually separately after dbbuild). For this example, assume the meta_structure_file generated by mfsmeta is called `data.str' and the `data.inp' contains the <!DOCTYPE> declaration for the SGML files in the database:
% sgmlrgn -v -m region -M data.str -D data.dd data.inp data.dd
sgmlrgn will use the data.str to identify all the sgml format files and will build SGML regions on them.
As with Type 1 SGML databases, the first step involves setting up the FilterChain sections of the Data Dictionary. However, because the files conform to more than one DTD, they must be separated into groups, where all the files in a group conform to a particular DTD. A FilterChain section is then set up for each group.
The DisplayFmt section of each FilterChain section is then set to contain two values, separated by a comma. The first value is the keyword `sgml' and the second value is a short group name that you pick, which uniquely identifies the group.
For example, the following FilterChain sections might be appropriate a Type 2 SGML database that contains files from two DTD's (with group names, `manual' and `news').
<FilterChain> <SearchView>meta</SearchView> <DisplayView>meta</DisplayView> <RawView>meta</RawView> <DisplayFmt>sgml,manual</DisplayFmt> <FileGroup> <MfsDir>mandata</MfsDir> <MfsFile>*.sgm</MfsFile> <MfsExpand>tree</MfsExpand> </FileGroup> </FilterChain> <FilterChain> <SearchView>meta</SearchView> <DisplayView>meta</DisplayView> <RawView>meta</RawView> <DisplayFmt>sgml,news</DisplayFmt> <FileGroup> <MfsDir>newsdata</MfsDir> <MfsFile>*.sgm</MfsFile> <MfsExpand>tree</MfsExpand> </FileGroup> </FilterChain>
Once the FilterChain sections have been set up, the following commands can be used to build the SGML regions (each DTD in the database requires one pass with sgmlrgn). For this example, assume the meta_structure_file generated by mfsmeta is called data.str. Assume that the file, `manual.inp' contains the <!DOCTYPE> declaration for the `manual' files. Finally, assume that the file, `news.inp' contains the <!DOCTYPE> declaration for the `news' files.
% sgmlrgn -v -m region -M data.str -G manual -D data.dd manual.inp data.dd % sgmlrgn -v -m region -M data.str -G news -D data.dd news.inp data.dd
Note that the `-G' option is used to specify which group to build the regions on in each pass.
As with Type 2 SGML databases, the first step involves setting up the FilterChain sections of the Data Dictionary. Also, as in Type 2 SGML databases, the files must be separated into groups. What is different in Type 3 databases is that the groups not only specifies files which use a particular DTD, but may also be further refined to specify files which contain text for a specific element of a DTD.
For example, assume the newspaper documents in the example above consisted of two elements, HEADLINE and TEXT. Further, assume that text for all the HEADLINE parts were in files with the suffix, `.hl' and that the text for the TEXT parts were in files with the suffix, .txt'. Then the following FilterChain sections could be used to define this database (which also includes the `manual' files in the other directory):
<FilterChain> <SearchView>meta</SearchView> <DisplayView>meta</DisplayView> <RawView>meta</RawView> <DisplayFmt>sgml,manual</DisplayFmt> <FileGroup> <MfsDir>mandata</MfsDir> <MfsFile>*.sgm</MfsFile> <MfsExpand>tree</MfsExpand> </FileGroup> </FilterChain> <FilterChain> <SearchView>meta</SearchView> <DisplayView>meta</DisplayView> <RawView>meta</RawView> <DisplayFmt>sgml,newshl,HEADLINE</DisplayFmt> <FileGroup> <MfsDir>newsdata</MfsDir> <MfsFile>*.hl</MfsFile> <MfsExpand>tree</MfsExpand> </FileGroup> </FilterChain> <FilterChain> <SearchView>meta</SearchView> <DisplayView>meta</DisplayView> <RawView>meta</RawView> <DisplayFmt>sgml,newstxt,TEXT</DisplayFmt> <FileGroup> <MfsDir>newsdata</MfsDir> <MfsFile>*.txt</MfsFile> <MfsExpand>tree</MfsExpand> </FileGroup> </FilterChain>
Note that a third attribute has been added to the DisplayFmt fields of the `news' files, which identifies the element that the text in those files corresponds to. Also note that `HEADLINE' and `TEXT' groups have different group names (`newshl' and `newstxt'). Also note that there is no element attribute defined for the `manual' files because they are to be parsed using the entire `manual' DTD.
Once the FilterChain sections have been set up, the following commands can be used to build the SGML regions. For this example, assume the meta_structure_file generated by mfsmeta is called data.str. Assume that the file, `manual.inp' contains the <!DOCTYPE> declaration for the `manual' files. Finally, assume that the file, `news.inp' contains the <!DOCTYPE> declaration for the `news' files.
% sgmlrgn -v -m region -M data.str -G manual -D data.dd manual.inp data.dd % sgmlrgn -v -m region -M data.str -G newshl -D data.dd news.inp data.dd % sgmlrgn -v -m region -M data.str -G newstxt -D data.dd news.inp data.dd
Note that the `-G' option is used to specify which group to build the regions on in each pass.
An external entity resides in one or more files. The entity manager component of sgmlrgn maps a sequence of files into an entity in three sequential stages:
A system identifier is interpreted as a list of filenames separated by colons. If no system identifier is supplied, then the entity manager will attempt to generate a filename using the public identifier. The searching of the related system filename associated with the public identifier is done by a table lookup. The table is named sgmlentity.map in the system. The sgmlentity.map file has two white-space delimited fields per document type. The first field is the system filename. The second field is the PUBLIC ID. The following are sample entries for document types in the sgmlentity file:
The mode examples that follow rely on three files: the document type definition, the document instance, and the input file. The following is a sample document type definition called example.dtd.
The following invokes sgmlrgn as a region file generator.
The filter mode can be invoked as follows:
The following invokes sgmlrgn to check the validity of the SGML document instance:
The following invokes sgmlrgn to expand the DTD.
The first form of the command retrieves the name of the DTD from the SGML document instance, whereas the second form retrieves the name of the DTD from the document type declaration file. Either of the above commands will generate the following output:
The following command invokes sgmlrgn in print mode:
The following invokes the root mode of sgmlrgn:
The following invokes the element list generation mode:
The following invokes sgmlrgn's document instance normalization mode:
The test mode is used to generate a test document instance and is invoked by:
Invoke the generation of a sample set of test-generation parameters by:
Tag Name | Description | Default |
<OmitStag> | allow omit start tag | 0 |
<OmitStagProb> | allow omit start tag probability | 0.3 |
<OmitEtag> | allow omit end tag | 1 |
<OmitEtagProb> | allow omit end tag probability | 0.3 |
<NetTag> | allow NET tag | 1 |
<NetTagProb> | allow NET tag probability | 0.1 |
<UncloseTag> | allow unclosed tag | 1 |
<UncloseProb> | allow unclosed tag probability | 0.1 |
<Orep> | range of * element | 2 |
<Rep> | range of + element | 3 |
<OptProb> | ? element probability | 0.3 |
<GrpOrep> | range of * group | 2 |
<GrpRep> | range of + group | 3 |
<GrpOptProb> | ? group probability | 0.3 |
<DataRange> | range of # of data chars | 12 |
<DataCharRange> | range of data used chars | 3 |
<DataCharStart> | start data used char | a |
<DataEoln> | allow end of line in data | 1 |
<DataEolnProb> | allow eoln in data probability | 0.3 |
<StagEoln> | allow end of line in start tag | 1 |
<StagEolnProb> | allow eoln in start tag probability | 0.3 |
<AttRange> | range of # of char-attrib. chars | 8 |
<AttCharRange> | range of char-attrib. used chars | 2 |
<AttCharStart> | start char-attrib. used char | x |
<AttNumRange> | range of # of number-attrib. chars | 8 |
<AttNumCharRange> | range of number-attrib. used chars | 2 |
<AttNumCharStart> | start number-attrib. used char | 0 |
<IDVal> | start ID number | 0 |
<OpenElement> | not used | 0 |
The global default values, apply to any element, are enclosed by pairs of <Regions> tags.
Within the <Regions> description, each element can override values in the global set of parameters by:
This set of test-generation parameters can be stored in a generate_file which can be used in the test generation mode, -m test.
SYSTEM "ISO 8879:1986" | ||||||||
CHARSET | ||||||||
BASESET | "ISO 646-1983//CHARSET | |||||||
International Reference Version (IRV)//ESC 2/5 4/0" | ||||||||
DESCSET | 01280 | |||||||
CAPACITY | PUBLIC | "ISO 8879:1986//CAPACITY Reference//EN" | ||||||
FEATURES | ||||||||
MINIMIZE | DATATAG | NO | OMITTAG | YES | RANK | NO | SHORTTAG | YES |
LINK | SIMPLE | NO | IMPLICIT | NO | EXPLICIT | NO | ||
OTHER | CONCUR | NO | SUBDOC | YES 1 | FORMAL | YES | ||
SCOPE | DOCUMENT | |||||||
SYNTAX | PUBLIC | "ISO 8879:1986//SYNTAX Reference//EN" | ||||||
SYNTAX | PUBLIC | "ISO 8879:1986//SYNTAX Core//EN" | ||||||
VALIDATE | ||||||||
GENERAL | YES | MODEL | YES | EXCLUDE | YES | CAPACITY | YES | |
NONSGML | YES | SGML | YES | FORMAL | YES | |||
SDIF | ||||||||
PACK | NO | UNPACK | NO |
The memory usage of sgmlrgn is not a function of the capacity points used by a document; however, sgmlrgn can handle capacities significantly greater than the reference capacity set.
In some environments, higher values may be supported for the SUBDOC parameter.
Documents that do not use optional features are also supported. For example, if FORMALNO is specified in the SGML declaration, public identifiers will not be required to be valid formal public identifiers.
Certain parts of the concrete syntax may be changed:
BASESET | "ISO Registration Number 100//CHARSET |
ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1" | |
DESCSET | 02560 |
Uppercase substitution can be performed or not performed both for entity names and for other names.
Either short reference delimiters assigned by the reference delimiter set or no short reference delimiters are supported.
The reserved names can be changed.
The quantity set can be increased within certain limits subject to there being sufficient memory available. The upper limit on NAMELEN is 239. The upper limits on ATTCNT, ATTSPLEN, BSEQLEN, ENTLVL, LITLEN, PILEN, TAGLEN, and TAGLVL are more than thirty times greater than the reference limits. The upper limit on GRPCNT, GRPGTCNT, and GRPLVL is 253. NORMSEP cannot be changed. DTAGLEN are DTEMPLEN irrelevant since sgmlrgn does not support the DATATAG feature.
<!SGML "ISO 8879:1986" | ||||||||
CHARSET | ||||||||
BASESET | "ISO 646-1983//CHARSET | |||||||
International Reference Version (IRV)//ESC 2/5 4/0" | ||||||||
DESCSET | 09UNUSED | |||||||
929 | ||||||||
112UNUSED | ||||||||
13113 | ||||||||
1418UNUSED | ||||||||
329532 | ||||||||
1271UNUSED | ||||||||
CAPACITY | SGMLREF | |||||||
TOTALCAP | 1000000 | |||||||
ENTCAP | 1000000 | |||||||
ENTCHCAP | 1000000 | |||||||
ELEMCAP | 1000000 | |||||||
GRPCAP | 1000000 | |||||||
EXGRPCAP | 1000000 | |||||||
EXNMCAP | 1000000 | |||||||
ATTCAP | 1000000 | |||||||
ATTCHCAP | 1000000 | |||||||
AVGRPCAP | 1000000 | |||||||
NOTCAP | 1000000 | |||||||
NOTCHCAP | 1000000 | |||||||
IDCAP | 1000000 | |||||||
IDREFCAP | 1000000 | |||||||
MAPCAP | 1000000 | |||||||
LKSETCAP | 1000000 | |||||||
LKNMCAP | 1000000 | |||||||
SCOPE | DOCUMENT | |||||||
SYNTAX | PUBLIC | "ISO 8879:1986//SYNTAX Reference//EN" | ||||||
QUANTITY | SGMLREF | |||||||
ATTCNT | 100 | |||||||
ATTSPLEN | 960 | |||||||
BSEQLEN | 960 | |||||||
DTAGLEN | 32 | |||||||
DTEMPLEN | 32 | |||||||
ENTLVL | 32 | |||||||
GRPCNT | 100 | |||||||
GRPGTCNT | 96 | |||||||
GRPLVL | 32 | |||||||
LITLEN | 1024 | |||||||
NAMELEN | 80 | |||||||
NORMSEP | 2 | |||||||
PILEN | 1024 | |||||||
TAGLEN | 960 | |||||||
TAGLVL | 1000 | |||||||
FEATURES | ||||||||
MINIMIZE | DATATAG | NO | OMITTAG | YES | RANK | NO | SHORTTAG | YES |
LINK | SIMPLE | NO | IMPLICIT | NO | EXPLICIT | NO | ||
OTHER | CONCUR | NO | SUBDOC | YES 99999 | FORMAL | YES | ||
APPINFO NONE> |
BASESET | "ISO Registration Number 100//CHARSET |
ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1" | |
DESCSET | 12832UNUSED |
1609532 | |
2551UNUSED |
The reference capacity set is upgraded to a 1 MB limit and the reference quantity set is upgraded to a bigger limit.
xpat(1), xpatrgn(1), multirgn(1), mfsmeta(1), regions(5), data_dictionary(5) mfs(5)
The SGML Handbook, Charles F. Goldfarb
ISO 8879 (Standard Generalized Markup Language), International Organization for Standardization
CAN/CSA-Z243.210-89 (ISO 8879, 9069), Canadian Standard Association
ARCSGML was written by Charles F. Goldfarb.
Sgmls was derived from ARCSGML by James Clark.
sgmlrgn was derived from Sgmls.