<!DOCTYPE TEI.2  PUBLIC "-//TEI//DTD TEI Lite 1.0//EN" --
                   SYSTEM  "teilite.dtd" --
[
<!ENTITY null "">
<!ENTITY perc CDATA "\%">
]>

<TEI.2>
<TEIHEADER>
<FILEDESC>
<TITLESTMT><TITLE>Text Encoding for Information Interchange: An
Introduction to the Text Encoding Initiative.</TITLE>
</TITLESTMT>
<PUBLICATIONSTMT><P>For presentation at the LEC, London, Nov 1995.</P>
</PUBLICATIONSTMT>
<SOURCEDESC><P>Revised and expanded version of article published as
<TITLE>The Wider Relevance of the Text Encoding Initiative  </TITLE> in
OII Spectrum, Nov 1994.
</P>
</SOURCEDESC></FILEDESC>
</TEIHEADER>
<TEXT>
<FRONT>
<TITLEPAGE>
<DOCTITLE>
<TITLEPART>Text Encoding for Information Interchange : An Introduction
to the Text Encoding Initiative
</TITLEPART></DOCTITLE>
<DOCAUTHOR>Lou Burnard</DOCAUTHOR>
<!-- DOCIMPRINT>Document No:  TEI ED 99</DOCIMPRINT -->
<DOCDATE>July 1995</DOCDATE>
</TITLEPAGE>
<!-- divgen type='toc' -->
<!-- ==================================cut to save space===========
<DIV TYPE="abstract">
<HEAD>Abstract</HEAD>
<P>The Text Encoding Initiative (TEI) is an international cooperative
research effort, the goal of which is to define a set of generic
Guidelines for the representation of textual materials in electronic
form, in such a way as to enable researchers in any discipline to
interchange and re-use resources, independently of software, hardware,
and application area. The first full version of the TEI Guidelines was
published in May 1994, after six years of development in Europe and the
US, with  major funding from the American National Endowment for the
Humanities and from the EU.  
</P>
<P>This publication takes the form of a substantial reference manual,
documenting a modular and extensible SGML document type definition
(DTD), capable of describing all kinds of texts, of all times and in all
languages.  Modern computer-aided research crosses many political,
linguistic, temporal, and disciplinary boundaries;  the TEI scheme
reflects this fact. As far as possible, the Guidelines eschew
controversy; where consensus has not been established, only very general
recommendations are made.  The object is to help the researcher make his
or her position explicit, not to dictate what that position should be.
</P>
<P>Consequently, the TEI <Q>standard</Q> offers neither a single
all-embracing encoding scheme, nor an unstructured collection of tag
sets.  Rather it offers an extensible framework containing a common core
of features, a choice of frameworks, and a wide variety of optional
additions for specific application areas.  
</P>
<P>The value of this approach has been demonstrated by the diversity of
fields in which the Guidelines have been applied, ranging from
traditional humanities disciplines such as textual editing and critical
editions, to linguistic engineering applications such as  spoken and
written language corpora. The Guidelines include detailed provision for
the encoding of meta-textual cataloguing information; they also include
an extensive set of recommendations for the representation of arbitrary
interpretive structures and for the encoding of complex hypertextual
links and alignments, such as those needed for multilingual corpora, and
multimedia applications.
</P></DIV>
 ==================================end of cut to save space===========  -->
</FRONT> 
<BODY>
<DIV1 TYPE="foo"><HEAD>Standardization and the TEI
</HEAD> 
<P>Standards come into being for a variety of reasons, and in a variety
of ways, not always entirely explicable. They may be entirely
market-defined; for example by manufacturers' attempts singly or as a
group to control market share, or by consumers' desires to simplify
purchase decisions. Standards also  result from pressure applied by
well-intentioned groups of experts, or as a consequence of legislation
in the public interest. And finally, standards come about as the
expression of some emergent consensus within some large community. This
last method is the most likely to last, but the most difficult to
achieve. 
</P>
<P>In creating such a consensus, there is an inevitable tension between
the need to transform what is simply tried and tested into something
normative and binding on the one hand, and the reluctance to
straitjacket or constrain unanticipated development on the other. This
is particularly true of the research and development arena, which
depends for its survival on innovation, and thus the ability to provide
answers to as yet unformulated questions, while at the same time being
as concerned as any other community to codify existing practice. The
research community is populated by experts, who need maximal flexibility
and who distrust constraint, but also by novices, who need access to
that accumulated expertise in a consistent and codified form, if only in
order to rebel against it and thus become experts in their turn. 
</P>
<P>Standardization of the way in which information is stored and
represented (rather than processed) is the key to a number of closely
related problems, all of central concern to users
of modern Information Technology, be they academic or commercial. 
For creators of language resources in
particular, it addresses the difficulty of ensuring that information is
<EMPH>reusable</EMPH>; the difficulty of ensuring that information
represented in different ways can be seamlessly
<EMPH>integrated;</EMPH> and the difficulty of facilitating loss-free
information <EMPH>interchange</EMPH> between the widest choice of
different platforms, different application systems and different
languages.  
</P>
<P>By standardizing at the level of text representation, we can
hope to retain the flexibility needed to develop new applications, while
ensuring that old ones continue to function. By attempting a
theory-neutral standardization, at the level where consensus exists, we
avoid the need to reinvent the wheel, without requiring that everyone
drive a particular brand of bicycle. 
</P>
<P>In this spirit, the TEI <TITLE>Guidelines</TITLE> which form the
topic of this paper, aim to provide not a set of normative rules for
particular applications, but rather a modular and extensible framework,
within which particular application-specific norms can be defined. The
development of such TEI-aware norms is already underway in a number of
contexts, most significantly for the present audience, within EAGLES and
related EU projects such as Multext, but also in a wide variety of
corpus building, scholarly editing and digital library projects. Such
projects have in common the need to customize and make less generic the
framework defined by the TEI, retaining as they do so the capacity for
interchange for which it was developed. The general principles, and
many of the specific mechanisms, underlying this approach are of clear
relevance to all large scale users of information technology.
</P>
<P>This paper<NOTE>An earlier version of which was published as
<TITLE>The Wider Relevance of the Text Encoding Initiative </TITLE> in
OII <title>Spectrum</title>, Nov 1994.</NOTE> describes the origins and organization of
the TEI scheme, including some technical details of how it may be customized for
multiple application areas, and an overview of its coverage. 
</P></DIV1>
<DIV1><HEAD>What is the TEI?</HEAD> 
<P>The Text Encoding Initiative (TEI) is an international cooperative
research effort, the goal of which is to define a set of generic
Guidelines for the representation of textual materials in electronic
form. The project was sponsored and organized by three leading
professional associations in the field: the Association for
Computational Linguistics (ACL), the Association for Literary and
Linguistic Computing (ALLC) and the Association for Computing and the
Humanities (ACH). It has been funded throughout its five years of
activities on both sides of the Atlantic: primarily by the US National
Endowment for the Humanities and by the European Union 3rd framework
Programme for Linguistic Research and Engineering, but also with grants
from the Mellon Foundation and from the Canadian Social Sciences and
Humanities Research Council. Of equal significance has been the donation
of time and expertise by the many members of the wider research
community who have served on the TEI's Working Committees and Working
Groups. 
</P>
<P>As its title suggests, the TEI is strongly interested in text. But
this interest is by no means confined to the use of electronic text as a
stage in the production of paper documents, and the word
<MENTIONED>text</MENTIONED> should not be read too literally.  The TEI
is equally concerned with both textual and non-textual resources in
electronic form, whether as constituents of a research database or
components of non-paper publications.  
</P>
<P>Like the publishing industry, the research community has long
realized that its stock in hand is not words on the page, but
information, independent of any particular physical realization. As
technology emerges which is genuinely adequate to the task of
integrating text, graphics and audio into a seamless information-bearing
vehicle, so the importance of that integrated vision becomes more
apparent. By providing a description of information which is independent
of realization or media, the TEI scheme, like other SGML-based
approaches, enormously facilitates the construction and exploitation of
multimedia technology. 
</P>
<P>In the same way, the texts with which language researchers are
concerned are likely to be very heterogenous. In the construction of
language corpora such as the British National Corpus <NOTE>See
<CODE>http://info.ox.ac.uk/bnc</CODE> for details of this 100 million
word TEI-conformant corpus of modern British English.</NOTE>, material
as divers as newspapers, books, office memoranda, playscripts, publicity
brochures, letters and diaries, transcribed lectures and interviews, TV
and radio broadcasts, and unscripted conversations are  integrated into
a single body of material. Research needs impose that this integration
be carried out with minimal loss of information, and at the same time
with minimal complexity: in any case, the resulting 
<socalled>text</socalled> is far removed from the conventional notion of a printed
work.  
</P>
<P>Electronic texts are most obviously different from printed ones in
that the former contain <TERM>markup</TERM> or encoding, which makes
explicit various features of the text, so that they can be  efficiently
processed. Printed texts adopt a variety of similarly-motivated
conventions (use of typeface, organization of the carrier medium etc),
but these are not so readily processable as the tags of a formal markup
scheme.  
</P>
<P>The goals of the TEI project initially had a dual focus: being concerned with both
<EMPH>what</EMPH> textual features should be encoded (i.e. made
explicit) in an electronic text, and <EMPH>how</EMPH> that encoding
should be represented for loss-free, platform-independent, interchange. 
<!-- The approach taken was a two stage one: 
<LIST TYPE="gloss">
<LABEL>firstly</LABEL>
<ITEM> the identification of those distinctions concerning which there
is common agreement; </ITEM>
<LABEL>secondly</LABEL>
<ITEM> the creation of a uniform encoding system within which those
distinctions can be expressed for interchange.</ITEM> </LIST> 
-->
</P>
<P>Early on in the project, the Standard Generalized Markup Language
(SGML; ISO 8879) was chosen as the most appropriate vehicle for the
Guidelines, initially on the purely pragmatic grounds that to create a
comparably expressive and versatile formal language would be a major
research project in itself. In the event, despite some frequently
rehearsed inelegancies, SGML has proved entirely adequate to the needs
of researchers, and after five years, is still increasing its domination
of the software industry, with new product announcements coming every
year. The TEI was thus able to focus its efforts on the expression,
using SGML, of the set of textual features indicated as its first
goal above. 
</P>
<P>The prime deliverable of the TEI project is a very large number
(over 400) of textual feature definitions, expressed as 
SGML elements and attributes, with associated documentation and examples. 
These elements are grouped into <TERM>tag
sets</TERM> of various kinds, as further discussed below, and together
constitute a modular scheme which can be configured to provide
hardware-, software-, and application- independent support for the
encoding of all kinds of text in all languages and of all times.   
</P>
<P>The TEI tag sets are necessarily based on, but not limited by,
existing encoding practices; they are designed to be both comprehensive
and extensible. They are collectively documented in a substantial
reference manual, the <TITLE>Guidelines for Text Encoding for
Interchange</TITLE>, which appeared in May 1994 after five years of
extensive development work. 
<!-- A first draft of this publication appeared in
November 1990. Between 1992 and 1994, chapters of a revised draft were
circulated electronically. A fully revised and completed version, known
internally as P3, was first published in May 1994. 
A revised and corrected edition is tentatively planned  for first
quarter 1996. -->
This 1400 page manual is published both in paper and
electronic hypertext form and is also available over the Internet, in a
variety of formats. <NOTE><TITLE>Guidelines for the encoding and
interchange of machine-readable texts</TITLE> edited by
C.M.Sperberg-McQueen and Lou Burnard (Chicago and Oxford, ALLC-ACH-ACL
Text Encoding Initiative, 1994).  For details of current availability
and locations, see the official TEI Web page at 
<CODE>http://www-tei.uic.edu/orgs/tei</CODE>.</NOTE> 
</P></DIV1>
<DIV1><HEAD>Organization of the TEI scheme </HEAD> 
<P>As an SGML application, the TEI scheme necessarily requires the
existence of some kind of <TERM>document type definition</TERM> (DTD). 
Current approaches to dtd design may be caricatured as falling into one
of three camps, depending on their answer to the question <Q>How many
DTDs does the world need?</Q>.   
</P>
<P>For many of the first users of SGML, the appropriate answer was
<Q>One</Q>: the whole purpose of the exercise being to define a
template against which all texts could be checked rigorously and
consistently. This approach, which might be characterized by the phrase
<Q>we know what's best for you</Q>, has an obvious place in
applications such as technical documentation, but is equally obviously
inappropriate where the object of the exercise is to describe texts
produced before the blessings of structured document design were
revealed to the world. 
</P>
<P>At the opposite extreme are those whose answer would be
<Q>none</Q>, for whom no DTD can ever be adequate to the full
complexity of the texts to be described: this attitude might be
caricatured as <Q>No-one will ever understand my problem</Q>. Again, it
is not impossible to imagine applications for which a DTD consisting
only of elements with the content model ANY would be entirely
appropriate (the first electronic edition of the <TITLE>Oxford English
Dictionary</TITLE> provides one obvious example), although its
usefulness in the general case is less clear. 
</P>
<P>Perhaps most numerous are those who shrug their shoulders and say
<Q>as many as it takes</Q>: the world will always need new DTDs, in the
boundary case, one per document. In the name of pragmatism, this
attitude risks crowding the fledgeling possibility of information
interchange out of the nest entirely; nevertheless, its popularity
reminds us that sometimes the document must drive
the DTD, rather than the reverse. 
</P>
<P>The approach taken by the TEI attempts to combine virtues of all
three of these approaches. It defines not one, but many possible DTDs,
which may be tailored to the needs of a particular application in a way
difficult or impossible with most other general purpose DTDs so far
developed. The user of the TEI scheme is offered the opportunity of
building a DTD which matches his or her requirements, but constrained to
do so in a way that facilitates interchange. 
</P>
<P>We refer to this somewhat jocularly as the Chicago Pizza model. All
pizzas have some ingredients in common (cheese and tomato sauce); in
Chicago, at least, they may have entirely different forms of pastry
base, with which (universally) the consumer is expected to make his or
her own selection of toppings. Using SGML syntax this might be
summarized as follows:
<EG><![CDATA [<!ENTITY % base    "(deepDish | thinCrust | stuffed)" >
<!ENTITY % topping "(sausage | mushroom | pepper | anchovy ...)"> 
<!ELEMENT pizza - - (%base, cheese & tomato, (%topping;)* )>
]]></EG>  In the same way, the user of the TEI scheme constructs a view
of the TEI DTD by combining the core tag sets (which are always
present), exactly one <SOCALLED>base</SOCALLED> tag set and his or her
own selection of <SOCALLED>additional</SOCALLED>tag sets or toppings. 
</P>
<P>We use the term <TERM>tag set</TERM> to denote simply a collection
of definitions for SGML elements and their attributes. These tag sets
are the basic organizing principles of the TEI scheme, and are divided
into four groups:
<LIST type=gloss>
<label>core </label><ITEM>tag sets defining elements likely to be needed by all
documents, and therefore available by default in all cases.
</ITEM>
<label>base </label><ITEM>tag sets defining particular classes of document whose
gross structure may vary; in general, only one base tag set is
appropriate for a given document.
</ITEM>
<label>additional </label><ITEM>tag sets defining sets of elements which may be
found in any class of document but which are typically associated with
some specialized application or detailed subject area.
</ITEM>
<label>auxiliary </label><ITEM>tag sets comprising elements with highly
specialized roles, typically for description of some part of the encoding
scheme, and which make up a DTD independent of the main one.
</ITEM>
</LIST>In general, elements appear in only one tag set, though the
current model allows for the redefinition of elements within different
base tag sets. Elements may not be defined in more than one additional
tag set. 
</P>
<P>This modularization is achieved by the use of parameter entities in
the TEI DTD, which is further discussed below. To illustrate the basic
mechanism we present here the start of a minimal TEI-conformant document
in which the base tag set for prose has been selected together with the
additional tag set for linking: 
<EG><![CDATA [<!DOCTYPE tei.2 [
<!ENTITY % TEI.prose "INCLUDE">
<!ENTITY % TEI.linking "INCLUDE">
]>
<tei.2>
 <!-- content of document here -->
</tei.2>
]]></EG>  Because this selection of tag sets is effected explicitly by
declarations within the DTD subset, as shown above, any recipient of the
document can tell which TEI tag sets are required to process it. Any
deviations or modifications of the TEI definitions (for example, the
renaming of elements, or the addition of new ones) may be made in a
similar declarative manner. Once a given view of the TEI dtd has been
defined in this way, it can be fixed or <SOCALLED>compiled</SOCALLED>
to preclude further modification and also to remove the complexity
necessarily introduced by the extensive use of indirection in the TEI
dtd.  
</P></DIV1>
<DIV1><HEAD>The TEI core </HEAD> 
<P>Two core tag sets are available to all TEI documents without
formality. The first defines a large number of elements which may appear
in almost any kind of document, whatever kind of base tag set is in use.
The second defines the <TERM>header</TERM>, providing something
analogous to an electronic title page for the electronic text.
</P>
<DIV2 TYPE="splot"><HEAD>Elements available to all bases</HEAD> 
<P>The core tag set common to all TEI documents provides means of
encoding with a reasonable degree of sophistication such textual
features as typographically highlighted or quoted 
phrases, (optionally distinguishing highlighting used 
for emphasis, technical terms, foreign words, titles etc); 
quoted phrases, (optionally distinguishing amongst direct speech,
quotation, glosses, cited phrases etc.);
<Q>data-like</Q> phrases such as
names, numbers and measures, dates and times, etc.;
lists of all kinds;
basic editorial changes (e.g. correction of apparent errors;
regularization and normalization; additions, deletions and omissions);
simple links and cross references, providing basic hypertextual
features; facilities for annotation, indexing, bibliographic citations
and referencing systems.
<!-- == deleted to save space ======================================
the following
textual features: 
<LIST>
<ITEM>typographically highlighted phrases, (optionally distinguishing
amongst highlighting for emphasis, technical terms, foreign words,
titles etc.)
</ITEM>
<ITEM>quoted phrases, optionally distinguishing amongst direct speech,
quotation, glosses, cited phrases etc.
</ITEM>
<ITEM>names, numbers and measures, dates and times, and similar
<Q>data-like</Q> phrases.
</ITEM>
<ITEM>lists of all kinds
</ITEM>
<ITEM>basic editorial changes (e.g. correction of apparent errors;
regularization and normalization; additions, deletions and omissions)
</ITEM>
<ITEM>simple links and cross references, providing basic hypertextual
features.
</ITEM>
<ITEM>pre-existing or generated annotation and indexing
</ITEM>
<ITEM>bibliographic citations, adequate for most commonly used
bibliographic packages, in either a free or a tightly structured format
</ITEM>
<ITEM>simple or complex referencing systems, not necessarily dependent
on the existing SGML structure.
</ITEM> 
</LIST>
== end of deleted to save space ====================================== -->
There are few documents which do not exhibit some of these
features; and none of these features is particularly restricted to any
one kind of document. In some cases, an additional tagset is also
available, providing more specialized elements for those wishing to
encode aspects of these features in greater detail (for example, for
verse and drama, and for names), but the elements defined in this core
are believed to be adequate for most applications most of the time.   
</P></DIV2>
<DIV2 ID="hdr"><HEAD>The header</HEAD> 
<P>The TEI scheme attaches particular importance to the provision of
documentary or bibliographic information about electronic texts. Such
information is essential for any satisfactory interchange of texts
coming from multiple sources, or for which long term uses are envisaged.
<!-- As with software, leaving the documentation of an electronic text to the
last moment is a recipe for disaster all too commonly followed.  -->
</P>
<P>The TEI header is one of the few mandatory elements in a TEI
document. It has four major divisions which together provide a detailed
syntax for the documentation of:
<LIST>
<ITEM>the electronic document itself and  the sources from which it was
derived;
</ITEM>
<ITEM>the encoding system which has been applied;
</ITEM>
<ITEM>descriptive information categorizing the document and its subject
matter;
</ITEM>
<ITEM>its revision history.</ITEM> 
</LIST></P>
<P>The first of these, the <TERM>file description</TERM>, contains
traditional bibliographic material, detailing title, intellectual
responsibility and publication or distribution information relating to
an electronic text, which can readily be translated into a conventional
catalogue record for use by the growing number of forward-thinking
academic and public libraries now coming to terms with their new role as
curators of non-print electronic materials. 
</P>
<P>Several commentators, noticing how the day to day information
processing of all sectors of the economy now takes place in electronic
form only, have expressed concern at the difficulties faced by
librarians and archivists in handling these new forms of historical
records. Others, trying to come to terms with the wealth of information
in <Q>cyberspace</Q>, have lamented the absence of any effective
cataloguing standards for networked resources and other forms of
electronic publication. For creators of language corpora, the provision
of such meta-descriptive information is essential, since without it
analysis of the full complexity of language use is all but impossible.
The TEI Header represents a major contribution to overcoming all these
problems. 
</P>
<P>Many electronic texts are essentially derivative works, created
either by keying or scanning previously existing print materials,
combining or modifying previously existing electronic materials, or
both.  The <TERM>source description</TERM> part of the TEI header
allows an encoder to specify the source or sources from which a text has
been derived, using traditional bibliographic concepts. The pedigree of
a TEI-conformant text can thus be specified, in the same way as a
conventional book will generally document its publishing history. A
detailed formal description of changes made in producing a text can be
recorded as a distinct <TERM>revision history </TERM>; this is
particularly useful for highly dynamic texts. 
</P>
<P>As noted above, the TEI is not a fixed encoding scheme, but offers a
variety of options appropriate to different situations. Consequently,
the <TERM>encoding description</TERM> within a TEI Header is of
particular importance to users of an electronic document. It provides,
in structured or unstructured form, vital information about editorial
conventions or policies, design decisions and even the selection of tags
actually used within the document.   
</P>
<P>The <TERM>profile description</TERM> is used to group together a
wide range of additional descriptive information ranging from
specifications of the languages used within it, the situation or social
context in which it was produced, its topics or classification, to
demographic or social characteristics of its authors or participants.
No-one is likely to need all of these categories of information, but 
<!-- the working groups involved in defining the header agreed that -->
all of them are likely to be essential to some users. 
</P>
<!--
<P>At one extreme, an encoder may provide only a bibliographic
identification of the text.  At the other, encoders wishing to ensure
that their texts can be used for the widest range of applications, will
want to provide a level of detailed documentation approximating to the
kind most often supplied in the form of a manual. Most texts will lie
somewhere between these extremes; textual corpora in particular will
tend more to the latter extreme. 
</P>
-->
<P>A collection of TEI headers can also be regarded as a distinct
document, and an auxiliary DTD is provided to support interchange of
headers alone, for example between libraries or archives. 
</P></DIV2></DIV1>
<DIV1 ID="bas" TYPE="frrpo"><HEAD>The TEI base tag sets</HEAD> 
<P>To construct a view of the TEI DTD, the user must always choose one
base tag sets. Six of these are currently defined,  for documents which
are predominantly one of prose, verse, drama, transcribed speech, dictionaries, or terminological databases. Another two are
provided for use with texts which combine these basic tag sets.
<!-- <LIST>
<ITEM>prose
</ITEM>
<ITEM>verse
</ITEM>
<ITEM>drama
</ITEM>
<ITEM>transcribed speech
</ITEM>
<ITEM>letters and memoranda
</ITEM>
<ITEM>dictionary entries
</ITEM>
<ITEM>terminological entries
</ITEM> 
</LIST>
-->
</P>
<P>The choice of a base tag set determines the basic structure of all the
documents with which it is to be used, reflecting the fact that 
subelements likely to appear within a
dictionary (for example) will be entirely different in kind from those
likely to appear within a letter or a novel, and even
more so from those likely to be found in a transcription of spoken
language.  To cater for this variety, the constituents of all divisions
of a TEI <GI>text</GI> element are not defined explicitly, but in terms
of <TERM>parameter entities</TERM>. The mechanism used is to provide
definitions like the following within the DTD, one of which the
user must over-ride by supplying an appropriate declaration 
in the DTD subset:
<eg><![ CDATA [
<!ENTITY % TEI.prose "IGNORE">
<!ENTITY % TEI.dictionary "IGNORE">
]]></eg>
The body of the main dtd contains a series of alternative definitions,
each enclosed within an SGML <term>marked section</term> named after the
base which it defines, as in this simplified example:
<eg><![ CDATA [
<![ %TEI.prose [
<!-- This definition is in force when the prose base is selected -->
<!-- Its effect is to define component as either paragraph or list -->
<!ENTITY % component "p|list" >
]&null;]>

<![ %TEI.dictionary [
<!--This definition is in force when the dictionary base is selected -->
<!-- Its effect is to define component as entry alone -->
<!ENTITY % component "entry" >
]&null;]>

<!-- This definition is always in force -->
<!-- Its effect is to define component.seq as one or more of -->
<!-- whatever definition of component is currently in force -->
<!ENTITY % component.seq "(%component)+">
]]></eg>
Within the body of the DTD, elements are defined using these
parameter entities only, for example:
<eg><![ CDATA [
<!ELEMENT div - - ((%component.seq)+)>
]]></eg>

To select a base tag set a declaration such as the following should
be supplied within the DTD subset for the document:

<eg><![ CDATA [
<!ENTITY % TEI.prose "INCLUDE">
]]></eg>
This will over-ride the declaration within the TEI DTD itself, because
it is given first. If no base is declared, the DTD will not compile.
</P>

<P>The value of the parameter entity called
<ident>component.seq</ident> will thus  differ in different bases. In
this way it is possible for the divisions of a text using the drama
base (for example) to consist of speeches and stage directions, while
those of a text using the dictionary base will consist of lexical
entries.  </P>

<DIV2><HEAD>Textual Divisions</HEAD> 

<P>Although the actual components may differ, groups of textual
components are potentially grouped into higher level
<socalled>division</socalled>s in almost any kind of text. These
higher level units may be called variously
<SOCALLED>chapters</SOCALLED>, <SOCALLED>sections</SOCALLED>,
<SOCALLED>subdvisions</SOCALLED>, <SOCALLED>acts</SOCALLED> or
<soCalled>parts</socalled> but all seem to behave in more or less the
same way: they are incomplete in themselves, and nested hierarchically. In
the TEI scheme all such objects are therefore regarded as the same
kind of element, called here a <TERM>division</TERM>. <!--
; though a
distinction is made between divisions whose hierarchic position is
regarded as inseparable from their semantics (these are encoded as
<GI>div1</GI>, <GI>div2</GI> etc. down to <GI>div7</GI> elements) and
those for which their position in the document tree is regarded as of
lesser importance (these are known as <SOCALLED>vanilla
</SOCALLED>
<GI>div</GI>s). Numbered and unnumbered division elements may not be
mixed in the same <GI>front</GI>, <GI>body</GI>, or <GI>back</GI>
element.-->
</P>

<P>A <IDENT>type</IDENT> attribute may be used to distinguish amongst
divisions in some respect other than their hierarchic position: the
values for this attribute (as for several others in the TEI scheme) are
not standardized, precisely because no consensus exists, or is likely to
exist, as to a generic typology. A set of legal values should however be
defined for a given application, either in the TEI Header or by a
user-defined modification. 
</P>
<P>In the normal case, the components of all divisions in a particular
base are homogeneous --- they all use the same value for
<ident>component.seq</ident>. However, the scheme also allows for two
kinds of heterogeneity.  If the <TERM>general</TERM> base is selected,
together with two or more other bases, then different divisions of a
text may have different constituents, though each division must itself
be homogeneous. A <TERM>mixed</TERM> base is also defined, in which
components from any selection of bases may be combined promiscuously
across division boundaries.  
</P>
<P>This approach applies equally to the encoding of smaller units: 
rather than attempt to enumerate all the different analytic units which
particular disciplines might find necessary, the TEI proposes two
generic segmentation elements: one (<GI>s</GI>) for simple end-to-end
segmentation, such as that commonly used in language corpora, roughly
corresponding to the notion of orthographic sentence; the other (<GI>seg</GI>)
for segments which can potentially self-nest. In either case, a <IDENT>type</IDENT>
attribute may be used to distinguish different kinds of segment.  
</P></DIV2>
<DIV2><HEAD>The TEI Class System and Modification Mechanisms</HEAD> 
<P>Textual features, and hence the elements which encode them,  may be 
categorized or classified in a number of ways. The TEI scheme identifies
two kinds of classification scheme: <TERM>attribute classes</TERM> and
<TERM>model classes</TERM>; both are used for broadly similar purposes. 
</P>
<P>Members of an attribute class share the same set of attributes. For
example, all elements which represent links or associations between one
element and another do so using a common set of attributes, defined
by the <ident>pointer</ident> attribute class.
<!-- All elements are members of at least one
attribute class, the class <Q>global</Q>, which is further discussed
below (section <PTR TARGET="glob">).  -->
</P>
<P>Members of a model class share the same structural properties: that
is, they may appear at the same position within the SGML document
structure. For example, the 
class <TERM>divtop</TERM> contains all elements (headings, epigraphs
etc.)  which can appear at the start of a textual division; all elements used to mark editorial corrections or
omissions are members of the class <TERM>edit</TERM>; elements marking
bibliographic citations etc. are all members of the class
<TERM>bibl</TERM> and so on. 
</P>
<P>Elements may of course be members of more than one class. Classes
may have super- and sub-classes, and properties (notably associated
attributes) may be inherited. Classes are defined in the TEI dtd
by means of parameter entities, and used extensively for DTD maintenance,
documentation, and extension.
<P>The TEI scheme supports three kinds of user modification: new elements may be added into
existing classes, and existing elements renamed or undefined. These operations are carried out in a controlled manner, using the class
system and without
any need for extensive revision of the TEI DTD itself.
<p>The process of adding a new element to a class may be illustrated 
as follows. Consider
the model class <term>divTop</term> mentioned above. Simplifying somewhat, this element class is defined as follows:
<eg><![ CDATA [
<!ENTITY % x.divtop "">
<!ENTITY % m.divtop "%x.divtop head | byline | epigraph">
]]></eg> To add a new element (say, <gi>keywords</gi>) to this class,
enabling it to appear anywhere in the content model that other members
of the class do, all that is needed is to re-define the
<socalled>x-entity</socalled> within the document type subset:

<eg><![ CDATA [
<!ENTITY % x.divtop "keywords |">
]]></eg>
Note the trailing vertical bar, which is required. As it happens, the
element <gi>keywords</gi> is already defined in the TEI scheme (within the
header); if it were not, an element declaration would also be
necessary.

<p>Parameter entities are also used to effect the two other kinds of 
modification mentioned above:
the ability to undefine elements, and the ability to rename them. 

<P>Within the main TEI dtd, each element definition and its associated
attribute list specification is enclosed by a marked section with the same
name as the element, the default value for which is "INCLUDE". Thus, to
undefine the element <gi>mentioned</gi>, all that is needed is a declaration
like the following in the DTD subset:
<eg><![ CDATA [
<!ENTITY % mentioned "IGNORE">
]]></eg>
</p>
<p>A similar declaration may be used to rename any element; for example, to 
rename <gi>p</gi> as <gi>para</gi>:
<eg><![ CDATA [
<!ENTITY % n.p "para">
]]></eg>
This works because all references to the <gi>p</gi> element throughout
the TEI dtd are made indirectly, using the <ident>n.p</ident> entity. 
Furthermore, the original name for an element is recoverable by an SGML 
application, because it forms the value of a global attribute 
<ident>teiform</ident> of declared type FIXED.
</p>
<p>All user-defined modifications of this kind are regarded as
forming an additional tag set, which is embedded within the DTD in the
same way as as any other tag set, i.e. by enabling the 
<TERM>TEI.extensions</TERM>
parameter entities. In this way a TEI document can make explicit the
extent and nature of any modification required in the base TEI scheme
for its processing. An auxiliary tag set is also provided for the
documentation of additional SGML elements in a way compatible with that
used for the rest of the scheme. 
</P></DIV2>
<DIV2 ID="glob"><HEAD>The global attributes</HEAD> 
<P>One particularly important class is the <TERM>global</TERM>
attribute class. By default the following attributes are members of
this class and may therefore be supplied for all elements in the 
TEI scheme:
<LIST TYPE="gloss">
<LABEL>id</LABEL>
<ITEM>provides an SGML identifier for an element</ITEM>
<LABEL>n</LABEL>
<ITEM>provides a possibly non-unique  name or number for an element</ITEM>
<LABEL>lang</LABEL>
<ITEM>specifies the language and hence the writing system used for an
element</ITEM>
<LABEL>rend</LABEL>
<ITEM>provides information about the rendering of an element where this
is not otherwise specified</ITEM>
</LIST> 
</P>
<P>This list may be extended: for example,
selecting the additional tag set for analysis will add analytic attributes to
the above list. The <IDENT>id</IDENT> and <IDENT>n</IDENT> attributes allow for
the identification of any element occurrence within a TEI-conformant
text.  Elements carrying an <IDENT>id</IDENT> attribute value may be
the object of a link or cross-reference, or any of the other
re-structuring mechanisms proposed by the TEI for circumventing the
rigidly hierarchic structure of a simple SGML DTD. The fact that the
requirement for such links is usually unpredictable is one reason for
making this attribute global. 
</P>
<P>Values on <IDENT>id</IDENT> attributes must be unique (their
declared value is ID). Values on the <IDENT>n</IDENT> attribute however
need not be; they may be used to carry a TEI canonical reference. A
method for defining the structure of such canonical reference schemes is
also provided, so that documents using it can be processed
automatically. 
</P>
<P>The <IDENT>lang</IDENT> attribute indicates both the language and
hence the writing system applicable to the element's content, thus
providing explicit support for polyglot or multiscript texts. If no
value is given, that of the element's direct parent is assumed. (A
number of TEI attributes have this characteristic, which is catered for
by a TEI-defined keyword). The value of this element identifies a
special purpose <GI>language</GI> element which documents the language
in use, optionally associating it with an external entity in which a
formal <TERM>writing system declaration</TERM> (WSD) may be given.  
</P>
<P>A WSD  defines a language/writing system pair
(for example, <Q>Koine Greek, using TLG Beta Code</Q>).
and is formally defined by an auxiliary DTD  which allows each 
character to be systematically defined and documented, in terms of
existing international or other standards, public or private entity
sets, ad hoc transliteration schemes or explicit definitions, as well as
combinations of all four. 
</P>
<P>Finally, the global <IDENT>rend</IDENT> element may be used to give
information about the physical presentation of the text in the source,
where this is not otherwise given. A default rendition may be specified
for all elements of a given type. No specific set of values is defined
for this attribute in the current draft, though it is probable that some
suitable set of DSSSL primitives will be proposed in a later version. 
</P>
<P>It should be stressed that the <IDENT>rend</IDENT> element is
<EMPH>not</EMPH> intended for use as a means of specifying the desired
formatting of an element, except insofaras this may be determined by a
desire to mimic the approximate appearance of the original text. Like
other SGML applications, the TEI scheme attempts to provide elements for
the encoding of those textual features deemed essential to a productive
use of the encoded text; however, unlike most other SGML applications,
the TEI scheme recognizes that for some, it is precisely the appearance
of a text which is the object of research.  
</P></div2></DIV1>
<DIV1 ID="adds"><HEAD>The TEI additional tag sets</HEAD> 
<P>Ten additional tag sets are defined by the current TEI proposals.
These include tag sets for special application areas such as the
orthographic transcription of speech, the detailed physical description
of manuscript or print material, and the recording of an
<SOCALLED>electronic variorum</SOCALLED> modelled on the traditional
critical apparatus. A tag set is defined for the detailed documentation
of contextual information needed by language corpora, as well  as for
the detailed encoding of names and dates; abstractions such as networks,
graphs or trees; mathematical formulae and tables etc.   
</P>
<P>In addition to these application-specific additional tag sets, some
more general purpose additional tag sets are defined for
<LIST>
<ITEM>linking and alignment
</ITEM>
<ITEM>analysis and interpretation
</ITEM>
<ITEM>feature structure analysis
</ITEM> 
</LIST></P>
<P>The tag set for linking and alignment extends the set of linking and
pointing elements already defined in the TEI core to provide facilities
for linking to arbitrary locations or spans of texts, whether or not
these are in the current document, and whether or not the target is an
SGML document. Mechanisms are included for recording the alignment or
correspondence of parts of a text, for example in multilingual corpora,
or for marking the alignment of audio or video with a transcription of
it. As such, this tag set provides a usefully large subset of the
facilities offered by the HyTime standard, but with a considerably
simpler and more efficient interface. <NOTE>Witness the fact that, as
of May 1995, support for the  TEI <TERM>extended pointer mechanism</TERM>
 has already been implemented in Softquad's Panorama Pro, and Electronic
Book Technology's DynaText --- the two market leaders amongst commercial
SGML browsing software.</NOTE> 
</P>
<P>As noted above, a generic segmentation element is defined for the
identification of textual spans appropriate to any analytic scheme. An
out-of-line generic  <GI>interp</GI> element may be used to link
arbitrary text segments (which may be nested or discontinuous) with any
user-defined set of attribute/value pair interpretations. Specific tags
are also defined for the most common  requirements of linguistic
analysis such as identification and typing of morphemes, words, phrases,
and sentences. 
</P>
<P>A specialized tagset is also provided  for the encoding of abstract
interpretations of a text, either in parallel with it or embedded within
it. This is based on the <TERM>feature structure</TERM> notation
employed in theoretical linguistics, but has applications beyond
linguistic theory. <NOTE>An introduction to this tag set is provided by
D. T. Langendoen and G.F.  Simons ``A rationale for the TEI
recommendations for feature-structure markup'' in <TITLE>Computers and
the Humanities</TITLE> (forthcoming, 1995; for an extended discussion
of an application of the feature structure scheme to the problems of
encoding historical source materials, see D. I. Greenstein, and L.
Burnard ``Speaking with one voice'' (ib).</NOTE> 
</P>
<P>Using this mechanism, encoders can define arbitrarily complex
bundles or sets of features  identified in a text, according to their
own methodological bias. They may thus embed a whole range of
interpretations of a text, linguistic, literary, or thematic, within a
text in a controlled manner. The syntax defined by the Guidelines not
only formalizes the way in which such features are encoded, but also
provides for a detailed specification of legal feature value/pair
combinations and rules determining, for example, the implication of
under-specified or defaulted features. This is known as a <TERM>feature
system declaration</TERM> and is defined by an auxiliary tag set.  
</P>
<P>An additional tag set is also provided for the encoding of degrees
of uncertainty or ambiguity in the encoding of a text. These particular
tag sets exhibit in a particularly noticeable form one of the chief
strengths of the TEI approach to encoding: it provides the encoder with
a well-defined set of tools which can be used to make explicit his or
her reading of a text.  No claim to absolute authority is made by any
encoder, nor ever should be; the TEI scheme merely allows encoders to
<Q>come clean</Q> about what they have perceived in a text, to whatever
degree of detail seems appropriate. 
</P>
<P>A user of the TEI scheme may combine as many or as few additional
tag sets as suit his or her needs. The existence of tag sets for
particular application areas in the current draft reflects, to some
extent, accidents of history: no claim to systematic or encyclopaedic
coverage is implied. Indeed, it is confidently expected that new tag
sets will be added, and that their definition will form an important
part of the continued work of this and successor projects.   
</P></DIV1>
<DIV1><HEAD>From General to Specific</HEAD> 
<P>The TEI Guidelines have taken more than five years to reach their
present state, the first at which they can be said to be reasonably
complete. In retrospect, it is doubtless true that they could have been
created much more quickly with less involvement from the research
community, or a clearer statement from it of a set of particular goals.
But that statement would have inevitably limited the scope of the
resulting scheme, providing exactly the kind of strait-jacket which we
wished to avoid. Moreover, by prioritizing any one research agenda
however well-articulated, we would have effectively disenfranchised and
alienated all others. A little like the early Church fathers then, the
TEI chose to provide as broad and as catholic a means of salvation as
possible. 
</P>
<P>At the same time, the TEI scheme applies rigorously the principle
<Q>essentia non sunt multiplicanda praeter necessitatem</Q><NOTE>Generally
attributed to William of Occam (1300-1349), this recommendation is known
as <TERM>Occam's Razor</TERM>; it may be translated as <Q>Essences
should not be unnecessarily multiplied</Q> and refers properly to the
distinction made by the Scholiasts between <Q>essence</Q> --- those
properties of an entity which define its type and <Q>accidents</Q> ---
those properties specific only to one instance of an entity</NOTE>.
Rather than defining discrete elements for different kinds of list
(bulleted, glossary, enumerated etc.). the TEI scheme defines a single
<GI>list</GI> element which bears a <IDENT>type</IDENT> attribute to
distinguish amongst these various kinds. In the same way, all kinds of
links between document elements, whatever their semantics, are encoded
using the same tags. To handle the indefinite number of elements
potentially needed to handle all kinds of analysis and interpretation, a
small number of generic tags are proposed which (in the case of the
feature structure tag set referred to above) are sufficiently abstract
and general to cater for almost any kind of interpretative judgment. 
</P>
<P>At the same time, there remain many situations in which the TEI's
desire to exclude no-one has lead to a multiplication of distinctions at
first sight rather bewildering. It seems to say the least unlikely that
anyone will ever encode a document using every possible element defined
by the union of every TEI tag set, though such a monster DTD is indeed
possible.  
</P>
<!-- deleted to save space ========================================
<P>Even in a relatively small area such as the definition of text
classification schemes, the TEI proposes three parallel (and mutually
incompatible)  methods. In the matter of hypertextual addressing the TEI
syntax permits of 14 different <Q>location methods</Q>. Names of
persons, places, and organizations may be left unmarked, tagged simply
as referring strings, or analyzed into subcomponents specific to them.
Bibliographic citations may be presented as simple prose, or as
assemblages of specific elements, either highly structured or loosely
assembled. The Guidelines are even, seemingly, unable to make up their
mind whether to organize text into numbered or unnumbered divisions!   
</P>
<P>It is probable that many people confronted by the 1400 pages of the
current printed version are likely to derive less comfort from knowing
that somewhere in it exists precisely the general-purpose solution they
need than they would from a demonstration of the application of that
general mechanism to the specific problem currently facing them.
 ===end deleted to save space ========================================-->
<p> As published, the Guidelines constitute a substantial document unsuitable 
for casual browsing, even in electronic form. The TEI therefore plans to
make available a number of smaller introductory tutorials focused on
particular application areas. Two such have already appeared: one
dealing with terminological systems, <NOTE>Melby, Alan et al <TITLE>Terminology
Interchange Format (TIF): a tutorial</TITLE> (Vienna, Infoterm, 1993)
</NOTE> and the other on encoding of manuscript transcriptions <NOTE>Robinson,
Peter
<TITLE>Encoding of Primary Sources Using SGML</TITLE>, Oxford, Office
for Humanities Communication, 1994</NOTE>. 
</P>
<P>A third tutorial has also recently been completed, 
documenting a special pedagogically-motivated subset of some
200 elements, selected from the whole TEI scheme (not just the core). 
Known as TEI Lite, this DTD has already been used in two electronic
publishing projects and is in use at electronic text repositories at the
Universitirs of Oxford, Virginia and Michigan, and elsewhere.  <NOTE>At
the time of writing, the document defining this scheme is only available
in electronic form, as  (Sperberg-McQueen, C.M. and Lou Burnard <TITLE>TEI
Lite: An Introduction to the TEI encoding scheme</TITLE> (Chicago and
Oxford, May 1995)) from the URLs
<CODE>http://www-tei.uic.edu/orgs/tei</CODE> or
<CODE>http://info.ox.ac.uk/~archive/teilite</CODE></NOTE>  
</P>
<P>The real proof of the effectiveness of the TEI design will come only
with its wide-spread adoption, tailored to the particular needs of
individual projects. As far as can be judged from the long list of early
implementors, such evidence will soon be forthcoming. 
</P></DIV1>
<DIV1><HEAD>Conclusions</HEAD> 
<P>This article has focussed chiefly on the complexity and generality
of the TEI scheme, with a view to demonstrating its intellectual
adequacy and its potential as  a model for many SGML applications.  
</P>
<P>It has also attempted to demonstrate how a simple modular scheme can
be implemented in such a way as to maximize the <Q>interchange space</Q>
within which information interchange takes place.  
</P>
<P>The origins of the TEI scheme in the academic world mean that it has
been designed with the widest possible set of applications in mind.
Optimizing it for particular sets of users will be a new challenge.  
</P>
</DIV1></BODY></TEXT>
</TEI.2>    
