The British National Corpus
Lead partner in consortium
Oxford University Press
Text selection for miscellaneous and unpublished written
materials
W R Chambers
Text selection, data capture and transcription for
spoken texts and for 14% of published written texts
Longman ELT
Text selection for 86% published written texts
Oxford University Press
Data capture and transcription for all miscellaneous and
unpublished written texts and for 86% of published
written texts
Oxford University Press
Encoding, storage and distribution
Oxford University Computing Services
Text enrichment
Unit for Computer Research into the English
Language, University of Lancaster
Corpus header automatically generated by mkcorphdr 0.20
Archive site
Oxford University Computing Services
13 Banbury Road, Oxford OX2 6NN U.K.
Telephone: +44 491 273280
Facsimile: +44 491 273275
Internet mail: natcorp@ox.ac.uk
THE BRITISH NATIONAL CORPUS IS AVAILABLE AT NOMINAL CHARGE
FOR ACADEMIC RESEARCH PURPOSES THROUGHOUT EUROPEAN UNION
(EU), SUBJECT TO A SIGNED END USER LICENCE HAVING BEEN
RECEIVED BY OXFORD UNIVERSITY COMPUTING SERVICES, from whom
blank forms and supporting materials are available.
Apply to Oxford University Computing Services for
information on availability in whole or part for academic
research purposes outside the EU.
The BNC is available for commercial research and
exploitation only where terms have first been agreed with
the BNC Consortium Exploitation Committee. Apply in the
first instance to Oxford University Computing Services.
Specific conditions and restrictions relating to individual
texts in the corpus are set out in their respective text
headers, and are binding upon users.
Distribution of any part of the corpus must include a copy
of this corpus header.
For information, the conditions of the End User Licence for
academic use are as follows:
(a) The BNC Consortium grants according to the terms and
conditions set out herein and in consideration of the
payments specified herein a non-exclusive, non-transferable
Licence to the Licensee to use the BNC Processed Material
for the purposes of linguistic research and/or the
development of language products.
(b) Distribution of the BNC Processed Material is
restricted to the Licensee or in the event of the Licensee
being an organisation, to the Licensee's research group.
This group is defined as consisting only of those Licensee's
employees whom the Licensee authorises to perform the work
using the BNC Processed Material for the purposes described
in paragraph (a).
(c) Members of the said research group must not, except as
herein provided, copy, publish or otherwise give to any
third party access to the whole or any part of the BNC
Processed Material. It is the responsibility of the
Licensee to ensure that the members of the said research
group understand and abide by this restriction, and to
supervise their activities with respect to the BNC Processed
Material. Neither the Licensee nor members of the
Licensee's said research group may assign, transfer, lease,
sell, rent, charge or otherwise encumber the BNC Processed
Material.
(d) The BNC Processed Material may be installed at the
place or places of work of the said research group. The
place of work is defined as the computing systems that the
members of a research group normally use to conduct their
research activities. It can include both work and home
computers, and is not restricted to a particular machine or
building.
(e)Copies of the BNC Processed Material may be made for
backup purposes, or for the purposes of making data
available to members of the research group but the Licensee
shall ensure that the BNC Consortium's copyright notice is
reproduced on all copies or parts thereof of the BNC
Processed Material. Any such copies will be deemed to be
part of the BNC Processed Material.
(f) There is no restriction on the use of the Licensee's
Results except that the Licensee may not publish in print or
electronic form or exploit commercially in any form
whatsoever any extracts from the BNC Processed Material
other than those permitted under the fair dealings provision
of copyright law.
(g) The BNC Consortium does not grant to the Licensee any
rights whatsoever to reproduce the BNC Texts or use all or
any part of the BNC Texts in commercial products or services
in any way other than would be permitted under the fair
dealings provision of copyright law.
1994-11-30
[The British National Corpus]
As a corpus of electronic texts, the British National
Corpus has no source document as such. For details of
the source or sources used in the creation of each
electronic text, see the text headers.
This bibliographic record is simply a place-holder in
the corpus header, and has no other significance.
GOALS
The British National Corpus (BNC) Consortium was formed in 1990,
and started work in 1991 on the three-year task of producing a
hundred-million word corpus of modern British English for use in
commercial and academic research.
A second, smaller, project resulted in the production of SARA,
freely-distributable search and retrieval software based on a
client-server model.
THE CONSORTIUM PARTICIPANTS
The BNC is part of a unique collaboration between three major
U.K. dictionary publishers, two universities, and the British
Library. The dictionary publishers are Chambers Harrap, Longman,
and Oxford University Press; the universities are The Unit for
Computer Research into the English Language (UCREL) at Lancaster
University; and Oxford University Services (OUCS).
FUNDING
The development of the BNC was funded by the commercial partners
in the consortium with assistance from the the U.K. government's
Department of Trade and Industry (DTI) and Science and Engineering
Research Council (SERC) under the Joint Framework for Information
Technology (JFIT).
The development of SARA was funded by the British Library.
DESIGN
The British National Corpus is
+ A large corpus: its hundred million words are made up of
ninety million from written and ten million from spoken sources.
+ A sample corpus: it is composed of text samples, generally of
no more than 40,000 words, rather than of complete works.
+ A synchronic corpus: it includes imaginative texts dating from
the 1960s to 1994; informative texts dating from 1975 to 1994;
and spoken texts gathered primarily between 1990 and 1994.
+ A general corpus: it is not specifically restricted to any
particular subject field, register or genre. It includes language
from all age and social groups and a broad spread of U.K. regions.
+ A monolingual British English Corpus: text samples are
substantially the product of British English speakers. A small
proportion of the words in the corpus are in a foreign language or
non-British English.
+ A TEI-conformant Corpus: texts in the corpus are uniformly
marked up according to the recommendations of the Text Encoding
Initiative (TEI), an international consortium concerned concerned
with the mark-up of texts for use in academic research. These
recommendations are an application of Standardized General Markup
Language (SGML), defined by International Standard IS 8879:1986.
USES
+ Lexicography: The corpus provides a body of new data on word
meaning, grammar and usage. It yields empirical data on word
frequencies, word classes and spelling preferences, among other
things. It also reveals hitherto undocumented evidence about the
spoken language, with consequences that go far beyond the
immediate impact on dictionary-writing.
+ Linguistic research: The corpus provides a standard basis for
investigating phenomena and testing competing linguistic theories.
+ Language technology: Statistical techniques, requiring very
large samples of text, are increasingly used in machine
translation, speech recognition, speech synthesizers, spelling and
grammar checkers for word-processing and desk-top publishing,
hand-held electronic books and other developments in information
technology.
+ Teaching: The corpus provides a rich source of examples of
current usage for English Language Teaching, allowing more
frequent patterns of use to be distinguished from less frequent.
In addition, the corpus provides a valuable didactic resource for
use in many areas of higher education.
+ As a model: Future TEI-conformant corpora, in English and other
languages, may base their designs on the experience gained in the
production of the BNC.
Published: chosen selectively from candidate population
Boilerplate yet to be written
Published: chosen at random from candidate population
Boilerplate yet to be written
Unpublished: chosen according to relevant design criteria
Boilerplate yet to be written
Spoken: obtained from demographic sample of UK population
Boilerplate yet to be written
Spoken: obtained in context determined by design criteria
Boilerplate yet to be written
Errors tagged with <sic> when seen; no normalization
Boilerplate yet to be written
Errors tagged with <sic> if seen; norm'n with <reg>
Boilerplate yet to be written
Normalized to standard British English or control list member
Boilerplate yet to be written
Corrections and normalizations applied silently
Boilerplate yet to be written
Smart elision of line-end hyphens; &rehy used for remainder
Boilerplate yet to be written
Dumb elision of line-end hyphens; true hyphens hand-reinstated
Boilerplate yet to be written
Line-end hyphens removed by hand where appropriate
Boilerplate yet to be written
Source material contains no line-end hyphens
Boilerplate yet to be written
Open, close quote normalized to &bquo, &equo
Boilerplate yet to be written
Open and close quote normalized to &quo
Boilerplate yet to be written
Quotation may be represented using <shift>
Boilerplate yet to be written
Segmentation and word-class marking by CLAWS 5
Boilerplate yet to be written
Segmentation and word-class marking by CLAWS 6
Boilerplate yet to be written
Segmentation, word-class by CLAWS 6, augmented by hand
Boilerplate yet to be written
Copy-typed from hard-copy into OUP format; transduced to CDIF
Boilerplate yet to be written
Copy-typed from hard-copy into L'man format; transduced to CDIF
Boilerplate yet to be written
Scanned from hard-copy into OUP format; transduced to CDIF
Boilerplate yet to be written
Scanned from hard-copy into L'man format; transduced to CDIF
Boilerplate yet to be written
Transduced from M-R into OUP format; transduced to CDIF
Boilerplate yet to be written
Transduced from M-R into L'man format; transduced to CDIF
Boilerplate yet to be written
Recording transcribed into L'man format; transduced to CDIF
Boilerplate yet to be written
Canonical references in the British National Corpus
are to text segment (<s>) elements, and
are constructed by taking the value of the n attribute
of the <cdif> element containing the target text,
and concatenating a dot separator, followed by the value
of the n attribute of the target <s> element.
Canonical references in the British National Corpus
are to text segment (<s>) elements, and
are constructed by taking the value of the n attribute
of the <cdif> element containing the target text,
and concatenating a dot separator, followed by the value
of the n attribute of the target <s> element.
Canonical references in the British National Corpus
are to text segment (<s>) elements, and
are constructed by taking the value of the n attribute
of the <cdif> element containing the target text,
and concatenating a dot separator, followed by the value
of the n attribute of the target <s> element.
Canonical references in the British National Corpus
are to text segment (<s>) elements, and
are constructed by taking the value of the n attribute
of the <cdif> element containing the target text,
and concatenating a dot separator, followed by the value
of the n attribute of the target <s> element.
Canonical references in the British National Corpus
are to text segment (<s>) elements, and
are constructed by taking the value of the n attribute
of the <cdif> element containing the target text,
and concatenating a dot separator, followed by the value
of the n attribute of the target <s> element.
Alignment point for overlapped speech
Boilerplate yet to be written
Free format bibliographic citation
Boilerplate yet to be written
A single character &mdash typically punctuation
Boilerplate yet to be written
Floating caption in written material
Boilerplate yet to be written
Transcriber's or recordist's note in spoken material
Boilerplate yet to be written
Spoken text division
Boilerplate yet to be written
Written text division, level 1
Boilerplate yet to be written
Written text division, level 2
Boilerplate yet to be written
Written text division, level 3
Boilerplate yet to be written
Written text division, level 4
Boilerplate yet to be written
Non-verbal event in spoken text
Boilerplate yet to be written
Extract(s) from corpus text(s) produced by searching
Boilerplate yet to be written
Point where source material omitted from electronic text
Boilerplate yet to be written
Header or headline on written text <div>
Boilerplate yet to be written
Written text highlight indicator
Boilerplate yet to be written
Fragment of corpus text matching search condition
Boilerplate yet to be written
List item
Boilerplate yet to be written
Poem or verse line
Boilerplate yet to be written
List item's label
Boilerplate yet to be written
Line break indicator
Boilerplate yet to be written
A list
Boilerplate yet to be written
Anchor indicating point at which moved elementoccurs
Boilerplate yet to be written
An editorial or original note pertaining to a text
Boilerplate yet to be written
Written text paragraph
Boilerplate yet to be written
Pause indicator in spoken text
Boilerplate yet to be written
Written text page break
Boilerplate yet to be written
Poetic or verse material
Boilerplate yet to be written
Pointer from one part of a text to another
Boilerplate yet to be written
Search condition specification
Boilerplate yet to be written
Written text quoted material indicator
Boilerplate yet to be written
Regularizes questionable or incorrectly-spelled material
Boilerplate yet to be written
Text segment
Boilerplate yet to be written
A salutation (as in a letter etc.)
Boilerplate yet to be written
Segment for purposes of clausal analysis
Boilerplate yet to be written
Indicates a change of register etc. in spoken material
Boilerplate yet to be written
Marks questionable spelling or usage
Boilerplate yet to be written
Dramatic written material speech marker
Boilerplate yet to be written
Dramatic written material speaker indicator
Boilerplate yet to be written
Dramatic written material stage direction
Boilerplate yet to be written
Spoken text
Boilerplate yet to be written
Written text
Boilerplate yet to be written
Title of a work cited in a text
Boilerplate yet to be written
Indicates truncated word in spoken material
Boilerplate yet to be written
Spoken text utterance
Boilerplate yet to be written
Indicates untranscribable material in spoken text
Boilerplate yet to be written
Vocalized non-word in spoken material
Boilerplate yet to be written
CLAWS-defined word
Boilerplate yet to be written
No category information available. (Should not happen.)
Text availability
free, world: Freely available worldwide
restricted, world: Available worldwide
restricted, Not-NA: Not available in North America
restricted, Not-US: Not available in U.S.A.
restricted, EU: Not available outside the European Union
restricted, Not-USP: Not available in U.S.A. & Philippines
restricted, Not-NAP: Not available in N America & Philippines
Text type
Spoken demographic
Spoken context-governed
Written books and periodicals
Written-to-be-spoken
Written miscellaneous
Domain for context-governed spoken material
Educational/Informative
Business
Public/Institutional
Leisure
Age band for demographic respondent
0-14
15-24
25-34
35-44
45-59
60+
Social class for demographic repondent
AB
C1
C2
DE
Sex of demographic respondent
Male
Female
Interaction type for spoken text
Monologue
Dialogue
Region where spoken text captured
South
Midlands
North
Written books & periodicals: selection method
Selective
Random
Written miscellaneous materials: publication status
Published
Unpublished
Author age band for written material
0-14
15-24
25-34
35-44
45-59
60+
Author domicile
Australia
Canada
France
Germany
Ireland
Italy
Lebanon
Monaco
New Zealand
Portugal
Singapore
Switzerland
United Kingdom
United States
UK -- North (north of Mersey-Humber line)
UK -- Midlands (north of Bristol Channel-Wash line)
UK -- South (south of Bristol Channel-Wash line)
Author ethnic background
Written: author sex
Male
Female
Mixed
Unknown
Written: type of author
Corporate
Multiple
Sole
Unknown
Written: audience age
Child
Teenager
Adult
Any
Domain for written corpus texts
Imaginative
Informative -- natural & pure science
Informative -- applied science
Informative -- social science
Informative -- world affairs
Informative -- commerce & finance
Informative -- arts
Informative -- belief & thought
Informative -- leisure
Written text audience level
Low
Medium
High
Medium for written corpus texts
Book
Periodical
Miscellaneous -- published
Miscellaneous -- unpublished
To-be-spoken
Place of publication
Ireland
United Kingdom
United States
UK -- North (north of Mersey-Humber line)
UK -- Midlands (north of Bristol Channel-Wash line)
UK -- South (south of Bristol Channel-Wash line)
Written text sample type
Whole text
Beginning sample
Middle sample
End sample
Composite
Written text reception status
Low
Medium
High
Written: target audience sex
Male
Female
Mixed
Unknown
Written text time period
1960-1974
1975-1993
This version of the corpus contains texts
accessioned on or before 1994-11-04.
The language of the British National Corpus is modern
British English. Words, fragments, and passages from many
other languages, both ancient and modern, occur within the
corpus where these may be represented using a Latin
alphabet. Long passages in these languages, and material
in other languages, are generally silently deleted. In no
case is the lang attribute used to indicate the language
of a word, phrase or passage, nor are alternate writing
system definitions used.
Name: Unknown speaker (singular)
Name: Unknown speaker (multiple)
1994-11-30
Corpus revision.
Oxford University Computing Services