The British National Corpus Lead partner in consortium Oxford University Press Text selection for miscellaneous and unpublished written materials W R Chambers Text selection, data capture and transcription for spoken texts and for 14% of published written texts Longman ELT Text selection for 86% published written texts Oxford University Press Data capture and transcription for all miscellaneous and unpublished written texts and for 86% of published written texts Oxford University Press Encoding, storage and distribution Oxford University Computing Services Text enrichment Unit for Computer Research into the English Language, University of Lancaster Corpus header automatically generated by mkcorphdr 0.20 Archive site Oxford University Computing Services
13 Banbury Road, Oxford OX2 6NN U.K. Telephone: +44 491 273280 Facsimile: +44 491 273275 Internet mail: natcorp@ox.ac.uk
THE BRITISH NATIONAL CORPUS IS AVAILABLE AT NOMINAL CHARGE FOR ACADEMIC RESEARCH PURPOSES THROUGHOUT EUROPEAN UNION (EU), SUBJECT TO A SIGNED END USER LICENCE HAVING BEEN RECEIVED BY OXFORD UNIVERSITY COMPUTING SERVICES, from whom blank forms and supporting materials are available. Apply to Oxford University Computing Services for information on availability in whole or part for academic research purposes outside the EU. The BNC is available for commercial research and exploitation only where terms have first been agreed with the BNC Consortium Exploitation Committee. Apply in the first instance to Oxford University Computing Services. Specific conditions and restrictions relating to individual texts in the corpus are set out in their respective text headers, and are binding upon users. Distribution of any part of the corpus must include a copy of this corpus header. For information, the conditions of the End User Licence for academic use are as follows: (a) The BNC Consortium grants according to the terms and conditions set out herein and in consideration of the payments specified herein a non-exclusive, non-transferable Licence to the Licensee to use the BNC Processed Material for the purposes of linguistic research and/or the development of language products. (b) Distribution of the BNC Processed Material is restricted to the Licensee or in the event of the Licensee being an organisation, to the Licensee's research group. This group is defined as consisting only of those Licensee's employees whom the Licensee authorises to perform the work using the BNC Processed Material for the purposes described in paragraph (a). (c) Members of the said research group must not, except as herein provided, copy, publish or otherwise give to any third party access to the whole or any part of the BNC Processed Material. It is the responsibility of the Licensee to ensure that the members of the said research group understand and abide by this restriction, and to supervise their activities with respect to the BNC Processed Material. Neither the Licensee nor members of the Licensee's said research group may assign, transfer, lease, sell, rent, charge or otherwise encumber the BNC Processed Material. (d) The BNC Processed Material may be installed at the place or places of work of the said research group. The place of work is defined as the computing systems that the members of a research group normally use to conduct their research activities. It can include both work and home computers, and is not restricted to a particular machine or building. (e)Copies of the BNC Processed Material may be made for backup purposes, or for the purposes of making data available to members of the research group but the Licensee shall ensure that the BNC Consortium's copyright notice is reproduced on all copies or parts thereof of the BNC Processed Material. Any such copies will be deemed to be part of the BNC Processed Material. (f) There is no restriction on the use of the Licensee's Results except that the Licensee may not publish in print or electronic form or exploit commercially in any form whatsoever any extracts from the BNC Processed Material other than those permitted under the fair dealings provision of copyright law. (g) The BNC Consortium does not grant to the Licensee any rights whatsoever to reproduce the BNC Texts or use all or any part of the BNC Texts in commercial products or services in any way other than would be permitted under the fair dealings provision of copyright law. 1994-11-30
[The British National Corpus] As a corpus of electronic texts, the British National Corpus has no source document as such. For details of the source or sources used in the creation of each electronic text, see the text headers. This bibliographic record is simply a place-holder in the corpus header, and has no other significance.
GOALS The British National Corpus (BNC) Consortium was formed in 1990, and started work in 1991 on the three-year task of producing a hundred-million word corpus of modern British English for use in commercial and academic research. A second, smaller, project resulted in the production of SARA, freely-distributable search and retrieval software based on a client-server model. THE CONSORTIUM PARTICIPANTS The BNC is part of a unique collaboration between three major U.K. dictionary publishers, two universities, and the British Library. The dictionary publishers are Chambers Harrap, Longman, and Oxford University Press; the universities are The Unit for Computer Research into the English Language (UCREL) at Lancaster University; and Oxford University Services (OUCS). FUNDING The development of the BNC was funded by the commercial partners in the consortium with assistance from the the U.K. government's Department of Trade and Industry (DTI) and Science and Engineering Research Council (SERC) under the Joint Framework for Information Technology (JFIT). The development of SARA was funded by the British Library. DESIGN The British National Corpus is + A large corpus: its hundred million words are made up of ninety million from written and ten million from spoken sources. + A sample corpus: it is composed of text samples, generally of no more than 40,000 words, rather than of complete works. + A synchronic corpus: it includes imaginative texts dating from the 1960s to 1994; informative texts dating from 1975 to 1994; and spoken texts gathered primarily between 1990 and 1994. + A general corpus: it is not specifically restricted to any particular subject field, register or genre. It includes language from all age and social groups and a broad spread of U.K. regions. + A monolingual British English Corpus: text samples are substantially the product of British English speakers. A small proportion of the words in the corpus are in a foreign language or non-British English. + A TEI-conformant Corpus: texts in the corpus are uniformly marked up according to the recommendations of the Text Encoding Initiative (TEI), an international consortium concerned concerned with the mark-up of texts for use in academic research. These recommendations are an application of Standardized General Markup Language (SGML), defined by International Standard IS 8879:1986. USES + Lexicography: The corpus provides a body of new data on word meaning, grammar and usage. It yields empirical data on word frequencies, word classes and spelling preferences, among other things. It also reveals hitherto undocumented evidence about the spoken language, with consequences that go far beyond the immediate impact on dictionary-writing. + Linguistic research: The corpus provides a standard basis for investigating phenomena and testing competing linguistic theories. + Language technology: Statistical techniques, requiring very large samples of text, are increasingly used in machine translation, speech recognition, speech synthesizers, spelling and grammar checkers for word-processing and desk-top publishing, hand-held electronic books and other developments in information technology. + Teaching: The corpus provides a rich source of examples of current usage for English Language Teaching, allowing more frequent patterns of use to be distinguished from less frequent. In addition, the corpus provides a valuable didactic resource for use in many areas of higher education. + As a model: Future TEI-conformant corpora, in English and other languages, may base their designs on the experience gained in the production of the BNC. Published: chosen selectively from candidate population Boilerplate yet to be written Published: chosen at random from candidate population Boilerplate yet to be written Unpublished: chosen according to relevant design criteria Boilerplate yet to be written Spoken: obtained from demographic sample of UK population Boilerplate yet to be written Spoken: obtained in context determined by design criteria Boilerplate yet to be written Errors tagged with <sic> when seen; no normalization Boilerplate yet to be written Errors tagged with <sic> if seen; norm'n with <reg> Boilerplate yet to be written Normalized to standard British English or control list member Boilerplate yet to be written Corrections and normalizations applied silently Boilerplate yet to be written Smart elision of line-end hyphens; &rehy used for remainder Boilerplate yet to be written Dumb elision of line-end hyphens; true hyphens hand-reinstated Boilerplate yet to be written Line-end hyphens removed by hand where appropriate Boilerplate yet to be written Source material contains no line-end hyphens Boilerplate yet to be written Open, close quote normalized to &bquo, &equo Boilerplate yet to be written Open and close quote normalized to &quo Boilerplate yet to be written Quotation may be represented using <shift> Boilerplate yet to be written Segmentation and word-class marking by CLAWS 5 Boilerplate yet to be written Segmentation and word-class marking by CLAWS 6 Boilerplate yet to be written Segmentation, word-class by CLAWS 6, augmented by hand Boilerplate yet to be written Copy-typed from hard-copy into OUP format; transduced to CDIF Boilerplate yet to be written Copy-typed from hard-copy into L'man format; transduced to CDIF Boilerplate yet to be written Scanned from hard-copy into OUP format; transduced to CDIF Boilerplate yet to be written Scanned from hard-copy into L'man format; transduced to CDIF Boilerplate yet to be written Transduced from M-R into OUP format; transduced to CDIF Boilerplate yet to be written Transduced from M-R into L'man format; transduced to CDIF Boilerplate yet to be written Recording transcribed into L'man format; transduced to CDIF Boilerplate yet to be written Canonical references in the British National Corpus are to text segment (<s>) elements, and are constructed by taking the value of the n attribute of the <cdif> element containing the target text, and concatenating a dot separator, followed by the value of the n attribute of the target <s> element. Canonical references in the British National Corpus are to text segment (<s>) elements, and are constructed by taking the value of the n attribute of the <cdif> element containing the target text, and concatenating a dot separator, followed by the value of the n attribute of the target <s> element. Canonical references in the British National Corpus are to text segment (<s>) elements, and are constructed by taking the value of the n attribute of the <cdif> element containing the target text, and concatenating a dot separator, followed by the value of the n attribute of the target <s> element. Canonical references in the British National Corpus are to text segment (<s>) elements, and are constructed by taking the value of the n attribute of the <cdif> element containing the target text, and concatenating a dot separator, followed by the value of the n attribute of the target <s> element. Canonical references in the British National Corpus are to text segment (<s>) elements, and are constructed by taking the value of the n attribute of the <cdif> element containing the target text, and concatenating a dot separator, followed by the value of the n attribute of the target <s> element. Alignment point for overlapped speech Boilerplate yet to be written Free format bibliographic citation Boilerplate yet to be written A single character &mdash typically punctuation Boilerplate yet to be written Floating caption in written material Boilerplate yet to be written Transcriber's or recordist's note in spoken material Boilerplate yet to be written Spoken text division Boilerplate yet to be written Written text division, level 1 Boilerplate yet to be written Written text division, level 2 Boilerplate yet to be written Written text division, level 3 Boilerplate yet to be written Written text division, level 4 Boilerplate yet to be written Non-verbal event in spoken text Boilerplate yet to be written Extract(s) from corpus text(s) produced by searching Boilerplate yet to be written Point where source material omitted from electronic text Boilerplate yet to be written Header or headline on written text <div> Boilerplate yet to be written Written text highlight indicator Boilerplate yet to be written Fragment of corpus text matching search condition Boilerplate yet to be written List item Boilerplate yet to be written Poem or verse line Boilerplate yet to be written List item's label Boilerplate yet to be written Line break indicator Boilerplate yet to be written A list Boilerplate yet to be written Anchor indicating point at which moved elementoccurs Boilerplate yet to be written An editorial or original note pertaining to a text Boilerplate yet to be written Written text paragraph Boilerplate yet to be written Pause indicator in spoken text Boilerplate yet to be written Written text page break Boilerplate yet to be written Poetic or verse material Boilerplate yet to be written Pointer from one part of a text to another Boilerplate yet to be written Search condition specification Boilerplate yet to be written Written text quoted material indicator Boilerplate yet to be written Regularizes questionable or incorrectly-spelled material Boilerplate yet to be written Text segment Boilerplate yet to be written A salutation (as in a letter etc.) Boilerplate yet to be written Segment for purposes of clausal analysis Boilerplate yet to be written Indicates a change of register etc. in spoken material Boilerplate yet to be written Marks questionable spelling or usage Boilerplate yet to be written Dramatic written material speech marker Boilerplate yet to be written Dramatic written material speaker indicator Boilerplate yet to be written Dramatic written material stage direction Boilerplate yet to be written Spoken text Boilerplate yet to be written Written text Boilerplate yet to be written Title of a work cited in a text Boilerplate yet to be written Indicates truncated word in spoken material Boilerplate yet to be written Spoken text utterance Boilerplate yet to be written Indicates untranscribable material in spoken text Boilerplate yet to be written Vocalized non-word in spoken material Boilerplate yet to be written CLAWS-defined word Boilerplate yet to be written No category information available. (Should not happen.) Text availability free, world: Freely available worldwide restricted, world: Available worldwide restricted, Not-NA: Not available in North America restricted, Not-US: Not available in U.S.A. restricted, EU: Not available outside the European Union restricted, Not-USP: Not available in U.S.A. & Philippines restricted, Not-NAP: Not available in N America & Philippines Text type Spoken demographic Spoken context-governed Written books and periodicals Written-to-be-spoken Written miscellaneous Domain for context-governed spoken material Educational/Informative Business Public/Institutional Leisure Age band for demographic respondent 0-14 15-24 25-34 35-44 45-59 60+ Social class for demographic repondent AB C1 C2 DE Sex of demographic respondent Male Female Interaction type for spoken text Monologue Dialogue Region where spoken text captured South Midlands North Written books & periodicals: selection method Selective Random Written miscellaneous materials: publication status Published Unpublished Author age band for written material 0-14 15-24 25-34 35-44 45-59 60+ Author domicile Australia Canada France Germany Ireland Italy Lebanon Monaco New Zealand Portugal Singapore Switzerland United Kingdom United States UK -- North (north of Mersey-Humber line) UK -- Midlands (north of Bristol Channel-Wash line) UK -- South (south of Bristol Channel-Wash line) Author ethnic background Written: author sex Male Female Mixed Unknown Written: type of author Corporate Multiple Sole Unknown Written: audience age Child Teenager Adult Any Domain for written corpus texts Imaginative Informative -- natural & pure science Informative -- applied science Informative -- social science Informative -- world affairs Informative -- commerce & finance Informative -- arts Informative -- belief & thought Informative -- leisure Written text audience level Low Medium High Medium for written corpus texts Book Periodical Miscellaneous -- published Miscellaneous -- unpublished To-be-spoken Place of publication Ireland United Kingdom United States UK -- North (north of Mersey-Humber line) UK -- Midlands (north of Bristol Channel-Wash line) UK -- South (south of Bristol Channel-Wash line) Written text sample type Whole text Beginning sample Middle sample End sample Composite Written text reception status Low Medium High Written: target audience sex Male Female Mixed Unknown Written text time period 1960-1974 1975-1993 This version of the corpus contains texts accessioned on or before 1994-11-04. The language of the British National Corpus is modern British English. Words, fragments, and passages from many other languages, both ancient and modern, occur within the corpus where these may be represented using a Latin alphabet. Long passages in these languages, and material in other languages, are generally silently deleted. In no case is the lang attribute used to indicate the language of a word, phrase or passage, nor are alternate writing system definitions used. Name: Unknown speaker (singular) Name: Unknown speaker (multiple) 1994-11-30 Corpus revision. Oxford University Computing Services