[version of 27 Feb] Feasibility study into establishing an Arts and Humanities Datacentre: a proposal to the Joint Information Systems Committee Lou Burnard (Oxford University Computing Services) Harold Short (King's College London) ------------------------------------------------------------------------- INTRODUCTION 1. This document sets out proposals for a feasibility study into the creation of an Arts and Humanities Datacentre. The paper considers the need to support more widespread access to electronic resources in these subject areas, and suggests a framework within which such access could be provided. The proposed feasibility study is designed to assess the viability and implications, both practical and financial, of such a venture. ACCESS TO ELECTRONIC DATA RESOURCES IN THE ARTS AND HUMANITIES 2. A large and growing range of electronic information resources is now available to scholars working in the arts and humanities. These resources, which are international in scope, are not only creating an information environment which is increasingly rich but are also developing a potential for new and highly sophisticated forms of research. 3. A recent report jointly published by the British Academy and the British Library has documented the growth of electronic material in the humanities, and demonstrated that there is a growing interest in the use of this material in research.(1) The electronic resources which are available are of different types. Primary source materials exist as text, data, image and sound, while multimedia products are also becoming available. At the same time, new forms of academic communication are emerging, such as discussion groups and bulletin boards. This information revolution offers large and exciting possibilities. 4. However, the diversity of information resources is matched by the diversity of the information providers. Scholars attempting to use electronic resources are confronted by products and services with an extreme variety of organisational characteristics and access requirements, and they are obliged to learn many different connection procedures and access languages. This can involve a considerable personal investment of time and effort in the acquisition of these otherwise irrelevant skills. Finally, there is a lack of comprehensive information about the resources which exist, and hence a lack of awareness of them. 5. In response to this situation, the joint British Academy and British Library report referenced earlier has called for the development of a national strategy for 'the holding, maintenance and provision of access' to electronic materials. The desirability of this has also been recognised in the Report of the Joint Funding Councils' Libraries Review Group (the Follett report), which has recommended a feasibility study to determine how proposals for an Arts and Humanities Datacentre could be taken forward.(2) 6. An initiative of this kind would be particularly timely in view of the widespread appreciation which now exists of the need for internationally accepted standards in relation to data creation, storage and access. In some cases funding has been forthcoming to promote these standards. The Text Encoding Initiative (TEI) is an important example of this. 7. Current technological developments are also now favouring a more coordinated approach to the provision of access to electronic materials. In this connection we may cite the approach of high-speed, wide-band communications channels, making possible the transfer of very large volumes of information, and the emergence of increasingly flexible access tools, such as Gopher and World Wide Web. These and related developments make it timely and important to focus on the needs of humanities scholars, in order to promote an appropriate environment for the creation of new resources, and to encourage the development of access and manipulation tools which will provide the greatest possible benefit to the scholarly community. CORE AND SECONDARY FUNCTIONS OF AN ARTS AND HUMANITIES DATACENTRE 8. There are a number of models on which to base the proposed Arts and Humanities Datacentre, ranging from the centralised to the distributed. These options are discussed later in this paper. But whatever model is chosen, the datacentre, real or virtual, is likely to have the following core functions. * To maintain specialised data collections, and to provide the specialised knowledge needed to support them. * To maintain and disseminate an electronic catalogue of the data sets held by the Arts and Humanities Datacentre, containing information about each holding, and to provide advice on access and manipulation. * To maintain and disseminate information about data sets available both nationally and internationally and to provide an electronic gateway to such data sets. * To preserve old or 'orphaned' data sets, so as to ensure continuing scholarly access. 9. In addition to these core functions, the datacentre might usefully develop certain supplementary or secondary functions. These include the following. * To gather and disseminate information about appropriate tools for the cataloguing and indexing of electronic data sets, as well as tools for their retrieval and manipulation, and where appropriate to encourage the development of such tools. * To develop technical and introductory guidelines for new projects, and for for the benefit of funding bodies, and to develop an advisory service for scholars. * To invesigate and report on the particular problems and challenges posed with respect to intellectual property rights in electronic resources. MODELS OF PROVISION 10. There are a number of organisational and functional models which might be considered. One such would distinguish between primary sources, such as edited texts, on the one hand, and secondary material extracted by scholarship from source material, formalised, and organised for analysis in databases. Another model would have a Centre for each of the different information types: text, data, images, sound, and multimedia.In practice, questions of overlap would almost certainly make this impracticable. 11. However, there is a more complex model which is more in keeping with the ways in which resources are actually created. This distributed model would involve the establishment of an infrastructure with specified standards for access to and technical compatibility of datasets. In this model the Arts and Humanities Data Centre (AHDC) might initially be based at a small > number of collaborating institutions, to allow for the controlled development of standards and organizational practices which would facilitate the later addition of other institutions to the same network, as technological developments permit. The base institutions could perform the additional function of taking responsibility for the conversion or maintenance of existing datasets to conform to the requisite standards. The standards themselves would be formally agreed and monitored by an appropriate body representative of the relevant research, library, information technology and funding interests. 12. In this distributed model, the proposed datacentre would therefore have a number of additional functions to fulfil. These would include the following. * To specify and develop appropriate standards, procedures and documentation. so as to enable uniform access to electronic data sets, thereby ensuring that both completed data sets and work in progress are available to the scholarly community. It is expensive to create an electronic data set, so the more widely and frequently it is used, the greater the return on the investment. Alongside this is the qualitative benefit to scholars in having access to a wider range of more up to date resources, and the ability to validate work carried out by others. * To collaborate with similar bodies elsewhere in Europe and in North America. Within the UK institutions such as the Essex Archive, the Office for Humanities Communication and the various TLTP/CTI-sponsored projects are obvious candidates. Close liaison with the libraries community through CURL, BUBL, RARE etc. would also be essential. Within Europe, contact should be sought with institutions such as the Dutch Historical Data Archive at Leiden, the Danish Data Archive at Odense, and the Istituto Linguistica Compuzaionale at Pisa. In North America, the newly founded Center for Electronic Texts in the Humanities at Princeton and Rutgers provides an obvious model for much of what is proposed here. * To monitor the development and publication of internationally agreed standards for encoding, storage and transmission of electronic information, and to contribute to the formation of such standards. The prime example is the Text Encoding Initiative, though close attention must also be paid to emerging de facto interface standards such as World Wide Web, MIME or Gopher. * To provide practical advice on the methods and costs involved both in the creation of new data sets, and in the conversion of existing or old data sets in order to conform to new standards, in the light of experience gained at the Oxford Text Archive over the last twenty years. FEASIBILITY STUDY 13. It is proposed that a feasibility study should be undertaken by Lou Burnard, Director of the Oxford Text Archive, and Harold Short, Assistant Director, Humanities and Information Management in the Computing Centre at King's College London. The study would have the dustributed model outlined above as its main focus, but this would not mean discounting alternative models. The purpose of the feasibility study will be: * in conjunction with the OHC, to report on electronic resources for the Humanities, currently available in the UK and elsewhere; * to research and report on similar organisations elsewhere in Europe and in North America, and to identify areas of common interest; * to identify the practical, organisational and technical issues involved in creating and operating an Arts and Humanities Datacentre; * to propose a strategy for integrating existing electronic data resources at the participating instituitons; * to identify the specialized services which may be needed; * to identify the levels of funding required to set up and run the Datacentre and a timescale for achieving it. 14. This study can only be successful if it is undertaken with the widest possible consultation. To that end, it is proposed to convene an initial workshop, to which representatives of as many interested parties as possible will be invited. These will include agencies such as the British Academy, the British Library, the Office for Humanities Communication and the Joint Information Services Committee, as well as other clearly relevant instiutions, plus a wide spectrum of research scholars from the old and new universities. Delegates will be encouraged to discuss critically the proposal here sketched out and to propose areas in need of further more detailed investigation. They will be asked to provide information about relevant resources, both those which they might be able to contribute to and those which they would hope to obtain from such an archive. Their input will provide the first step in the production of a detailed report on the feasibility of the scheme. 15. Further information will be obtained by consulting available networked or printed resources. A substantial amount of travel to relevant sites in the UK, Europe and US will also be needed, with the purpose of gathering information about current and planned datasets, in particular the ways in which access is managed, the organisational structures, and the resources committed and required. First hand information gathering is considered particularly important in the case of the centres at Leiden, Odense, and Pisa mentioned earlier in the proposal; visits are also proposed to those at Bergen and Paris. Of the North American centres, it is considered important to visit CETH (also mentioned above), and the Universities of Georgetown and Toronto. Because of the costs of international travel, at least two of these visits would be combined in a single journey. 16. The process of collecting information on the data resources already available, and those needed by UK scholars, will benefit greatly from research already under way by the Office for Humanities Communication and the British Library. At least a part of the remaining research which will be required may be most effectively carried out by commissioning reports from consultants. It is also envisaged that it may prove necessary to obtain dedicated secretarial support particularly during the report writing phase of the project. 17. It is proposed that a Steering Committee should be formed to guide the progress of the feasibility study, its membership drawn from the British Academy and the Office for Humanities Communication, along with appropriate representatives from the relevant research, library, and information technology interests. PERSONAL PROFILES Lou Burnard Director of the Oxford Text Archive since its founding in 1976, and European Editor of the Text Encoding Initiative since 1989, Lou Burnard has worked at Oxford University Computing Services since 1974. He was educated at Balliol College, Oxford, where he took a first in English in 1968, and an MPhil in Modern English Studies in 1973. After a brief spell teaching English at the University of Malawi he returned to Oxford at a period when computing support for the Humanities was a novelty. He was closely involved in the development of the popular Oxford Concordance Program and has remained active in development of Humanities Computing facilities at Oxford and in humanities applications nationally. In addition to his work for the Archive and the TEI, he is currently responsible for OUCS participation in the British National Corpus project. His publications include papers on Snobol, database design, CAFS, text retrieval systems, research applications for IT and book reviews. His areas of expertise are in database design and construction, free text retrieval systems and the encoding of text for research purposes. Harold Short Assistant Director, Humanities and Information Management in the Computing Centre, and Director, Research Unit in Humanities Computing. First degree in English and French, second bachelor degree in Systems and Computing. Fourteen years commercial computing experience, including eleven at the BBC as programmer, systems analyst, database designer, and project manager. In present post at KingUs for past five years. Besides management duties, involved in a number of database and hypermedia projects for research and teaching in the Humanities, and in organising and teaching undergraduate and postgraduate courses in humanities computing. Member of School of Humanities Research Committee. REFERNCES 1. The British Library Research and Development Department and the British Academy, Information Technology in Humanities Scholarship, British Library R&D Report 6097, London 1993. 2. Joint Funding Councils' Libraries Review Group: Report (HEFCE. 1993), para. 299