3.2 Corpora Used in this Book
3.2.1 The English Comparable Corpus:JDEST
Two monolingual corpora are used in this book.The English comparable corpus is JiaoDa English of Science and Technology(JDEST),one of the most widely used academic corpora around the world.When this corpus was first built at Shanghai Jiaotong University in 1985,it consisted of only one million words.Since then,the size of this corpus has been continuously increased,and now it contains about 6.5 million running words.The texts collected in JDEST cover a wide range of subject areas such as physics,nuclear energy,computer science,metallurgy,aeronautics,electrical engineering,mechanics,chemical engineering,architectural engineering and so on.These texts were selected at random from journal articles,popular science,textbooks,digests,theses and other text types published in several major English-speaking countries.
Variation across disciplines is not the central focus of the present study,and issues such as why a given phrase occurs more often in Philosophy than in Astronomy will not be seriously addressed.However,there is indeed a strong tendency for some TSSs to appear only in certain disciplines but not others.This is the most apparent when we consider the broad academic division into hard sciences(i.e.natural,physical and computing sciences)and soft sciences(i.e.social sciences).These register differences exert considerable influence over discourse conventions,and can be seen as a source of variation for certain TSSs.Table 3.1 shows the basic information of JDEST.
Table 3.1 Disciplines and running words in JDEST
As we can see in Table 3.1,the soft and the hard sciences include 15 and 26 disciplines respectively,and there are a little bit more running words in soft sciences than in hard ones,with a ratio of approximately 6∶5.The data are in general representative and provide an authoritative body of linguistic resources which can be used to make generalizations and to test hypotheses associated with academic English features.