Knowledge

International Corpus of English

Source 📝

46:'s goal to compile corpora that would compare the syntax of world English became the ICE project that was achieved by Professor Charles F. Meyer. Sidney Greenbaum anticipated for international teams of researchers to collect comparable national variations of English both written and spoken. Comparable variations would be British English, American English, and Indian English, that would be represented through a computer corpora. The corpora are used by researchers to compare the syntax of the varieties of English. ICE corpora completion would have comprehensive linguistic analysis of varieties of English that have emerged. Ongoing research for ICE is implemented by international teams in diversified regions. The project began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty-three research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation. 82:
women of many age groups, but the corpus website makes it a point to note that, "The proportions, however, are not representative of the proportions in the population as a whole: women are not equally represented in professions such as politics and law, and so do not produce equal amounts of discourse in these fields."
115:
To ensure compatibility between the individual corpora in ICE, each team is following a common corpus design, as well as a common scheme for grammatical annotation. Many corpora are currently available for download on the ICE official webpage, though some require a license. Others, however, are not
81:
The corpora consist entirely of data from 1990 or later. The subjects from which the data was collected are all adults who were educated in English and were either born, or moved at an early age, to the country to which their data is attributed. There are speech and text samples from both men and
77:
English. The father of the project, Sidney Greenbaum, insisted on the primacy of the spoken word, following Randolph Quirk and Jan Svartvik's collaboration on the original London-Lund Corpus (LLC). This emphasis on word-for-word transcription marks out ICE from many other corpora, including those
136:
Original markup and layout such as sentence and paragraph parsing is preserved, with special markers indicating it as original. Spoken data is transcribed orthographically, with indicators for hesitations, false starts, and pauses.
155:
All other languages are tagged automatically using the PENN Treebank and the CLAWS tagset. While the tags are not corrected manually, they are checked regularly for quality.
152:
British texts are automatically tagged for wordclass by the ICE tagger, developed at University College London, which uses a comprehensive grammar of the English language.
73:
With only one million words per corpus, ICE corpora are considered very small for modern standards. ICE corpora contain 60% (600,000 words) of orthographically transcribed
128:, in the International Corpus of English Manuals and Documentation. The three levels of annotation are Text Markup, Wordclass Tagging, Syntactic Parsing. 163:
The sentence are parsed automatically and, if necessary, are manually corrected with ICECUP, a syntax tree editor created specifically for the corpus.
35:
from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.
1006: 891: 210:
Broadcast Discussions (20) Broadcast Interviews (10) Parliamentary Debates (10) Legal cross-examinations (10) Business Transactions (10)
926: 951: 357:
There are a number of books published about the International Corpus of English, as well as books based in part on the corpora.
166:
Dependency parsing is also done automatically with the Dependency Parser Pro3GreS. The results are not manually verified.
1094: 991: 757: 182:
Below are the subsections of the ICE, with the number of corpora for each category and sub-category in parentheses.
1074: 884: 54:
Each corpus contains one million words in 500 texts of 2000 words, following the sampling methodology used for the
536: 124:
Researchers and Linguists follow specific guidelines when annotating data for the corpus, which can be found
1196: 1191: 1186: 1176: 971: 1145: 877: 642: 625: 573: 556: 1129: 1114: 1099: 1069: 745:
Exploring Natural Language. Working with the British Component of the International Corpus of English
423:
Exploring Natural Language: Working with the British component of the International Corpus of English
1181: 1044: 1039: 946: 916: 174:
Ireland is currently the only participant country who includes pragmatic annotation in their data.
1089: 1059: 931: 531: 63: 93: 92:
grammar, and the analyses have been thoroughly checked and completed. This analysis includes a
89: 1119: 1084: 1079: 1049: 986: 976: 1124: 961: 8: 900: 526: 607: 603: 1201: 1064: 1024: 713: 599: 43: 32: 1054: 921: 362:
English in the Caribbean: Variation, Style and Standards in Jamaica and Trinidad
149:, are grammatical categories for words based upon their function in a sentence. 941: 146: 730:
Quirk, Randolph, Greenbaum, Sidney, Leech, Geoffrey and Svartvik, Jan (1985).
1170: 1155: 611: 813: 374:
Mapping Unity and Diversity Worldwide: Corpus-based Studies of New Englishes
85:
The British Component of ICE, ICE-GB, is fully parsed with a detailed Quirk
956: 936: 55: 761: 864: 837: 125: 28: 680: 717: 59: 16:
Set of text corpora representing varieties of English around the world
869: 368:
The Present Perfect in World Englishes: Charting Unity and Diversity
1104: 1034: 981: 101: 783:"International Corpus of English (ICE) Homepage @ ICE-corpora.net" 660:"International Corpus of English (ICE) Homepage @ ICE-corpora.net" 1150: 1109: 1029: 1001: 97: 429:
Comparing English Worldwide: The International Corpus of English
782: 704:
Nelson, Gerald (2017). "The ICE project and world Englishes".
659: 446:
The current list of participant countries are (*= available):
996: 410:
Word-Formation in New Englishes: A corpus-based Analysis
743:
Nelson, Gerald, Wallis, Sean, and Aarts, Bas (2002).
119: 78:
containing, e.g. parliamentary or legal paraphrases.
425:(2002) by Gerald Nelson, Sean Wallis, and Bas Aarts 104:can be thoroughly searched and explored with the 1168: 1007:Wellington Corpus of Spoken New Zealand English 732:A Comprehensive Grammar of the English Language 112:software. More information is in the handbook. 1035:CorCenCC National Corpus of Contemporary Welsh 241:Broadcast Talks (20) Non-broadcast Talks (10) 885: 758:"The International Corpus of English website" 406:(2009) by Sidney Greenbaum and Gerald Nelson 865:The International Corpus of English website 590:Nelson, Gerald (May 2004). "Introduction". 892: 878: 927:Bergen Corpus of London Teenage Language 177: 952:Corpus of Contemporary American English 376:(2012) by Marianne Hundt and Ulrike Gut 70:of texts are derived from spoken data. 1169: 899: 703: 589: 459:East Africa (Kenya, Malawi, Tanzania)* 873: 808: 806: 804: 802: 140: 675: 673: 671: 669: 169: 158: 62:(or indeed mega-corpora such as the 1095:Scottish Corpus of Texts and Speech 992:Switchboard Telephone Speech Corpus 380:The Syntax of Spoken Indian English 13: 799: 404:An Introduction to English Grammar 120:Textual and Grammatical Annotation 60:Lancaster-Oslo-Bergen (LOB) Corpus 14: 1213: 858: 681:"Corpus Design @ ICE-corpora.net" 666: 131: 1075:Neo-Assyrian Text Corpus Project 838:"Publications @ ICE-corpora.net" 604:10.1111/j.0883-2919.2004.00347.x 346:Novels & short stories (20) 197:Face-to-face conversations (90) 967:International Corpus of English 830: 775: 750: 737: 441: 352: 21:International Corpus of English 724: 697: 652: 635: 618: 583: 566: 549: 537:BYU Corpus of American English 222:Spontaneous commentaries (20) 49: 1: 542: 392:Adjunct Adverbials in English 386:Oxford Modern English Grammar 972:Lancaster-Oslo-Bergen Corpus 327:Administrative Writing (10) 7: 520: 324:Instructional Writing (20) 10: 1218: 437:(1996) by Sidney Greenbaum 431:(1996) by Sidney Greenbaum 412:(2008) by Thomas Biermeier 394:(2010) by Hilde Hasselgård 145:Word Classes, also called 106:ICE Corpus Utility Program 100:of the entire corpus. The 38: 31:representing varieties of 1138: 1130:Thesaurus Linguae Graecae 1115:Tehran Monolingual Corpus 1100:Slovenian National Corpus 1070:National Corpus of Polish 1015: 907: 747:Amsterdam: John Benjamins 419:Volume 23 Number 2 (2004) 370:(2014) by Valentin Werner 278: 253: 230:Legal Presentations (10) 224:Unscripted Speeches (30) 215: 190: 1045:Croatian National Corpus 1040:Croatian Language Corpus 947:Cambridge English Corpus 917:American National Corpus 335:Persuasive Writing (10) 319:Press news reports (20) 1090:Russian National Corpus 1060:German Reference Corpus 932:British National Corpus 532:British National Corpus 468:Great Britain* (parsed) 382:(2012) by Claudia Lange 364:(2014) by Dagmar Deuber 208:Classroom Lessons (20) 116:ready for publication. 64:British National Corpus 818:www.ice-corpora.uzh.ch 435:Oxford English Grammar 343:Creative Writing (20) 338:Press editorials (10) 307:Natural Sciences (10) 290:Natural Sciences (10) 282:Academic Writing (40) 273:Business Letters (15) 94:part-of-speech tagging 58:. Unlike Brown or the 1120:Tekstaro de Esperanto 1085:Quranic Arabic Corpus 1080:Persian Speech Corpus 1050:Czech National Corpus 987:Spoken English Corpus 977:Oxford English Corpus 304:Social Sciences (10) 299:Popular Writing (40) 287:Social Sciences (10) 257:Student Writing (20) 178:Design of the Corpora 1125:TenTen Corpus Family 329:Skills/hobbies (10) 271:Social Letters (15) 260:Student Essays (10) 239:Broadcast News (20) 227:Demonstrations (10) 1197:Linguistic research 1192:Applied linguistics 1187:Dialects of English 1177:1990 establishments 513:Trinidad and Tobago 388:(2011) by Bas Aarts 250: 187: 901:Corpus linguistics 718:10.1111/weng.12276 527:Corpus linguistics 262:Exam Scripts (10) 248: 185: 141:Word Class Tagging 1164: 1163: 643:"The ICE Project" 626:"The ICE Project" 574:"The ICE Project" 557:"The ICE Project" 492:Nigeria* (tagged) 415:Special issue of 350: 349: 254:Non-Printed (50) 246: 245: 216:Monologues (120) 170:Pragmatic Parsing 159:Syntactic Parsing 1209: 1065:Hamshahri Corpus 1025:Bijankhan Corpus 894: 887: 880: 871: 870: 852: 851: 849: 848: 834: 828: 827: 825: 824: 810: 797: 796: 794: 793: 779: 773: 772: 770: 769: 760:. Archived from 754: 748: 741: 735: 728: 722: 721: 701: 695: 694: 692: 691: 677: 664: 663: 656: 650: 649: 647: 639: 633: 632: 630: 622: 616: 615: 587: 581: 580: 578: 570: 564: 563: 561: 553: 498:The Philippines* 310:Technology (10) 302:Humanities (10) 293:Technology (10) 285:Humanities (10) 251: 247: 219:Unscripted (70) 199:Phonecalls (10) 191:Dialogues (180) 188: 184: 90:phrase structure 66:), however, the 44:Sidney Greenbaum 1217: 1216: 1212: 1211: 1210: 1208: 1207: 1206: 1182:English corpora 1167: 1166: 1165: 1160: 1134: 1055:Europarl Corpus 1017: 1011: 922:Bank of English 909: 903: 898: 861: 856: 855: 846: 844: 842:ice-corpora.net 836: 835: 831: 822: 820: 812: 811: 800: 791: 789: 787:ice-corpora.net 781: 780: 776: 767: 765: 756: 755: 751: 742: 738: 734:London: Longman 729: 725: 706:World Englishes 702: 698: 689: 687: 685:ice-corpora.net 679: 678: 667: 658: 657: 653: 645: 641: 640: 636: 628: 624: 623: 619: 592:World Englishes 588: 584: 576: 572: 571: 567: 559: 555: 554: 550: 545: 523: 444: 417:World Englishes 355: 316:Reportage (20) 180: 172: 161: 147:Parts of Speech 143: 134: 122: 52: 41: 17: 12: 11: 5: 1215: 1205: 1204: 1199: 1194: 1189: 1184: 1179: 1162: 1161: 1159: 1158: 1153: 1148: 1146:BNC consortium 1142: 1140: 1136: 1135: 1133: 1132: 1127: 1122: 1117: 1112: 1107: 1102: 1097: 1092: 1087: 1082: 1077: 1072: 1067: 1062: 1057: 1052: 1047: 1042: 1037: 1032: 1027: 1021: 1019: 1013: 1012: 1010: 1009: 1004: 999: 994: 989: 984: 979: 974: 969: 964: 959: 954: 949: 944: 942:Buckeye Corpus 939: 934: 929: 924: 919: 913: 911: 905: 904: 897: 896: 889: 882: 874: 868: 867: 860: 859:External links 857: 854: 853: 829: 798: 774: 749: 736: 723: 712:(3): 367–370. 696: 665: 651: 634: 617: 598:(2): 225–226. 582: 565: 547: 546: 544: 541: 540: 539: 534: 529: 522: 519: 518: 517: 514: 511: 508: 505: 502: 499: 496: 493: 490: 487: 484: 481: 478: 475: 472: 469: 466: 463: 460: 457: 454: 451: 443: 440: 439: 438: 432: 426: 420: 413: 407: 401: 395: 389: 383: 377: 371: 365: 354: 351: 348: 347: 344: 340: 339: 336: 332: 331: 325: 321: 320: 317: 313: 312: 300: 296: 295: 283: 280: 279:Printed (150) 276: 275: 269: 265: 264: 258: 255: 249:Written (200) 244: 243: 237: 236:Scripted (50) 233: 232: 220: 217: 213: 212: 206: 202: 201: 195: 194:Private (100) 192: 179: 176: 171: 168: 160: 157: 142: 139: 133: 132:Textual Markup 130: 121: 118: 51: 48: 40: 37: 27:) is a set of 15: 9: 6: 4: 3: 2: 1214: 1203: 1200: 1198: 1195: 1193: 1190: 1188: 1185: 1183: 1180: 1178: 1175: 1174: 1172: 1157: 1156:Sketch Engine 1154: 1152: 1149: 1147: 1144: 1143: 1141: 1139:Organizations 1137: 1131: 1128: 1126: 1123: 1121: 1118: 1116: 1113: 1111: 1108: 1106: 1103: 1101: 1098: 1096: 1093: 1091: 1088: 1086: 1083: 1081: 1078: 1076: 1073: 1071: 1068: 1066: 1063: 1061: 1058: 1056: 1053: 1051: 1048: 1046: 1043: 1041: 1038: 1036: 1033: 1031: 1028: 1026: 1023: 1022: 1020: 1016:Text corpora, 1014: 1008: 1005: 1003: 1000: 998: 995: 993: 990: 988: 985: 983: 980: 978: 975: 973: 970: 968: 965: 963: 960: 958: 955: 953: 950: 948: 945: 943: 940: 938: 935: 933: 930: 928: 925: 923: 920: 918: 915: 914: 912: 908:Text corpora, 906: 902: 895: 890: 888: 883: 881: 876: 875: 872: 866: 863: 862: 843: 839: 833: 819: 815: 809: 807: 805: 803: 788: 784: 778: 764:on 2009-02-04 763: 759: 753: 746: 740: 733: 727: 719: 715: 711: 707: 700: 686: 682: 676: 674: 672: 670: 661: 655: 644: 638: 627: 621: 613: 609: 605: 601: 597: 593: 586: 575: 569: 558: 552: 548: 538: 535: 533: 530: 528: 525: 524: 515: 512: 509: 506: 503: 500: 497: 494: 491: 488: 485: 482: 479: 476: 473: 470: 467: 464: 461: 458: 455: 452: 449: 448: 447: 436: 433: 430: 427: 424: 421: 418: 414: 411: 408: 405: 402: 399: 398:ICAME Journal 396: 393: 390: 387: 384: 381: 378: 375: 372: 369: 366: 363: 360: 359: 358: 345: 342: 341: 337: 334: 333: 330: 326: 323: 322: 318: 315: 314: 311: 308: 305: 301: 298: 297: 294: 291: 288: 284: 281: 277: 274: 270: 268:Letters (30) 267: 266: 263: 259: 256: 252: 242: 238: 235: 234: 231: 228: 225: 221: 218: 214: 211: 207: 204: 203: 200: 196: 193: 189: 186:Spoken (300) 183: 175: 167: 164: 156: 153: 150: 148: 138: 129: 127: 117: 113: 111: 107: 103: 99: 95: 91: 88: 83: 79: 76: 71: 69: 65: 61: 57: 47: 45: 36: 34: 30: 26: 22: 966: 957:Enron Corpus 937:Brown Corpus 845:. Retrieved 841: 832: 821:. Retrieved 817: 814:"Annotation" 790:. Retrieved 786: 777: 766:. Retrieved 762:the original 752: 744: 739: 731: 726: 709: 705: 699: 688:. Retrieved 684: 654: 637: 620: 595: 591: 585: 568: 551: 507:South Africa 501:Sierra Leone 489:New Zealand* 445: 442:Participants 434: 428: 422: 416: 409: 403: 400:No 34 (2010) 397: 391: 385: 379: 373: 367: 361: 356: 353:Publications 328: 309: 306: 303: 292: 289: 286: 272: 261: 240: 229: 226: 223: 209: 205:Public (80) 198: 181: 173: 165: 162: 154: 151: 144: 135: 123: 114: 109: 105: 86: 84: 80: 74: 72: 67: 56:Brown Corpus 53: 42: 29:text corpora 24: 20: 18: 1018:non-English 50:Description 1171:Categories 847:2018-04-22 823:2018-03-29 792:2018-03-03 768:2008-01-13 690:2018-03-03 543:References 504:Singapore* 471:Hong Kong* 612:0883-2919 510:Sri Lanka 450:Australia 1105:TalkBank 982:PropBank 962:EnTenTen 521:See also 495:Pakistan 486:Malaysia 480:Jamaica* 477:Ireland* 453:Cameroon 102:treebank 68:majority 1202:Corpora 1151:COBUILD 1110:Tatoeba 1030:CHILDES 1002:VerbNet 910:English 456:Canada* 98:parsing 39:History 33:English 610:  474:India* 110:ICECUP 87:et al. 75:spoken 997:TIMIT 646:(PDF) 629:(PDF) 577:(PDF) 560:(PDF) 483:Malta 465:Ghana 608:ISSN 516:USA* 462:Fiji 126:here 96:and 19:The 714:doi 600:doi 108:or 25:ICE 1173:: 840:. 816:. 801:^ 785:. 710:36 708:. 683:. 668:^ 606:. 596:23 594:. 893:e 886:t 879:v 850:. 826:. 795:. 771:. 720:. 716:: 693:. 662:. 648:. 631:. 614:. 602:: 579:. 562:. 23:(

Index

text corpora
English
Sidney Greenbaum
Brown Corpus
Lancaster-Oslo-Bergen (LOB) Corpus
British National Corpus
phrase structure
part-of-speech tagging
parsing
treebank
here
Parts of Speech
Corpus linguistics
British National Corpus
BYU Corpus of American English
"The ICE Project"
"The ICE Project"
doi
10.1111/j.0883-2919.2004.00347.x
ISSN
0883-2919
"The ICE Project"
"The ICE Project"
"International Corpus of English (ICE) Homepage @ ICE-corpora.net"




"Corpus Design @ ICE-corpora.net"
doi

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.