46:'s goal to compile corpora that would compare the syntax of world English became the ICE project that was achieved by Professor Charles F. Meyer. Sidney Greenbaum anticipated for international teams of researchers to collect comparable national variations of English both written and spoken. Comparable variations would be British English, American English, and Indian English, that would be represented through a computer corpora. The corpora are used by researchers to compare the syntax of the varieties of English. ICE corpora completion would have comprehensive linguistic analysis of varieties of English that have emerged. Ongoing research for ICE is implemented by international teams in diversified regions. The project began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty-three research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation.
82:
women of many age groups, but the corpus website makes it a point to note that, "The proportions, however, are not representative of the proportions in the population as a whole: women are not equally represented in professions such as politics and law, and so do not produce equal amounts of discourse in these fields."
115:
To ensure compatibility between the individual corpora in ICE, each team is following a common corpus design, as well as a common scheme for grammatical annotation. Many corpora are currently available for download on the ICE official webpage, though some require a license. Others, however, are not
81:
The corpora consist entirely of data from 1990 or later. The subjects from which the data was collected are all adults who were educated in
English and were either born, or moved at an early age, to the country to which their data is attributed. There are speech and text samples from both men and
77:
English. The father of the project, Sidney
Greenbaum, insisted on the primacy of the spoken word, following Randolph Quirk and Jan Svartvik's collaboration on the original London-Lund Corpus (LLC). This emphasis on word-for-word transcription marks out ICE from many other corpora, including those
136:
Original markup and layout such as sentence and paragraph parsing is preserved, with special markers indicating it as original. Spoken data is transcribed orthographically, with indicators for hesitations, false starts, and pauses.
155:
All other languages are tagged automatically using the PENN Treebank and the CLAWS tagset. While the tags are not corrected manually, they are checked regularly for quality.
152:
British texts are automatically tagged for wordclass by the ICE tagger, developed at
University College London, which uses a comprehensive grammar of the English language.
73:
With only one million words per corpus, ICE corpora are considered very small for modern standards. ICE corpora contain 60% (600,000 words) of orthographically transcribed
128:, in the International Corpus of English Manuals and Documentation. The three levels of annotation are Text Markup, Wordclass Tagging, Syntactic Parsing.
163:
The sentence are parsed automatically and, if necessary, are manually corrected with ICECUP, a syntax tree editor created specifically for the corpus.
35:
from around the world. Over twenty countries or groups of countries where
English is the first language or an official second language are included.
1006:
891:
210:
Broadcast
Discussions (20) Broadcast Interviews (10) Parliamentary Debates (10) Legal cross-examinations (10) Business Transactions (10)
926:
951:
357:
There are a number of books published about the
International Corpus of English, as well as books based in part on the corpora.
166:
Dependency parsing is also done automatically with the
Dependency Parser Pro3GreS. The results are not manually verified.
1094:
991:
757:
182:
Below are the subsections of the ICE, with the number of corpora for each category and sub-category in parentheses.
1074:
884:
54:
Each corpus contains one million words in 500 texts of 2000 words, following the sampling methodology used for the
536:
124:
Researchers and
Linguists follow specific guidelines when annotating data for the corpus, which can be found
1196:
1191:
1186:
1176:
971:
1145:
877:
642:
625:
573:
556:
1129:
1114:
1099:
1069:
745:
Exploring
Natural Language. Working with the British Component of the International Corpus of English
423:
Exploring
Natural Language: Working with the British component of the International Corpus of English
1181:
1044:
1039:
946:
916:
174:
Ireland is currently the only participant country who includes pragmatic annotation in their data.
1089:
1059:
931:
531:
63:
93:
92:
grammar, and the analyses have been thoroughly checked and completed. This analysis includes a
89:
1119:
1084:
1079:
1049:
986:
976:
1124:
961:
8:
900:
526:
607:
603:
1201:
1064:
1024:
713:
599:
43:
32:
1054:
921:
362:
English in the Caribbean: Variation, Style and Standards in Jamaica and Trinidad
149:, are grammatical categories for words based upon their function in a sentence.
941:
146:
730:
Quirk, Randolph, Greenbaum, Sidney, Leech, Geoffrey and Svartvik, Jan (1985).
1170:
1155:
611:
813:
374:
Mapping Unity and Diversity Worldwide: Corpus-based Studies of New Englishes
85:
The British Component of ICE, ICE-GB, is fully parsed with a detailed Quirk
956:
936:
55:
761:
864:
837:
125:
28:
680:
717:
59:
16:
Set of text corpora representing varieties of English around the world
869:
368:
The Present Perfect in World Englishes: Charting Unity and Diversity
1104:
1034:
981:
101:
783:"International Corpus of English (ICE) Homepage @ ICE-corpora.net"
660:"International Corpus of English (ICE) Homepage @ ICE-corpora.net"
1150:
1109:
1029:
1001:
97:
429:
Comparing English Worldwide: The International Corpus of English
782:
704:
Nelson, Gerald (2017). "The ICE project and world Englishes".
659:
446:
The current list of participant countries are (*= available):
996:
410:
Word-Formation in New Englishes: A corpus-based Analysis
743:
Nelson, Gerald, Wallis, Sean, and Aarts, Bas (2002).
119:
78:
containing, e.g. parliamentary or legal paraphrases.
425:(2002) by Gerald Nelson, Sean Wallis, and Bas Aarts
104:can be thoroughly searched and explored with the
1168:
1007:Wellington Corpus of Spoken New Zealand English
732:A Comprehensive Grammar of the English Language
112:software. More information is in the handbook.
1035:CorCenCC National Corpus of Contemporary Welsh
241:Broadcast Talks (20) Non-broadcast Talks (10)
885:
758:"The International Corpus of English website"
406:(2009) by Sidney Greenbaum and Gerald Nelson
865:The International Corpus of English website
590:Nelson, Gerald (May 2004). "Introduction".
892:
878:
927:Bergen Corpus of London Teenage Language
177:
952:Corpus of Contemporary American English
376:(2012) by Marianne Hundt and Ulrike Gut
70:of texts are derived from spoken data.
1169:
899:
703:
589:
459:East Africa (Kenya, Malawi, Tanzania)*
873:
808:
806:
804:
802:
140:
675:
673:
671:
669:
169:
158:
62:(or indeed mega-corpora such as the
1095:Scottish Corpus of Texts and Speech
992:Switchboard Telephone Speech Corpus
380:The Syntax of Spoken Indian English
13:
799:
404:An Introduction to English Grammar
120:Textual and Grammatical Annotation
60:Lancaster-Oslo-Bergen (LOB) Corpus
14:
1213:
858:
681:"Corpus Design @ ICE-corpora.net"
666:
131:
1075:Neo-Assyrian Text Corpus Project
838:"Publications @ ICE-corpora.net"
604:10.1111/j.0883-2919.2004.00347.x
346:Novels & short stories (20)
197:Face-to-face conversations (90)
967:International Corpus of English
830:
775:
750:
737:
441:
352:
21:International Corpus of English
724:
697:
652:
635:
618:
583:
566:
549:
537:BYU Corpus of American English
222:Spontaneous commentaries (20)
49:
1:
542:
392:Adjunct Adverbials in English
386:Oxford Modern English Grammar
972:Lancaster-Oslo-Bergen Corpus
327:Administrative Writing (10)
7:
520:
324:Instructional Writing (20)
10:
1218:
437:(1996) by Sidney Greenbaum
431:(1996) by Sidney Greenbaum
412:(2008) by Thomas Biermeier
394:(2010) by Hilde Hasselgård
145:Word Classes, also called
106:ICE Corpus Utility Program
100:of the entire corpus. The
38:
31:representing varieties of
1138:
1130:Thesaurus Linguae Graecae
1115:Tehran Monolingual Corpus
1100:Slovenian National Corpus
1070:National Corpus of Polish
1015:
907:
747:Amsterdam: John Benjamins
419:Volume 23 Number 2 (2004)
370:(2014) by Valentin Werner
278:
253:
230:Legal Presentations (10)
224:Unscripted Speeches (30)
215:
190:
1045:Croatian National Corpus
1040:Croatian Language Corpus
947:Cambridge English Corpus
917:American National Corpus
335:Persuasive Writing (10)
319:Press news reports (20)
1090:Russian National Corpus
1060:German Reference Corpus
932:British National Corpus
532:British National Corpus
468:Great Britain* (parsed)
382:(2012) by Claudia Lange
364:(2014) by Dagmar Deuber
208:Classroom Lessons (20)
116:ready for publication.
64:British National Corpus
818:www.ice-corpora.uzh.ch
435:Oxford English Grammar
343:Creative Writing (20)
338:Press editorials (10)
307:Natural Sciences (10)
290:Natural Sciences (10)
282:Academic Writing (40)
273:Business Letters (15)
94:part-of-speech tagging
58:. Unlike Brown or the
1120:Tekstaro de Esperanto
1085:Quranic Arabic Corpus
1080:Persian Speech Corpus
1050:Czech National Corpus
987:Spoken English Corpus
977:Oxford English Corpus
304:Social Sciences (10)
299:Popular Writing (40)
287:Social Sciences (10)
257:Student Writing (20)
178:Design of the Corpora
1125:TenTen Corpus Family
329:Skills/hobbies (10)
271:Social Letters (15)
260:Student Essays (10)
239:Broadcast News (20)
227:Demonstrations (10)
1197:Linguistic research
1192:Applied linguistics
1187:Dialects of English
1177:1990 establishments
513:Trinidad and Tobago
388:(2011) by Bas Aarts
250:
187:
901:Corpus linguistics
718:10.1111/weng.12276
527:Corpus linguistics
262:Exam Scripts (10)
248:
185:
141:Word Class Tagging
1164:
1163:
643:"The ICE Project"
626:"The ICE Project"
574:"The ICE Project"
557:"The ICE Project"
492:Nigeria* (tagged)
415:Special issue of
350:
349:
254:Non-Printed (50)
246:
245:
216:Monologues (120)
170:Pragmatic Parsing
159:Syntactic Parsing
1209:
1065:Hamshahri Corpus
1025:Bijankhan Corpus
894:
887:
880:
871:
870:
852:
851:
849:
848:
834:
828:
827:
825:
824:
810:
797:
796:
794:
793:
779:
773:
772:
770:
769:
760:. Archived from
754:
748:
741:
735:
728:
722:
721:
701:
695:
694:
692:
691:
677:
664:
663:
656:
650:
649:
647:
639:
633:
632:
630:
622:
616:
615:
587:
581:
580:
578:
570:
564:
563:
561:
553:
498:The Philippines*
310:Technology (10)
302:Humanities (10)
293:Technology (10)
285:Humanities (10)
251:
247:
219:Unscripted (70)
199:Phonecalls (10)
191:Dialogues (180)
188:
184:
90:phrase structure
66:), however, the
44:Sidney Greenbaum
1217:
1216:
1212:
1211:
1210:
1208:
1207:
1206:
1182:English corpora
1167:
1166:
1165:
1160:
1134:
1055:Europarl Corpus
1017:
1011:
922:Bank of English
909:
903:
898:
861:
856:
855:
846:
844:
842:ice-corpora.net
836:
835:
831:
822:
820:
812:
811:
800:
791:
789:
787:ice-corpora.net
781:
780:
776:
767:
765:
756:
755:
751:
742:
738:
734:London: Longman
729:
725:
706:World Englishes
702:
698:
689:
687:
685:ice-corpora.net
679:
678:
667:
658:
657:
653:
645:
641:
640:
636:
628:
624:
623:
619:
592:World Englishes
588:
584:
576:
572:
571:
567:
559:
555:
554:
550:
545:
523:
444:
417:World Englishes
355:
316:Reportage (20)
180:
172:
161:
147:Parts of Speech
143:
134:
122:
52:
41:
17:
12:
11:
5:
1215:
1205:
1204:
1199:
1194:
1189:
1184:
1179:
1162:
1161:
1159:
1158:
1153:
1148:
1146:BNC consortium
1142:
1140:
1136:
1135:
1133:
1132:
1127:
1122:
1117:
1112:
1107:
1102:
1097:
1092:
1087:
1082:
1077:
1072:
1067:
1062:
1057:
1052:
1047:
1042:
1037:
1032:
1027:
1021:
1019:
1013:
1012:
1010:
1009:
1004:
999:
994:
989:
984:
979:
974:
969:
964:
959:
954:
949:
944:
942:Buckeye Corpus
939:
934:
929:
924:
919:
913:
911:
905:
904:
897:
896:
889:
882:
874:
868:
867:
860:
859:External links
857:
854:
853:
829:
798:
774:
749:
736:
723:
712:(3): 367–370.
696:
665:
651:
634:
617:
598:(2): 225–226.
582:
565:
547:
546:
544:
541:
540:
539:
534:
529:
522:
519:
518:
517:
514:
511:
508:
505:
502:
499:
496:
493:
490:
487:
484:
481:
478:
475:
472:
469:
466:
463:
460:
457:
454:
451:
443:
440:
439:
438:
432:
426:
420:
413:
407:
401:
395:
389:
383:
377:
371:
365:
354:
351:
348:
347:
344:
340:
339:
336:
332:
331:
325:
321:
320:
317:
313:
312:
300:
296:
295:
283:
280:
279:Printed (150)
276:
275:
269:
265:
264:
258:
255:
249:Written (200)
244:
243:
237:
236:Scripted (50)
233:
232:
220:
217:
213:
212:
206:
202:
201:
195:
194:Private (100)
192:
179:
176:
171:
168:
160:
157:
142:
139:
133:
132:Textual Markup
130:
121:
118:
51:
48:
40:
37:
27:) is a set of
15:
9:
6:
4:
3:
2:
1214:
1203:
1200:
1198:
1195:
1193:
1190:
1188:
1185:
1183:
1180:
1178:
1175:
1174:
1172:
1157:
1156:Sketch Engine
1154:
1152:
1149:
1147:
1144:
1143:
1141:
1139:Organizations
1137:
1131:
1128:
1126:
1123:
1121:
1118:
1116:
1113:
1111:
1108:
1106:
1103:
1101:
1098:
1096:
1093:
1091:
1088:
1086:
1083:
1081:
1078:
1076:
1073:
1071:
1068:
1066:
1063:
1061:
1058:
1056:
1053:
1051:
1048:
1046:
1043:
1041:
1038:
1036:
1033:
1031:
1028:
1026:
1023:
1022:
1020:
1016:Text corpora,
1014:
1008:
1005:
1003:
1000:
998:
995:
993:
990:
988:
985:
983:
980:
978:
975:
973:
970:
968:
965:
963:
960:
958:
955:
953:
950:
948:
945:
943:
940:
938:
935:
933:
930:
928:
925:
923:
920:
918:
915:
914:
912:
908:Text corpora,
906:
902:
895:
890:
888:
883:
881:
876:
875:
872:
866:
863:
862:
843:
839:
833:
819:
815:
809:
807:
805:
803:
788:
784:
778:
764:on 2009-02-04
763:
759:
753:
746:
740:
733:
727:
719:
715:
711:
707:
700:
686:
682:
676:
674:
672:
670:
661:
655:
644:
638:
627:
621:
613:
609:
605:
601:
597:
593:
586:
575:
569:
558:
552:
548:
538:
535:
533:
530:
528:
525:
524:
515:
512:
509:
506:
503:
500:
497:
494:
491:
488:
485:
482:
479:
476:
473:
470:
467:
464:
461:
458:
455:
452:
449:
448:
447:
436:
433:
430:
427:
424:
421:
418:
414:
411:
408:
405:
402:
399:
398:ICAME Journal
396:
393:
390:
387:
384:
381:
378:
375:
372:
369:
366:
363:
360:
359:
358:
345:
342:
341:
337:
334:
333:
330:
326:
323:
322:
318:
315:
314:
311:
308:
305:
301:
298:
297:
294:
291:
288:
284:
281:
277:
274:
270:
268:Letters (30)
267:
266:
263:
259:
256:
252:
242:
238:
235:
234:
231:
228:
225:
221:
218:
214:
211:
207:
204:
203:
200:
196:
193:
189:
186:Spoken (300)
183:
175:
167:
164:
156:
153:
150:
148:
138:
129:
127:
117:
113:
111:
107:
103:
99:
95:
91:
88:
83:
79:
76:
71:
69:
65:
61:
57:
47:
45:
36:
34:
30:
26:
22:
966:
957:Enron Corpus
937:Brown Corpus
845:. Retrieved
841:
832:
821:. Retrieved
817:
814:"Annotation"
790:. Retrieved
786:
777:
766:. Retrieved
762:the original
752:
744:
739:
731:
726:
709:
705:
699:
688:. Retrieved
684:
654:
637:
620:
595:
591:
585:
568:
551:
507:South Africa
501:Sierra Leone
489:New Zealand*
445:
442:Participants
434:
428:
422:
416:
409:
403:
400:No 34 (2010)
397:
391:
385:
379:
373:
367:
361:
356:
353:Publications
328:
309:
306:
303:
292:
289:
286:
272:
261:
240:
229:
226:
223:
209:
205:Public (80)
198:
181:
173:
165:
162:
154:
151:
144:
135:
123:
114:
109:
105:
86:
84:
80:
74:
72:
67:
56:Brown Corpus
53:
42:
29:text corpora
24:
20:
18:
1018:non-English
50:Description
1171:Categories
847:2018-04-22
823:2018-03-29
792:2018-03-03
768:2008-01-13
690:2018-03-03
543:References
504:Singapore*
471:Hong Kong*
612:0883-2919
510:Sri Lanka
450:Australia
1105:TalkBank
982:PropBank
962:EnTenTen
521:See also
495:Pakistan
486:Malaysia
480:Jamaica*
477:Ireland*
453:Cameroon
102:treebank
68:majority
1202:Corpora
1151:COBUILD
1110:Tatoeba
1030:CHILDES
1002:VerbNet
910:English
456:Canada*
98:parsing
39:History
33:English
610:
474:India*
110:ICECUP
87:et al.
75:spoken
997:TIMIT
646:(PDF)
629:(PDF)
577:(PDF)
560:(PDF)
483:Malta
465:Ghana
608:ISSN
516:USA*
462:Fiji
126:here
96:and
19:The
714:doi
600:doi
108:or
25:ICE
1173::
840:.
816:.
801:^
785:.
710:36
708:.
683:.
668:^
606:.
596:23
594:.
893:e
886:t
879:v
850:.
826:.
795:.
771:.
720:.
716::
693:.
662:.
648:.
631:.
614:.
602::
579:.
562:.
23:(
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.