Parallel text - Knowledge

60: 27: 279:

elements represented in only one corpus in order to extract cleaner parallel fragments of bilingual elements. Comparable corpora are used to directly obtain knowledge for translation purposes. High-quality parallel data is difficult to obtain, however, especially for under-resourced languages.

338:

have some similarities with translation memories. The most salient difference is that a translation memory loses the original context, while a bitext retains the original sentence order. That said, some implementations of translation memory, such as

361:

In his original 1988 article, Harris also posited that bitext represents how translators hold their source and target texts together in their mental working memories as they progress. However, this hypothesis has not been followed up.

485:

Abdallah, A. (2021). Impact of using parallel text strategy on teaching reading to intermediate II level students. International Journal on Social and Education Sciences (IJonSES), 3(1), 95-108.

313:, which automatically aligns the original and translated versions of the same text. The tool generally matches these two texts sentence by sentence. A collection of bitexts is called a 241:

contains bilingual sentences that are not perfectly aligned or have poor quality translations. Nevertheless, most of its contents are bilingual translations of a specific document.

268:

algorithms are usually extracted from large bodies of similar sources, such as databases of news articles written in the first and second languages describing similar events.

234:

contains translations of the same document in two or more languages, aligned at least at the sentence level. These tend to be rarer than less-comparable corpora.

921: 1081: 211:

research. During translation, sentences can be split, merged, deleted, inserted or reordered by the translator. This makes alignment a non-trivial task.

358:, not by a machine. As such, small alignment errors or minor discrepancies that would cause a translation memory to fail are of no importance. 1059: 365:

Online bitexts and translation memories may also be called online bilingual concordances. Several are available on the public Web, including

880: 271:

However, extracted fragments may be noisy, with extra elements inserted in each corpus. Extraction techniques can differentiate between

1470: 914: 558: 193: 47: 1639: 731: 810: 707: 500:"Noisy-Parallel and Comparable Corpora Filtering Methodology for the Extraction of Bi-Lingual Equivalent Data at Sentence Level" 176:

may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study;

1675:

Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24–26 May 2006

470: 443: 124: 1669:

Ralf, Ralf Steinberger; Pouliquen, Bruno; Widiger, Anna; Ignat, Camelia; Erjavec, Tomaž; Tufiş, Dan; Varga, Dániel (2006).

1380: 1071: 907: 96: 1634: 1241: 1395: 1226: 392: 143: 103: 678: 1166: 1583: 1236: 820: 758: 722:

TERMSEARCH – English/Russian/French parallel corpora (Major international treaties, conventions, agreements, etc.

412: 1693: 1231: 976: 815: 110: 81: 77: 1698: 1500: 1221: 842: 348: 248:

is built from non-sentence-aligned and untranslated bilingual documents, but the documents are topic-aligned.

1193: 92: 1703: 1538: 1523: 1495: 1360: 1355: 930: 397: 387: 340: 1275: 1246: 1024: 255:

includes very heterogeneous and non-parallel bilingual documents that may or may not be topic-aligned.

184:(Greek for "sixfold") placed six versions of the Old Testament side by side. A famous example is the 1118: 971: 744: 1644: 1568: 1300: 1256: 1141: 1039: 609:

and Their Reliability Through a Contrastive Analysis of Complex Prepositions from French to English

370: 1548: 1518: 1185: 877: 70: 1019: 832: 1405: 1098: 1076: 1066: 1034: 1009: 165: 164:

is the identification of the corresponding sentences in both halves of the parallel text. The

1265: 894: 382: 289: 169: 302:

is a merged document composed of both source- and target-language versions of a given text.

1618: 1294: 1270: 1123: 687: 521: 117: 8: 1598: 1528: 1485: 1441: 1213: 1203: 1198: 1086: 295: 265: 573: 525: 1608: 1480: 1345: 1108: 1091: 949: 789: 652: 539: 511: 330: 215: 207:). Alignments of parallel corpora at sentence level are prerequisite for many areas of 173: 1613: 1325: 1133: 1044: 704: 649:

WeBiText: Building Large Heterogeneous Translation Memories from Parallel Web Content

466: 439: 189: 39: 656: 543: 1490: 1375: 1350: 1151: 1054: 754: 613: 529: 799:

Language Grid – Multilingual service platform that includes parallel text services

1674: 1602: 1563: 1558: 1426: 1156: 1029: 1004: 986: 884: 748: 711: 647:

Désilets, Alain; Farley, Benoît; Stojanović, Marta; Patenaude, Geneviève (2008).

460: 433: 402: 1310: 1290: 1014: 793: 682: 534: 499: 407: 20: 899: 1687: 1573: 1385: 1146: 781: 651:. Proceedings of Translating and the Computer. Vol. 30. pp. 27–28. 185: 43: 31: 843:

An implementation of the Gale and Church sentence alignment algorithm (2005)

612:(M.A. thesis). Université catholique de Louvain & Universitetet i Oslo. 1553: 1171: 766:, concordancer (open source AGPL) with online search on JCR and UNO corpus 1671:

The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages

1510: 1390: 1103: 996: 944: 867: 847: 355: 276: 208: 204: 889: 862: 837: 769: 1113: 857: 618: 852: 486: 981: 821:

Proceedings of the 2005 Workshop on Building and Using Parallel Texts

816:

Proceedings of the 2003 Workshop on Building and Using Parallel Texts

811:

Parallel text processing bibliography by J. Veronis and M.-D. Mahimon

700:

The Opus project aims at collecting freely available parallel corpora

593:

How Reliable Are Online Bilingual Concordancers? An investigation of

459:

Williams, Philip; Sennrich, Rico; Post, Matt; Koehn, Philipp (2016).

272: 872: 716: 591: 59: 1456: 1436: 1421: 1400: 1370: 1315: 1280: 1161: 516: 694: 351:(CAT) programs, allow preserving the original order of sentences. 1593: 1451: 1431: 1305: 1049: 964: 785: 646: 366: 181: 26: 959: 954: 868:

Bleualign – machine translation based sentence alignment (2010)

705:

Japanese-English Bilingual Corpus of Knowledge's Kyoto Articles

177: 775: 226:

Parallel corpora can be classified into four main categories:

1649: 1285: 721: 699: 172:

are two examples of dual-language series of texts. Reference

160:

is a text placed alongside its translation or translations.

35: 736: 726: 1668: 1446: 741: 737:

ParaSol – A parallel corpus of Slavic and other languages

695:

European Parliament Proceedings Parallel Corpus 1996–2011

458: 344: 778:

multilingual parallel corpora, online search interface.

305:

Bitexts are generated by a piece of software called an

633: 798: 763: 727:

TradooIT – English/French/Spanish – Free Online tools

16:

Text placed alongside its translation or translations

838:

Uplug – tools for processing parallel corpora (2003)

732:

Nunavut Hansard – English/Inuktitut parallel corpus

452: 347:format for exchanging translation memories between 324: 84:. Unsourced material may be challenged and removed. 1685: 1132: 435:Routledge Encyclopedia of Translation Technology 354:Bitexts are designed to be consulted by a human 929: 199:Large collections of parallel texts are called 559:"Bi-Text, A New Concept in Translation Theory" 915: 717:COMPARA – Portuguese/English parallel corpora 38:engraved with the same decree in both of the 895:Web Alignment Tool at University of Grenoble 890:Vecalign sentence alignment algorithm (2019) 462:Syntax-based Statistical Machine Translation 221: 679:The JRC-Acquis Multilingual Parallel Corpus 321:, and can be consulted with a search tool. 922: 908: 755:InterCorp: A multilingual parallel corpus 617: 533: 515: 275:elements represented in both corpora and 144:Learn how and when to remove this message 878:Hierarchical alignment tool (HAT) (2018) 264:Large corpora used as training sets for 25: 589: 425: 1686: 742:Glosbe: Multilanguage parallel corpora 556: 903: 1381:Simple Knowledge Organization System 848:The Hunalign sentence aligner (2005) 497: 431: 82:adding citations to reliable sources 53: 672: 634:"TradooIT – Concordancier bilingue" 487:https://doi.org/10.46328/ijonses.48 259: 13: 826: 792:and other public documents of the 14: 1715: 1396:Thesaurus (information retrieval) 863:Gargantua sentence aligner (2010) 757:40 languages aligned with Czech, 667: 393:Example-based machine translation 804: 325:Bitexts and translation memories 58: 772:, with online search interface. 413:Statistical machine translation 69:needs additional citations for 1662: 977:Natural language understanding 640: 626: 583: 550: 491: 479: 214:Parallel texts may be used in 188:, whose discovery allowed the 50:the Ancient Egyptian language. 1: 1501:Optical character recognition 418: 349:computer-assisted translation 1194:Multi-document summarization 833:GIZA++ alignment tool (1999) 751:with online search interface 7: 1524:Latent Dirichlet allocation 1496:Natural language generation 1361:Machine-readable dictionary 1356:Linguistic Linked Open Data 931:Natural language processing 398:Natural language processing 388:Computer-assisted reviewing 376: 341:Translation Memory eXchange 46:. Its discovery was key to 10: 1720: 1276:Explicit semantic analysis 1025:Deep linguistic processing 535:10.7494/csci.2015.16.2.169 328: 287: 18: 1627: 1582: 1537: 1509: 1469: 1414: 1336: 1324: 1255: 1212: 1184: 1119:Word-sense disambiguation 995: 972:Computational linguistics 937: 557:Harris, B. (March 1988). 465:. Morgan & Claypool. 283: 222:Types of parallel corpora 190:Ancient Egyptian language 1645:Natural Language Toolkit 1569:Pronunciation assessment 1471:Automatic identification 1301:Latent semantic analysis 1257:Distributional semantics 1142:Compound-term processing 1040:Named-entity recognition 691:with 231 language pairs. 498:Wołk, Krzysztof (2015). 40:Ancient Egyptian scripts 19:Not to be confused with 1549:Automated essay scoring 1519:Document classification 1186:Automatic summarization 782:EUR-Lex Corpus – corpus 759:online search interface 590:Genette, Marie (2016). 253:quasi-comparable corpus 162:Parallel text alignment 1406:Universal Dependencies 1099:Terminology extraction 1082:Semantic decomposition 1077:Semantic role labeling 1067:Part-of-speech tagging 1035:Information extraction 1020:Coreference resolution 1010:Collocation extraction 572:: 8–10. Archived from 432:Chan, Sin-Wai (2015). 166:Loeb Classical Library 51: 1694:Translation databases 1167:Sentence segmentation 788:database consists of 681:of the total body of 438:. London: Routledge. 383:Bilingual inscription 290:Bitext word alignment 239:noisy parallel corpus 170:Clay Sanskrit Library 29: 1699:Language acquisition 1619:Voice user interface 1330:datasets and corpora 1271:Document-term matrix 1124:Word-sense induction 688:Acquis Communautaire 78:improve this article 1599:Interactive fiction 1529:Pachinko allocation 1486:Speech segmentation 1442:Google Ngram Viewer 1214:Machine translation 1204:Text simplification 1199:Sentence extraction 1087:Semantic similarity 858:mALIGNa (2008–2020) 526:2015arXiv151004500W 296:translation studies 266:machine translation 1704:Corpus linguistics 1609:Question answering 1481:Speech recognition 1346:Corpus linguistics 1326:Language resources 1109:Textual entailment 1092:Sentiment analysis 883:2020-07-05 at the 853:Champollion (2006) 790:European Union law 747:2013-05-27 at the 710:2012-08-22 at the 343:(TMX), a standard 331:Translation memory 216:language education 52: 1658: 1657: 1614:Virtual assistant 1539:Computer-assisted 1465: 1464: 1222:Computer-assisted 1180: 1179: 1172:Word segmentation 1134:Text segmentation 1072:Semantic analysis 1060:Syntactic parsing 1045:Ontology learning 472:978-1-62705-502-4 445:978-1-315-74912-9 246:comparable corpus 154: 153: 146: 128: 1711: 1679: 1678: 1666: 1635:Formal semantics 1584:Natural language 1491:Speech synthesis 1473:and data capture 1376:Semantic network 1351:Lexical resource 1334: 1333: 1152:Lexical analysis 1130: 1129: 1055:Semantic parsing 924: 917: 910: 901: 900: 784:built up of the 673:Parallel corpora 661: 660: 644: 638: 637: 630: 624: 623: 621: 587: 581: 580: 578: 566:Language Monthly 563: 554: 548: 547: 537: 519: 504:Computer Science 495: 489: 483: 477: 476: 456: 450: 449: 429: 373:, and Tradooit. 319:bilingual corpus 294:In the field of 260:Noise in corpora 201:parallel corpora 149: 142: 138: 135: 129: 127: 86: 62: 54: 1719: 1718: 1714: 1713: 1712: 1710: 1709: 1708: 1684: 1683: 1682: 1667: 1663: 1659: 1654: 1623: 1603:Syntax guessing 1585: 1578: 1564:Predictive text 1559:Grammar checker 1540: 1533: 1505: 1472: 1461: 1427:Bank of English 1410: 1338: 1329: 1320: 1251: 1208: 1176: 1128: 1030:Distant reading 1005:Argument mining 991: 987:Text processing 933: 928: 885:Wayback Machine 829: 827:Alignment tools 807: 749:Wayback Machine 712:Wayback Machine 675: 670: 665: 664: 645: 641: 632: 631: 627: 588: 584: 576: 561: 555: 551: 496: 492: 484: 480: 473: 457: 453: 446: 430: 426: 421: 403:Polyglot (book) 379: 333: 327: 315:bitext database 292: 286: 262: 232:parallel corpus 224: 192:to begin being 150: 139: 133: 130: 93:"Parallel text" 87: 85: 75: 63: 24: 17: 12: 11: 5: 1717: 1707: 1706: 1701: 1696: 1681: 1680: 1660: 1656: 1655: 1653: 1652: 1647: 1642: 1637: 1631: 1629: 1625: 1624: 1622: 1621: 1616: 1611: 1606: 1596: 1590: 1588: 1586:user interface 1580: 1579: 1577: 1576: 1571: 1566: 1561: 1556: 1551: 1545: 1543: 1535: 1534: 1532: 1531: 1526: 1521: 1515: 1513: 1507: 1506: 1504: 1503: 1498: 1493: 1488: 1483: 1477: 1475: 1467: 1466: 1463: 1462: 1460: 1459: 1454: 1449: 1444: 1439: 1434: 1429: 1424: 1418: 1416: 1412: 1411: 1409: 1408: 1403: 1398: 1393: 1388: 1383: 1378: 1373: 1368: 1363: 1358: 1353: 1348: 1342: 1340: 1331: 1322: 1321: 1319: 1318: 1313: 1311:Word embedding 1308: 1303: 1298: 1291:Language model 1288: 1283: 1278: 1273: 1268: 1262: 1260: 1253: 1252: 1250: 1249: 1244: 1242:Transfer-based 1239: 1234: 1229: 1224: 1218: 1216: 1210: 1209: 1207: 1206: 1201: 1196: 1190: 1188: 1182: 1181: 1178: 1177: 1175: 1174: 1169: 1164: 1159: 1154: 1149: 1144: 1138: 1136: 1127: 1126: 1121: 1116: 1111: 1106: 1101: 1095: 1094: 1089: 1084: 1079: 1074: 1069: 1064: 1063: 1062: 1057: 1047: 1042: 1037: 1032: 1027: 1022: 1017: 1015:Concept mining 1012: 1007: 1001: 999: 993: 992: 990: 989: 984: 979: 974: 969: 968: 967: 962: 952: 947: 941: 939: 935: 934: 927: 926: 919: 912: 904: 898: 897: 892: 887: 875: 870: 865: 860: 855: 850: 845: 840: 835: 828: 825: 824: 823: 818: 813: 806: 803: 802: 801: 796: 794:European Union 779: 773: 767: 764:myCAT – Olanto 761: 752: 739: 734: 729: 724: 719: 714: 702: 697: 692: 683:European Union 674: 671: 669: 668:External links 666: 663: 662: 639: 625: 607:ReversoContext 582: 579:on 2018-03-02. 549: 510:(2): 169–184. 490: 478: 471: 451: 444: 423: 422: 420: 417: 416: 415: 410: 408:Ruby character 405: 400: 395: 390: 385: 378: 375: 329:Main article: 326: 323: 307:alignment tool 288:Main article: 285: 282: 261: 258: 257: 256: 249: 242: 235: 223: 220: 152: 151: 66: 64: 57: 21:Parallel novel 15: 9: 6: 4: 3: 2: 1716: 1705: 1702: 1700: 1697: 1695: 1692: 1691: 1689: 1676: 1672: 1665: 1661: 1651: 1648: 1646: 1643: 1641: 1640:Hallucination 1638: 1636: 1633: 1632: 1630: 1626: 1620: 1617: 1615: 1612: 1610: 1607: 1604: 1600: 1597: 1595: 1592: 1591: 1589: 1587: 1581: 1575: 1574:Spell checker 1572: 1570: 1567: 1565: 1562: 1560: 1557: 1555: 1552: 1550: 1547: 1546: 1544: 1542: 1536: 1530: 1527: 1525: 1522: 1520: 1517: 1516: 1514: 1512: 1508: 1502: 1499: 1497: 1494: 1492: 1489: 1487: 1484: 1482: 1479: 1478: 1476: 1474: 1468: 1458: 1455: 1453: 1450: 1448: 1445: 1443: 1440: 1438: 1435: 1433: 1430: 1428: 1425: 1423: 1420: 1419: 1417: 1413: 1407: 1404: 1402: 1399: 1397: 1394: 1392: 1389: 1387: 1386:Speech corpus 1384: 1382: 1379: 1377: 1374: 1372: 1369: 1367: 1366:Parallel text 1364: 1362: 1359: 1357: 1354: 1352: 1349: 1347: 1344: 1343: 1341: 1335: 1332: 1327: 1323: 1317: 1314: 1312: 1309: 1307: 1304: 1302: 1299: 1296: 1292: 1289: 1287: 1284: 1282: 1279: 1277: 1274: 1272: 1269: 1267: 1264: 1263: 1261: 1258: 1254: 1248: 1245: 1243: 1240: 1238: 1235: 1233: 1230: 1228: 1227:Example-based 1225: 1223: 1220: 1219: 1217: 1215: 1211: 1205: 1202: 1200: 1197: 1195: 1192: 1191: 1189: 1187: 1183: 1173: 1170: 1168: 1165: 1163: 1160: 1158: 1157:Text chunking 1155: 1153: 1150: 1148: 1147:Lemmatisation 1145: 1143: 1140: 1139: 1137: 1135: 1131: 1125: 1122: 1120: 1117: 1115: 1112: 1110: 1107: 1105: 1102: 1100: 1097: 1096: 1093: 1090: 1088: 1085: 1083: 1080: 1078: 1075: 1073: 1070: 1068: 1065: 1061: 1058: 1056: 1053: 1052: 1051: 1048: 1046: 1043: 1041: 1038: 1036: 1033: 1031: 1028: 1026: 1023: 1021: 1018: 1016: 1013: 1011: 1008: 1006: 1003: 1002: 1000: 998: 997:Text analysis 994: 988: 985: 983: 980: 978: 975: 973: 970: 966: 963: 961: 958: 957: 956: 953: 951: 948: 946: 943: 942: 940: 938:General terms 936: 932: 925: 920: 918: 913: 911: 906: 905: 902: 896: 893: 891: 888: 886: 882: 879: 876: 874: 871: 869: 866: 864: 861: 859: 856: 854: 851: 849: 846: 844: 841: 839: 836: 834: 831: 830: 822: 819: 817: 814: 812: 809: 808: 805:Documentation 800: 797: 795: 791: 787: 783: 780: 777: 774: 771: 768: 765: 762: 760: 756: 753: 750: 746: 743: 740: 738: 735: 733: 730: 728: 725: 723: 720: 718: 715: 713: 709: 706: 703: 701: 698: 696: 693: 690: 689: 684: 680: 677: 676: 658: 654: 650: 643: 635: 629: 620: 615: 611: 610: 606: 602: 598: 594: 586: 575: 571: 567: 560: 553: 545: 541: 536: 531: 527: 523: 518: 513: 509: 505: 501: 494: 488: 482: 474: 468: 464: 463: 455: 447: 441: 437: 436: 428: 424: 414: 411: 409: 406: 404: 401: 399: 396: 394: 391: 389: 386: 384: 381: 380: 374: 372: 368: 363: 359: 357: 352: 350: 346: 342: 337: 332: 322: 320: 316: 312: 308: 303: 301: 297: 291: 281: 278: 274: 269: 267: 254: 250: 247: 243: 240: 236: 233: 229: 228: 227: 219: 217: 212: 210: 206: 202: 197: 195: 191: 187: 186:Rosetta Stone 183: 179: 175: 171: 167: 163: 159: 158:parallel text 148: 145: 137: 126: 123: 119: 116: 112: 109: 105: 102: 98: 95: – 94: 90: 89:Find sources: 83: 79: 73: 72: 67:This article 65: 61: 56: 55: 49: 45: 44:Ancient Greek 41: 37: 33: 32:Rosetta Stone 28: 22: 1670: 1664: 1554:Concordancer 1365: 950:Bag-of-words 686: 648: 642: 628: 608: 604: 600: 596: 592: 585: 574:the original 569: 565: 552: 507: 503: 493: 481: 461: 454: 434: 427: 364: 360: 353: 335: 334: 318: 314: 310: 306: 304: 299: 293: 270: 263: 252: 245: 238: 231: 225: 213: 200: 198: 161: 157: 155: 140: 131: 121: 114: 107: 100: 88: 76:Please help 71:verification 68: 1511:Topic model 1391:Text corpus 1237:Statistical 1104:Text mining 945:AI-complete 873:YASA (2013) 776:linguatools 619:10852/51577 311:bitext tool 277:monolingual 205:text corpus 48:deciphering 42:as well as 1688:Categories 1232:Rule-based 1114:Truecasing 982:Stop words 685:(EU) law: 517:1510.04500 419:References 356:translator 209:linguistic 194:deciphered 104:newspapers 1541:reviewing 1339:standards 1337:Types and 273:bilingual 1457:Wikidata 1437:FrameNet 1422:BabelNet 1401:Treebank 1371:PropBank 1316:Word2vec 1281:fastText 1162:Stemming 881:Archived 745:Archived 708:Archived 657:14586900 603:WeBiText 599:TradooIT 544:12860633 377:See also 168:and the 134:May 2008 1628:Related 1594:Chatbot 1452:WordNet 1432:DBpedia 1306:Seq2seq 1050:Parsing 965:Trigram 786:EUR-Lex 595:Linguee 522:Bibcode 371:Reverso 367:Linguée 336:Bitexts 309:, or a 182:Hexapla 118:scholar 1601:(c.f. 1259:models 1247:Neural 960:Bigram 955:n-gram 655: 542: 469: 442: 300:bitext 284:Bitext 178:Origen 174:Bibles 120: 113: 106: 99: 91: 1650:spaCy 1295:large 1286:GloVe 653:S2CID 577:(PDF) 562:(PDF) 540:S2CID 512:arXiv 317:or a 203:(see 125:JSTOR 111:books 36:stele 1415:Data 1266:BERT 770:TAUS 605:and 467:ISBN 440:ISBN 97:news 34:, a 30:The 1447:UBY 614:hdl 530:doi 345:XML 180:'s 80:by 1690:: 1673:. 601:, 597:, 570:54 568:. 564:. 538:. 528:. 520:. 508:16 506:. 502:. 369:, 298:a 251:A 244:A 237:A 230:A 218:. 196:. 156:A 1677:. 1605:) 1328:, 1297:) 1293:( 923:e 916:t 909:v 659:. 636:. 622:. 616:: 546:. 532:: 524:: 514:: 475:. 448:. 147:) 141:( 136:) 132:( 122:· 115:· 108:· 101:· 74:. 23:.

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index