Knowledge

Longest common substring

Source 📝

972: 59: 986:
for the strings, and then finding the deepest internal nodes which have leaf nodes from all the strings in the subtree below it. The figure on the right is the suffix tree for the strings "ABAB", "BABA" and "ABBA", padded with unique string terminators, to become "ABAB$ 0", "BABA$ 1" and "ABBA$ 2".
678: 1019:
time (if the size of the alphabet is constant). If the tree is traversed from the bottom up with a bit vector telling which strings are seen below each node, the k-common substring problem can be solved in
758: 594: 66:
The picture shows two strings where the problem has multiple solutions. Although the substring occurrences always overlap, it is impossible to obtain a longer common substring by "uniting" them.
957: 849: 898: 267: 385: 317: 1285: 1439: 1050: 794: 1083: 1017: 69:
The strings "ABABC", "BABCA" and "ABCBA" have only one longest common substring, viz. "ABC" of length 3. Other common substrings are "A", "AB", "B", "BA", "BC" and "C".
353: 534: 474: 1317: 1179: 506: 453: 433: 405: 198: 178: 158: 138: 118: 98: 1369:
Charalampopoulos, Panagiotis; Kociumaka, Tomasz; Pissis, Solon P.; Radoszewski, Jakub (Aug 2021). Mutzel, Petra; Pagh, Rasmus; Herman, Grzegorz (eds.).
1714: 1609: 599: 1444: 1648: 1574: 1327:
Store only non-zero values in the rows. This can be done using hash-tables instead of arrays. This is useful for large alphabets.
1373:. European Symposium on Algorithms. Leibniz International Proceedings in Informatics (LIPIcs). Vol. 204. Schloss Dagstuhl. 1579: 683: 1584: 539: 987:
The nodes representing "A", "B", "AB" and "BA" all have descendant leaves from all of the strings, numbered 0, 1 and 2.
1795: 1569: 1396: 1663: 1604: 1476: 1449: 1406: 1511: 456: 1093:
The following pseudocode finds the set of longest common substrings between two strings with dynamic programming:
907: 799: 854: 1691: 214: 1337: 1831: 1826: 1696: 1551: 1775: 1501: 1780: 1658: 1625: 1561: 20: 1790: 1686: 1630: 1531: 1485: 35: 1589: 1521: 358: 272: 1322:
The last and current row can be stored on the same 1D array by traversing the inner loop backwards
1246: 1734: 1023: 983: 975: 960: 767: 509: 1059: 993: 1053: 322: 1739: 1681: 1541: 1469: 1131:
L > z z := L ret := {S}
519: 459: 1546: 1290: 1152: 479: 47: 42:
of all of them. There may be more than one longest common substring. Applications include
8: 1744: 901: 761: 1424: 62:
The strings "BADANAT" and "CANADAS" share the maximal-length substrings "ADA" and "ANA".
1673: 1640: 438: 418: 390: 183: 163: 143: 123: 103: 83: 43: 1398:
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
1805: 1402: 1224:, which is the last character of the longest common substring (of size z) instead of 415:
One can find the lengths and starting positions of the longest common substrings of
1800: 1770: 1724: 1526: 1462: 1374: 27: 1239:
The following tricks can be used to reduce the memory usage of an implementation:
1653: 1594: 1506: 1208:
is used to hold the length of the longest common substring found so far. The set
1706: 1599: 1379: 1450:
Suffix Tree based C implementation of Longest common substring for two strings
1820: 1516: 1493: 1429: 982:
The longest common substrings of a set of strings can be found by building a
971: 1719: 1536: 1434: 1368: 1729: 58: 1425:
Dictionary of Algorithms and Data Structures: longest common substring
1440:
Dynamic programming implementations in various languages on wikibooks
39: 1243:
Keep only the last and current row of the DP table to save memory (
513: 1765: 1228:. Thus all the longest common substrings would be, for each i in 673:{\textstyle O\left((n+m)\log \sigma /{\sqrt {\log(n+m)}}\right)} 1445:
working AS3 implementation of the dynamic programming algorithm
1342: 1185:
stores the length of the longest common suffix of the prefixes
978:
for the strings "ABAB", "BABA" and "ABBA", numbered 0, 1 and 2.
1138:L = z ret := ret âˆȘ {S} 1749: 1430:
Perl/XS implementation of the dynamic programming algorithm
1454: 753:{\displaystyle O\left((n+m)\log \sigma /\log(n+m)\right)} 387:, a longest string which occurs as substring of at least 1212:
is used to hold the set of strings which are of length
1052:
time. If the suffix tree is prepared for constant time
1123:j = 1 L := 1 602: 325: 1293: 1249: 1155: 1062: 1026: 996: 910: 857: 802: 770: 686: 589:{\displaystyle 2^{o\left({\sqrt {\log(n+m)}}\right)}} 542: 522: 482: 462: 441: 421: 393: 361: 275: 217: 186: 166: 146: 126: 106: 86: 1103:(1..r, 1..n) z := 0 ret := {} 1435:
Perl/XS implementation of the suffix tree algorithm
1220:can be saved efficiently by just storing the index 160:, find a longest string which is substring of both 1311: 1279: 1173: 1077: 1044: 1011: 951: 892: 843: 788: 752: 672: 588: 528: 500: 468: 447: 427: 399: 379: 347: 311: 261: 192: 172: 152: 132: 112: 92: 1818: 1256: 796:. The solutions to the generalized problem take 1371:Faster Algorithms for Longest Common Substring 1470: 952:{\displaystyle \Theta (n_{1}+\cdots +n_{K})} 844:{\displaystyle \Theta (n_{1}+\cdots +n_{K})} 512:. A faster algorithm can be achieved in the 256: 224: 1099:LongestCommonSubstring(S, T) L := 1477: 1463: 893:{\displaystyle \Theta (n_{1}\cdots n_{K})} 1378: 262:{\displaystyle S=\{S_{1},\ldots ,S_{K}\}} 72:ABABC ||| BABCA ||| ABCBA 1649:Comparison of regular-expression engines 1394: 1348:, all the possible substrings of length 970: 596:. In particular, this algorithm runs in 57: 1819: 1088: 1610:Zhu–Takaoka string matching algorithm 1458: 1362: 75: 1388: 34:of two or more strings is a longest 1575:Boyer–Moore string-search algorithm 1401:. USA: Cambridge University Press. 13: 1063: 1027: 997: 911: 858: 803: 771: 463: 14: 1843: 1664:Nondeterministic finite automaton 1605:Two-way string-matching algorithm 1418: 516:model of computation if the size 1127:L := L + 1 1056:retrieval, it can be solved in 990:Building the suffix tree takes 1580:Boyer–Moore–Horspool algorithm 1570:Apostolico–Giancarlo algorithm 1352:that are contained in a string 1306: 1297: 1274: 1271: 1259: 1253: 1168: 1159: 1072: 1066: 1039: 1030: 1006: 1000: 966: 946: 914: 887: 861: 838: 806: 783: 774: 760:space. Solving the problem by 742: 730: 707: 695: 660: 648: 623: 611: 575: 563: 495: 483: 292: 277: 1: 1356: 1338:Longest palindromic substring 1204:, respectively. The variable 410: 380:{\displaystyle 2\leq k\leq K} 312:{\displaystyle |S_{i}|=n_{i}} 1585:Knuth–Morris–Pratt algorithm 1512:Damerau–Levenshtein distance 1280:{\displaystyle O(\min(r,n))} 536:of the input alphabet is in 7: 1776:Compressed pattern matching 1502:Approximate string matching 1484: 1331: 1111:j := 1..n 1045:{\displaystyle \Theta (NK)} 789:{\displaystyle \Theta (nm)} 211:. Given the set of strings 53: 10: 1848: 1781:Longest common subsequence 1692:Needleman–Wunsch algorithm 1562:String-searching algorithm 1380:10.4230/LIPIcs.ESA.2021.30 1078:{\displaystyle \Theta (N)} 1012:{\displaystyle \Theta (N)} 21:longest common subsequence 18: 1791:Sequential pattern mining 1758: 1705: 1672: 1639: 1631:Commentz-Walter algorithm 1619:Multiple string searching 1618: 1560: 1552:Wagner–Fischer algorithm 1492: 348:{\textstyle \sum n_{i}=N} 1801:String rewriting systems 1786:Longest common substring 1697:Smith–Waterman algorithm 1522:Gestalt pattern matching 1385:Here: Theorem 1, p.30:2. 508:time with the help of a 209:common substring problem 203:A generalization is the 32:longest common substring 19:Not to be confused with 16:Computer science problem 1735:Generalized suffix tree 1659:Thompson's construction 1395:Gusfield, Dan (1999) . 1149:This algorithm runs in 1107:i := 1..r 984:generalized suffix tree 976:Generalized suffix tree 961:generalized suffix tree 529:{\displaystyle \sigma } 510:generalized suffix tree 469:{\displaystyle \Theta } 1687:Hirschberg's algorithm 1313: 1281: 1175: 1115:S = T 1079: 1054:lowest common ancestor 1046: 1013: 979: 953: 894: 845: 790: 754: 674: 590: 530: 502: 470: 449: 429: 401: 381: 349: 313: 263: 194: 174: 154: 134: 114: 94: 63: 1542:Levenshtein automaton 1532:Jaro–Winkler distance 1314: 1312:{\displaystyle O(nr)} 1282: 1176: 1174:{\displaystyle O(nr)} 1080: 1047: 1014: 974: 954: 895: 846: 791: 755: 675: 591: 531: 503: 501:{\displaystyle (n+m)} 471: 450: 430: 402: 382: 350: 314: 264: 195: 175: 155: 135: 115: 95: 61: 1590:Rabin–Karp algorithm 1547:Levenshtein distance 1291: 1247: 1153: 1060: 1024: 994: 908: 855: 800: 768: 684: 600: 540: 520: 480: 460: 439: 419: 391: 359: 323: 273: 215: 184: 164: 144: 124: 104: 84: 48:plagiarism detection 1832:Dynamic programming 1827:Problems on strings 1745:Ternary search tree 1089:Dynamic programming 902:dynamic programming 762:dynamic programming 80:Given two strings, 1674:Sequence alignment 1641:Regular expression 1309: 1277: 1171: 1075: 1042: 1009: 980: 949: 890: 841: 786: 750: 670: 586: 526: 498: 466: 445: 425: 397: 377: 345: 309: 259: 190: 170: 150: 130: 110: 90: 76:Problem definition 64: 44:data deduplication 1814: 1813: 1806:String operations 663: 578: 448:{\displaystyle T} 428:{\displaystyle S} 400:{\displaystyle k} 193:{\displaystyle T} 173:{\displaystyle S} 153:{\displaystyle n} 133:{\displaystyle T} 113:{\displaystyle m} 93:{\displaystyle S} 1839: 1771:Pattern matching 1725:Suffix automaton 1527:Hamming distance 1479: 1472: 1465: 1456: 1455: 1413: 1412: 1392: 1386: 1384: 1382: 1366: 1318: 1316: 1315: 1310: 1286: 1284: 1283: 1278: 1235: 1231: 1227: 1223: 1219: 1215: 1211: 1207: 1203: 1199: 1192: 1188: 1184: 1181:time. The array 1180: 1178: 1177: 1172: 1142:L := 0 1084: 1082: 1081: 1076: 1051: 1049: 1048: 1043: 1018: 1016: 1015: 1010: 958: 956: 955: 950: 945: 944: 926: 925: 899: 897: 896: 891: 886: 885: 873: 872: 850: 848: 847: 842: 837: 836: 818: 817: 795: 793: 792: 787: 759: 757: 756: 751: 749: 745: 723: 679: 677: 676: 671: 669: 665: 664: 641: 639: 595: 593: 592: 587: 585: 584: 583: 579: 556: 535: 533: 532: 527: 507: 505: 504: 499: 475: 473: 472: 467: 454: 452: 451: 446: 434: 432: 431: 426: 406: 404: 403: 398: 386: 384: 383: 378: 355:. Find for each 354: 352: 351: 346: 338: 337: 318: 316: 315: 310: 308: 307: 295: 290: 289: 280: 268: 266: 265: 260: 255: 254: 236: 235: 199: 197: 196: 191: 179: 177: 176: 171: 159: 157: 156: 151: 139: 137: 136: 131: 119: 117: 116: 111: 99: 97: 96: 91: 28:computer science 1847: 1846: 1842: 1841: 1840: 1838: 1837: 1836: 1817: 1816: 1815: 1810: 1754: 1701: 1668: 1654:Regular grammar 1635: 1614: 1595:Raita algorithm 1556: 1507:Bitap algorithm 1488: 1483: 1421: 1416: 1409: 1393: 1389: 1367: 1363: 1359: 1334: 1292: 1289: 1288: 1248: 1245: 1244: 1233: 1229: 1225: 1221: 1217: 1213: 1209: 1205: 1201: 1197: 1195:end at position 1190: 1186: 1182: 1154: 1151: 1150: 1147: 1091: 1061: 1058: 1057: 1025: 1022: 1021: 995: 992: 991: 969: 940: 936: 921: 917: 909: 906: 905: 881: 877: 868: 864: 856: 853: 852: 832: 828: 813: 809: 801: 798: 797: 769: 766: 765: 719: 694: 690: 685: 682: 681: 640: 635: 610: 606: 601: 598: 597: 555: 551: 547: 543: 541: 538: 537: 521: 518: 517: 481: 478: 477: 461: 458: 457: 440: 437: 436: 420: 417: 416: 413: 392: 389: 388: 360: 357: 356: 333: 329: 324: 321: 320: 303: 299: 291: 285: 281: 276: 274: 271: 270: 250: 246: 231: 227: 216: 213: 212: 185: 182: 181: 165: 162: 161: 145: 142: 141: 125: 122: 121: 105: 102: 101: 85: 82: 81: 78: 73: 56: 24: 17: 12: 11: 5: 1845: 1835: 1834: 1829: 1812: 1811: 1809: 1808: 1803: 1798: 1793: 1788: 1783: 1778: 1773: 1768: 1762: 1760: 1756: 1755: 1753: 1752: 1747: 1742: 1737: 1732: 1727: 1722: 1717: 1711: 1709: 1707:Data structure 1703: 1702: 1700: 1699: 1694: 1689: 1684: 1678: 1676: 1670: 1669: 1667: 1666: 1661: 1656: 1651: 1645: 1643: 1637: 1636: 1634: 1633: 1628: 1622: 1620: 1616: 1615: 1613: 1612: 1607: 1602: 1600:Trigram search 1597: 1592: 1587: 1582: 1577: 1572: 1566: 1564: 1558: 1557: 1555: 1554: 1549: 1544: 1539: 1534: 1529: 1524: 1519: 1514: 1509: 1504: 1498: 1496: 1490: 1489: 1482: 1481: 1474: 1467: 1459: 1453: 1452: 1447: 1442: 1437: 1432: 1427: 1420: 1419:External links 1417: 1415: 1414: 1407: 1387: 1360: 1358: 1355: 1354: 1353: 1340: 1333: 1330: 1329: 1328: 1325: 1324: 1323: 1308: 1305: 1302: 1299: 1296: 1276: 1273: 1270: 1267: 1264: 1261: 1258: 1255: 1252: 1170: 1167: 1164: 1161: 1158: 1095: 1090: 1087: 1074: 1071: 1068: 1065: 1041: 1038: 1035: 1032: 1029: 1008: 1005: 1002: 999: 968: 965: 948: 943: 939: 935: 932: 929: 924: 920: 916: 913: 889: 884: 880: 876: 871: 867: 863: 860: 840: 835: 831: 827: 824: 821: 816: 812: 808: 805: 785: 782: 779: 776: 773: 748: 744: 741: 738: 735: 732: 729: 726: 722: 718: 715: 712: 709: 706: 703: 700: 697: 693: 689: 668: 662: 659: 656: 653: 650: 647: 644: 638: 634: 631: 628: 625: 622: 619: 616: 613: 609: 605: 582: 577: 574: 571: 568: 565: 562: 559: 554: 550: 546: 525: 497: 494: 491: 488: 485: 465: 444: 424: 412: 409: 396: 376: 373: 370: 367: 364: 344: 341: 336: 332: 328: 306: 302: 298: 294: 288: 284: 279: 258: 253: 249: 245: 242: 239: 234: 230: 226: 223: 220: 189: 169: 149: 129: 109: 89: 77: 74: 71: 55: 52: 15: 9: 6: 4: 3: 2: 1844: 1833: 1830: 1828: 1825: 1824: 1822: 1807: 1804: 1802: 1799: 1797: 1794: 1792: 1789: 1787: 1784: 1782: 1779: 1777: 1774: 1772: 1769: 1767: 1764: 1763: 1761: 1757: 1751: 1748: 1746: 1743: 1741: 1738: 1736: 1733: 1731: 1728: 1726: 1723: 1721: 1718: 1716: 1713: 1712: 1710: 1708: 1704: 1698: 1695: 1693: 1690: 1688: 1685: 1683: 1680: 1679: 1677: 1675: 1671: 1665: 1662: 1660: 1657: 1655: 1652: 1650: 1647: 1646: 1644: 1642: 1638: 1632: 1629: 1627: 1624: 1623: 1621: 1617: 1611: 1608: 1606: 1603: 1601: 1598: 1596: 1593: 1591: 1588: 1586: 1583: 1581: 1578: 1576: 1573: 1571: 1568: 1567: 1565: 1563: 1559: 1553: 1550: 1548: 1545: 1543: 1540: 1538: 1535: 1533: 1530: 1528: 1525: 1523: 1520: 1518: 1517:Edit distance 1515: 1513: 1510: 1508: 1505: 1503: 1500: 1499: 1497: 1495: 1494:String metric 1491: 1487: 1480: 1475: 1473: 1468: 1466: 1461: 1460: 1457: 1451: 1448: 1446: 1443: 1441: 1438: 1436: 1433: 1431: 1428: 1426: 1423: 1422: 1410: 1408:0-521-58519-8 1404: 1400: 1399: 1391: 1381: 1376: 1372: 1365: 1361: 1351: 1347: 1345: 1341: 1339: 1336: 1335: 1326: 1321: 1320: 1303: 1300: 1294: 1268: 1265: 1262: 1250: 1242: 1241: 1240: 1237: 1196: 1165: 1162: 1156: 1145: 1141: 1137: 1134: 1130: 1126: 1122: 1118: 1114: 1110: 1106: 1102: 1098: 1094: 1086: 1069: 1055: 1036: 1033: 1003: 988: 985: 977: 973: 964: 962: 941: 937: 933: 930: 927: 922: 918: 903: 882: 878: 874: 869: 865: 833: 829: 825: 822: 819: 814: 810: 780: 777: 763: 746: 739: 736: 733: 727: 724: 720: 716: 713: 710: 704: 701: 698: 691: 687: 666: 657: 654: 651: 645: 642: 636: 632: 629: 626: 620: 617: 614: 607: 603: 580: 572: 569: 566: 560: 557: 552: 548: 544: 523: 515: 511: 492: 489: 486: 476: 442: 422: 408: 394: 374: 371: 368: 365: 362: 342: 339: 334: 330: 326: 304: 300: 296: 286: 282: 251: 247: 243: 240: 237: 232: 228: 221: 218: 210: 206: 201: 187: 167: 147: 127: 107: 87: 70: 67: 60: 51: 49: 45: 41: 37: 33: 29: 22: 1785: 1720:Suffix array 1626:Aho–Corasick 1537:Lee distance 1397: 1390: 1370: 1364: 1349: 1343: 1238: 1234:S-z)..(ret)] 1194: 1148: 1143: 1139: 1135: 1132: 1128: 1124: 1120: 1116: 1112: 1108: 1104: 1100: 1096: 1092: 989: 981: 959:time with a 414: 208: 204: 202: 79: 68: 65: 31: 25: 1730:Suffix tree 1287:instead of 967:Suffix tree 680:time using 1821:Categories 1357:References 1216:. The set 900:time with 851:space and 411:Algorithms 140:of length 100:of length 38:that is a 1064:Θ 1028:Θ 998:Θ 931:⋯ 912:Θ 904:and take 875:⋯ 859:Θ 823:⋯ 804:Θ 772:Θ 728:⁡ 717:σ 714:⁡ 646:⁡ 633:σ 630:⁡ 561:⁡ 524:σ 464:Θ 407:strings. 372:≤ 366:≤ 327:∑ 241:… 40:substring 1332:See also 1097:function 514:word RAM 269:, where 54:Examples 1796:Sorting 1766:Parsing 1486:Strings 1405:  1193:which 1144:return 1119:i = 1 1085:time. 764:costs 36:string 1759:Other 1715:DAFSA 1682:BLAST 1346:-gram 1101:array 1750:Trie 1740:Rope 1403:ISBN 1200:and 1189:and 1146:ret 1140:else 1133:else 1125:else 435:and 319:and 180:and 120:and 46:and 30:, a 1375:doi 1257:min 1230:ret 1218:ret 1210:ret 1109:for 1105:for 725:log 711:log 643:log 627:log 558:log 455:in 26:In 1823:: 1319:) 1236:. 1232:, 1136:if 1129:if 1121:or 1117:if 1113:if 963:. 200:. 50:. 1478:e 1471:t 1464:v 1411:. 1383:. 1377:: 1350:n 1344:n 1307:) 1304:r 1301:n 1298:( 1295:O 1275:) 1272:) 1269:n 1266:, 1263:r 1260:( 1254:( 1251:O 1226:S 1222:i 1214:z 1206:z 1202:j 1198:i 1191:T 1187:S 1183:L 1169:) 1166:r 1163:n 1160:( 1157:O 1073:) 1070:N 1067:( 1040:) 1037:K 1034:N 1031:( 1007:) 1004:N 1001:( 947:) 942:K 938:n 934:+ 928:+ 923:1 919:n 915:( 888:) 883:K 879:n 870:1 866:n 862:( 839:) 834:K 830:n 826:+ 820:+ 815:1 811:n 807:( 784:) 781:m 778:n 775:( 747:) 743:) 740:m 737:+ 734:n 731:( 721:/ 708:) 705:m 702:+ 699:n 696:( 692:( 688:O 667:) 661:) 658:m 655:+ 652:n 649:( 637:/ 624:) 621:m 618:+ 615:n 612:( 608:( 604:O 581:) 576:) 573:m 570:+ 567:n 564:( 553:( 549:o 545:2 496:) 493:m 490:+ 487:n 484:( 443:T 423:S 395:k 375:K 369:k 363:2 343:N 340:= 335:i 331:n 305:i 301:n 297:= 293:| 287:i 283:S 278:| 257:} 252:K 248:S 244:, 238:, 233:1 229:S 225:{ 222:= 219:S 207:- 205:k 188:T 168:S 148:n 128:T 108:m 88:S 23:.

Index

longest common subsequence
computer science
string
substring
data deduplication
plagiarism detection

Θ {\displaystyle \Theta }
generalized suffix tree
word RAM
dynamic programming
dynamic programming
generalized suffix tree

Generalized suffix tree
generalized suffix tree
lowest common ancestor
Longest palindromic substring
n-gram
doi
10.4230/LIPIcs.ESA.2021.30
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
ISBN
0-521-58519-8
Dictionary of Algorithms and Data Structures: longest common substring
Perl/XS implementation of the dynamic programming algorithm
Perl/XS implementation of the suffix tree algorithm
Dynamic programming implementations in various languages on wikibooks
working AS3 implementation of the dynamic programming algorithm
Suffix Tree based C implementation of Longest common substring for two strings

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑