Knowledge

Longest common substring

Source 📝

983: 70: 997:
for the strings, and then finding the deepest internal nodes which have leaf nodes from all the strings in the subtree below it. The figure on the right is the suffix tree for the strings "ABAB", "BABA" and "ABBA", padded with unique string terminators, to become "ABAB$ 0", "BABA$ 1" and "ABBA$ 2".
689: 1030:
time (if the size of the alphabet is constant). If the tree is traversed from the bottom up with a bit vector telling which strings are seen below each node, the k-common substring problem can be solved in
769: 605: 77:
The picture shows two strings where the problem has multiple solutions. Although the substring occurrences always overlap, it is impossible to obtain a longer common substring by "uniting" them.
968: 860: 909: 278: 396: 328: 1296: 1450: 1061: 805: 1094: 1028: 80:
The strings "ABABC", "BABCA" and "ABCBA" have only one longest common substring, viz. "ABC" of length 3. Other common substrings are "A", "AB", "B", "BA", "BC" and "C".
364: 545: 485: 1328: 1190: 517: 464: 444: 416: 209: 189: 169: 149: 129: 109: 1380:
Charalampopoulos, Panagiotis; Kociumaka, Tomasz; Pissis, Solon P.; Radoszewski, Jakub (Aug 2021). Mutzel, Petra; Pagh, Rasmus; Herman, Grzegorz (eds.).
17: 1725: 1620: 610: 1455: 1659: 1585: 1338:
Store only non-zero values in the rows. This can be done using hash-tables instead of arrays. This is useful for large alphabets.
1384:. European Symposium on Algorithms. Leibniz International Proceedings in Informatics (LIPIcs). Vol. 204. Schloss Dagstuhl. 1590: 694: 1595: 550: 998:
The nodes representing "A", "B", "AB" and "BA" all have descendant leaves from all of the strings, numbered 0, 1 and 2.
1806: 1580: 1407: 1674: 1615: 1487: 1460: 1417: 1522: 467: 1104:
The following pseudocode finds the set of longest common substrings between two strings with dynamic programming:
918: 810: 865: 1702: 225: 1348: 1842: 1837: 1707: 1562: 1786: 1512: 1791: 1669: 1636: 1572: 31: 1801: 1697: 1641: 1542: 1496: 46: 1600: 1532: 369: 283: 1333:
The last and current row can be stored on the same 1D array by traversing the inner loop backwards
1257: 1745: 1034: 994: 986: 971: 778: 520: 1070: 1004: 1064: 333: 1750: 1692: 1552: 1480: 1142:
L > z z := L ret := {S}
530: 470: 1557: 1301: 1163: 490: 58: 53:
of all of them. There may be more than one longest common substring. Applications include
8: 1755: 912: 772: 1435: 73:
The strings "BADANAT" and "CANADAS" share the maximal-length substrings "ADA" and "ANA".
1684: 1651: 449: 429: 401: 194: 174: 154: 134: 114: 94: 54: 1409:
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
1816: 1413: 1235:, which is the last character of the longest common substring (of size z) instead of 426:
One can find the lengths and starting positions of the longest common substrings of
1811: 1781: 1735: 1537: 1473: 1385: 38: 1250:
The following tricks can be used to reduce the memory usage of an implementation:
1664: 1605: 1517: 1219:
is used to hold the length of the longest common substring found so far. The set
1717: 1610: 1390: 1461:
Suffix Tree based C implementation of Longest common substring for two strings
1831: 1527: 1504: 1440: 993:
The longest common substrings of a set of strings can be found by building a
982: 1730: 1547: 1445: 1379: 1740: 69: 1436:
Dictionary of Algorithms and Data Structures: longest common substring
1451:
Dynamic programming implementations in various languages on wikibooks
50: 1254:
Keep only the last and current row of the DP table to save memory (
524: 1776: 1239:. Thus all the longest common substrings would be, for each i in 684:{\textstyle O\left((n+m)\log \sigma /{\sqrt {\log(n+m)}}\right)} 1456:
working AS3 implementation of the dynamic programming algorithm
1353: 1196:
stores the length of the longest common suffix of the prefixes
989:
for the strings "ABAB", "BABA" and "ABBA", numbered 0, 1 and 2.
1149:L = z ret := ret âˆȘ {S} 1760: 1441:
Perl/XS implementation of the dynamic programming algorithm
1465: 764:{\displaystyle O\left((n+m)\log \sigma /\log(n+m)\right)} 398:, a longest string which occurs as substring of at least 1223:
is used to hold the set of strings which are of length
1063:
time. If the suffix tree is prepared for constant time
1134:j = 1 L := 1 613: 336: 1304: 1260: 1166: 1073: 1037: 1007: 921: 868: 813: 781: 697: 600:{\displaystyle 2^{o\left({\sqrt {\log(n+m)}}\right)}} 553: 533: 493: 473: 452: 432: 404: 372: 286: 228: 197: 177: 157: 137: 117: 97: 1114:(1..r, 1..n) z := 0 ret := {} 1446:
Perl/XS implementation of the suffix tree algorithm
1231:can be saved efficiently by just storing the index 171:, find a longest string which is substring of both 1322: 1290: 1184: 1088: 1055: 1022: 962: 903: 854: 799: 763: 683: 599: 539: 511: 479: 458: 438: 410: 390: 358: 322: 272: 203: 183: 163: 143: 123: 103: 1829: 1267: 807:. The solutions to the generalized problem take 1382:Faster Algorithms for Longest Common Substring 1481: 963:{\displaystyle \Theta (n_{1}+\cdots +n_{K})} 855:{\displaystyle \Theta (n_{1}+\cdots +n_{K})} 523:. A faster algorithm can be achieved in the 267: 235: 1110:LongestCommonSubstring(S, T) L := 1488: 1474: 904:{\displaystyle \Theta (n_{1}\cdots n_{K})} 1389: 273:{\displaystyle S=\{S_{1},\ldots ,S_{K}\}} 83:ABABC ||| BABCA ||| ABCBA 1660:Comparison of regular-expression engines 1405: 1359:, all the possible substrings of length 981: 607:. In particular, this algorithm runs in 68: 14: 1830: 1099: 1621:Zhu–Takaoka string matching algorithm 1469: 1373: 86: 1399: 45:of two or more strings is a longest 1586:Boyer–Moore string-search algorithm 1412:. USA: Cambridge University Press. 24: 1074: 1038: 1008: 922: 869: 814: 782: 474: 25: 1854: 1675:Nondeterministic finite automaton 1616:Two-way string-matching algorithm 1429: 527:model of computation if the size 1138:L := L + 1 18:Longest common substring problem 1067:retrieval, it can be solved in 1001:Building the suffix tree takes 1591:Boyer–Moore–Horspool algorithm 1581:Apostolico–Giancarlo algorithm 1363:that are contained in a string 1317: 1308: 1285: 1282: 1270: 1264: 1179: 1170: 1083: 1077: 1050: 1041: 1017: 1011: 977: 957: 925: 898: 872: 849: 817: 794: 785: 771:space. Solving the problem by 753: 741: 718: 706: 671: 659: 634: 622: 586: 574: 506: 494: 303: 288: 13: 1: 1367: 1349:Longest palindromic substring 1215:, respectively. The variable 421: 391:{\displaystyle 2\leq k\leq K} 323:{\displaystyle |S_{i}|=n_{i}} 1596:Knuth–Morris–Pratt algorithm 1523:Damerau–Levenshtein distance 1291:{\displaystyle O(\min(r,n))} 547:of the input alphabet is in 7: 1787:Compressed pattern matching 1513:Approximate string matching 1495: 1342: 1122:j := 1..n 1056:{\displaystyle \Theta (NK)} 800:{\displaystyle \Theta (nm)} 222:. Given the set of strings 64: 10: 1859: 1792:Longest common subsequence 1703:Needleman–Wunsch algorithm 1573:String-searching algorithm 1391:10.4230/LIPIcs.ESA.2021.30 1089:{\displaystyle \Theta (N)} 1023:{\displaystyle \Theta (N)} 32:longest common subsequence 29: 1802:Sequential pattern mining 1769: 1716: 1683: 1650: 1642:Commentz-Walter algorithm 1630:Multiple string searching 1629: 1571: 1563:Wagner–Fischer algorithm 1503: 359:{\textstyle \sum n_{i}=N} 1812:String rewriting systems 1797:Longest common substring 1708:Smith–Waterman algorithm 1533:Gestalt pattern matching 1396:Here: Theorem 1, p.30:2. 519:time with the help of a 220:common substring problem 214:A generalization is the 43:longest common substring 30:Not to be confused with 27:Computer science problem 1746:Generalized suffix tree 1670:Thompson's construction 1406:Gusfield, Dan (1999) . 1160:This algorithm runs in 1118:i := 1..r 995:generalized suffix tree 987:Generalized suffix tree 972:generalized suffix tree 540:{\displaystyle \sigma } 521:generalized suffix tree 480:{\displaystyle \Theta } 1698:Hirschberg's algorithm 1324: 1292: 1186: 1126:S = T 1090: 1065:lowest common ancestor 1057: 1024: 990: 964: 905: 856: 801: 765: 685: 601: 541: 513: 481: 460: 440: 412: 392: 360: 324: 274: 205: 185: 165: 145: 125: 105: 74: 1553:Levenshtein automaton 1543:Jaro–Winkler distance 1325: 1323:{\displaystyle O(nr)} 1293: 1187: 1185:{\displaystyle O(nr)} 1091: 1058: 1025: 985: 965: 906: 857: 802: 766: 686: 602: 542: 514: 512:{\displaystyle (n+m)} 482: 461: 441: 413: 393: 361: 325: 275: 206: 186: 166: 146: 126: 106: 72: 1601:Rabin–Karp algorithm 1558:Levenshtein distance 1302: 1258: 1164: 1071: 1035: 1005: 919: 866: 811: 779: 695: 611: 551: 531: 491: 471: 450: 430: 402: 370: 334: 284: 226: 195: 175: 155: 135: 115: 95: 59:plagiarism detection 1843:Dynamic programming 1838:Problems on strings 1756:Ternary search tree 1100:Dynamic programming 913:dynamic programming 773:dynamic programming 91:Given two strings, 1685:Sequence alignment 1652:Regular expression 1320: 1288: 1182: 1086: 1053: 1020: 991: 960: 901: 852: 797: 761: 681: 597: 537: 509: 477: 456: 436: 408: 388: 356: 320: 270: 201: 181: 161: 141: 121: 101: 87:Problem definition 75: 55:data deduplication 1825: 1824: 1817:String operations 674: 589: 459:{\displaystyle T} 439:{\displaystyle S} 411:{\displaystyle k} 204:{\displaystyle T} 184:{\displaystyle S} 164:{\displaystyle n} 144:{\displaystyle T} 124:{\displaystyle m} 104:{\displaystyle S} 16:(Redirected from 1850: 1782:Pattern matching 1736:Suffix automaton 1538:Hamming distance 1490: 1483: 1476: 1467: 1466: 1424: 1423: 1403: 1397: 1395: 1393: 1377: 1329: 1327: 1326: 1321: 1297: 1295: 1294: 1289: 1246: 1242: 1238: 1234: 1230: 1226: 1222: 1218: 1214: 1210: 1203: 1199: 1195: 1192:time. The array 1191: 1189: 1188: 1183: 1153:L := 0 1095: 1093: 1092: 1087: 1062: 1060: 1059: 1054: 1029: 1027: 1026: 1021: 969: 967: 966: 961: 956: 955: 937: 936: 910: 908: 907: 902: 897: 896: 884: 883: 861: 859: 858: 853: 848: 847: 829: 828: 806: 804: 803: 798: 770: 768: 767: 762: 760: 756: 734: 690: 688: 687: 682: 680: 676: 675: 652: 650: 606: 604: 603: 598: 596: 595: 594: 590: 567: 546: 544: 543: 538: 518: 516: 515: 510: 486: 484: 483: 478: 465: 463: 462: 457: 445: 443: 442: 437: 417: 415: 414: 409: 397: 395: 394: 389: 366:. Find for each 365: 363: 362: 357: 349: 348: 329: 327: 326: 321: 319: 318: 306: 301: 300: 291: 279: 277: 276: 271: 266: 265: 247: 246: 210: 208: 207: 202: 190: 188: 187: 182: 170: 168: 167: 162: 150: 148: 147: 142: 130: 128: 127: 122: 110: 108: 107: 102: 39:computer science 21: 1858: 1857: 1853: 1852: 1851: 1849: 1848: 1847: 1828: 1827: 1826: 1821: 1765: 1712: 1679: 1665:Regular grammar 1646: 1625: 1606:Raita algorithm 1567: 1518:Bitap algorithm 1499: 1494: 1432: 1427: 1420: 1404: 1400: 1378: 1374: 1370: 1345: 1303: 1300: 1299: 1259: 1256: 1255: 1244: 1240: 1236: 1232: 1228: 1224: 1220: 1216: 1212: 1208: 1206:end at position 1201: 1197: 1193: 1165: 1162: 1161: 1158: 1102: 1072: 1069: 1068: 1036: 1033: 1032: 1006: 1003: 1002: 980: 951: 947: 932: 928: 920: 917: 916: 892: 888: 879: 875: 867: 864: 863: 843: 839: 824: 820: 812: 809: 808: 780: 777: 776: 730: 705: 701: 696: 693: 692: 651: 646: 621: 617: 612: 609: 608: 566: 562: 558: 554: 552: 549: 548: 532: 529: 528: 492: 489: 488: 472: 469: 468: 451: 448: 447: 431: 428: 427: 424: 403: 400: 399: 371: 368: 367: 344: 340: 335: 332: 331: 314: 310: 302: 296: 292: 287: 285: 282: 281: 261: 257: 242: 238: 227: 224: 223: 196: 193: 192: 176: 173: 172: 156: 153: 152: 136: 133: 132: 116: 113: 112: 96: 93: 92: 89: 84: 67: 35: 28: 23: 22: 15: 12: 11: 5: 1856: 1846: 1845: 1840: 1823: 1822: 1820: 1819: 1814: 1809: 1804: 1799: 1794: 1789: 1784: 1779: 1773: 1771: 1767: 1766: 1764: 1763: 1758: 1753: 1748: 1743: 1738: 1733: 1728: 1722: 1720: 1718:Data structure 1714: 1713: 1711: 1710: 1705: 1700: 1695: 1689: 1687: 1681: 1680: 1678: 1677: 1672: 1667: 1662: 1656: 1654: 1648: 1647: 1645: 1644: 1639: 1633: 1631: 1627: 1626: 1624: 1623: 1618: 1613: 1611:Trigram search 1608: 1603: 1598: 1593: 1588: 1583: 1577: 1575: 1569: 1568: 1566: 1565: 1560: 1555: 1550: 1545: 1540: 1535: 1530: 1525: 1520: 1515: 1509: 1507: 1501: 1500: 1493: 1492: 1485: 1478: 1470: 1464: 1463: 1458: 1453: 1448: 1443: 1438: 1431: 1430:External links 1428: 1426: 1425: 1418: 1398: 1371: 1369: 1366: 1365: 1364: 1351: 1344: 1341: 1340: 1339: 1336: 1335: 1334: 1319: 1316: 1313: 1310: 1307: 1287: 1284: 1281: 1278: 1275: 1272: 1269: 1266: 1263: 1181: 1178: 1175: 1172: 1169: 1106: 1101: 1098: 1085: 1082: 1079: 1076: 1052: 1049: 1046: 1043: 1040: 1019: 1016: 1013: 1010: 979: 976: 959: 954: 950: 946: 943: 940: 935: 931: 927: 924: 900: 895: 891: 887: 882: 878: 874: 871: 851: 846: 842: 838: 835: 832: 827: 823: 819: 816: 796: 793: 790: 787: 784: 759: 755: 752: 749: 746: 743: 740: 737: 733: 729: 726: 723: 720: 717: 714: 711: 708: 704: 700: 679: 673: 670: 667: 664: 661: 658: 655: 649: 645: 642: 639: 636: 633: 630: 627: 624: 620: 616: 593: 588: 585: 582: 579: 576: 573: 570: 565: 561: 557: 536: 508: 505: 502: 499: 496: 476: 455: 435: 423: 420: 407: 387: 384: 381: 378: 375: 355: 352: 347: 343: 339: 317: 313: 309: 305: 299: 295: 290: 269: 264: 260: 256: 253: 250: 245: 241: 237: 234: 231: 200: 180: 160: 140: 120: 100: 88: 85: 82: 66: 63: 26: 9: 6: 4: 3: 2: 1855: 1844: 1841: 1839: 1836: 1835: 1833: 1818: 1815: 1813: 1810: 1808: 1805: 1803: 1800: 1798: 1795: 1793: 1790: 1788: 1785: 1783: 1780: 1778: 1775: 1774: 1772: 1768: 1762: 1759: 1757: 1754: 1752: 1749: 1747: 1744: 1742: 1739: 1737: 1734: 1732: 1729: 1727: 1724: 1723: 1721: 1719: 1715: 1709: 1706: 1704: 1701: 1699: 1696: 1694: 1691: 1690: 1688: 1686: 1682: 1676: 1673: 1671: 1668: 1666: 1663: 1661: 1658: 1657: 1655: 1653: 1649: 1643: 1640: 1638: 1635: 1634: 1632: 1628: 1622: 1619: 1617: 1614: 1612: 1609: 1607: 1604: 1602: 1599: 1597: 1594: 1592: 1589: 1587: 1584: 1582: 1579: 1578: 1576: 1574: 1570: 1564: 1561: 1559: 1556: 1554: 1551: 1549: 1546: 1544: 1541: 1539: 1536: 1534: 1531: 1529: 1528:Edit distance 1526: 1524: 1521: 1519: 1516: 1514: 1511: 1510: 1508: 1506: 1505:String metric 1502: 1498: 1491: 1486: 1484: 1479: 1477: 1472: 1471: 1468: 1462: 1459: 1457: 1454: 1452: 1449: 1447: 1444: 1442: 1439: 1437: 1434: 1433: 1421: 1419:0-521-58519-8 1415: 1411: 1410: 1402: 1392: 1387: 1383: 1376: 1372: 1362: 1358: 1356: 1352: 1350: 1347: 1346: 1337: 1332: 1331: 1314: 1311: 1305: 1279: 1276: 1273: 1261: 1253: 1252: 1251: 1248: 1207: 1176: 1173: 1167: 1156: 1152: 1148: 1145: 1141: 1137: 1133: 1129: 1125: 1121: 1117: 1113: 1109: 1105: 1097: 1080: 1066: 1047: 1044: 1014: 999: 996: 988: 984: 975: 973: 952: 948: 944: 941: 938: 933: 929: 914: 893: 889: 885: 880: 876: 844: 840: 836: 833: 830: 825: 821: 791: 788: 774: 757: 750: 747: 744: 738: 735: 731: 727: 724: 721: 715: 712: 709: 702: 698: 677: 668: 665: 662: 656: 653: 647: 643: 640: 637: 631: 628: 625: 618: 614: 591: 583: 580: 577: 571: 568: 563: 559: 555: 534: 526: 522: 503: 500: 497: 487: 453: 433: 419: 405: 385: 382: 379: 376: 373: 353: 350: 345: 341: 337: 315: 311: 307: 297: 293: 262: 258: 254: 251: 248: 243: 239: 232: 229: 221: 217: 212: 198: 178: 158: 138: 118: 98: 81: 78: 71: 62: 60: 56: 52: 48: 44: 40: 33: 19: 1796: 1731:Suffix array 1637:Aho–Corasick 1548:Lee distance 1408: 1401: 1381: 1375: 1360: 1354: 1249: 1245:S-z)..(ret)] 1205: 1159: 1154: 1150: 1146: 1143: 1139: 1135: 1131: 1127: 1123: 1119: 1115: 1111: 1107: 1103: 1000: 992: 970:time with a 425: 219: 215: 213: 90: 79: 76: 42: 36: 1741:Suffix tree 1298:instead of 978:Suffix tree 691:time using 1832:Categories 1368:References 1227:. The set 911:time with 862:space and 422:Algorithms 151:of length 111:of length 49:that is a 1075:Θ 1039:Θ 1009:Θ 942:⋯ 923:Θ 915:and take 886:⋯ 870:Θ 834:⋯ 815:Θ 783:Θ 739:⁡ 728:σ 725:⁡ 657:⁡ 644:σ 641:⁡ 572:⁡ 535:σ 475:Θ 418:strings. 383:≤ 377:≤ 338:∑ 252:… 51:substring 1343:See also 1108:function 525:word RAM 280:, where 65:Examples 1807:Sorting 1777:Parsing 1497:Strings 1416:  1204:which 1155:return 1130:i = 1 1096:time. 775:costs 47:string 1770:Other 1726:DAFSA 1693:BLAST 1357:-gram 1112:array 1761:Trie 1751:Rope 1414:ISBN 1211:and 1200:and 1157:ret 1151:else 1144:else 1136:else 446:and 330:and 191:and 131:and 57:and 41:, a 1386:doi 1268:min 1241:ret 1229:ret 1221:ret 1120:for 1116:for 736:log 722:log 654:log 638:log 569:log 466:in 37:In 1834:: 1330:) 1247:. 1243:, 1147:if 1140:if 1132:or 1128:if 1124:if 974:. 211:. 61:. 1489:e 1482:t 1475:v 1422:. 1394:. 1388:: 1361:n 1355:n 1318:) 1315:r 1312:n 1309:( 1306:O 1286:) 1283:) 1280:n 1277:, 1274:r 1271:( 1265:( 1262:O 1237:S 1233:i 1225:z 1217:z 1213:j 1209:i 1202:T 1198:S 1194:L 1180:) 1177:r 1174:n 1171:( 1168:O 1084:) 1081:N 1078:( 1051:) 1048:K 1045:N 1042:( 1018:) 1015:N 1012:( 958:) 953:K 949:n 945:+ 939:+ 934:1 930:n 926:( 899:) 894:K 890:n 881:1 877:n 873:( 850:) 845:K 841:n 837:+ 831:+ 826:1 822:n 818:( 795:) 792:m 789:n 786:( 758:) 754:) 751:m 748:+ 745:n 742:( 732:/ 719:) 716:m 713:+ 710:n 707:( 703:( 699:O 678:) 672:) 669:m 666:+ 663:n 660:( 648:/ 635:) 632:m 629:+ 626:n 623:( 619:( 615:O 592:) 587:) 584:m 581:+ 578:n 575:( 564:( 560:o 556:2 507:) 504:m 501:+ 498:n 495:( 454:T 434:S 406:k 386:K 380:k 374:2 354:N 351:= 346:i 342:n 316:i 312:n 308:= 304:| 298:i 294:S 289:| 268:} 263:K 259:S 255:, 249:, 244:1 240:S 236:{ 233:= 230:S 218:- 216:k 199:T 179:S 159:n 139:T 119:m 99:S 34:. 20:)

Index

Longest common substring problem
longest common subsequence
computer science
string
substring
data deduplication
plagiarism detection

Θ {\displaystyle \Theta }
generalized suffix tree
word RAM
dynamic programming
dynamic programming
generalized suffix tree

Generalized suffix tree
generalized suffix tree
lowest common ancestor
Longest palindromic substring
n-gram
doi
10.4230/LIPIcs.ESA.2021.30
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
ISBN
0-521-58519-8
Dictionary of Algorithms and Data Structures: longest common substring
Perl/XS implementation of the dynamic programming algorithm
Perl/XS implementation of the suffix tree algorithm
Dynamic programming implementations in various languages on wikibooks
working AS3 implementation of the dynamic programming algorithm

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑