Knowledge

Count–min sketch

Source 📝

863:, the Count–min sketch is a linear sketch. That is, given two streams, constructing a sketch on each stream and summing the sketches yields the same result as concatenating the streams and constructing a sketch on the concatenated streams. This makes the sketch mergeable and appropriate for use in distributed settings in addition to streaming ones. 890:(MLE) was derived in Ting. By using the MLE, the estimator is always able to match or better the min estimator and works well even if the distribution is not skewed. This paper also showed the hCount* debiasing operation is a bootstrapping procedure that can be efficiently computed without random sampling and can be generalized to any estimator. 893:
Since errors arise from hash collisions with unknown items from the universe, several approaches correct for the collisions when multiple elements of the universe are known or queried for simultaneously . For each of these, a large proportion of the universe must be known to observe a significant
875:
of the true frequency of events: they may overestimate, but never underestimate the true count in a point query. Furthermore, while the min estimator works well when the distribution is highly skewed, other sketches such as the Count sketch based on means are more accurate when the distribution is
218:. However, they are used differently and therefore sized differently: a count–min sketch typically has a sublinear number of cells, related to the desired approximation quality of the sketch, while a counting Bloom filter is more typically sized to match the number of elements in the set. 226:
The goal of the basic version of the count–min sketch is to consume a stream of events, one at a time, and count the frequency of the different types of events in the stream. At any time, the sketch can be queried for the frequency of a particular event type
1200:
The following discussion assumes that only "positive" events occur, i.e., the frequency of the various types cannot decrease over time. Modifications of the following algorithms exist for the more general case where frequencies are allowed to
1156: 1003: 494: 677: 756: 585: 850: 809: 528: 253: 703: 612: 883:
estimator repeatedly randomly selects d random entries in the sketch and takes the minimum to obtain an unbiased estimate of the bias and subtracts it off.
1008: 215: 274:
are fixed when the sketch is created, and determine the time and space needs and the probability of error when the sketch is queried for a frequency or
192: 207: 1248: 255:, and will return an estimate of this frequency that is within a certain distance of the true frequency, with a certain probability. 139: 1547: 23: 911: 402: 624: 876:
not sufficiently skewed. Several variations on the sketch have been proposed to reduce error and reduce or eliminate bias.
1552: 708: 1404: 1444:
Lu, Yi; Montanari, Andrea; Prabhakar, Balaji; Dharmapurikar, Sarang; Kabbani, Abdul (2008). "Counter braids".
541: 887: 92: 814: 773: 132: 1176: 499: 1512: 1493: 1430: 1387:
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
1368: 234: 1310: 856:
Small modifications to the data structure can be used to sketch other different stream statistics.
1504:
Popularity is everything: A new approach to protecting passwords from statistical-guessing attacks
1507: 1488: 1425: 1363: 1305: 682: 196: 125: 283: 211: 1271: 1226: 1162:. While this update procedure makes the sketch not a linear sketch, it is still mergeable. 872: 871:
One potential problem with the usual min estimator for count–min sketches is that they are
590: 165: 108: 31: 8: 1483:
Dwork, Cynthia; Naor, Moni; Pitassi, Toniann; Rothblum, Guy N.; Yekhanin, Sergey (2010).
172: 1527: 1323: 1289: 82: 1461: 1400: 338: 1151:{\displaystyle \mathrm {count} \leftarrow \max\{\mathrm {count} ,{\hat {a_{i}}}+c\}} 1542: 1453: 1390: 1327: 1315: 1263: 184: 1267: 1171: 59: 188: 168: 35: 1536: 1465: 1293: 767: 352:
of the table, apply the corresponding hash function to obtain a column index
275: 176: 1457: 1395: 1249:"An Improved Data Stream Summary: The Count-Min Sketch and its Applications" 1342: 860: 203: 49: 44: 1296:(2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol", 113: 73: 758:
is the stream size, i.e. the total number of items seen by the sketch.
320:, where the error in answering a query is within an additive factor of 180: 1319: 153: 64: 1422:
New estimation algorithms for streaming data: Count-min can do more
395:. The estimated count is given by the least value in the table for 1502:
Schechter, Stuart; Herley, Cormac; Mitzenmacher, Michael (2010).
1443: 1181: 258:
The actual sketch data structure is a two-dimensional array of
770:
between the histograms represented by two count–min sketches,
282:
rows is a separate hash function; the hash functions must be
87: 1501: 900:
changes the update, but not the query algorithms. To count
1360:
Dynamically maintaining frequent items over a data stream
1482: 998:{\displaystyle {\hat {a}}_{i}=\min _{j}\mathrm {count} } 489:{\displaystyle {\hat {a}}_{i}=\min _{j}\mathrm {count} } 1246: 672:{\displaystyle {\hat {a}}_{i}\leq a_{i}+\varepsilon N} 1344:
Sketch algorithms for estimating point queries in NLP
1341:
Goyal, Amit; Daumé, Hal III; Cormode, Graham (2012).
1011: 914: 817: 776: 711: 685: 627: 593: 544: 502: 405: 383:
Several types of queries are possible on the stream.
237: 187:, at the expense of overcounting some events due to 1287: 621:Additionally, this estimate has the guarantee that 1150: 997: 844: 803: 751:{\displaystyle N=\sum _{i\in {\mathcal {U}}}a_{i}} 750: 697: 671: 606: 579: 522: 488: 247: 1340: 1534: 1331:. A preliminary version appeared at SIGCOMM '98. 1063: 938: 429: 171:that serves as a frequency table of events in a 16:Probabilistic data structure in computer science 191:. The count–min sketch was invented in 2003 by 1437: 1506:. USENIX Workshop on Hot Topics in Security. 1357: 210:and can be considered an implementation of a 133: 1446:ACM SIGMETRICS Performance Evaluation Review 1358:Jin, C.; Qian, W.; Xu, X.; Zhou, A. (2003), 1145: 1066: 391:, which asks for the count of an event type 348:arrives we update as follows: for each row 179:to map events to frequencies, but unlike a 1419: 1380: 1378: 1247:Cormode, Graham; S. Muthukrishnan (2005). 866: 140: 126: 1511: 1492: 1429: 1394: 1367: 1309: 580:{\displaystyle a_{i}\leq {\hat {a}}_{i}} 1375: 1224: 199:and described by them in a 2005 paper. 1535: 1220: 1218: 202:Count–min sketch is an alternative to 1384: 1334: 845:{\displaystyle \mathrm {count} _{b}} 804:{\displaystyle \mathrm {count} _{a}} 1298:IEEE/ACM Transactions on Networking 1215: 13: 1476: 1420:Deng, Fan; Rafiei, Davood (2007), 1385:Ting, Daniel (2018). "Count-Min". 1082: 1079: 1076: 1073: 1070: 1025: 1022: 1019: 1016: 1013: 960: 957: 954: 951: 948: 832: 829: 826: 823: 820: 791: 788: 785: 782: 779: 731: 516: 513: 510: 507: 504: 451: 448: 445: 442: 439: 372:. Then increment the value in row 240: 14: 1564: 1521: 908:, one first computes an estimate 614:is the true frequency with which 221: 1485:Pan-private streaming algorithms 1234:Encyclopedia of Database Systems 523:{\displaystyle \mathrm {count} } 231:from a universe of event types 1413: 1351: 1281: 1240: 1194: 1133: 1114: 1111: 1105: 1086: 1060: 1057: 1054: 1048: 1029: 992: 989: 983: 964: 922: 635: 565: 483: 480: 474: 455: 413: 278:. Associated with each of the 248:{\displaystyle {\mathcal {U}}} 1: 1548:Probabilistic data structures 1236:. Springer. pp. 511–516. 1208: 93:Rapidly exploring random tree 1268:10.1016/j.jalgor.2003.12.001 888:maximum likelihood estimator 7: 1165: 10: 1569: 1553:Hash-based data structures 1177:Locality-sensitive hashing 880: 698:{\displaystyle 1-\delta } 344:When a new event of type 294:can be chosen by setting 1225:Cormode, Graham (2009). 1187: 904:instances of event type 618:occurred in the stream. 1458:10.1145/1384529.1375472 1396:10.1145/3219819.3219975 867:Reducing bias and error 1389:. pp. 2319–2328. 1152: 999: 846: 805: 752: 699: 673: 608: 581: 524: 490: 249: 214:(Fan et al., 1998) or 197:S. Muthu Muthukrishnan 1153: 1000: 898:Conservative updating 847: 806: 753: 700: 674: 609: 607:{\displaystyle a_{i}} 582: 525: 491: 266:rows. The parameters 250: 212:counting Bloom filter 1347:. Proc. EMNLP/CoNLL. 1009: 912: 879:To remove bias, the 815: 774: 709: 683: 625: 591: 542: 534:Obviously, for each 500: 403: 387:The simplest is the 284:pairwise independent 235: 109:Randomized algorithm 1288:Fan, Li; Cao, Pei; 764:inner product query 1227:"Count-min sketch" 1148: 995: 946: 842: 801: 748: 737: 695: 669: 604: 577: 520: 486: 437: 245: 83:Random binary tree 1320:10.1109/90.851975 1136: 937: 925: 873:biased estimators 718: 679:with probability 638: 568: 428: 416: 331:(see below), and 324:with probability 286:. The parameters 216:multistage-filter 150: 149: 1560: 1517: 1515: 1498: 1496: 1470: 1469: 1441: 1435: 1434: 1433: 1417: 1411: 1410: 1398: 1382: 1373: 1372: 1371: 1355: 1349: 1348: 1338: 1332: 1330: 1313: 1290:Almeida, Jussara 1285: 1279: 1278: 1276: 1270:. Archived from 1253: 1244: 1238: 1237: 1231: 1222: 1202: 1198: 1161: 1157: 1155: 1154: 1149: 1138: 1137: 1132: 1131: 1122: 1104: 1103: 1085: 1047: 1046: 1028: 1004: 1002: 1001: 996: 982: 981: 963: 945: 933: 932: 927: 926: 918: 907: 903: 882: 851: 849: 848: 843: 841: 840: 835: 810: 808: 807: 802: 800: 799: 794: 757: 755: 754: 749: 747: 746: 736: 735: 734: 704: 702: 701: 696: 678: 676: 675: 670: 659: 658: 646: 645: 640: 639: 631: 617: 613: 611: 610: 605: 603: 602: 586: 584: 583: 578: 576: 575: 570: 569: 561: 554: 553: 537: 529: 527: 526: 521: 519: 495: 493: 492: 487: 473: 472: 454: 436: 424: 423: 418: 417: 409: 398: 394: 379: 375: 371: 351: 347: 336: 330: 323: 319: 308: 293: 289: 281: 273: 269: 265: 261: 254: 252: 251: 246: 244: 243: 230: 185:sub-linear space 158:count–min sketch 142: 135: 128: 55:Count–min sketch 19: 18: 1568: 1567: 1563: 1562: 1561: 1559: 1558: 1557: 1533: 1532: 1524: 1513:10.1.1.170.9356 1494:10.1.1.165.5923 1479: 1477:Further reading 1474: 1473: 1442: 1438: 1431:10.1.1.552.1283 1418: 1414: 1407: 1383: 1376: 1369:10.1.1.151.5909 1356: 1352: 1339: 1335: 1286: 1282: 1274: 1251: 1245: 1241: 1229: 1223: 1216: 1211: 1206: 1205: 1199: 1195: 1190: 1172:Feature hashing 1168: 1159: 1127: 1123: 1121: 1120: 1099: 1095: 1069: 1042: 1038: 1012: 1010: 1007: 1006: 1005:, then updates 977: 973: 947: 941: 928: 917: 916: 915: 913: 910: 909: 905: 901: 869: 836: 819: 818: 816: 813: 812: 795: 778: 777: 775: 772: 771: 742: 738: 730: 729: 722: 710: 707: 706: 684: 681: 680: 654: 650: 641: 630: 629: 628: 626: 623: 622: 615: 598: 594: 592: 589: 588: 571: 560: 559: 558: 549: 545: 543: 540: 539: 535: 503: 501: 498: 497: 468: 464: 438: 432: 419: 408: 407: 406: 404: 401: 400: 396: 392: 377: 373: 365: 353: 349: 345: 332: 325: 321: 310: 295: 291: 287: 279: 271: 267: 263: 259: 239: 238: 236: 233: 232: 228: 224: 146: 60:Quotient filter 36:data structures 34: 17: 12: 11: 5: 1566: 1556: 1555: 1550: 1545: 1531: 1530: 1523: 1522:External links 1520: 1519: 1518: 1499: 1478: 1475: 1472: 1471: 1452:(1): 121–132. 1436: 1412: 1405: 1374: 1350: 1333: 1311:10.1.1.41.1487 1304:(3): 281–293, 1294:Broder, Andrei 1280: 1277:on 2023-05-25. 1239: 1213: 1212: 1210: 1207: 1204: 1203: 1192: 1191: 1189: 1186: 1185: 1184: 1179: 1174: 1167: 1164: 1147: 1144: 1141: 1135: 1130: 1126: 1119: 1116: 1113: 1110: 1107: 1102: 1098: 1094: 1091: 1088: 1084: 1081: 1078: 1075: 1072: 1068: 1065: 1062: 1059: 1056: 1053: 1050: 1045: 1041: 1037: 1034: 1031: 1027: 1024: 1021: 1018: 1015: 994: 991: 988: 985: 980: 976: 972: 969: 966: 962: 959: 956: 953: 950: 944: 940: 936: 931: 924: 921: 868: 865: 854: 853: 839: 834: 831: 828: 825: 822: 798: 793: 790: 787: 784: 781: 745: 741: 733: 728: 725: 721: 717: 714: 694: 691: 688: 668: 665: 662: 657: 653: 649: 644: 637: 634: 601: 597: 574: 567: 564: 557: 552: 548: 532: 531: 518: 515: 512: 509: 506: 485: 482: 479: 476: 471: 467: 463: 460: 457: 453: 450: 447: 444: 441: 435: 431: 427: 422: 415: 412: 361: 339:Euler's number 242: 223: 222:Data structure 220: 193:Graham Cormode 177:hash functions 173:stream of data 169:data structure 148: 147: 145: 144: 137: 130: 122: 119: 118: 117: 116: 111: 103: 102: 98: 97: 96: 95: 90: 85: 77: 76: 70: 69: 68: 67: 62: 57: 52: 47: 39: 38: 28: 27: 15: 9: 6: 4: 3: 2: 1565: 1554: 1551: 1549: 1546: 1544: 1541: 1540: 1538: 1529: 1528:Count–min FAQ 1526: 1525: 1514: 1509: 1505: 1500: 1495: 1490: 1487:. Proc. ICS. 1486: 1481: 1480: 1467: 1463: 1459: 1455: 1451: 1447: 1440: 1432: 1427: 1423: 1416: 1408: 1406:9781450355520 1402: 1397: 1392: 1388: 1381: 1379: 1370: 1365: 1361: 1354: 1346: 1345: 1337: 1329: 1325: 1321: 1317: 1312: 1307: 1303: 1299: 1295: 1291: 1284: 1273: 1269: 1265: 1261: 1257: 1256:J. Algorithms 1250: 1243: 1235: 1228: 1221: 1219: 1214: 1197: 1193: 1183: 1180: 1178: 1175: 1173: 1170: 1169: 1163: 1158:for each row 1142: 1139: 1128: 1124: 1117: 1108: 1100: 1096: 1092: 1089: 1051: 1043: 1039: 1035: 1032: 986: 978: 974: 970: 967: 942: 934: 929: 919: 899: 895: 891: 889: 884: 877: 874: 864: 862: 857: 837: 796: 769: 768:inner product 766:asks for the 765: 761: 760: 759: 743: 739: 726: 723: 719: 715: 712: 692: 689: 686: 666: 663: 660: 655: 651: 647: 642: 632: 619: 599: 595: 572: 562: 555: 550: 546: 530:is the table. 477: 469: 465: 461: 458: 433: 425: 420: 410: 390: 386: 385: 384: 381: 369: 364: 360: 356: 342: 340: 335: 329: 317: 313: 306: 302: 298: 285: 277: 276:inner product 256: 219: 217: 213: 209: 205: 200: 198: 194: 190: 186: 182: 178: 174: 170: 167: 166:probabilistic 163: 159: 155: 143: 138: 136: 131: 129: 124: 123: 121: 120: 115: 112: 110: 107: 106: 105: 104: 100: 99: 94: 91: 89: 86: 84: 81: 80: 79: 78: 75: 72: 71: 66: 63: 61: 58: 56: 53: 51: 48: 46: 43: 42: 41: 40: 37: 33: 32:Probabilistic 30: 29: 25: 21: 20: 1503: 1484: 1449: 1445: 1439: 1421: 1415: 1386: 1359: 1353: 1343: 1336: 1301: 1297: 1283: 1272:the original 1259: 1255: 1242: 1233: 1196: 897: 896: 892: 885: 878: 870: 861:count sketch 858: 855: 763: 620: 533: 388: 382: 367: 362: 358: 354: 343: 333: 327: 315: 311: 304: 300: 296: 262:columns and 257: 225: 204:count sketch 201: 161: 157: 151: 74:Random trees 54: 50:Count sketch 45:Bloom filter 389:point query 114:HyperLogLog 1537:Categories 1209:References 538:, one has 326:1 − 208:AMS sketch 189:collisions 183:uses only 181:hash table 175:. It uses 1508:CiteSeerX 1489:CiteSeerX 1466:0163-5999 1426:CiteSeerX 1364:CiteSeerX 1306:CiteSeerX 1262:: 29–38. 1201:decrease. 1134:^ 1061:← 923:^ 894:benefit. 859:Like the 727:∈ 720:∑ 693:δ 690:− 664:ε 648:≤ 636:^ 566:^ 556:≤ 414:^ 399:, namely 376:, column 162:CM sketch 154:computing 65:Skip list 1166:See also 705:, where 587:, where 496:, where 380:by one. 314:= ⌈ln 1/ 24:a series 22:Part of 1543:Hashing 1328:4779754 1182:MinHash 881:hCount* 164:) is a 101:Related 1510:  1491:  1464:  1428:  1403:  1366:  1326:  1308:  328:δ 156:, the 1324:S2CID 1275:(PDF) 1252:(PDF) 1230:(PDF) 1188:Notes 88:Treap 1462:ISSN 1401:ISBN 811:and 309:and 290:and 270:and 206:and 195:and 1454:doi 1391:doi 1316:doi 1264:doi 1064:max 939:min 762:An 430:min 337:is 299:= ⌈ 152:In 1539:: 1460:. 1450:36 1448:. 1424:, 1399:. 1377:^ 1362:, 1322:, 1314:, 1300:, 1292:; 1260:55 1258:. 1254:. 1232:. 1217:^ 886:A 357:= 341:. 26:on 1516:. 1497:. 1468:. 1456:: 1409:. 1393:: 1318:: 1302:8 1266:: 1160:j 1146:} 1143:c 1140:+ 1129:i 1125:a 1118:, 1115:] 1112:) 1109:i 1106:( 1101:j 1097:h 1093:, 1090:j 1087:[ 1083:t 1080:n 1077:u 1074:o 1071:c 1067:{ 1058:] 1055:) 1052:i 1049:( 1044:j 1040:h 1036:, 1033:j 1030:[ 1026:t 1023:n 1020:u 1017:o 1014:c 993:] 990:) 987:i 984:( 979:j 975:h 971:, 968:j 965:[ 961:t 958:n 955:u 952:o 949:c 943:j 935:= 930:i 920:a 906:i 902:c 852:. 838:b 833:t 830:n 827:u 824:o 821:c 797:a 792:t 789:n 786:u 783:o 780:c 744:i 740:a 732:U 724:i 716:= 713:N 687:1 667:N 661:+ 656:i 652:a 643:i 633:a 616:i 600:i 596:a 573:i 563:a 551:i 547:a 536:i 517:t 514:n 511:u 508:o 505:c 484:] 481:) 478:i 475:( 470:j 466:h 462:, 459:j 456:[ 452:t 449:n 446:u 443:o 440:c 434:j 426:= 421:i 411:a 397:i 393:i 378:k 374:j 370:) 368:i 366:( 363:j 359:h 355:k 350:j 346:i 334:e 322:ε 318:⌉ 316:δ 312:d 307:⌉ 305:ε 303:/ 301:e 297:w 292:d 288:w 280:d 272:d 268:w 264:d 260:w 241:U 229:i 160:( 141:e 134:t 127:v

Index

a series
Probabilistic
data structures
Bloom filter
Count sketch
Count–min sketch
Quotient filter
Skip list
Random trees
Random binary tree
Treap
Rapidly exploring random tree
Randomized algorithm
HyperLogLog
v
t
e
computing
probabilistic
data structure
stream of data
hash functions
hash table
sub-linear space
collisions
Graham Cormode
S. Muthu Muthukrishnan
count sketch
AMS sketch
counting Bloom filter

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.