Knowledge

Variable-width encoding

Source đź“ť

269:
scan the text from the beginning of all definitive sequences in order to identify the various units and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if the hexadecimal values DE, DF, E0, and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the sequence DE DF E0 E1, which consists of two consecutive two-unit sequences. There is also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences incorrect. In a variable-width encoding where all three types of units are disjunct, string searching always works without false positives, and (provided the decoder is well written) the corruption or loss of one unit corrupts only one character.
2537: 22: 291:
marked by having the most significant bit set, that is, being in the range 80–FF (hexadecimal), while the singletons were in the range 00–7F alone. The lead units and trail units were in the range A1 to FE (hexadecimal), that is, the same as their range in the ISO 2022 encodings, but with the high bit set to 1. These encodings were reasonably easy to work with provided all your delimiters were
471:
In the original version of UTF-8, from its 1992 publication until its code space was restricted to that of UTF-16 in 2003, the range of lead units encoding three-unit trailing sequences was larger (F0–F7); additionally, the lead units F8–FB were followed by four trail units, and FC–FD by five. FE–FF
282:
used the range 21–7E (hexadecimal) for both lead units and trail units, and marked them off from the singletons by using ISO 2022 escape sequences to switch between single-byte and multibyte mode. A total of 8,836 (94×94) characters could be encoded at first, and further sets of 94×94 characters with
359:
operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have the range 00–7F, lead units have the range C0–FD (now actually C2–F4, to avoid overlong sequences and to maintain
290:
platforms, the ISO 2022 7-bit encodings were replaced by a set of 8-bit encoding schemes, the Extended Unix Code: EUC-JP, EUC-CN and EUC-KR. Instead of distinguishing between the multiunit sequences and the singletons with escape sequences, which made the encodings stateful, multiunit sequences were
194:
with an existing constraint. For example, with one byte (8 bits) per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to use two or more bytes per encoding unit, two bytes (16 bits) would allow 65,536 possible characters, but
367:
UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000–D7FF (55,296 code points) and E000–FFFF (8192 code points, 63,488 in total), lead units the range D800–DBFF
268:
UTF-8 makes it easy for a program to identify the three sorts of units, since they fall into separate value ranges. Older variable-width encodings are typically not as well-designed, since the ranges may overlap. A text processing application that deals with the variable-width encoding must then
314:
respectively. In Shift-JIS, lead units had the range 81–9F and E0–FC, trail units had the range 40–7E and 80–FC, and singletons had the range 21–7E and A1–DF. In Big5, lead units had the range A1–FE, trail units had the range 40–7E and A1–FE, and singletons had the range 21–7E (all values in
203:
Since the aim of a multibyte encoding system is to minimise changes to existing application software, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a
216:, which come afterwards in a multiunit sequence. Input and display software obviously needs to know about the structure of the multibyte encoding scheme, but other software generally doesn't need to know if a pair of bytes represent two separate characters or just one character. 450:, which represents the most common characters in exactly the manner just described (and uses pairs of 16-bit code units for less-common characters) never gained traction as an encoding for text intended for interchange due to its incompatibility with the ubiquitous 7-/8-bit 376:, respectively, in Unicode terminology, map 1024Ă—1024 or 1,048,576 supplementary characters, making 1,112,064 (63,488 BMP code points + 1,048,576 code points represented by high and low surrogate pairs) encodable code points, or 277:
The first use of multibyte encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. At first the encoding was constrained to the limit of 7 bits. The
283:
switching. The ISO 2022 encoding schemes for CJK are still in use on the Internet. The stateful nature of these encodings and the large overlap make them very awkward to process.
2155: 315:
hexadecimal). This overlap again made processing tricky, though at least most of the symbols had unique byte values (though strangely the backslash does not).
364:
article), and trail units have the range 80–BF. The lead unit also tells how many trail units follow: one after C2–DF, two after E0–EF and three after F0–F4.
1000: 347:, in which singletons had the range 00–9F, lead units the range A0–FF and trail units the ranges A0–FF and 21–7E. Because of this bad design, similar to 837: 343:
standards were meant to be fixed-width, with Unicode being 16-bit and ISO 10646 being 32-bit. ISO 10646 provided a variable-width encoding called
295:
characters and you avoided truncating strings to fixed lengths, but a break in the middle of a multibyte character could still cause major corruption.
306:
platforms), two encodings became established for Japanese and Traditional Chinese in which all of singletons, lead units and trail units overlapped:
1946: 945: 1020: 534: 86: 58: 39: 183:(which unlike tapes allowed random access allowing text to be loaded on demand), increases in computer memory and general purpose 2247: 2001: 171:
Early variable width encodings using less than a byte per character were sometimes used to pack English text into fewer bytes in
190:
Multibyte encodings are usually the result of a need to increase the number of characters which can be encoded without breaking
2237: 65: 1986: 940: 2120: 368:(1024 code points) and trail units the range DC00–DFFF (1024 code points, 2048 in total). The lead and trail units, called 72: 2514: 2025: 1828: 572: 2070: 1686: 1681: 1184: 1015: 562: 527: 105: 54: 1104: 2322: 2257: 2011: 1991: 121:
This article is about the storage of text in computers. For the transmission of data across noisy channels, see
2075: 668: 395: 265:
are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units.
43: 2189: 2160: 1810: 2252: 2140: 2100: 520: 195:
such a change would break compatibility with existing systems and therefore might not be feasible at all.
2559: 2494: 2105: 2035: 2021: 2006: 1910: 1823: 1795: 1761: 2468: 2413: 2334: 2115: 1771: 1766: 1119: 2110: 2175: 2130: 1966: 1515: 1219: 1164: 1129: 407: 401: 79: 1690: 1199: 1179: 1174: 1114: 1109: 618: 2065: 2540: 2451: 2446: 2408: 2379: 2344: 1776: 1510: 1209: 1094: 32: 2135: 2125: 1981: 1971: 1505: 1214: 659: 646: 582: 191: 2312: 2150: 2085: 1961: 1500: 654: 356: 184: 1530: 2473: 2145: 1905: 1525: 122: 168:, because representation size is an attribute of the encoding, not of the character set.) 8: 2428: 2055: 1540: 1425: 1415: 1410: 2509: 2357: 2170: 2165: 2090: 1089: 1063: 587: 543: 433:
The concept long precedes the advent of the electronic computer, however, as seen with
133: 2499: 2438: 2418: 2080: 2060: 2040: 1668: 1144: 1124: 636: 303: 153: 2456: 2030: 1996: 1706: 1535: 498: 2504: 2423: 1154: 1149: 1139: 1084: 769: 759: 754: 749: 744: 739: 734: 220: 1956: 1951: 1941: 1936: 1931: 1926: 1890: 1885: 1878: 1873: 1868: 1863: 1858: 1853: 1848: 1843: 1838: 1833: 1701: 1658: 1653: 1648: 1643: 1638: 1633: 1628: 1623: 1618: 1613: 1608: 1603: 1598: 1593: 1588: 1495: 1490: 1485: 1480: 1475: 1470: 1465: 1460: 1455: 1450: 1445: 1440: 1224: 809: 729: 724: 719: 714: 709: 704: 699: 694: 689: 557: 172: 1204: 2553: 2276: 1696: 1583: 1578: 1573: 1568: 1563: 1558: 1435: 1430: 1420: 1405: 1400: 1395: 1390: 1385: 1380: 1375: 1370: 1365: 1360: 1355: 1350: 1345: 1340: 1335: 1330: 1325: 1320: 1315: 1310: 1305: 1300: 1295: 1290: 1285: 1280: 1275: 1270: 1265: 1260: 1255: 1250: 1245: 1240: 1159: 1134: 1099: 1058: 804: 176: 137: 2296: 2291: 2286: 2281: 2016: 1756: 1751: 1746: 1741: 1736: 1731: 1726: 1721: 1716: 1711: 1194: 1189: 1169: 1053: 1045: 678: 279: 180: 611: 594: 228: 245:. Of the six units in that sequence, 49, 4E, and 59 are singletons (for 2461: 2369: 2222: 1900: 995: 965: 960: 955: 950: 915: 799: 794: 784: 779: 577: 567: 434: 2402: 512: 2349: 2327: 2232: 2045: 1074: 1005: 985: 980: 905: 900: 502: 348: 340: 307: 157: 21: 2519: 2374: 2339: 2317: 2227: 2050: 990: 975: 935: 930: 925: 910: 869: 864: 859: 854: 849: 844: 641: 631: 627: 601: 165: 141: 2389: 2185: 2095: 1976: 1550: 920: 895: 885: 623: 389: 324: 136:
scheme in which codes of differing lengths are used to encode a
2394: 2384: 2362: 2242: 2217: 2212: 1895: 1786: 1676: 1035: 1025: 1010: 827: 447: 336: 332: 2489: 2207: 2202: 2197: 1814: 1520: 1030: 970: 832: 606: 472:
were never valid lead or trail units in any version of UTF-8.
455: 451: 361: 344: 328: 292: 224: 495:
UTF-9 and UTF-18 Efficient Transformation Formats of Unicode
156:) to encode different characters. (Some authors, notably in 1800: 890: 413: 352: 311: 287: 149: 140:(a repertoire of symbols) for representation, usually in a 360:
synchronism with the encoding capacity of UTF-16; see the
2267: 299: 454:
encoding, with its intended role instead being taken by
380:
in Unicode parlance (surrogates are not encodable).
318: 46:. Unsourced material may be challenged and removed. 280:ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings 2551: 2469:Unicode control, format and separator characters 212:, which come first in a multiunit sequence, and 355:in its overlap of values, the inventors of the 528: 187:have rendered such tricks largely obsolete. 327:standard has two variable-width encodings: 144:. Most common variable-width encodings are 535: 521: 272: 106:Learn how and when to remove this message 219:For example, the four character string " 542: 492: 2552: 516: 335:(it also has a fixed-width encoding, 339:). Originally, both the Unicode and 198: 44:adding citations to reliable sources 15: 13: 1879:Norwegian and Danish (alternative) 208:, which consist of a single unit, 14: 2571: 118:Type of character encoding scheme 2536: 2535: 446:As a real-life example of this, 319:Unicode variable-width encodings 20: 2323:Digital encoding of APL symbols 2258:Comparison of Unicode encodings 776:Proposed but not approved 148:, which use varying numbers of 31:needs additional citations for 486: 465: 440: 427: 396:Lotus Multi-Byte Character Set 1: 479: 462:preserve ASCII compatibility. 493:Crispin, M. (1 April 2005). 160:documentation, use the term 7: 2495:Character encodings in HTML 1829:National Replacement (NRCS) 1796:Japanese language in EBCDIC 383: 10: 2576: 120: 2533: 2482: 2437: 2305: 2266: 2184: 1919: 1809: 1785: 1667: 1549: 1233: 1072: 1044: 878: 820: 677: 550: 414:Single-Byte Character Set 408:Double-byte character set 402:Triple-Byte Character Set 204:variable-width encoding: 55:"Variable-width encoding" 2525:Variable-length encoding 2306:Miscellaneous code pages 1064:Extended Unix Code / EUC 755:-15 (New Western Europe) 551:Early telecommunications 420: 162:multibyte character set, 2452:C0 and C1 control codes 273:CJK multibyte encodings 130:variable-width encoding 700:-3 (Maltese/Esperanto) 651:World System Teletext 192:backward compatibility 185:compression algorithms 2474:Whitespace characters 2151:Ventura International 1869:Norwegian and Danish 227:like this (shown as 123:variable-length code 40:improve this article 2429:Unified Hangul Code 2101:PostScript Standard 1824:Multinational (MCS) 695:-2 (Central Europe) 690:-1 (Western Europe) 544:Character encodings 257:is a lead unit and 146:multibyte encodings 2560:Character encoding 2510:Hardware code page 2270:typesetting system 2106:PostScript Latin 1 1762:Cyrillic + Finnish 1669:Windows code pages 1551:IBM AIX code pages 879:National standards 810:Ukrainian Cyrillic 134:character encoding 2547: 2546: 2500:Charset detection 2439:Control character 2121:Sharp calculators 1992:Casio calculators 1920:Platform specific 1772:Cyrillic + German 1767:Cyrillic + French 1185:Maltese/Esperanto 821:Bibliographic use 705:-4 (North Europe) 637:T.51/ISO/IEC 6937 595:Baudot and Murray 304:Microsoft Windows 199:General structure 116: 115: 108: 90: 2567: 2539: 2538: 2031:DG International 1906:Special Graphics 1707:Extended Latin-8 1105:Central European 1095:Barents Cyrillic 800:Barents Cyrillic 770:-12 (Devanagari) 766:Abandoned parts 537: 530: 523: 514: 513: 507: 506: 503:10.17487/rfc4042 490: 473: 469: 463: 444: 438: 431: 264: 260: 256: 244: 242: 239: 236: 223:" is encoded in 111: 104: 100: 97: 91: 89: 48: 24: 16: 2575: 2574: 2570: 2569: 2568: 2566: 2565: 2564: 2550: 2549: 2548: 2543: 2529: 2505:Han unification 2478: 2433: 2301: 2262: 2180: 2002:Compucolor 8001 1915: 1911:Technical (TCS) 1834:French Canadian 1805: 1781: 1777:Polytonic Greek 1663: 1545: 1229: 1215:Turkic Cyrillic 1130:Font X (Kermit) 1125:Farsi (Persian) 1077: 1068: 1040: 874: 816: 686:Approved parts 673: 546: 541: 511: 510: 491: 487: 482: 477: 476: 470: 466: 445: 441: 432: 428: 423: 392:wide characters 386: 370:high surrogates 321: 275: 262: 258: 254: 240: 237: 234: 232: 201: 173:adventure games 126: 119: 112: 101: 95: 92: 49: 47: 37: 25: 12: 11: 5: 2573: 2563: 2562: 2545: 2544: 2541:Character sets 2534: 2531: 2530: 2528: 2527: 2522: 2517: 2512: 2507: 2502: 2497: 2492: 2486: 2484: 2483:Related topics 2480: 2479: 2477: 2476: 2471: 2466: 2465: 2464: 2459: 2449: 2447:Morse prosigns 2443: 2441: 2435: 2434: 2432: 2431: 2426: 2421: 2416: 2411: 2406: 2399: 2398: 2397: 2392: 2387: 2377: 2372: 2367: 2366: 2365: 2360: 2352: 2347: 2342: 2337: 2332: 2331: 2330: 2320: 2315: 2309: 2307: 2303: 2302: 2300: 2299: 2294: 2289: 2284: 2279: 2273: 2271: 2264: 2263: 2261: 2260: 2255: 2250: 2245: 2240: 2235: 2230: 2225: 2220: 2215: 2210: 2205: 2200: 2194: 2192: 2182: 2181: 2179: 2178: 2173: 2168: 2163: 2158: 2153: 2148: 2143: 2141:TI calculators 2138: 2133: 2128: 2123: 2118: 2113: 2108: 2103: 2098: 2093: 2088: 2083: 2078: 2073: 2068: 2063: 2058: 2053: 2048: 2043: 2038: 2033: 2028: 2019: 2014: 2009: 2004: 1999: 1994: 1989: 1984: 1979: 1974: 1969: 1964: 1959: 1954: 1949: 1944: 1939: 1934: 1929: 1923: 1921: 1917: 1916: 1914: 1913: 1908: 1903: 1898: 1893: 1888: 1883: 1882: 1881: 1876: 1871: 1866: 1861: 1856: 1851: 1849:United Kingdom 1846: 1841: 1836: 1826: 1820: 1818: 1807: 1806: 1804: 1803: 1798: 1792: 1790: 1783: 1782: 1780: 1779: 1774: 1769: 1764: 1759: 1754: 1749: 1744: 1739: 1734: 1729: 1724: 1719: 1714: 1709: 1704: 1699: 1694: 1684: 1679: 1673: 1671: 1665: 1664: 1662: 1661: 1656: 1651: 1646: 1641: 1636: 1631: 1626: 1621: 1616: 1611: 1606: 1601: 1596: 1591: 1586: 1581: 1576: 1571: 1566: 1561: 1555: 1553: 1547: 1546: 1544: 1543: 1538: 1533: 1528: 1523: 1518: 1513: 1508: 1503: 1498: 1493: 1488: 1483: 1478: 1473: 1468: 1463: 1458: 1453: 1448: 1443: 1438: 1433: 1428: 1423: 1418: 1413: 1408: 1403: 1398: 1393: 1388: 1383: 1378: 1373: 1368: 1363: 1358: 1353: 1348: 1343: 1338: 1333: 1328: 1323: 1318: 1313: 1308: 1303: 1298: 1293: 1288: 1283: 1278: 1273: 1268: 1263: 1258: 1253: 1248: 1243: 1237: 1235: 1234:DOS code pages 1231: 1230: 1228: 1227: 1222: 1217: 1212: 1207: 1202: 1197: 1192: 1187: 1182: 1180:Latin (Kermit) 1177: 1172: 1167: 1162: 1157: 1152: 1147: 1142: 1137: 1132: 1127: 1122: 1117: 1112: 1107: 1102: 1097: 1092: 1087: 1081: 1079: 1070: 1069: 1067: 1066: 1061: 1056: 1050: 1048: 1042: 1041: 1039: 1038: 1033: 1028: 1023: 1018: 1013: 1008: 1003: 998: 993: 988: 983: 978: 973: 968: 963: 958: 953: 948: 943: 938: 933: 928: 923: 918: 913: 908: 903: 898: 893: 888: 882: 880: 876: 875: 873: 872: 867: 862: 857: 852: 847: 842: 841: 840: 835: 824: 822: 818: 817: 815: 814: 813: 812: 807: 802: 797: 789: 788: 787: 782: 780:KOI-8 Cyrillic 774: 773: 772: 764: 763: 762: 760:-16 (Romanian) 757: 752: 747: 742: 737: 732: 727: 722: 717: 712: 707: 702: 697: 692: 683: 681: 675: 674: 672: 671: 666: 665: 664: 663: 662: 657: 649: 644: 639: 621: 616: 615: 614: 604: 599: 598: 597: 592: 591: 590: 585: 580: 575: 565: 558:Telegraph code 554: 552: 548: 547: 540: 539: 532: 525: 517: 509: 508: 484: 483: 481: 478: 475: 474: 464: 439: 425: 424: 422: 419: 418: 417: 411: 405: 399: 393: 385: 382: 374:low surrogates 341:ISO 10646 320: 317: 274: 271: 231:byte values): 200: 197: 177:microcomputers 117: 114: 113: 28: 26: 19: 9: 6: 4: 3: 2: 2572: 2561: 2558: 2557: 2555: 2542: 2532: 2526: 2523: 2521: 2518: 2516: 2513: 2511: 2508: 2506: 2503: 2501: 2498: 2496: 2493: 2491: 2488: 2487: 2485: 2481: 2475: 2472: 2470: 2467: 2463: 2460: 2458: 2455: 2454: 2453: 2450: 2448: 2445: 2444: 2442: 2440: 2436: 2430: 2427: 2425: 2422: 2420: 2417: 2415: 2412: 2410: 2407: 2405: 2404: 2400: 2396: 2393: 2391: 2388: 2386: 2383: 2382: 2381: 2378: 2376: 2373: 2371: 2368: 2364: 2361: 2359: 2356: 2355: 2353: 2351: 2348: 2346: 2343: 2341: 2338: 2336: 2333: 2329: 2326: 2325: 2324: 2321: 2319: 2316: 2314: 2311: 2310: 2308: 2304: 2298: 2295: 2293: 2290: 2288: 2285: 2283: 2280: 2278: 2275: 2274: 2272: 2269: 2265: 2259: 2256: 2254: 2251: 2249: 2246: 2244: 2241: 2239: 2236: 2234: 2231: 2229: 2226: 2224: 2221: 2219: 2216: 2214: 2211: 2209: 2206: 2204: 2201: 2199: 2196: 2195: 2193: 2191: 2190:ISO/IEC 10646 2187: 2183: 2177: 2174: 2172: 2169: 2167: 2164: 2162: 2159: 2157: 2154: 2152: 2149: 2147: 2144: 2142: 2139: 2137: 2134: 2132: 2129: 2127: 2124: 2122: 2119: 2117: 2114: 2112: 2109: 2107: 2104: 2102: 2099: 2097: 2094: 2092: 2089: 2087: 2084: 2082: 2079: 2077: 2074: 2072: 2069: 2067: 2064: 2062: 2059: 2057: 2054: 2052: 2049: 2047: 2044: 2042: 2039: 2037: 2034: 2032: 2029: 2027: 2023: 2020: 2018: 2015: 2013: 2010: 2008: 2007:Compucolor II 2005: 2003: 2000: 1998: 1995: 1993: 1990: 1988: 1985: 1983: 1980: 1978: 1975: 1973: 1970: 1968: 1965: 1963: 1962:Acorn RISC OS 1960: 1958: 1955: 1953: 1950: 1948: 1945: 1943: 1940: 1938: 1935: 1933: 1930: 1928: 1925: 1924: 1922: 1918: 1912: 1909: 1907: 1904: 1902: 1899: 1897: 1894: 1892: 1891:8-bit Turkish 1889: 1887: 1884: 1880: 1877: 1875: 1872: 1870: 1867: 1865: 1862: 1860: 1857: 1855: 1852: 1850: 1847: 1845: 1842: 1840: 1837: 1835: 1832: 1831: 1830: 1827: 1825: 1822: 1821: 1819: 1816: 1812: 1808: 1802: 1799: 1797: 1794: 1793: 1791: 1788: 1784: 1778: 1775: 1773: 1770: 1768: 1765: 1763: 1760: 1758: 1755: 1753: 1750: 1748: 1745: 1743: 1740: 1738: 1735: 1733: 1730: 1728: 1725: 1723: 1720: 1718: 1715: 1713: 1710: 1708: 1705: 1703: 1700: 1698: 1695: 1692: 1688: 1685: 1683: 1680: 1678: 1675: 1674: 1672: 1670: 1666: 1660: 1657: 1655: 1652: 1650: 1647: 1645: 1642: 1640: 1637: 1635: 1632: 1630: 1627: 1625: 1622: 1620: 1617: 1615: 1612: 1610: 1607: 1605: 1602: 1600: 1597: 1595: 1592: 1590: 1587: 1585: 1582: 1580: 1577: 1575: 1572: 1570: 1567: 1565: 1562: 1560: 1557: 1556: 1554: 1552: 1548: 1542: 1539: 1537: 1534: 1532: 1529: 1527: 1524: 1522: 1519: 1517: 1514: 1512: 1509: 1507: 1504: 1502: 1499: 1497: 1494: 1492: 1489: 1487: 1484: 1482: 1479: 1477: 1474: 1472: 1469: 1467: 1464: 1462: 1459: 1457: 1454: 1452: 1449: 1447: 1444: 1442: 1439: 1437: 1434: 1432: 1429: 1427: 1424: 1422: 1419: 1417: 1414: 1412: 1409: 1407: 1404: 1402: 1399: 1397: 1394: 1392: 1389: 1387: 1384: 1382: 1379: 1377: 1374: 1372: 1369: 1367: 1364: 1362: 1359: 1357: 1354: 1352: 1349: 1347: 1344: 1342: 1339: 1337: 1334: 1332: 1329: 1327: 1324: 1322: 1319: 1317: 1314: 1312: 1309: 1307: 1304: 1302: 1299: 1297: 1294: 1292: 1289: 1287: 1284: 1282: 1279: 1277: 1274: 1272: 1269: 1267: 1264: 1262: 1259: 1257: 1254: 1252: 1249: 1247: 1244: 1242: 1239: 1238: 1236: 1232: 1226: 1223: 1221: 1218: 1216: 1213: 1211: 1208: 1206: 1203: 1201: 1198: 1196: 1193: 1191: 1188: 1186: 1183: 1181: 1178: 1176: 1173: 1171: 1168: 1166: 1163: 1161: 1158: 1156: 1153: 1151: 1148: 1146: 1143: 1141: 1138: 1136: 1133: 1131: 1128: 1126: 1123: 1121: 1118: 1116: 1113: 1111: 1108: 1106: 1103: 1101: 1098: 1096: 1093: 1091: 1088: 1086: 1083: 1082: 1080: 1076: 1071: 1065: 1062: 1060: 1059:ISO/IEC 10367 1057: 1055: 1052: 1051: 1049: 1047: 1043: 1037: 1034: 1032: 1029: 1027: 1024: 1022: 1019: 1017: 1014: 1012: 1009: 1007: 1004: 1002: 999: 997: 994: 992: 989: 987: 984: 982: 979: 977: 974: 972: 969: 967: 964: 962: 959: 957: 954: 952: 949: 947: 944: 942: 939: 937: 934: 932: 929: 927: 924: 922: 919: 917: 914: 912: 909: 907: 904: 902: 899: 897: 894: 892: 889: 887: 884: 883: 881: 877: 871: 868: 866: 863: 861: 858: 856: 853: 851: 848: 846: 843: 839: 836: 834: 831: 830: 829: 826: 825: 823: 819: 811: 808: 806: 803: 801: 798: 796: 793: 792: 790: 786: 783: 781: 778: 777: 775: 771: 768: 767: 765: 761: 758: 756: 753: 751: 748: 746: 743: 741: 738: 736: 733: 731: 728: 726: 723: 721: 718: 716: 713: 711: 710:-5 (Cyrillic) 708: 706: 703: 701: 698: 696: 693: 691: 688: 687: 685: 684: 682: 680: 676: 670: 667: 661: 658: 656: 653: 652: 650: 648: 645: 643: 640: 638: 635: 634: 633: 629: 625: 622: 620: 617: 613: 610: 609: 608: 605: 603: 600: 596: 593: 589: 586: 584: 581: 579: 576: 574: 571: 570: 569: 566: 564: 561: 560: 559: 556: 555: 553: 549: 545: 538: 533: 531: 526: 524: 519: 518: 515: 504: 500: 496: 489: 485: 468: 461: 457: 453: 449: 443: 436: 430: 426: 415: 412: 409: 406: 403: 400: 397: 394: 391: 388: 387: 381: 379: 378:scalar values 375: 371: 365: 363: 358: 354: 350: 346: 342: 338: 334: 330: 326: 316: 313: 309: 305: 301: 296: 294: 289: 284: 281: 270: 266: 252: 248: 230: 226: 222: 217: 215: 211: 207: 196: 193: 188: 186: 182: 178: 174: 169: 167: 163: 159: 155: 151: 147: 143: 139: 138:character set 135: 132:is a type of 131: 124: 110: 107: 99: 96:December 2009 88: 85: 81: 78: 74: 71: 67: 64: 60: 57: â€“  56: 52: 51:Find sources: 45: 41: 35: 34: 29:This article 27: 23: 18: 17: 2524: 2457:ISO/IEC 6429 2414:Stanford/ITS 2401: 2335:ARIB STD-B24 2116:Sega SC-3000 2017:DEC RADIX 50 1054:ISO/IEC 8859 1046:ISO/IEC 2022 791:Adaptations 750:-14 (Celtic) 745:-13 (Baltic) 735:-10 (Nordic) 730:-9 (Turkish) 679:ISO/IEC 8859 494: 488: 467: 459: 442: 429: 377: 373: 369: 366: 322: 297: 285: 276: 267: 250: 246: 218: 213: 209: 205: 202: 189: 170: 161: 145: 129: 127: 102: 93: 83: 76: 69: 62: 50: 38:Please help 33:verification 30: 2176:ZX Spectrum 2131:Sinclair QL 1967:Amstrad CPC 1886:8-bit Greek 1813:terminals ( 1526:Iran System 1078:("scripts") 725:-8 (Hebrew) 715:-6 (Arabic) 612:ISO/IEC 646 298:On the PC ( 229:hexadecimal 214:trail units 164:which is a 2462:JIS X 0211 2370:ISO-IR-169 2223:UTF-EBCDIC 1789:code pages 1516:CSX+ Indic 1120:Devanagari 1075:Code pages 996:LST 1590-4 966:JIS X 0213 961:JIS X 0212 956:JIS X 0208 951:JIS X 0201 916:GOST 10859 838:CCCII/EACC 740:-11 (Thai) 720:-7 (Greek) 655:background 578:Wabun/Kana 480:References 435:Morse code 210:lead units 206:singletons 179:. However 175:for early 66:newspapers 2515:MICR code 2350:IEC-P27-1 2328:ISO-IR-68 2233:DIN 91379 2111:SAM CoupĂ© 2046:GSM 03.38 2036:Galaksija 1531:KamenickĂ˝ 1511:CSX Indic 1220:Ukrainian 1006:Shift JIS 986:KS X 1002 981:KS X 1001 906:DIN 66003 901:CNS 11643 669:Transcode 647:ITU T.101 573:Non-Latin 349:Shift JIS 308:Shift-JIS 158:Microsoft 2554:Category 2520:Mojibake 2375:ISO 2033 2340:Fieldata 2318:ASMO 449 2228:GB 18030 2188: / 2136:Teletext 2126:Sharp MZ 2056:HP FOCAL 2051:HP Roman 1982:Atari ST 1972:Apple II 1506:CS Indic 1200:Romanian 1175:Keyboard 1155:Gurmukhi 1150:Gujarati 1140:Georgian 1115:Cyrillic 1110:Croatian 1085:Armenian 991:LST 1564 976:KPS 9566 936:GB 18030 931:GB 12052 926:GB 12345 911:ELOT 927 845:ISO 5426 805:Estonian 642:ITU T.61 632:Teletext 628:Videotex 602:Fieldata 588:Cyrillic 458:, which 384:See also 166:misnomer 142:computer 2409:SEASCII 2403:MojikyĹŤ 2390:KOI8-RU 2313:ABICOMP 2186:Unicode 2096:PETSCII 2086:NEC APC 2022:DEC MCS 1977:ATASCII 1874:Swedish 1859:Finnish 1844:Spanish 1536:Mazovia 1501:ABICOMP 1210:Turkish 1165:Iceland 1073:Mac OS 1016:TIS-620 921:GB 2312 896:BraSCII 886:ArmSCII 624:Teletex 583:Chinese 398:(LMBCS) 390:wchar_t 325:Unicode 80:scholar 2419:Symbol 2395:KOI8-U 2385:KOI8-R 2253:TACE16 2243:CESU-8 2238:BOCU-1 2218:UTF-32 2213:UTF-16 2156:WISCII 2146:TRS-80 2066:SQUOZE 2061:HP RPL 1901:Hebrew 1896:SI 960 1864:French 1787:EBCDIC 1677:CER-GS 1160:Hebrew 1135:Gaelic 1100:Celtic 1090:Arabic 1036:YUSCII 1026:VISCII 1011:SI 960 1001:PASCII 850:5426-2 828:MARC-8 563:Needle 448:UTF-16 416:(SBCS) 410:(DBCS) 404:(TBCS) 357:Plan 9 337:UTF-32 333:UTF-16 154:octets 82:  75:  68:  61:  53:  2490:CCSID 2363:8-bit 2358:7-bit 2354:INIS 2208:UTF-8 2203:UTF-7 2198:UTF-1 2076:LMBCS 2012:CP/M+ 1854:Dutch 1839:Swiss 1521:CWI-2 1225:VT100 1195:Roman 1190:Ogham 1170:Inuit 1145:Greek 1031:VSCII 1021:TSCII 971:KOI-7 946:ISCII 941:HKSCS 833:ANSEL 795:Welsh 619:BCDIC 607:ASCII 568:Morse 456:UTF-8 452:ASCII 421:Notes 362:UTF-8 345:UTF-1 329:UTF-8 293:ASCII 247:I, N, 243:4E 59 225:UTF-8 181:disks 150:bytes 87:JSTOR 73:books 2424:TRON 2277:Cork 2248:SCSU 2171:ZX81 2166:ZX80 2161:XCCS 2091:NeXT 2071:LICS 2026:NRCS 1987:BICS 1957:1058 1952:1057 1947:1056 1942:1055 1937:1054 1932:1053 1927:1052 1801:DKOI 1757:1270 1752:1258 1747:1257 1742:1256 1737:1255 1732:1254 1727:1253 1722:1252 1717:1251 1712:1250 1702:1169 1659:1133 1654:1124 1649:1046 1644:1019 1639:1018 1634:1017 1629:1016 1624:1015 1619:1014 1614:1013 1609:1012 1604:1010 1599:1009 1594:1008 1589:1006 1496:3846 1491:1127 1486:1118 1481:1117 1476:1116 1471:1115 1466:1098 1461:1044 1456:1043 1451:1042 1446:1040 1441:1034 1205:Sámi 891:Big5 870:6862 865:6438 860:5428 855:5427 785:Sámi 660:sets 626:and 460:does 372:and 353:Big5 351:and 331:and 323:The 312:Big5 310:and 302:and 288:Unix 261:and 249:and 221:I♥NY 59:news 2380:KOI 2297:OT1 2292:OMS 2287:OML 2282:LY1 2268:TeX 2081:MSX 2041:GEM 1997:CDC 1815:VTx 1811:DEC 1697:950 1691:GBK 1687:936 1682:932 1584:922 1579:921 1574:915 1569:912 1564:896 1559:895 1541:MIK 1436:951 1431:950 1426:949 1421:942 1416:936 1411:932 1406:904 1401:903 1396:899 1391:897 1386:869 1381:868 1376:867 1371:866 1366:865 1361:864 1356:863 1351:862 1346:861 1341:860 1336:859 1331:858 1326:857 1321:856 1316:855 1311:853 1306:852 1301:851 1296:850 1291:778 1286:777 1281:776 1276:775 1271:773 1266:770 1261:737 1256:720 1251:708 1246:668 1241:437 499:doi 300:DOS 286:On 253:), 233:49 42:by 2556:: 2345:HZ 497:. 263:A5 259:99 255:E2 241:A5 238:99 235:E2 128:A 2024:/ 1817:) 1693:) 1689:( 630:/ 536:e 529:t 522:v 505:. 501:: 437:. 251:Y 152:( 125:. 109:) 103:( 98:) 94:( 84:· 77:· 70:· 63:· 36:.

Index


verification
improve this article
adding citations to reliable sources
"Variable-width encoding"
news
newspapers
books
scholar
JSTOR
Learn how and when to remove this message
variable-length code
character encoding
character set
computer
bytes
octets
Microsoft
misnomer
adventure games
microcomputers
disks
compression algorithms
backward compatibility
I♥NY
UTF-8
hexadecimal
ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings
Unix
ASCII

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑