Knowledge

Canonicalization

Source 📝

185:, may cause an additional need for canonicalization in some situations. Namely, by the standard, in UTF-8 there is only one valid byte sequence for any Unicode character, but some byte sequences are invalid, i.e., they cannot be obtained by encoding any string of Unicode characters into UTF-8. Some sloppy decoder implementations may accept invalid byte sequences as input and produce a valid Unicode character as output for such a sequence. If one uses such a decoder, some Unicode characters effectively have more than one corresponding byte sequence: the valid one and some invalid ones. This could lead to security issues similar to the one described in the previous section. Therefore, if one wants to apply some filter (e.g., a regular expression written in UTF-8) to UTF-8 strings that will later be passed to a decoder that allows invalid byte sequences, one should canonicalize the strings before passing them to the filter. In this context, canonicalization is the process of translating every string character to its single valid byte sequence. An alternative to canonicalization is to reject any strings containing invalid byte sequences. 166:
can be represented in Unicode as the Unicode character U+0065 (LATIN SMALL LETTER E) followed by the character U+0301 (COMBINING ACUTE ACCENT), but it can also be represented as the precomposed character U+00E9 (LATIN SMALL LETTER E WITH ACUTE). This makes string comparison more complicated, since
377:
The first example contains extra spaces in the closing tag of the first node. The second example, which has been canonicalized, has had these spaces removed. Note that only the spaces within the tags are removed under W3C canonicalization, not those between tags.
298:
deals with web content that has more than one possible URL. Having multiple URLs for the same web content can cause problems for search engines - specifically in determining which URL should be shown in search results. Most search engines support the
359:. Briefly, canonicalization removes whitespace within tags, uses particular character encodings, sorts namespace references and eliminates redundant ones, removes XML and DOCTYPE declarations, and transforms relative URIs into absolute URIs. 229:), Google chooses one as canonical. Note that the pages do not need to be absolutely identical; minor changes in sorting or filtering of list pages do not make the page unique (for example, sorting by price or filtering by item color). 217:
A canonical URL is the URL of the page that Google thinks is most representative from a set of duplicate pages on your site. For example, if you have URLs for the same page (for example
58:. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various 118:
Canonicalization of filenames is important for computer security. For example, a web server may have a restriction that only files under the cgi directory
259:, manual searching for information is predominant. In this case, canonical URLs can be defined in a non-machine-readable form, too. For example in a 303:
as a hint to which URL should be treated as the true version. As indicated by John Mueller of Google, having other directives in a page, like the
111:
components referring to parent directories, simplification of sequences of multiple slashes, removal of trailing slashes, and the resolution of
272: 343:
All of these URLs point to the homepage of Knowledge, but a search engine will only consider one of them to be the canonical form of the URL.
167:
every possible representation of a string containing such glyphs must be considered. To deal with this, Unicode provides the mechanism of
638: 107:
performs this task. Other operations performed by this function to canonicalize filenames are the handling of
367:<node1 x='1' a="1" a="2">Data</node1    > <node2>Data</node2> 142:
to execute would be an error caused by a failure to canonicalize the filename to the simplest representation,
409:
All whitespace in character content is retained (excluding characters removed during line feed normalization)
606: 240:
With the help of canonical URLs, a search engine knows which link should be provided in a query result.
291: 46: 134:
path specifier to traverse back up the directory hierarchy in an attempt to execute a file outside of
441: 278:
Since the Canonical URL gets used in the search result of search engines, they are in most cases a
581: 415:
Special characters in attribute values and character content are replaced by character references
178: 300: 244: 202: 22: 172: 168: 566: 431:
Lexicographic order is imposed on the namespace declarations and attributes of each element
295: 307:
element can give search engines conflicting signals about how to handle canonicalization
150:
vulnerability. With the path canonicalized, it is clear the file should not be executed.
8: 147: 530: 520: 449: 100: 493: 406:
Whitespace outside of the document element and within start and end tags is normalized
206: 355:
document is by definition an XML document that is in XML Canonical form, defined by
62:
by eliminating repeated calculations, or to make it possible to impose a meaningful
489: 461: 33: 372:<node1 a="2" x="1">Data</node1> <node2>Data</node2> 515: 55: 54:
that has more than one possible representation into a "standard", "normal", or
162:, many accented letters can be represented in more than one way. For example, 632: 525: 499: 352: 122:
may be executed. This rule is enforced by checking that the path starts with
112: 535: 445: 279: 26: 16:
Process for converting data into a "standard", "normal", or canonical form
567:"Consolidate Duplicate URLs with Canonicals | Google Search Central" 336: 80: 554: 412:
Attribute value delimiters are set to quotation marks (double quotes)
260: 256: 59: 225: 84: 356: 219: 304: 159: 63: 418:
Superfluous namespace declarations are removed from each element
391:
Attribute values are normalized, as if by a validating processor
481: 130:
initially appears to be in the cgi directory, it exploits the
400:
The XML declaration and document type declaration are removed
362:
A simple example would be the following two snippets of XML:
329: 232:
The canonical can be in a different domain than a duplicate.
182: 381:
A full summary of canonicalization changes is listed below:
322: 128:
C:\inetpub\wwwroot\cgi-bin\..\..\..\Windows\System32\cmd.exe
457: 88: 51: 271:
Canonical URLs are usually the URLs that get used for the
198: 397:
CDATA sections are replaced with their character content
315: 388:
Line breaks normalized to #xA on input, before parsing
555:
RFC 2279: UTF-8, a transformation format of ISO 10646
403:
Empty elements are converted to start-end tag pairs
394:
Character and parsed entity references are replaced
630: 83:may in most cases be accessed through multiple 435: 421:Default attributes are added to each element 126:and only then executing it. While the file 502:is the process of converting a word to its 285: 607:"Canonicalized URL is noindex, nofollow" 247:can get used to define a canonical URL. 181:in the Unicode standard, in particular 171:. In this context, canonicalization is 631: 579: 337:http://www.wikipedia.com/?source=asdf 586:Matt Cutts: Gadgets, Google, and SEO 573: 484:, so we can select one of them; ex. 13: 582:"SEO advice: url canonicalization" 14: 650: 496:use this kind of representation. 385:The document is encoded in UTF-8 226:https://example.com/dresses/1234 212: 357:The Canonical XML specification 220:https://example.com/?dress=1234 599: 580:Cutts, Matt (4 January 2006). 559: 548: 488:, to represent all the forms. 69: 50:) is a process for converting 1: 541: 74: 7: 509: 250: 235: 144:C:\Windows\System32\cmd.exe 124:C:\inetpub\wwwroot\cgi-bin\ 91:-like systems, the string " 10: 655: 292:search engine optimization 153: 120:C:\inetpub\wwwroot\cgi-bin 20: 436:Computational linguistics 330:http://www.wikipedia.com/ 624: 323:http://www.wikipedia.com 179:Variable-width encodings 21:Not to be confused with 428:attributes is performed 266: 480:are forms of the same 346: 301:Canonical link element 286:Search engines and SEO 245:canonical link element 203:single source of truth 188: 95:" can be replaced by " 23:Canonical link element 639:Computing terminology 173:Unicode normalization 169:canonical equivalence 316:http://wikipedia.com 296:URL canonicalization 148:directory traversal 87:. For instance in 531:Text normalization 521:Graph canonization 290:In web search and 146:, and is called a 101:C standard library 490:Lexical databases 207:duplicate content 201:for defining the 646: 618: 617: 615: 613: 603: 597: 596: 594: 592: 577: 571: 570: 563: 557: 552: 427: 373: 368: 339: 332: 325: 318: 228: 222: 145: 141: 137: 133: 129: 125: 121: 110: 106: 98: 94: 38:canonicalization 34:computer science 654: 653: 649: 648: 647: 645: 644: 643: 629: 628: 627: 622: 621: 611: 609: 605: 604: 600: 590: 588: 578: 574: 565: 564: 560: 553: 549: 544: 512: 464:, for example, 438: 425: 371: 366: 349: 335: 328: 321: 314: 288: 269: 253: 238: 224: 218: 215: 191: 156: 143: 139: 135: 131: 127: 123: 119: 108: 104: 103:, the function 96: 92: 77: 72: 42:standardization 30: 17: 12: 11: 5: 652: 642: 641: 626: 623: 620: 619: 598: 572: 558: 546: 545: 543: 540: 539: 538: 533: 528: 523: 518: 516:Canonical form 511: 508: 504:canonical form 454:canonical form 437: 434: 433: 432: 429: 422: 419: 416: 413: 410: 407: 404: 401: 398: 395: 392: 389: 386: 375: 374: 369: 348: 345: 341: 340: 333: 326: 319: 305:robots noindex 287: 284: 268: 265: 252: 249: 237: 234: 214: 211: 190: 187: 155: 152: 113:symbolic links 76: 73: 71: 68: 56:canonical form 15: 9: 6: 4: 3: 2: 651: 640: 637: 636: 634: 608: 602: 587: 583: 576: 568: 562: 556: 551: 547: 537: 534: 532: 529: 527: 526:Lemmatisation 524: 522: 519: 517: 514: 513: 507: 505: 501: 500:Lemmatisation 497: 495: 491: 487: 483: 479: 475: 471: 467: 463: 459: 455: 451: 447: 443: 430: 423: 420: 417: 414: 411: 408: 405: 402: 399: 396: 393: 390: 387: 384: 383: 382: 379: 370: 365: 364: 363: 360: 358: 354: 353:Canonical XML 344: 338: 334: 331: 327: 324: 320: 317: 313: 312: 311: 308: 306: 302: 297: 293: 283: 281: 276: 274: 264: 262: 258: 248: 246: 241: 233: 230: 227: 221: 213:Use by Google 210: 208: 204: 200: 196: 195:canonical URL 186: 184: 180: 176: 174: 170: 165: 161: 151: 149: 138:. Permitting 116: 114: 102: 90: 86: 82: 67: 65: 61: 57: 53: 49: 48: 47:normalization 43: 39: 35: 28: 24: 19: 610:. Retrieved 601: 589:. Retrieved 585: 575: 561: 550: 536:Type species 503: 498: 485: 477: 473: 469: 465: 456:of a set of 453: 446:lexicography 439: 380: 376: 361: 350: 342: 309: 289: 280:landing page 277: 273:share action 270: 254: 242: 239: 231: 216: 194: 192: 177: 163: 157: 117: 81:file systems 78: 45: 41: 37: 31: 27:Canonization 18: 591:3 September 99:". In the 70:Usage cases 40:(sometimes 542:References 442:morphology 105:realpath() 60:algorithms 424:Fixup of 310:Example: 261:guideline 257:intranets 85:filenames 79:Files in 75:Filenames 633:Category 612:20 April 510:See also 492:such as 426:xml:base 251:Intranet 236:Internet 478:running 462:English 452:is the 294:(SEO), 160:Unicode 154:Unicode 140:cmd.exe 136:cgi-bin 66:order. 64:sorting 494:Unitex 482:lexeme 476:, and 625:links 460:. In 458:words 450:lemma 197:is a 183:UTF-8 614:2020 593:2013 470:runs 448:, a 444:and 267:Misc 223:and 205:for 89:Unix 52:data 486:run 474:ran 466:run 440:In 347:XML 255:In 199:URL 189:URL 158:In 109:/.. 93:/./ 44:or 32:In 25:or 635:: 584:. 506:. 472:, 468:, 351:A 282:. 275:. 263:. 243:A 209:. 193:A 175:. 132:.. 115:. 36:, 616:. 595:. 569:. 164:é 97:/ 29:.

Index

Canonical link element
Canonization
computer science
normalization
data
canonical form
algorithms
sorting
file systems
filenames
Unix
C standard library
symbolic links
directory traversal
Unicode
canonical equivalence
Unicode normalization
Variable-width encodings
UTF-8
URL
single source of truth
duplicate content
https://example.com/?dress=1234
https://example.com/dresses/1234
canonical link element
intranets
guideline
share action
landing page
search engine optimization

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.