Canonicalization - Knowledge

185:, may cause an additional need for canonicalization in some situations. Namely, by the standard, in UTF-8 there is only one valid byte sequence for any Unicode character, but some byte sequences are invalid, i.e., they cannot be obtained by encoding any string of Unicode characters into UTF-8. Some sloppy decoder implementations may accept invalid byte sequences as input and produce a valid Unicode character as output for such a sequence. If one uses such a decoder, some Unicode characters effectively have more than one corresponding byte sequence: the valid one and some invalid ones. This could lead to security issues similar to the one described in the previous section. Therefore, if one wants to apply some filter (e.g., a regular expression written in UTF-8) to UTF-8 strings that will later be passed to a decoder that allows invalid byte sequences, one should canonicalize the strings before passing them to the filter. In this context, canonicalization is the process of translating every string character to its single valid byte sequence. An alternative to canonicalization is to reject any strings containing invalid byte sequences. 166:

can be represented in Unicode as the Unicode character U+0065 (LATIN SMALL LETTER E) followed by the character U+0301 (COMBINING ACUTE ACCENT), but it can also be represented as the precomposed character U+00E9 (LATIN SMALL LETTER E WITH ACUTE). This makes string comparison more complicated, since

377:

The first example contains extra spaces in the closing tag of the first node. The second example, which has been canonicalized, has had these spaces removed. Note that only the spaces within the tags are removed under W3C canonicalization, not those between tags.

298:

deals with web content that has more than one possible URL. Having multiple URLs for the same web content can cause problems for search engines - specifically in determining which URL should be shown in search results. Most search engines support the

359:. Briefly, canonicalization removes whitespace within tags, uses particular character encodings, sorts namespace references and eliminates redundant ones, removes XML and DOCTYPE declarations, and transforms relative URIs into absolute URIs. 229:), Google chooses one as canonical. Note that the pages do not need to be absolutely identical; minor changes in sorting or filtering of list pages do not make the page unique (for example, sorting by price or filtering by item color). 217:

A canonical URL is the URL of the page that Google thinks is most representative from a set of duplicate pages on your site. For example, if you have URLs for the same page (for example

58:. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various 118:

Canonicalization of filenames is important for computer security. For example, a web server may have a restriction that only files under the cgi directory

259:, manual searching for information is predominant. In this case, canonical URLs can be defined in a non-machine-readable form, too. For example in a 303:

as a hint to which URL should be treated as the true version. As indicated by John Mueller of Google, having other directives in a page, like the

111:

components referring to parent directories, simplification of sequences of multiple slashes, removal of trailing slashes, and the resolution of

272: 343:

All of these URLs point to the homepage of Knowledge, but a search engine will only consider one of them to be the canonical form of the URL.

167:

every possible representation of a string containing such glyphs must be considered. To deal with this, Unicode provides the mechanism of

638: 107:

performs this task. Other operations performed by this function to canonicalize filenames are the handling of

367:<node1 x='1' a="1" a="2">Data</node1 > <node2>Data</node2> 142:

to execute would be an error caused by a failure to canonicalize the filename to the simplest representation,

409:

All whitespace in character content is retained (excluding characters removed during line feed normalization)

606: 240:

With the help of canonical URLs, a search engine knows which link should be provided in a query result.

291: 46: 134:

path specifier to traverse back up the directory hierarchy in an attempt to execute a file outside of

441: 278:

Since the Canonical URL gets used in the search result of search engines, they are in most cases a

581: 415:

Special characters in attribute values and character content are replaced by character references

178: 300: 244: 202: 22: 172: 168: 566: 431:

Lexicographic order is imposed on the namespace declarations and attributes of each element

295: 307:

element can give search engines conflicting signals about how to handle canonicalization

150:

vulnerability. With the path canonicalized, it is clear the file should not be executed.

8: 147: 530: 520: 449: 100: 493: 406:

Whitespace outside of the document element and within start and end tags is normalized

206: 355:

document is by definition an XML document that is in XML Canonical form, defined by

62:

by eliminating repeated calculations, or to make it possible to impose a meaningful

489: 461: 33: 372:<node1 a="2" x="1">Data</node1> <node2>Data</node2> 515: 55: 54:

that has more than one possible representation into a "standard", "normal", or

162:, many accented letters can be represented in more than one way. For example, 632: 525: 499: 352: 122:

may be executed. This rule is enforced by checking that the path starts with

112: 535: 445: 279: 26: 16:

Process for converting data into a "standard", "normal", or canonical form

567:"Consolidate Duplicate URLs with Canonicals | Google Search Central" 336: 80: 554: 412:

Attribute value delimiters are set to quotation marks (double quotes)

260: 256: 59: 225: 84: 356: 219: 304: 159: 63: 418:

Superfluous namespace declarations are removed from each element

391:

Attribute values are normalized, as if by a validating processor

481: 130:

initially appears to be in the cgi directory, it exploits the

400:

The XML declaration and document type declaration are removed

362:

A simple example would be the following two snippets of XML:

329: 232:

The canonical can be in a different domain than a duplicate.

182: 381:

A full summary of canonicalization changes is listed below:

322: 128:

C:\inetpub\wwwroot\cgi-bin\..\..\..\Windows\System32\cmd.exe

457: 88: 51: 271:

Canonical URLs are usually the URLs that get used for the

198: 397:

CDATA sections are replaced with their character content

315: 388:

Line breaks normalized to #xA on input, before parsing

555:

RFC 2279: UTF-8, a transformation format of ISO 10646

403:

Empty elements are converted to start-end tag pairs

394:

Character and parsed entity references are replaced

630: 83:may in most cases be accessed through multiple 435: 421:Default attributes are added to each element 126:and only then executing it. While the file 502:is the process of converting a word to its 285: 607:"Canonicalized URL is noindex, nofollow" 247:can get used to define a canonical URL. 181:in the Unicode standard, in particular 171:. In this context, canonicalization is 631: 579: 337:http://www.wikipedia.com/?source=asdf 586:Matt Cutts: Gadgets, Google, and SEO 573: 484:, so we can select one of them; ex. 13: 582:"SEO advice: url canonicalization" 14: 650: 496:use this kind of representation. 385:The document is encoded in UTF-8 226:https://example.com/dresses/1234 212: 357:The Canonical XML specification 220:https://example.com/?dress=1234 599: 580:Cutts, Matt (4 January 2006). 559: 548: 488:, to represent all the forms. 69: 50:) is a process for converting 1: 541: 74: 7: 509: 250: 235: 144:C:\Windows\System32\cmd.exe 124:C:\inetpub\wwwroot\cgi-bin\ 91:-like systems, the string " 10: 655: 292:search engine optimization 153: 120:C:\inetpub\wwwroot\cgi-bin 20: 436:Computational linguistics 330:http://www.wikipedia.com/ 624: 323:http://www.wikipedia.com 179:Variable-width encodings 21:Not to be confused with 428:attributes is performed 266: 480:are forms of the same 346: 301:Canonical link element 286:Search engines and SEO 245:canonical link element 203:single source of truth 188: 95:" can be replaced by " 23:Canonical link element 639:Computing terminology 173:Unicode normalization 169:canonical equivalence 316:http://wikipedia.com 296:URL canonicalization 148:directory traversal 87:. For instance in 531:Text normalization 521:Graph canonization 290:In web search and 146:, and is called a 101:C standard library 490:Lexical databases 207:duplicate content 201:for defining the 646: 618: 617: 615: 613: 603: 597: 596: 594: 592: 577: 571: 570: 563: 557: 552: 427: 373: 368: 339: 332: 325: 318: 228: 222: 145: 141: 137: 133: 129: 125: 121: 110: 106: 98: 94: 38:canonicalization 34:computer science 654: 653: 649: 648: 647: 645: 644: 643: 629: 628: 627: 622: 621: 611: 609: 605: 604: 600: 590: 588: 578: 574: 565: 564: 560: 553: 549: 544: 512: 464:, for example, 438: 425: 371: 366: 349: 335: 328: 321: 314: 288: 269: 253: 238: 224: 218: 215: 191: 156: 143: 139: 135: 131: 127: 123: 119: 108: 104: 103:, the function 96: 92: 77: 72: 42:standardization 30: 17: 12: 11: 5: 652: 642: 641: 626: 623: 620: 619: 598: 572: 558: 546: 545: 543: 540: 539: 538: 533: 528: 523: 518: 516:Canonical form 511: 508: 504:canonical form 454:canonical form 437: 434: 433: 432: 429: 422: 419: 416: 413: 410: 407: 404: 401: 398: 395: 392: 389: 386: 375: 374: 369: 348: 345: 341: 340: 333: 326: 319: 305:robots noindex 287: 284: 268: 265: 252: 249: 237: 234: 214: 211: 190: 187: 155: 152: 113:symbolic links 76: 73: 71: 68: 56:canonical form 15: 9: 6: 4: 3: 2: 651: 640: 637: 636: 634: 608: 602: 587: 583: 576: 568: 562: 556: 551: 547: 537: 534: 532: 529: 527: 526:Lemmatisation 524: 522: 519: 517: 514: 513: 507: 505: 501: 500:Lemmatisation 497: 495: 491: 487: 483: 479: 475: 471: 467: 463: 459: 455: 451: 447: 443: 430: 423: 420: 417: 414: 411: 408: 405: 402: 399: 396: 393: 390: 387: 384: 383: 382: 379: 370: 365: 364: 363: 360: 358: 354: 353:Canonical XML 344: 338: 334: 331: 327: 324: 320: 317: 313: 312: 311: 308: 306: 302: 297: 293: 283: 281: 276: 274: 264: 262: 258: 248: 246: 241: 233: 230: 227: 221: 213:Use by Google 210: 208: 204: 200: 196: 195:canonical URL 186: 184: 180: 176: 174: 170: 165: 161: 151: 149: 138:. Permitting 116: 114: 102: 90: 86: 82: 67: 65: 61: 57: 53: 49: 48: 47:normalization 43: 39: 35: 28: 24: 19: 610:. Retrieved 601: 589:. Retrieved 585: 575: 561: 550: 536:Type species 503: 498: 485: 477: 473: 469: 465: 456:of a set of 453: 446:lexicography 439: 380: 376: 361: 350: 342: 309: 289: 280:landing page 277: 273:share action 270: 254: 242: 239: 231: 216: 194: 192: 177: 163: 157: 117: 81:file systems 78: 45: 41: 37: 31: 27:Canonization 18: 591:3 September 99:". In the 70:Usage cases 40:(sometimes 542:References 442:morphology 105:realpath() 60:algorithms 424:Fixup of 310:Example: 261:guideline 257:intranets 85:filenames 79:Files in 75:Filenames 633:Category 612:20 April 510:See also 492:such as 426:xml:base 251:Intranet 236:Internet 478:running 462:English 452:is the 294:(SEO), 160:Unicode 154:Unicode 140:cmd.exe 136:cgi-bin 66:order. 64:sorting 494:Unitex 482:lexeme 476:, and 625:links 460:. In 458:words 450:lemma 197:is a 183:UTF-8 614:2020 593:2013 470:runs 448:, a 444:and 267:Misc 223:and 205:for 89:Unix 52:data 486:run 474:ran 466:run 440:In 347:XML 255:In 199:URL 189:URL 158:In 109:/.. 93:/./ 44:or 32:In 25:or 635:: 584:. 506:. 472:, 468:, 351:A 282:. 275:. 263:. 243:A 209:. 193:A 175:. 132:.. 115:. 36:, 616:. 595:. 569:. 164:é 97:/ 29:.

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index