Knowledge

Text normalization

Source 📝

196:), in which some attempt is made to preserve these features. The aim is to strike an appropriate balance between, on the one hand, rigorous fidelity to the source text (including, for example, the preservation of enigmatic and ambiguous elements); and, on the other, producing a new text that will be comprehensible and accessible to the modern reader. The extent of normalization is therefore at the discretion of the editor, and will vary. Some editors, for example, choose to modernize archaic spellings and punctuation, but others do not. 47:, since input is guaranteed to be consistent before operations are performed on it. Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure. 161:
of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of text and as a special case of machine translation.
98:
Text can also be normalized for storing and searching in a database. For instance, if a search for "resume" is to match the word "résumé," then the text would be normalized by removing
115: 174:
and the editing of historic texts, the term "normalization" implies a degree of modernization and standardization – for example in the extension of
426: 379: 264:
Sproat, R.; Black, A.; Chen, S.; Kumar, S.; Ostendorf, M.; Richards, C. (2001). "Normalization of non-standard words."
404: 188: 383:
Proceedings of the LREC workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA)
157:
into a single space. More complex normalization requires correspondingly complicated algorithms, including
338:
Zhu, C.; Tang, J.; Li, H.; Ng, H.; Zhao, T. (2007). "A Unified Tagging Approach to Text Normalization."
343: 272: 79:"$ 200" would be pronounced as "two hundred dollars" in English, but as "lua selau tālā" in Samoan. 75:
are non-standard "words" that need to be pronounced differently depending on context. For example:
347: 314: 364:
Proceedings of the International Multiconference on Computer Science and Information Technology
205: 44: 43:
that it might not have had before. Normalizing text before storing or processing it allows for
26: 175: 154: 8: 360: 220: 215: 171: 139: 22: 340:
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics
400: 244: 135: 99: 380:"Towards Facilitating the Accessibility of Web 2.0 Texts through Text Normalisation" 210: 158: 111: 87: 56: 240: 91: 40: 420: 83: 64: 289: 276: 102:; and if "john" is to match "John", the text would be converted to a single 131: 72: 103: 119: 130:
For simple, context-independent normalization, such as removing non-
107: 239: 68: 36: 60: 361:"Text Normalization as a Special Case of Machine Translation." 359:
Filip, G.; Krzysztof, J.; Agnieszka, W.; Mikołaj, W. (2006).
179: 182:
typically found in manuscript and early printed sources. A
143: 16:
Process of transforming text into a single canonical form
110:(e.g. converting "flew" and "flying" both into "fly"), 55:
Text normalization is frequently used when converting
148:
sed ‑e "s/\s+/ /g"  
106:. To prepare text for searching, it might also be 418: 378:Mosquera, A.; Lloret, E.; Moreda, P. (2012). 399:. London: British Library. pp. 40–46. 315:"Text-to-Speech Engines Text Normalization" 178:and the transliteration of the archaic 419: 394: 332: 258: 165: 243:and Steven Bedrick (September 2011). 94:" depending on the surrounding words. 233: 116:American or British English spelling 307: 13: 186:is therefore distinguished from a 14: 438: 372: 353: 142:would suffice. For example, the 50: 35:is the process of transforming 388: 282: 1: 226: 125: 82:"vi" could be pronounced as " 266:Computer Speech and Language 7: 427:Natural language processing 199: 10: 443: 397:Editing Historical Records 20: 395:Harvey, P. D. A. (2001). 114:(e.g. consistently using 245:"CS506/606: Txt Nrmlztn" 153:would normalize runs of 21:Not to be confused with 194:semi-diplomatic edition 277:10.1006/csla.2001.0169 206:Automated paraphrasing 45:separation of concerns 176:scribal abbreviations 155:whitespace characters 27:Unicode normalization 221:Unicode equivalence 216:Text simplification 172:textual scholarship 166:Textual scholarship 140:regular expressions 189:diplomatic edition 184:normalized edition 33:Text normalization 23:word normalization 136:diacritical marks 100:diacritical marks 434: 411: 410: 392: 386: 376: 370: 357: 351: 336: 330: 329: 327: 325: 311: 305: 304: 302: 300: 290:"Samoan Numbers" 286: 280: 262: 256: 255: 253: 251: 237: 211:Canonicalization 170:In the field of 159:domain knowledge 152: 442: 441: 437: 436: 435: 433: 432: 431: 417: 416: 415: 414: 407: 393: 389: 377: 373: 358: 354: 337: 333: 323: 321: 313: 312: 308: 298: 296: 294:MyLanguages.org 288: 287: 283: 263: 259: 249: 247: 238: 234: 229: 202: 168: 147: 128: 53: 30: 17: 12: 11: 5: 440: 430: 429: 413: 412: 405: 387: 371: 352: 348:10.1.1.72.8138 331: 306: 281: 257: 241:Richard Sproat 231: 230: 228: 225: 224: 223: 218: 213: 208: 201: 198: 167: 164: 134:characters or 127: 124: 96: 95: 80: 57:text to speech 52: 49: 41:canonical form 39:into a single 15: 9: 6: 4: 3: 2: 439: 428: 425: 424: 422: 408: 406:0-7123-4684-8 402: 398: 391: 384: 381: 375: 368: 365: 362: 356: 349: 345: 341: 335: 320: 316: 310: 295: 291: 285: 278: 274: 270: 267: 261: 246: 242: 236: 232: 222: 219: 217: 214: 212: 209: 207: 204: 203: 197: 195: 191: 190: 185: 181: 177: 173: 163: 160: 156: 151: 145: 141: 137: 133: 123: 121: 117: 113: 112:canonicalized 109: 105: 101: 93: 89: 85: 81: 78: 77: 76: 74: 73:abbreviations 70: 66: 62: 58: 48: 46: 42: 38: 34: 28: 24: 19: 396: 390: 382: 374: 366: 363: 355: 339: 334: 322:. Retrieved 318: 309: 297:. Retrieved 293: 284: 268: 265: 260: 248:. Retrieved 235: 193: 187: 183: 169: 149: 132:alphanumeric 129: 97: 54: 51:Applications 32: 31: 18: 342:; 688–695. 271:; 287–333. 118:), or have 324:October 2, 299:October 2, 250:October 2, 227:References 126:Techniques 120:stop words 150:inputfile 122:removed. 92:the sixth 421:Category 369:; 51–56. 200:See also 69:acronyms 146:script 108:stemmed 90:," or " 61:Numbers 403:  385:; 9-14 180:glyphs 71:, and 65:dates 401:ISBN 326:2012 319:MSDN 301:2012 252:2012 192:(or 104:case 86:," " 37:text 344:doi 273:doi 144:sed 88:vee 84:vie 25:or 423:: 317:. 292:. 269:15 138:, 67:, 63:, 59:. 409:. 367:1 350:. 346:: 328:. 303:. 279:. 275:: 254:. 29:.

Index

word normalization
Unicode normalization
text
canonical form
separation of concerns
text to speech
Numbers
dates
acronyms
abbreviations
vie
vee
the sixth
diacritical marks
case
stemmed
canonicalized
American or British English spelling
stop words
alphanumeric
diacritical marks
regular expressions
sed
whitespace characters
domain knowledge
textual scholarship
scribal abbreviations
glyphs
diplomatic edition
Automated paraphrasing

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.