Winograd schema challenge

126:, which involves free-flowing, unrestricted conversations in English between human judges and computer programs over a text-only channel (such as teletype). In general, the machine passes the test if interrogators are not able to tell the difference between it and a human in a five-minute conversation. 936:

Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey;

329:

The 2016 Winograd Schema Challenge was run on July 11, 2016 at IJCAI-16. There were four contestants. The first round of the contest was to solve PDPs—pronoun disambiguation problems, adapted from literary sources, not constructed as pairs of sentences. The highest score achieved was 58% correct, by

195:

The schema challenge question is, "Does the pronoun 'they' refer to the city councilmen or the demonstrators?" Switching between the two instances of the schema changes the answer. The answer is immediate for a human reader, but proves difficult to emulate in machines. Levesque argues that knowledge

337:

achieved 70% accuracy on 70 manually selected problems from the original 273 Winograd schema dataset. In June 2018, a score of 63.7% accuracy was achieved on the full dataset using an ensemble of recurrent neural network language models, marking the first use of deep neural networks that learn from

310:

In 2016 and 2018, Nuance Communications sponsored a competition, offering a grand prize of $ 25,000 for the top scorer above 90% (for comparison, humans correctly answer to 92–96% of WSC questions). However, nobody came close to winning the prize in 2016 and the 2018 competition was cancelled for

159:

The key factor in the WSC is the special format of its questions, which are derived from Winograd schemas. Questions of this form may be tailored to require knowledge and commonsense reasoning in a variety of domains. They must also be carefully written not to betray their answers by

112:. Turing proposed that, instead of debating whether a machine can think, the science of AI should be concerned with demonstrating intelligent behavior, which can be tested. But the exact nature of the test Turing proposed has come under scrutiny, especially since an AI chatbot named 282:

One difficulty with the Winograd schema challenge is the development of the questions. They need to be carefully tailored to ensure that they require commonsense reasoning to solve. For example, Levesque gives the following example of a so-called Winograd schema that is "too easy":

295:: in any situation, pills do not get pregnant, women do; women cannot be carcinogenic, but pills can. Thus this answer could be derived without the use of reasoning, or any understanding of the sentences' meaning—all that is necessary is data on the selectional restrictions of 330:

Quan Liu et al, of the University of Science and Technology, China. Hence, by the rules of that challenge, no prizes were awarded, and the challenge did not proceed to the second round. The organizing committee in 2016 was Leora Morgenstern, Ernest Davis, and Charles Ortiz.

937:

Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; et al. (2020). "Language Models are Few-Shot Learners".

732:

The prize could not be awarded to anybody. Most of the participants showed a result close to the random choice or even worse. The second competition scheduled for 2018 was canceled due to the lack of prospective

349:

A more challenging, adversarial "Winogrande" dataset of 44,000 problems was designed in 2019. This dataset consists of fill-in-the-blank style sentences, as opposed to the pronoun format of previous datasets.

626: 814:

Liu, Quan; Jiang, Hui; Ling, Zhen-Hua; Zhu, Xiaodan; Wei, Si; Hu, Yu (2016). "Commonsense Knowledge Enhanced Embeddings for Solving Pronoun Disambiguation Problems in Winograd Schema Challenge".

132:

announced in July 2014 that it would sponsor an annual WSC competition, with a prize of $ 25,000 for the best system that could match human performance. However, the prize is no longer offered.

116:

claimed to pass it in 2014. One of the major concerns with the Turing test is that a machine could easily pass the test with brute force and/or trickery, rather than true intelligence.

196:

plays a central role in these problems: the answer to this schema has to do with our understanding of the typical relationships between and behavior of councilmen and demonstrators.

119:

The Winograd schema challenge was proposed in 2012 in part to ameliorate the problems that came to light with the nature of the programs that performed well on the test.

576: 203:, has compiled a list of over 140 Winograd schemas from various sources as examples of the kinds of questions that should appear on the Winograd schema challenge. 338:

independent corpora to acquire common sense knowledge. In 2019 a score of 90.1%, was achieved on the original Winograd schema dataset by fine-tuning of the

318:

Spring Symposium Series at Stanford University, with a special focus on the Winograd schema challenge. The organizing committee included Leora Morgenstern (

147:

Conversation: A lot of interaction may qualify as "legitimate conversation"—jokes, clever asides, points of order—without requiring intelligent reasoning.

50:, it is a multiple-choice test that employs questions of a very specific structure: they are instances of what are called Winograd schemas, named after 267:

Winograd schemas of varying difficulty may be designed, involving anything from simple cause-and-effect relationships to complex narratives of events.

236:

A special word and alternate word, such that if the special word is replaced with the alternate word, the natural resolution of the pronoun changes.

140:

The performance of Eugene Goostman exhibited some of the Turing test's problems. Levesque identifies several major issues, summarized as follows:

379: 698:

Sakaguchi, Keisuke; Le Bras, Ronan; Bhagavatula, Chandra; Choi, Yejin (2019). "WinoGrande: An Adversarial Winograd Schema Challenge at Scale".

354: 17: 342:

language model with appropriate WSC-like training data to avoid having to learn commonsense reasoning. The general language model

314:

The Twelfth International Symposium on the Logical Formalizations of Commonsense Reasoning was held on March 23–25, 2015 at the

796: 85: 270:

They may be constructed to test reasoning ability in specific domains (e.g., social/psychological or spatial reasoning).

1016: 899: 415: 323: 878:"Cause-Effect Knowledge Acquisition and Neural Association Model for Solving a Set of Winograd Schema Problems" 358: 248:

A machine will be given the problem in a standardized form which includes the answer choices, thus making it a

563:

Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning

123: 74: 719:

Boguslavsky, I.M.; Frolova, T.I.; Iomdin, L.L.; Lazursky, A.V.; Rygaev, I.P.; Timoshenko, S.P. (2019).

728:

Proceedings of the International Conference of Computational Linguistics and Intellectual Technologies

445: 720: 444:

Kocijan, Vid; Davis, Ernest; Lukasiewicz, Thomas; Marcus, Gary; Morgenstern, Leora (11 July 2023).

144:

Deception: The machine is forced to construct a false identity, which is not part of intelligence.

292: 161: 109: 876:

Liu, Quan; Jiang, Hui; Evdokimov, Andrew; Ling, Zhen-Hua; Zhu, Xiaodan; Wei, Si; Hu, Yu (2017).

915:

Trinh, Trieu H.; Le, Quoc V. (26 September 2019). "A Simple Method for Commonsense Reasoning".

577:"Nuance announces the Winograd Schemas Challenge to Advance Artificial Intelligence Innovation" 62: 1006: 770: 339: 129: 78: 43: 322:), Theodore Patkos (The Foundation for Research & Technology Hellas), and Robert Sloan ( 77:, but Levesque argues that for Winograd schemas, the task requires the use of knowledge and 199:

Since the original proposal of the Winograd schema challenge, Ernest Davis, a professor at

66: 8: 882:

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence

200: 55: 191:

The city councilmen refused the demonstrators a permit because they advocated violence.

938: 916: 815: 699: 483: 457: 172:

The first cited example of a Winograd schema (and the reason for their name) is due to

1011: 895: 858: 645: 487: 475: 188:

The city councilmen refused the demonstrators a permit because they feared violence.

885: 848: 641: 607: 525: 508: 467: 424: 556: 471: 429: 410: 249: 113: 39: 986: 745: 184:

The choices of "feared" and "advocated" turn the schema into its two instances:

150:

Evaluation: Humans make mistakes and judges often would disagree on the results.

667: 604:

The Theory of Correlation Formulas and Their Application to Discourse Coherence

529: 516: 223: 180:

The city councilmen refused the demonstrators a permit because they violence.

173: 89: 51: 890: 1000: 877: 862: 853: 836: 479: 287:

The women stopped taking pills because they were . Which individuals were ?

61:

On the surface, Winograd schema questions simply require the resolution of

504: 219: 105: 101: 47: 611: 380:"Can Winograd Schemas Replace Turing Test for Defining Human-level AI" 260:

The Winograd schema challenge has the following purported advantages:

353:

A version of the Winograd schema challenge is one part of the GLUE (

991: 943: 921: 837:"Planning, Executing, and Evaluating the Winograd Schema Challenge" 835:

Morgenstern, Leora; Davis, Ernest; Ortiz, Charles L. (March 2016).

820: 704: 462: 108:

in 1950, the Turing test plays a central role in the philosophy of

244:

Two answer choices corresponding to the noun phrases in question.

230: 70: 718: 346:

achieved a score of 88.3% without specific fine-tuning in 2020.

100:

The Winograd Schema Challenge was proposed in the spirit of the

795:

Davis, Ernest; Morgenstern, Leora; Ortiz, Charles (Fall 2017).

697: 319: 264:

Knowledge and commonsense reasoning are required to solve them.

84:

The challenge is considered defeated in 2019 since a number of

291:

The answer to this question can be determined on the basis of

211:

A Winograd schema challenge question consists of three parts:

958: 443: 343: 555:

Levesque, Hector; Davis, Ernest; Morgenstern, Leora (2012).

241:

A question asking the identity of the ambiguous pronoun, and

164:

or statistical information about the words in the sentence.

315: 215:

A sentence or brief discourse that contains the following:

987:

Website for the contest sponsored by Nuance Communications

775:

Association for the Advancement of Artificial Intelligence

226:(male, female, inanimate, or group of objects or people), 554: 38:) is a test of machine intelligence proposed in 2012 by 721:"Knowledge-based approach to Winograd Schema Challenge" 233:

that may refer to either of the above noun phrases, and

935: 875: 834: 794: 311:lack of prospects; the prize is no longer offered. 357:) benchmark collection of challenges in automated 122:Turing's original proposal was what he called the 797:"The First Winograd Schema Challenge at IJCAI-16" 333:In 2017, a neural association model designed for 998: 929: 135: 813: 499: 497: 446:"The defeat of the Winograd Schema Challenge" 693: 691: 689: 606:(Thesis). UT Digital Repository. p. 6. 404: 402: 400: 494: 942: 920: 889: 852: 819: 703: 686: 659: 461: 428: 397: 355:General Language Understanding Evaluation 624: 595: 550: 548: 546: 408: 377: 73:in a statement. This makes it a task of 914: 601: 371: 46:. Designed to be an improvement on the 14: 999: 509:"Computing Machinery and Intelligence" 503: 569: 543: 206: 763: 154: 54:, professor of computer science at 24: 738: 668:"A Collection of Winograd Schemas" 618: 273:There is no need for human judges. 25: 1028: 980: 665: 335:commonsense knowledge acquisition 324:University of Illinois at Chicago 92:achieved accuracies of over 90%. 992:https://arxiv.org/abs/2201.02387 627:"Understanding Natural Language" 625:Winograd, Terry (January 1972). 65:: the machine must identify the 951: 908: 869: 828: 807: 788: 712: 602:Michael, Julian (18 May 2015). 378:Ackerman, Evan (29 July 2014). 437: 359:natural-language understanding 42:, a computer scientist at the 13: 1: 558:The Winograd Schema Challenge 364: 255: 136:Weaknesses of the Turing test 646:10.1016/0010-0285(72)90002-3 472:10.1016/j.artint.2023.103971 430:10.1016/j.artint.2014.03.007 27:Test of machine intelligence 7: 771:"AAAI 2015 Spring Symposia" 746:"Winograd Schema Challenge" 305: 277: 75:natural language processing 10: 1033: 95: 1017:Word-sense disambiguation 167: 32:Winograd schema challenge 18:Winograd Schema Challenge 854:10.1609/aimag.v37i1.2639 750:CommonsenseReasoning.org 530:10.1093/mind/LIX.236.433 409:Levesque, H. J. (2014). 293:selectional restrictions 162:selectional restrictions 891:10.24963/ijcai.2017/326 450:Artificial Intelligence 416:Artificial Intelligence 411:"On our best behaviour" 110:artificial intelligence 884:. pp. 2344–2350. 130:Nuance Communications 79:commonsense reasoning 44:University of Toronto 634:Cognitive Psychology 201:New York University 56:Stanford University 207:Formal description 963:GlueBenchmark.com 16:(Redirected from 1024: 974: 973: 971: 969: 959:"GLUE Benchmark" 955: 949: 948: 946: 933: 927: 926: 924: 912: 906: 905: 893: 873: 867: 866: 856: 832: 826: 825: 823: 811: 805: 804: 792: 786: 785: 783: 781: 767: 761: 760: 758: 756: 742: 736: 735: 725: 716: 710: 709: 707: 695: 684: 683: 681: 679: 663: 657: 656: 654: 652: 631: 622: 616: 615: 599: 593: 592: 590: 588: 573: 567: 566: 552: 541: 540: 538: 536: 524:(236): 433–460. 513: 507:(October 1950). 501: 492: 491: 465: 441: 435: 434: 432: 406: 395: 394: 392: 390: 375: 155:Winograd schemas 69:of an ambiguous 21: 1032: 1031: 1027: 1026: 1025: 1023: 1022: 1021: 997: 996: 983: 978: 977: 967: 965: 957: 956: 952: 934: 930: 913: 909: 902: 874: 870: 833: 829: 812: 808: 793: 789: 779: 777: 769: 768: 764: 754: 752: 744: 743: 739: 723: 717: 713: 696: 687: 677: 675: 666:Davis, Ernest. 664: 660: 650: 648: 629: 623: 619: 600: 596: 586: 584: 575: 574: 570: 553: 544: 534: 532: 511: 502: 495: 442: 438: 407: 398: 388: 386: 376: 372: 367: 308: 289: 280: 258: 250:binary decision 209: 193: 182: 170: 157: 138: 114:Eugene Goostman 98: 90:language models 40:Hector Levesque 28: 23: 22: 15: 12: 11: 5: 1030: 1020: 1019: 1014: 1009: 995: 994: 989: 982: 981:External links 979: 976: 975: 950: 928: 907: 900: 868: 827: 806: 787: 762: 737: 711: 685: 658: 617: 594: 583:. 28 July 2014 568: 542: 493: 436: 396: 369: 368: 366: 363: 307: 304: 285: 279: 276: 275: 274: 271: 268: 265: 257: 254: 246: 245: 242: 239: 238: 237: 234: 227: 224:semantic class 208: 205: 186: 178: 174:Terry Winograd 169: 166: 156: 153: 152: 151: 148: 145: 137: 134: 124:imitation game 104:. Proposed by 97: 94: 52:Terry Winograd 26: 9: 6: 4: 3: 2: 1029: 1018: 1015: 1013: 1010: 1008: 1005: 1004: 1002: 993: 990: 988: 985: 984: 964: 960: 954: 945: 940: 932: 923: 918: 911: 903: 901:9780999241103 897: 892: 887: 883: 879: 872: 864: 860: 855: 850: 846: 842: 838: 831: 822: 817: 810: 802: 798: 791: 776: 772: 766: 751: 747: 741: 734: 733:participants. 729: 722: 715: 706: 701: 694: 692: 690: 673: 669: 662: 647: 643: 639: 635: 628: 621: 613: 609: 605: 598: 582: 581:Business Wire 578: 572: 564: 560: 559: 551: 549: 547: 531: 527: 523: 519: 518: 510: 506: 500: 498: 489: 485: 481: 477: 473: 469: 464: 459: 455: 451: 447: 440: 431: 426: 422: 418: 417: 412: 405: 403: 401: 385: 384:IEEE Spectrum 381: 374: 370: 362: 360: 356: 351: 347: 345: 341: 336: 331: 327: 325: 321: 317: 312: 303: 302: 301:carcinogenic. 298: 294: 288: 284: 272: 269: 266: 263: 262: 261: 253: 251: 243: 240: 235: 232: 229:An ambiguous 228: 225: 221: 217: 216: 214: 213: 212: 204: 202: 197: 192: 189: 185: 181: 177: 175: 165: 163: 149: 146: 143: 142: 141: 133: 131: 127: 125: 120: 117: 115: 111: 107: 103: 93: 91: 87: 82: 80: 76: 72: 68: 64: 59: 57: 53: 49: 45: 41: 37: 33: 19: 1007:Turing tests 966:. Retrieved 962: 953: 931: 910: 881: 871: 847:(1): 50–54. 844: 840: 830: 809: 800: 790: 778:. Retrieved 774: 765: 753:. Retrieved 749: 740: 731: 727: 714: 676:. Retrieved 671: 661: 649:. Retrieved 640:(1): 1–191. 637: 633: 620: 603: 597: 585:. Retrieved 580: 571: 562: 557: 533:. Retrieved 521: 515: 505:Turing, Alan 453: 449: 439: 420: 414: 387:. Retrieved 383: 373: 352: 348: 334: 332: 328: 313: 309: 300: 296: 290: 286: 281: 259: 247: 222:of the same 220:noun phrases 210: 198: 194: 190: 187: 183: 179: 171: 158: 139: 128: 121: 118: 99: 83: 60: 35: 31: 29: 841:AI Magazine 801:AI Magazine 106:Alan Turing 102:Turing test 86:transformer 48:Turing test 1001:Categories 944:2005.14165 922:1806.02847 821:1611.04146 755:24 January 730:. Moscow. 705:1907.10641 678:30 October 672:cs.nyu.edu 651:4 November 612:2152/29979 587:9 November 535:28 October 463:2201.02387 456:: 103971. 389:29 October 365:References 256:Advantages 67:antecedent 863:0738-4602 780:1 January 488:245827747 480:0004-3702 423:: 27–35. 252:problem. 1012:Pronouns 306:Activity 297:pregnant 278:Pitfalls 63:anaphora 968:30 July 231:pronoun 96:History 88:-based 71:pronoun 898: 861: 486: 478: 320:Leidos 168:Origin 939:arXiv 917:arXiv 816:arXiv 724:(PDF) 700:arXiv 674:. NYU 630:(PDF) 512:(PDF) 484:S2CID 458:arXiv 344:GPT-3 970:2019 896:ISBN 859:ISSN 782:2015 757:2020 680:2014 653:2014 589:2014 537:2014 517:Mind 476:ISSN 391:2014 340:BERT 316:AAAI 299:and 218:Two 30:The 886:doi 849:doi 642:doi 608:hdl 526:doi 522:LIX 468:doi 454:325 425:doi 421:212 326:). 36:WSC 1003:: 961:. 894:. 880:. 857:. 845:37 843:. 839:. 799:. 773:. 748:. 726:. 688:^ 670:. 636:. 632:. 579:. 561:. 545:^ 520:. 514:. 496:^ 482:. 474:. 466:. 452:. 448:. 419:. 413:. 399:^ 382:. 361:. 176:: 81:. 58:. 972:. 947:. 941:: 925:. 919:: 904:. 888:: 865:. 851:: 824:. 818:: 803:. 784:. 759:. 708:. 702:: 682:. 655:. 644:: 638:3 614:. 610:: 591:. 565:. 539:. 528:: 490:. 470:: 460:: 433:. 427:: 393:. 34:( 20:)

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index