Direkt zum Inhalt

Poesio, Massimo ; Chamberlain, Jon ; Paun, Silviu ; Yu, Juntao ; Uma, Alexandra ; Kruschwitz, Udo

A Crowdsourced Corpus of Multiple Judgments and Disagreement on Anaphoric Interpretation

Poesio, Massimo, Chamberlain, Jon, Paun, Silviu, Yu, Juntao, Uma, Alexandra and Kruschwitz, Udo (2019) A Crowdsourced Corpus of Multiple Judgments and Disagreement on Anaphoric Interpretation. In: NAACL 2019 - Conference of the North American Chapter of the Association for Computational Linguistics, June, 2019, Minneapolis, Minnesota.

Date of publication of this fulltext: 29 Jun 2020 13:34
Conference or workshop item
DOI to cite this document: 10.5283/epub.43420


Abstract

We present a corpus of anaphoric information (coreference) crowdsourced through a game-with-a-purpose. The corpus, containing annotations for about 108,000 markables, is one of the largest corpora for coreference for English, and one of the largest crowdsourced NLP corpora, but its main feature is the large number of judgments per markable: 20 on average, and over 2.2M in total. This ...

We present a corpus of anaphoric information (coreference) crowdsourced through a game-with-a-purpose. The corpus, containing annotations for about 108,000 markables, is one of the largest corpora for coreference for English, and one of the largest crowdsourced NLP corpora, but its main feature is the large number of judgments per markable: 20 on average, and over 2.2M in total. This characteristic makes the corpus a unique resource for the study of disagreements on anaphoric interpretation. A second distinctive feature is its rich annotation scheme, covering singletons, expletives, and split-antecedent plurals. Finally, the corpus also comes with labels inferred using a recently proposed probabilistic model of annotation for coreference. The labels are of high quality and make it possible to successfully train a state of the art coreference resolver, including training on singletons and non-referring expressions. The annotation model can also result in more than one label, or no label, being proposed for a markable, thus serving as a baseline method for automatically identifying ambiguous markables. A preliminary analysis of the results is presented.



Involved Institutions


Details

Item typeConference or workshop item (UNSPECIFIED)
Title of Book:Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Publisher:Association for Computational Linguistics
Place of Publication:Minneapolis, Minnesota
Page Range:pp. 1778-1789
DateJune 2019
InstitutionsLanguages and Literatures > Institut für Information und Medien, Sprache und Kultur (I:IMSK) > Lehrstuhl für Informationswissenschaft (Prof. Dr. Udo Kruschwitz)
Informatics and Data Science > Department Human-Centered Computing > Lehrstuhl für Informationswissenschaft (Prof. Dr. Udo Kruschwitz)
Dewey Decimal Classification000 Computer science, information & general works > 020 Library & information sciences
StatusPublished
RefereedYes, this version has been refereed
Created at the University of RegensburgYes
URN of the UB Regensburgurn:nbn:de:bvb:355-epub-434200
Item ID43420

Export bibliographical data

Owner only: item control page

nach oben