All articles

Applica at CoNLL 2020

Applica continues its efforts to publish on top NLP Conferences. This time, the research team presented at CoNLL 2020 the paper: From Dataset Recycling to Multi-Property Extraction and Beyond. The paper focuses on the extraction of the information included in Wikipedia articles. The contributions are threefold. We:

  • Investigate the use of dual-source transformer models on Information Extraction tasks
  • Set the new state-of-the-art results on the WikiReading dataset
  • Introduce a dataset called WikiReading Recycled

WikiReading

In 2016 Daniel Hewlett proposed a dataset called WikiReading that combines Wikipedia articles and Information from WikiData portal. The information from the WikiData is called properties and consists of property name and property value. The goal is to extract property values from the associated Wikipedia article, given a property name. For example, having an article about the EMNLP conference, the model will have to predict that the conference’s main subject is Natural Language Processing.

Capture: Visualization of the examples. The Wikipedia article (on the left) and associated with its properties (on the right). The property names are short name and main subsect. The model should predict the answers (respectively): EMNLP and natural language processing.

Despite the usefulness of WikiReading, we discovered some drawbacks in its construction process. To fix these problems, we decided to modify the dataset, and it leads us to introduce the WikiReading Recycled dataset.

WikiReading Recycled and Multi-property Extraction

WikiReading Recycled uses the same data as WikiReading, although we reorganized them to meet the following criteria: no information leak between training and evaluation, the carefully constructed and human-annotated test set, and multi-property extraction paradigm. These requirements improved the dataset in many ways. First, the evaluation process is fair. Second, the model is encouraged to discover relationships between properties: such intertwining between data points is widespread in business data. The properties may intertwine in many ways: some are collocated with each other, some need information provided by the others. 

FeatureWikiReadingWikiReading Recycled
Base unitpropertyarticle
Properties per example14.5
An Article appears inFew splitOne split
Dataset splitrandomcontrolled
Human-annotated test-setnoyes

Results

Transformer-based models have shown superior performance in many NLP tasks. We evaluated a simple Transformer model and the dual-source Transformer. The latter was initially proposed for Automatic Post-Editing task and takes two inputs: a query (e.g.: property names) and a key (e.g.: an article). 

Both models outperformed the previous approaches on WikiReading by a large margin. We repeated the experiments on WikiReading Recycled and set strong baselines. Moreover, we  achieved a bit higher scores with transfer learning applied.

Besides, we performed a detailed analysis of the results. We checked how the models perform based on property frequencies or property types. We encourage you to see the paper for more details.

The dataset and the models are available on Github.