WAT API Documentation

WAT is an entity linker, namely a tool that identifies meaningful substrings (called "spots") in an unstructured English text and link each of them to the unambiguous entity (an item in a knowledge base). Entities are Wikipedia/Wikidata items. This has applications in a range of NLP/NLU problems such as question answering, knowledge base population, text classification, etc. You can annotate a text by issuing a query to the RESTful API documented in this page. You can annotate a text by issuing a query to the API documented in this page.

WAT deprecates TagME: it has similar runtime performance but the result is more accurate (See the paper for details), though it can currently process English documents only.

Parameters can be passed to the API endpoints either as URL-encoded parameters or as fields of a multipart request. All endpoints process HTTP GET request only.

Registering to the service

The service is hosted by the D4Science Infrastructure. To obtain access you need to register to the TagMe VRE and get your authorization token by clicking on the Show button in the left panel. Now you have everything in place to issue a query to WAT Api. For example, you can point your browser to:

https://wat.d4science.org/wat/tag/tag?lang=en&gcube-token=XXXX&text=Obama+visited+U.K.+in+March

(Replace XXXX with your actual Service Authorization Token)

Congratulations! You have made your first request to WAT.

How to annotate

Annotating a text is the main service provided by WAT. This is the so-called Sa2KB problem. An annotation is a pair (spot, entity), where "spot" is a substring of the input text and "entity" is a reference to a Wikipedia item representing the meaning of that spot, in that context.

The response includes all annotations found in the input text. WAT associates to each annotation an attribute called ρ (rho), which estimates the confidence in the annotation. (Note that ρ does not indicate the relevance of the entity in the input text). You can use the ρ value to discard annotations that are below a given threshold. The threshold should be chosen in the interval [0,1]. A reasonable threshold is between 0.1 and 0.3.

Endpoint URL

https://wat.d4science.org/wat/tag/tag

Parameters

text - required - the text to be annotated
gcube-token - required - the D4Science Service Authorization Token.
lang - optional - The language of the text to be annotated. Currently only en for English is accepted as value.
tokenizer - optional - Tokenizer to use. Accepted values: opennlp (default, for well-formed text), lucene (for non well-formed text).

Advanced optional parameters

debug - optional - Include debugging information. This value is interpreted as the bitwise OR of more than one value. Each value provides specific debugging information: 1 (document processing), 2 (spotted mentions), 4 (pipeline), 8 (explanation of disambiguation modules). E.g. to get debug information about document processing and pipeline, provide debug=5.

Python Running Example

Here a toy example writte in Python for querying WAT with the best configuration settings described in this thesis. Just copy and past these example in your code to see it working!

import json
import requests

MY_GCUBE_TOKEN = 'copy your gcube-token here!'

class WATAnnotation:
    # An entity annotated by WAT

    def __init__(self, d):

        # char offset (included)
        self.start = d['start']
        # char offset (not included)
        self.end = d['end']

        # annotation accuracy
        self.rho = d['rho']
        # spot-entity probability
        self.prior_prob = d['explanation']['prior_explanation']['entity_mention_probability']

        # annotated text
        self.spot = d['spot']

        # Wikpedia entity info
        self.wiki_id = d['id']
        self.wiki_title = d['title']


    def json_dict(self):
        # Simple dictionary representation
        return {'wiki_title': self.wiki_title,
                'wiki_id': self.wiki_id,
                'start': self.start,
                'end': self.end,
                'rho': self.rho,
                'prior_prob': self.prior_prob
                }


def wat_entity_linking(text):
    # Main method, text annotation with WAT entity linking system
    wat_url = 'https://wat.d4science.org/wat/tag/tag'
    payload = [("gcube-token", MY_GCUBE_TOKEN),
               ("text", text),
               ("lang", 'en'),
               ("tokenizer", "nlp4j"),
               ('debug', 9),
               ("method",
                "spotter:includeUserHint=true:includeNamedEntity=true:includeNounPhrase=true,prior:k=50,filter-valid,centroid:rescore=true,topk:k=5,voting:relatedness=lm,ranker:model=0046.model,confidence:model=pruner-wiki.linear")]

    response = requests.get(wat_url, params=payload)
    return [WATAnnotation(a) for a in response.json()['annotations']]


def print_wat_annotations(wat_annotations):
    json_list = [w.json_dict() for w in wat_annotations]
    print json.dumps(json_list, indent=4)


wat_annotations = wat_entity_linking('Barack Obama was in Pisa for a flying visit.')
print_wat_annotations(wat_annotations)

URL Get-Example

gcube-token=<your Service Authorization Token>
text=Schumacher won the race in Indianapolis
lang=en

This corresponds to the GET request:

https://wat.d4science.org/wat/tag/tag?lang=en&gcube-token=<your Service Authorization Token>&text=Schumacher won the race in Indianapolis

How to compute entity relatedness

This service computes the relatedness between two entities by returning a value in the range [0,1], which expresses how much the two entities are semantically related to each other, where 0 = not related, 1 = related.

We point out that this service could be used to relate two texts by, first, annotating them with TagMe and, then, estimating the pairwise relatedness among all pairs of their annotated entities. All these values could be combined some way (e.g. avg, max, etc. etc.) in order to derive some value which expresses the relatedness between the two input texts.

This endpoint accepts a list of Wikipedia Page IDs and will return the relatedness values among all provided pairs. This means it will return N2 values where N is the number of entities provided, so be careful!

Endpoint URL

https://wat.d4science.org/wat/relatedness/graph

Parameters

lang - optional - The language of the text to be annotated. Currently only en for English is accepted as value.
ids - required, repeated - The Wikipedia ID (a numeric identifiers) of an entity.
relatedness - optional - Relatedness function to compute. Accepted values are: mw (Milne-Witten), jaccard (Jaccard measure of pages outlinks), lm (language model), w2v (Word2Vect), conditionalprobability (Conditional Probability), barabasialbert (Barabasi-Albert on the Wikipedia Graph), pmi (Pointwise Mutual Information).

Example

Compute the relatedness among two entities Barack Obama (Wikipedia ID 534366) and Presidency of Barack Obama (Wikipedia ID 20082093):

gcube-token=<your Service Authorization Token>
ids=534366
ids=20082093

This corresponds to the GET request:

https://wat.d4science.org/wat/tag/tag?gcube-token=<your Service Authorization Token>&ids=534366&ids=20082093

How to get surface forms information

Surface forms are portions of text that might be a mention of an entity. WAT provides two endpoints to retrieve information a surface form, such as the entities it might refer to, how many times it has been seen in Wikipedia, how many times it appeared as a link, etc. The two endpoints have the same parameters. The first endpoint provides information about the frequency of a surface form, while the second also provides information about the entities it may refer to.

Endpoint URL - frequency information

https://wat.d4science.org/wat/sf/frequency

Endpoint URL - full information

https://wat.d4science.org/wat/sf/sf

Parameters

lang - optional - The language of the text to be annotated. Currently only en for English is accepted as value.
text - required - The surface form.

Example

Get information about the surface form obama:

gcube-token=<your Service Authorization Token>
text=obama

This corresponds to the GET request:

https://wat.d4science.org/wat/sf/sf?gcube-token=<your Service Authorization Token>&text=obama

Brief explanation of the most important fields in the response:

link_probability: Ratio of how many times this surface form occurs in a Wikipedia page as a link with respect to as normal text.
term_frequency: Number of occurrences of this surface form in all Wikipedia pages.
term_probability: Ratio of occurrences in all Wikipedia pages of this surface form with respect to all other surface forms.
document_frequency: Number of distinct Wikipedia pages contain this surface form.
entities: What Wikipedia pages this surface form points to in Wikipedia, for each entity:
- wiki_id: Wikipedia Page ID
- num_links: How many time this surface form points to this Wikipedia page
- probability: Ratio of how many times this surface form points to this Wikipedia page as opposed to others.

Wikipedia Title resolution

To resolve a Wikipedia page title into its ID, WAT provides the following API.

API Endpoint

https://wat.d4science.org/wat/title

Parameters

lang - optional - The language of the text to be annotated. Currently only en for English is accepted as value.
title - required - The Wikipedia page title.

Example

Get the ID of page Barack Obama:

gcube-token=<your Service Authorization Token>
title=Barack_Obama

This corresponds to the GET request:

https://wat.d4science.org/wat/title?title=Barack_Obama

Advanced use: Annotation with partially fed data

The first API endpoint described in this document takes care of the whole pipeline of parsing the text, searching the text for possible mentions, and finally linking them to the entities they refer to. WAT also provides an API endpoint to skip some of these steps by letting the user provide the parsed text or indicating what surface forms are mentions of an entity (D2KB problem). In this case, the document must be passed as a Json object.

API Endpoint

https://wat.d4science.org/wat/tag/json

Parameters

document - required - A JSON object representing the text to annotate. This object has the following key-value pairs:
- "text" - required - The document to process
- "sentences" - optional - An array in the format returned by Stanford CoreNLP parser.
- "spans" - optional - The spans to annotate (D2KB problem), and array of objects having "start" and "end" fields, i.e. the index of the first (inclusive) and last (exclusive) character of the span
gcube-token - required - the D4Science Service Authorization Token.
lang - optional - The language of the text to be annotated. Currently only en for English is accepted as value.
tokenizer - optional - Tokenizer to use. Accepted values: opennlp (default, for well-formed text), lucene (for non well-formed text).

Example: Providing the parsed sentence

Building a request to this endpoint by hand is not convenient, mainly because the JSON must be URL-encoded. Here's a Python script that issues a call providing the parsed text:

import json
import requests

document_json = json.loads("""{
  "text": "Barack Obama was in Pisa.",
  "sentences": [{
      "tokens": [
        { "position": { "start": 0, "end": 6 },
          "ner": { "type": "PERSON", "label": "Inside" },
          "id": 0,
          "word": { "word": "Barack" }
        }, {
          "position": { "start": 7, "end": 12 },
          "ner": { "type": "PERSON", "label": "Inside" },
          "id": 1,
          "word": { "word": "Obama" }
        }, {
          "position": { "start": 13, "end": 16 },
          "ner": { "type": "O", "label": "Outside" },
          "id": 2,
          "word": { "word": "was" }
        }, {
          "position": {"start": 17, "end": 19},
          "ner": {"type": "O", "label": "Outside"},
          "id": 3,
          "word": { "word": "in" }
        }, {
          "position": { "start": 20, "end": 24 },
          "ner": { "type": "LOCATION", "label": "Inside" },
          "id": 4,
          "word": { "word": "Pisa" }
        }, {
          "position": { "start": 24,  "end": 25 },
          "ner": { "type": "O",  "label": "Outside" },
          "id": 5,
          "word": { "word": "." }
        }
      ],
      "position": { "start": 0,  "end": 25 },
      "id": 0
    }]
}""")

r = requests.get('https://wat.d4science.org/wat/tag/json', params={"document": json.dumps(document_json)})

print r.text

Example: Providing the mentions to disambiguate (D2KB problem)

You can request WAT to annotate specific spans by adding the "suggested_spans" key.

import json
import requests

document_json = json.loads("""{

  "text": "Barack Obama was in Pisa.",
  "suggested_spans": [
    { "start":0, "end":6 },
    { "start":20, "end":24 }
  ]
}""")

r = requests.get('https://wat.d4science.org/wat/tag/json', params={"document": json.dumps(document_json)})

print r.text

Credits and Reference

To know more about the functioning of WAT, check out this paper appeard at ERD 2014 and the PhD Thesis by Francesco Piccinno.