TagMe API Documentation
TagMe is a powerful tool that identifies on-the-fly meaningful substrings (called "spots") in an unstructured text and link each of them to a pertinent Wikipedia page in an efficient and effective way. You can annotate a text by issuing a query to the RESTful API documented in this page.
TagMe results one of the best entity linking tool in the scientific community, with very good performance especially when annotating short texts (namely, those composed by a few dozens of terms).
Registering to the service
The service is hosted by the D4Science Infrastructure. To obtain access you need to register to the TagMe VRE and get your authorization token by clicking on the Show button in the left panel. Now you have everything in place to issue a query to TagMe RESTful api. For example, you can point your browser to:
https://tagme.d4science.org/tagme/tag?lang=en&gcube-token=XXXX&text=obama visited uk
(Replace XXXX with your actual Service Authorization Token)
Congratulations! You have made your first request to TagMe.
How to annotate
Annotating a text is the main service provided by TagMe. An annotation is a pair (spot,entity), where "spot" is a substring of the input text and "entity" is a reference to a Wikipedia page representing the meaning of that spot, in that context.
The response includes all annotations found in the input text. TagMe associates an attribute to each annotation, called ρ (rho), which estimates the "goodness" of the annotation with respect to the other entities of the input text. We stress here that ρ does not indicate the relevance of the entity in the input text, but is rather a confidence score assigned by TagMe to that annotation. You can use the ρ value to discard annotations that are below a given threshold. The threshold should be chosen in the interval [0,1]. A reasonable threshold is between 0.1 and 0.3.
Using optional parameters, which are described below, the response can include additional information like DBpedia categories associated to the annotated entity. TagMe can also be customized to work on Twitter messages.
Endpoint URL
https://tagme.d4science.org/tagme/tag
Parameters
- text - required - the text to be annotated, using UTF-8 encoding. We recommend using POST if the text is very long, as the limit for GET requests is set to 8 KB, while the limit for POST requests is set to 2 MB. Anyway recall that TagMe's strength relies in its ability to annotate short texts and it would be better to differently tune it when dealing with long texts.
- gcube-token - required - the D4Science Service Authorization Token.
- lang - optional - The language of the text to be annotated. Accepted values are de for German, en for English and it for Italian. Default is en.
- tweet - optional - Enable the special parser for Twitter messages. This parser has been designed to better handle with usual entities in tweets like url, user mentions and hash-tag. When this option is enabled, text parameter can contain the JSON dump of the tweet as directly retrieved from Twitter. Refer to Twitter API for further details. Supported values are true and false, default is false.
- include_abstract - optional - If this option is enabled, for each disambiguated spot, the response includes also the abstract of the related Wikipedia page. Supported values are true and false, default is false.
- include_categories - optional - If this option is enabled, for each disambiguated spot, the response includes also the list of categories which the related Wikipedia page belongs to. The list of categories is provided by DBpedia (currently this feature is based on DBpedia version 3.8). Supported values are true and false, default is false.
- include_all_spots - optional - If this option is enabled, the response will contain information about all spots found in the input text, including those ones that TagMe was not able to annotate with an entity. In such cases, JSON objects for un-tagged spots do not contain details about topic, like id, title, etc... Supported values are true and false, default is false.
Advanced optional parameters
- long_text - TagMe is designed to annote short text but it resulted competitive on long texts too. When annotating long texts, TagMe processes just a limited portion of the input text at once, namely a window of spots, and annotates a spot only using the surrounding spots in that window. This parameter lets you to specify the shifting window for long text. If you want to disable this mechanism and force TagMe to always process the whole text set this parameter to zero. In this latter case please be aware that taking into account all spots in a long text might induce dangerous topic drifts, which could jeopardise the effectiveness of the annotation. Supported values are integers starting from 0.
- epsilon - This parameter can be used to finely tune the disambiguation process: an higher value will favor the most-common topics for a spot, whereas a lower value will take more into account the context. This parameter could be useful when annotating particularly fragmented text, like tweets, where it would be better to favor most common topics because the context is less reliable for disambiguation. Supported values are floats in the range [0,0.5], default is 0.3.
Example
- gcube-token=<your Service Authorization Token>
- text=Schumacher won the race in Indianapolis
- lang=en
- include_abstract=true
- include_categories=true
This corresponds to the GET request:
https://tagme.d4science.org/tagme/tag?lang=en&include_abstract=true&include_categories=true&gcube-token=<your Service Authorization Token>&text=Schumacher won the race in Indianapolis
HTTP Errors
- 501 (NOT IMPLEMENTED) - The resource you requested is not a valid TagMe service.
- 401 (UNAUTHORIZED) - You haven't provided a Service Authorization Token or it is not valid.
- 400 (BAD REQUEST) - There are issues with the parameters you have sent (or not sent). Check the response message for details.
- 500 (INTERNAL SERVER ERROR) - We have experienced an issue with your request. Please report this error to tagme [at] di [dot] unipi [dot] it.
How to get spots (mentions) only
This service can be used to identify spots in a text (parts of the text that mention Wikipedia entities) without their linked entities.
Each spot is weighted using a factor, called link probability, that measures the reliability of that substring as a significant mention, and this value can be used to refine the returned spots via a post-processing phase.
Endpoint URL
https://tagme.d4science.org/tagme/spot
Parameters
- text - required - the text where mentions have to be identified, using UTF-8 encoding. We recommend using POST if the text is very long, as the limit for GET requests is set to 8 KB, while the limit for POST requests is set to 2 MB.
- lang - optional - The language of the text to be annotated. Accepted values are de for German, en for English and it for Italian. Default is en.
- tweet - optional - Enable the special parser for Twitter messages. This parser has been designed to better handle with usual entities in tweets like url, user mentions and hash-tag. When this option is enabled, text parameter can contain the JSON dump of the tweet as directly retrieved from Twitter. Refer to Twitter API for further details. Supported values are true and false, default is false.
Example
- gcube-token=<your Service Authorization Token>
- text=Recent poll show President Obama opening up a small lead over GOP rival Mitt Romney
- lang=en
- tweet=true
This corresponds to the GET request:
-
https://tagme.d4science.org/tagme/spot?lang=en&gcube-token=<your Service Authorization Token>&tweet=true&text=Recent poll show President Obama opening up a small lead over GOP rival Mitt Romney
HTTP Errors
- 501 (NOT IMPLEMENTED) - The resource you requested is not a valid TagMe service.
- 401 (UNAUTHORIZED) - You haven't provided a Service Authorization Token or it is not valid.
- 400 (BAD REQUEST) - Issues with parameters you have sent (or not sent). Check message in the response for the details.
- 500 (INTERNAL SERVER ERROR) - We have experienced an issue with your request. Please report this error to tagme [at] di [dot] unipi [dot] it.
How to compute entity relatedness
This service computes the relatedness between two entities by returning a value in the range [0,1], which expresses how much the two entities are semantically related to each other, where 0 = not related, 1 = related.
We point out that this service could be used to relate two texts by, first, annotating them with TagMe and, then, estimating the pairwise relatedness among all pairs of their annotated entities. All these values could be combined some way (e.g. avg, max, etc. etc.) in order to derive some value which expresses the relatedness between the two input texts.
Such a measure can be very powerful, particularly when dealing with short and poorly composed texts that do not share any syntactic term (here, the classic Tf-Idf scheme would fail indeed!).
A single request to this API can contain up to 100 entity pairs. Entities are identified by means of a numeric value: namely, the Wikipedia internal identifier of the page corresponding to an entity.
Endpoint URL
https://tagme.d4science.org/tagme/rel
Parameters
- lang - optional - The language of the text to be annotated. Accepted values are de for German, en for English and it for Italian. Default is en.
- id - optional/required, repeated - This parameter contains a pair of numeric identifiers for entities, like the ones received using the Tagging service above. The couple is encoded as a string where the two page IDs are separated by a space char. Either this parameter or the parameter tt must be specified in the request. To request multiple relatedness computations, repeat this parameter for all requested couples. If one occurrence of tt parameter is found in the request, any value provided using this parameter will be ignored.
- tt - optional/required, repeated - This parameter contains a pair of entity titles, like the ones received using the Tagging service above (namely, the title of the corresponding Wikipedia page). The pair is encoded as a string where space characters in titles are replaced by "underscore" char and the two titles are separated by a space char. Either this parameter or the parameter id must be specified in the request. To request multiple relatedness computations, repeat this parameter for all requested couples. If one occurrence of id parameter is found in the request, any value provided using this parameter will be ignored.
Example
- gcube-token=<your Service Authorization Token>
- lang=en
- tt=Linked_data Semantic_Web
- tt=University_of_Pisa Massachusetts_Institute_of_Technology
- tt=Academy_Award James_Cameron
- tt=Downing_Street David_Cameron
- tt=Academy_Award David_Cameron
- tt=Downing_Street James_Cameron
- tt=a_wrong_page_title Univeristy_of_Pisa
HTTP Errors
- 501 (NOT IMPLEMENTED) - The resource you requested is not a valid TagMe service.
- 401 (UNAUTHORIZED) - You haven't provided a Service Authorization Token or it is not valid.
- 400 (BAD REQUEST) - There are issues with parameters you have sent (or not sent). Check the response message for details.
- 500 (INTERNAL SERVER ERROR) - There was an issue with the request. Please report this error to tagme [at] di [dot] unipi [dot] it.
- 413 (REQUEST ENTITY TOO LARGE) - It has been requested the computation of relatedness for too many entity pairs.
Credits and References
The first version of TagMe has been released in 2010, and it is described in two papers by Ugo Scaiella and Paolo Ferragina that appeared in the proceedings of ACM CIKM 2010 and IEEE Software. Afterwards, we have achieved several improvements on that version and devised some successful applications of this annotation tool, publishing our results in the proceedings of three main international conferences, namely text classification (ECIR 2012), text/snippet clustering (WSDM 2012) and hashtag classification and disambiguation (AAAI ICWSM 2015). TagMe and its applications have been also awarded of two Google Faculty Awards in 2010 and 2013.
On August 2012, we have introduced major enhancements to the annotation engine and new services have been made available. This improved flexibility, precision and speed of TagMe. The current version of TagMe annotates text in three languages: English, Italian and German. Other languages will be added in the future.