Entity Extraction

Entity Extraction, also known as Named Entity Recognition, is the process of identifying information of interest, such as people and places.

Text is structured into a Text Graph, Tokens and Nodes so it can be analysed and Annotations marked up. Entities and Text References and links are marked up within a Document using annotations on the Text Graph.

Entity Extraction Models

There are different Entity Extraction models available. These models can vary in the speed and accuracy with which they perform Entity Extraction. Further, different models may be capable of performing Entity Extraction for a different range of languages.

Entity Extraction Mode

Sintelix provides two Entity Extraction modes:

  • Baseline: The Baseline Entity Extraction model is built into Sintelix. It can run on a standard computer using CPU processing.

  • Enhanced: The enhanced Entity Extraction model boasts improved accuracy, but is slower than the Baseline Entity Extraction mode. It needs to be installed separately and requires a computer to have access to compatible GPUs.

Configuration

The Entity Extraction mode used during Ingestion is configured in the Document Processing configuration. (See Enhanced Extraction).

Language Pipelines

Every language has its own vocabulary and grammatical rules. Therefore, each language requires its own Entity Extraction method, called a pipeline. See Languages.

Language pipelines need to be licensed.

Select the Status tab to view the language pipelines supported, installed and licensed. (Enhanced) displayed after the language name indicates the Entity Extraction enhanced mode is installed (see below). For example, .

See System Status Information.

The pipeline applied to a Document is displayed in the Document Properties section as the pipeline_type:

If Enhanced Entity Extraction has been applied, the pipeline_type will be show the value enhanced.

By examining the context of the words, you can identify Connections between Entities.

For example:

  • an Organisation is based in a Location

  • a Person works for an Organisation

  • a Person is the uncle of another Person, etc.

These Connections are captured in Documents and become Links in Networks.

Clustering

Clustering is a way of grouping similar information. Imagine analysing thousands of news reports and finding hundreds of references to "John London" who may or may not be the same person. How do you tell them apart?

Clustering looks for patterns, similarities and differentiators to distinguish between references and the connections between them.

For example:

  • John London from Sydney, works for Caltex and is not married.

  • John London from Washington, works for Microsoft and is married to Mary.

  • John London from Manchester is married to Maria but it's not clear where he works as an accountant.

Clustering the entities and links found in Documents results in a Network.

Network

A Network is a collection of Nodes and Links.

  • Nodes include Entities and Documents.

  • Links are connections between nodes, for example, from Entity to Entity or Entity to Document.

A Network allows users to analyse and visualise information extracted from a vast volume of data contained in Documents.