Document Processing

Document processing is an integral part of the Ingestion configuration. It is used to define what information is extracted from your documents and how it is marked up in the document output.

Dictionaries, Entity Extraction Scripts and Document Processing Scripts can be run at different stages of the process:

Early stage runs added dictionaries and entity extraction scripts before Sintelix’s Learned Entity Extractor
Mid stage runs Document Processing Scripts immediately after the Learned Entity Extractor.
Late Stage runs after everything else has been run.

As a general guide, Document Processing is run as listed on the configuration page.

To configure Ingestion:

Select Configurations > Document Processing
Select the configuration you want to modify.

See Manage Configurations for information on creating, copying, renaming, importing, exporting and deleting configurations.
Complete or modify each section's settings, as described below.
Select the Save button.

In its default state, document processing has a built in Entity Extraction that will extract common entities such as people, organisations and locations from your documents.

You can unselect the Enable Built-in Entity Extraction checkbox to disable the built-in entity extraction.

When disabled, the Dictionaries, Entity Extraction Scripts and Document Processing Scripts added below are used to apply entity extraction.

You can choose to use Enhanced entity extraction, if it is available.

Enhanced entity extraction uses a more sophisticated and accurate entity extraction process. However, it requires GPUs for processing. It may not be available, depending on the system running Sintelix. It may also run more slowly, depending on the processing power available.

You can unselect the Prefer Enhanced entity extraction checkbox to disable the enhanced entity extraction.

When disabled, the default entity extraction engine is used.

The Phrase Chunker is an advanced feature for dividing a sentence into sequences of semantically-related words. Selecting the Phase Chunker checkbox will generate a new annotation type on the Text Graph.

When editing a document, you can manually modify marked up to text references and connections.

You can save these edits so they can be applied to future documents processed by this configuration, by selecting the Enable Machine Learning checkbox.

Sintelix will save the text references in a Machine Learning dictionary and the connections in a Machine Learning entity extraction script.

How Machine Learning Works

Machine Learning applies when you make edits to the markups in a document (entities, text references and links). See Edit a Document.

A message at the top of the Document pane confirms that Machine Learning is active.

Sintelix does the following:

When you create a new text reference, the exact text will be added to the ‘learned’ namespace in the Machine Learning dictionary (unless an identical entry already exists).
When you delete a text reference:

if it was in the Machine Learning dictionary in the ‘Learned’ namespace it will be deleted
if it was not in the ‘Learned’ namespace it is added to the ‘removed’ namespace

When Sintelix applies the dictionary during document processing, any matching ‘removed’ namespace text references are removed first then any matching ‘learned’ text references are added (if they don't overlap existing markup). This removes any bad text references before new ones are added and classes are changed.
The dictionary will be applied immediately before Late Stage entity extraction scripts, so a custom script can be used to interact with and refine the result.
When you create a new connection between text references, Machine Learning will update the Machine Learning script (which is an entity extraction script). For each connection that is created this way, a very simple rule is added that matches the exact text of the connection.
If you remove a connection, Machine Learning will remove the matching rule from the script (if it was created by Machine Learning).
When Sintelix applies the script during document processing, the rules will add the Machine Learning connections but not remove any that you have deleted.

Script rules created by Machine Learning are very basic. They simply consist of existing text references and the tokens between them. They will work as they are but it is recommended that you use them as a template for developing custom entity extraction scripts.

Dictionaries added here will run before Sintelix's Learned Entity Extractor. Text References that have been created by the Dictionary in this stage will not be overridden by the Learned Entity Extractor.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

Entity Extraction Scripts added here will run before Sintelix's Learned Entity Extractor. Text References that have been created by the Entity Extraction Script in this stage will not be overridden by the Learned Entity Extractor.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

This section enables you to exclude specific Text References from the document output. You may either enter the name of the Text Reference class, or select it from an Ontology to add to the exclusion list.

Document Processing Scripts added here will run after Sintelix's Learned Entity Extractor.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

Dictionaries added here will run after Sintelix's Learned Entity Extractor. Text References created here can overlap Text References created by the Learned Entity Extractor.

Entity Extraction Scripts added here will run after Sintelix's Learned Entity Extractor. Text References created here can overlap Text References created by the Learned Entity Extractor.

Scripts added here can refer to Text References created by the Learned Entity Extractor. Scripts added here can 0modify or delete existing Text References.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

To configure Sentiment Analysis:

Select the Enable Sentiment Analysis option in the Sentiment Analysis section.
Select the Merge consecutive sentiments with the same polarity option, if required.

Example

For example, if the follow text was in a document:

Sample Text:

Australia is an exquisite, beautiful and affordable country.

The words "exquisite", "beautiful" and "affordable" are all positive sentiment terms, so when you have that box ticked it'll combine those three words, including separators and the merged term will be linked to the target.

If the checkbox is not enabled, the result is:

Result:

Australia is an exquisite, beautiful and affordable country.

If the checkbox is enabled, the result is:

Result:

Australia is an exquisite, beautiful and affordable country.

This applies to both positive and negative sentiments.
Select the Create tags only if the attributor exists option if required.

Example

The sentiment will only generate sentiment text references if Sintelix can detect a source and target for the sentiment.

For example, if the follow text was in a document:

Sample Text:

Australia is poor.

If the checkbox is enabled, "poor" would not be marked up as a sentiment as there is no attributor.

However, if the text in the document was:

Sample Text:

Putin said Australia is poor.

Then the sentiment text reference would be marked up, as an attributor was able to be identified.

Result:

Putin said Australia is poor,.
Specify the ontology classes to which sentiments can be linked and attributed:

To prevent sentiments from being linked or attributed to an ontology class, remove the class from the relevant list by clicking the delete symbol beside its name.
To enable sentiments to be linked or attributed to an ontology class, enter the name of the class in the field under the relevant list then click Add Tag.

Select the Save button.

See Perform Sentiment Analysis for more information.

This is an advanced feature that rarely needs to be used. It exists to cover any marginal use cases that may require a modification of the standard Document Processing workflow.

Select the Save button.

Document Processing

Introduction

Early, Mid and Late Stage

Process

Enable built-in Entity Extraction

Enhanced Extraction

Phrase Chunker

Machine Learning

Dictionaries (Early Stage)

Entity Extraction Scripts (Early Stage)

Learned Entity Extraction Configuration

Document Processing Scripts (Mid Stage)

Dictionaries (Late Stage)

Entity Extraction Scripts (Late Stage)

Sentiment Analysis

Document Processing Scripts

Document Processing

Introduction

Early, Mid and Late Stage

Process

Enable built-in Entity Extraction

Enhanced Extraction

Phrase Chunker

Machine Learning

How Machine Learning Works

Dictionaries (Early Stage)

Entity Extraction Scripts (Early Stage)

Learned Entity Extraction Configuration

Document Processing Scripts (Mid Stage)

Dictionaries (Late Stage)

Entity Extraction Scripts (Late Stage)

Sentiment Analysis

Sample Text:

Result:

Result:

Sample Text:

Sample Text:

Result:

Document Processing Scripts