Document Processing

Introduction

Document processing is an integral part of the Ingestion configuration. It is used to define what information is extracted from your documents and how it is marked up in the document output.

Early, Mid and Late Stage

Dictionaries, Entity Extraction Scripts and Document Processing Scripts can be run at different stages of the process:

  • Early stage runs added dictionaries and entity extraction scripts before Sintelix’s Learned Entity Extractor

  • Mid stage runs Document Processing Scripts immediately after the Learned Entity Extractor.

  • Late Stage runs after everything else has been run.

As a general guide, Document Processing is run as listed on the configuration page.

Process

To configure Ingestion:

  1. Select Configurations > Document Processing

  2. Select the configuration you want to modify.

    See Manage Configurations for information on creating, copying, renaming, importing, exporting and deleting configurations.

  3. Complete or modify each section's settings, as described below.

  4. Select the Save button.

Enable built-in Entity Extraction

In its default state, document processing has a built in Entity Extraction that will extract common entities such as people, organisations and locations from your documents.

You can unselect the Enable Built-in Entity Extraction checkbox to disable the built-in entity extraction.

When disabled, the Dictionaries, Entity Extraction Scripts and Document Processing Scripts added below are used to apply entity extraction.

Enhanced Extraction

You can choose to use Enhanced entity extraction, if it is available.

Enhanced entity extraction uses a more sophisticated and accurate entity extraction process. However, it requires GPUs for processing. It may not be available, depending on the system running Sintelix. It may also run more slowly, depending on the processing power available.

You can unselect the Prefer Enhanced entity extraction checkbox to disable the enhanced entity extraction.

When disabled, the default entity extraction engine is used.

Phrase Chunker

The Phrase Chunker is an advanced feature for dividing a sentence into sequences of semantically-related words. Selecting the Phase Chunker checkbox will generate a new annotation type on the Text Graph.

Machine Learning

When editing a document, you can manually modify marked up to text references and connections.

You can save these edits so they can be applied to future documents processed by this configuration, by selecting the Enable Machine Learning checkbox.

Sintelix will save the text references in a Machine Learning dictionary and the connections in a Machine Learning entity extraction script.

Dictionaries (Early Stage)

Dictionaries added here will run before Sintelix's Learned Entity Extractor. Text References that have been created by the Dictionary in this stage will not be overridden by the Learned Entity Extractor.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

Entity Extraction Scripts (Early Stage)

Entity Extraction Scripts added here will run before Sintelix's Learned Entity Extractor. Text References that have been created by the Entity Extraction Script in this stage will not be overridden by the Learned Entity Extractor.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

Learned Entity Extraction Configuration

This section enables you to exclude specific Text References from the document output. You may either enter the name of the Text Reference class, or select it from an Ontology to add to the exclusion list.

Document Processing Scripts (Mid Stage)

Document Processing Scripts added here will run after Sintelix's Learned Entity Extractor.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

Dictionaries (Late Stage)

Dictionaries added here will run after Sintelix's Learned Entity Extractor. Text References created here can overlap Text References created by the Learned Entity Extractor.

Entity Extraction Scripts (Late Stage)

Entity Extraction Scripts added here will run after Sintelix's Learned Entity Extractor. Text References created here can overlap Text References created by the Learned Entity Extractor.

Scripts added here can refer to Text References created by the Learned Entity Extractor. Scripts added here can 0modify or delete existing Text References.

Click and drag to change the order.

Open configuration in new tab.

Click to remove item.

Sentiment Analysis

To configure Sentiment Analysis:

  1. Select the Enable Sentiment Analysis option in the Sentiment Analysis section.

  2. Select the Merge consecutive sentiments with the same polarity option, if required.

  3. Select the Create tags only if the attributor exists option if required.

  4. Specify the ontology classes to which sentiments can be linked and attributed:

    • To prevent sentiments from being linked or attributed to an ontology class, remove the class from the relevant list by clicking the delete symbol beside its name.

    • To enable sentiments to be linked or attributed to an ontology class, enter the name of the class in the field under the relevant list then click Add Tag.

  5. Select the Save button.

See Perform Sentiment Analysis for more information.

Document Processing Scripts

This is an advanced feature that rarely needs to be used. It exists to cover any marginal use cases that may require a modification of the standard Document Processing workflow.

Select the Save button.