Harvest Documents

What is the Harvester?

The Sintelix Harvester can collect content from the web.

The content is added to a collection and processed as a Sintelix document.

Features
Features:
  • Sintelix Extension: You can Harvest from the Harvester tab or using Sintelix Extension. See Harvest via Sintelix Extension.
  • Adblocker: To reduce unnecessary downloads.
  • Dark Web: The Harvester can be used to extract text from .onion sites using Tor. See Harvest the Dark Web.
Requirements

Harvester requires a connection to a Sintelix Agent and Internet connectivity.

To check the Harvester status, select Harvester > Agent tab.

See Sintelix Agent Connections.

What you can do

Sintelix offers a number of methods for Harvesting from web pages. You can:

Harvester pane

When you select the Harvester tab, the following panes are displayed.

Harvester tabs

The table below describes what function each Tab performs.

Query (default)

Define a Harvest Job based on a search or URL list.

See Create a Harvest Query.

Saved 

List of saved queries, which can be quickly opened in the Query tab.

See Create a Harvest Query.

Batch 

Schedule queries to run automatically.

See Schedule a Harvest.

Personas 

Save login credentials to access protected sites.

See Manage Personas.

Agent  Displays the status of the Agent.
Welcome  (default) Displays the status of Harvester.
Harvest Jobs 

Opens when a Harvesting job is started. Displays the progress of each page being harvested.

See View Harvest Jobs.

Preview & Refine 

Opens when you select Preview & Refine from the Query tab. Allows you to preview the query results without running a full Harvest job. You can select URLs to exclude from the Harvest for the:

  • current query, or

  • project-wide queries by adding them to a Blacklist.

Harvest Page Load Log 

Displays a log record of each page harvested, including its status, the Rule Set applied and a link to the web page.

See View the Harvester Logs.