Create a Harvest Query

From the Harvester > Query tab, you can harvest using either a:

  • Search query (default), or
  • List of URLs.
Quick Summary

To harvest from the internet:

  1. You can either:

  2. Select the Select a Collection to save the results.

  3. Select Harvest.

    Result: The Harvest Jobs tab is displayed showing the progress of the harvest job. See View Harvest Jobs.

Load a Previous Query

You can quickly load previously used queries from the:

  • Saved tab, select the  Open Query button for the query you want to open.

  • Batch tab, select the Edit button to view the scheduled queries and select the Open Query button.

  • Harvest Jobs tab, in the Query Input column, select the Open this search facet button.

Search Query

Enter the terms or phrase you want to search for in the Text Queries field.

Choose Search Engines

Select the search engines to use:

Search Parameters (optional)

Expand the Advanced section to view and modify the search parameters for each selected search engine.

Ignore URLs: This query only (optional)

You can identify URLs to exclude from the search results for this query by listing them in the Ignore URLs field.

You can add URLs:

  • manually by copying and pasting URLs into the field, or

  • automatically by:

    1. selecting Preview & Refine to view the search results.

      See Preview & Refine (optional).

    2. unselecting any unwanted URLs.
      Result: Unselected URLs are added to the Ignore URLs field automatically.

You can also create a Blacklist to exclude URLs for all queries run in the whole project. Allows the use of wildcards. See Blacklist: Project-wide (optional).

URL List

To harvest from a list of URLs, select the URL List option.

You can either:

  • copy and paste URLs into the field, or

  • select the Add... to load URLs saved using the Sintelix Extension Add URL to Store button.

    The Add... option is only displayed if URLs have been saved.

    See Add URL to Store.

  • load a saved query from the Saved tab by selecting the  Open Query button for the query you want to open.

The example below lists five news sites about the 2024 presidential election.

  • Each URL is on a separate line.

  • Invalid URLs will be ignored.

Query Options

Select Persona (optional)

You need a Persona to harvest content from sites that require login credentials, such as social media sites.

Select the Persona from the dropdown list, if required.

To create or update Personas, see Manage Personas.

Rule Set Options (optional)

By default, all the Harvester Rule Sets configured for the project are enabled.

Expand the Rule Set Options section to view a list of all Rule Sets.

You can:

  • disable all or enable all by selecting the checkbox at the top of the Rule Set list (this can be useful when you only want to have one Rule Set enabled).

  • individually disable or enable a Rule Set by selecting the checkbox next to the Rule Set name.

  • modify the depth (the number links followed from page to next page before stopping) - the higher the depth number, the more pages will be harvested. See Concept: Harvest Depth. The Rule Set default is displayed, which you can override by typing a number in the field.

Search Engine Rule Sets are only shown in the Search Engines section.

Save Query (optional)

Select Save to save the Query so it can be easily used again in the future.

Result: The query is saved and listed under the Saved tab.

Copy into Batch (optional)

Select the Copy into Batch to copy the query into a new scheduled harvest or add to an existing scheduled harvest.

Result: The query is saved and listed under the Batch tab.

See Schedule a Harvest.

Preview & Refine (optional)

Select the Preview & Refine button to see the query results without harvesting the content.

Result: The query results are displayed in the Preview & Refine tab.

The results displayed in the Preview & Refine tab will expire after 8 hours.

Any results that have been ignored because the URL is blacklisted will be greyed out and marked with black numbers.

Actions

From the Preview & Refine tab, you can add URLs to ignore from the results for:

  • this query only, by unselecting the checkbox next to the results. As each result is unselected, the URL is added to the Ignore URLs field (see Ignore URLs: This query only (optional)).

  • all queries for this Project by unselecting the checkbox next to the results and then selecting Add Unselected to Project Blacklist.

    Result: The unselected URLs are added to the Blacklist and the results are greyed out with black numbers.

    You view and modify the Blacklist by selecting the Blacklist button. In the Blacklist dialog, you can edit the URLs using wildcards to make them more generic. See Blacklist: Project-wide (optional).

Select a Collection

Select the Collect to store the results.

Create a New Collection

To create a new Collection:

  1. Select the Add Collection button.

    Result: The Create a new Document Collection dialog is displayed.

  2. Select the Ingestion Configuration to apply to the Collection.

  3. Enter a name for the Collection.

  4. Apply security for the Collection, if required.

    See Collection/Network Security

  5. Select Create.

    Result: The Collection is created and selected in the Collection for results dropdown.

Harvest Options

Harvest Parameters

Enable the required harvest options:

  • Harvest Full Page (content and boiler plate elements) - this is useful when you are testing Rule Sets or setting up a Gold Standard Collection.

  • Harvest All IMGs - this will harvest all images within the elements being harvested within all Harvester Rule Sets. Images in elements not being harvested will still be excluded.

  • Capture Screenshots - Creates the screenshots of the websites that are harvested. This is useful if need to keep a record of the page at the time it was harvested as evidence.

  • Disable Adblocker - Disables the installed adblock capability, which can help the harvest run faster but may capture ads within the harvested content, depending on the ruleset settings applied.

Blacklist: Project-wide (optional)

You can create a list of URLs to exclude from all harvest jobs in the project.

You can also create a list of URLs to exclude for just this query. See Ignore URLs: This query only (optional).

View/Modify the Blacklist

You view and modify the Blacklist: Project-wide (optional) by selecting the Blacklist button.

Add URLs to the Blacklist

You can enter URLs:

  • manually by copying and pasting URLs into the blacklist, or

  • automatically by:

    1. selecting Preview & Refine to view the search results, and

    2. unselecting any unwanted URLS

    3. selecting Add Unselected to Project Blacklist.

      Result: The unselected URLs are added to the Blacklist and the results in the Preview & Refine tab are greyed out with black numbers.

Random Wait (optional)

Setting a Random Wait time adds a delay between the page requests to the same domain.

Why add a Random Wait time

Many websites have measures to detect and block web scraping activities, such as measuring the frequency of requests from a single IP address, which could result in an IP address being temporarily or permanently blocked. Random Wait times can mimic human behaviour and potentially avoid detection.

Random Wait Time In Harvester Configuration

A default Random Wait time is set in the Harvester Configuration and can be modified by the Administrator. Setting a Random Wait time using this feature will override the Admin Harvester Configuration settings .

Set a Random Wait Time

To set a Random Wait Time:

  1. Select Random Wait.

    Result: The Random Wait Time dialog is displayed.

  2. Enter the domains.

  3. Set the wait time delay in seconds range.

  4. To add a separate group of domains, select Add New Group.

  5. To remove a separate group of domains, select Delete.

  6. Select Save.

Set as Default

Set as Default saves the following settings for your current project:

  • Search Engine selections and parameters
  • Ruleset Options
  • Collection
  • Harvest Parameters

This default is applied when you access Harvester.

The default is required for the Harvest Now feature available from the Documents Pane and View the Node Table.