The generated API clients are a work in progress, you can also find our stable clients on the Algolia documentation.

Skip to main content

Crawler API (1.0.0)

Download OpenAPI specification:Download

The Crawler API lets you manage and run your crawlers.

Base URL

The base URL for making requests to the Crawler API is:

  • https://crawler.algolia.com/api

All requests must use HTTPS.

Availability and authentication

Acess to the Crawler API is available with the Crawler add-on.

To authenticate your API requests, use the basic authentication header:

  • Authorization: Basic <credentials>

where <credentials> is a base64-encoded string <user-id>:<api-key>.

  • <user-id>. The Crawler user ID.
  • <api-key>. The Crawler API key.

You can find both in the Crawler dashboard. The Crawler dashboard and API key are different from the regular Algolia dashboard and API keys.

Request format

Request bodies must be JSON objects.

Parameters

Parameters are passed as query parameters for GET requests, and in the request body for POST and PATCH requests.

Query parameters must be URL-encoded. Non-ASCII characters must be UTF-8 encoded.

Response status and errors

The Crawler API returns JSON responses. Since JSON doesn't guarantee any specific ordering, don't rely on the order of attributes in the API response.

Successful responses return a 2xx status. Client errors return a 4xx status. Server errors are indicated by a 5xx status. Error responses have a message property with more information.

Version

The current version of the Crawler API is version 1, as indicated by the /1/ in each endpoint's URL.

Actions

Actions change the state of crawlers, such as pausing and unpausing crawl schedules or testing the crawler with specific URLs.

Unpause a crawler

Unpauses the specified crawler. Previously ongoing crawls will be resumed. Otherwise, the crawler waits for its next scheduled run.

path Parameters
id
required
string
Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Responses

Response samples

Content type
application/json
{
  • "taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

Pause a crawler

Pauses the specified crawler.

path Parameters
id
required
string
Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Responses

Response samples

Content type
application/json
{
  • "taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

Start a crawl

Starts or resumes a crawl.

path Parameters
id
required
string
Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Responses

Response samples

Content type
application/json
{
  • "taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

Test crawling a URL

Tests a URL with the crawler's configuration and shows the extracted records.

You can override parts of the configuration to test your changes before updating the configuration.

path Parameters
id
required
string
Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Request Body schema: application/json
url
required
string

URL to test.

object

Crawler configuration to update. You can only update top-level configuration properties. To update a nested configuration, such as actions.recordExtractor, you must provide the complete top-level object such as actions.

Responses

Response samples

Content type
application/json
{}

Crawl URLs

Crawls the specified URLs, extracts records from them, and adds them to the index. If a crawl is currently running (the crawler's reindexing property is true), the records are added to a temporary index.

path Parameters
id
required
string
Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Request Body schema: application/json
urls
required
Array of strings

URLs to crawl.

save
boolean

Whether the specified URLs should be added to the extraURLs property of the crawler configuration. If unspecified, the URLs are added to the extraUrls field only if they haven't been indexed during the last reindex.

Responses

Response samples

Content type
application/json
{
  • "taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

Configuration

In the Crawler configuration, you specify which URLs to crawl, when to crawl, how to extract records from the crawl, and where to index the extracted records. The configuration is versioned, so you can always restore a previous version. It's easiest to make configuration changes in the Crawler dashboard. The editor has autocomplete and builtin validation so you can try your configuration changes before comitting them.

Update crawler configuration

Updates the configuration of the specified crawler. Every time you update the configuration, a new version is created.

path Parameters
id
required
string
Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Request Body schema: application/json
required
Array of objects [ 1 .. 30 ] items

Instructions how to process crawled URLs.

Each action defines:

  • The targeted subset of URLs it processes.
  • What information to extract from the web pages.
  • The Algolia indices where the extracted records will be stored.

A single web page can match multiple actions. In this case, the crawler produces one record for each matched action.

appId
required
string

Algolia application ID where the crawler creates and updates indices. The Crawler add-on must be enabled for this application.

rateLimit
required
number [ 1 .. 100 ]

Number of concurrent tasks per second.

If processing each URL takes n seconds, your crawler can process rateLimit / n URLs per second.

Higher numbers mean faster crawls but they also increase your bandwidth and server load.

apiKey
string

Algolia API key for indexing the records.

The API key must have the following access control list (ACL) permissions: search, browse, listIndexes, addObject, deleteObject, deleteIndex, settings, editSettings. The API key must not be the admin API key of the application. The API key must have access to the indices which the crawler is supposed to create. For example, if indexPrefix is crawler_, the API key must have access to all crawler_* indices.

exclusionPatterns
Array of strings <= 100 items

URLs to exclude from crawling.

externalData
Array of strings <= 10 items

References to external data sources for enriching the extracted records.

For more information, see Enrich extrated records with external data.

extraUrls
Array of strings <= 9999 items

URLs from where to start crawling.

These are the same as startUrls. URLs you crawl manually can be added to extraUrls.

boolean or Array of strings
ignoreNoFollowTo
boolean

Whether to ignore the nofollow meta tag or link attribute. If true, links with the rel="nofollow" attribute or links on pages with the nofollow robots meta tag will be crawled.

ignoreNoIndex
boolean

Whether to ignore the noindex robots meta tag. If true, pages with this meta tag will be crawled.

ignoreQueryParams
Array of strings <= 9999 items

Query parameters to ignore while crawling.

All URLs with the matching query parameters will be treated as identical. This prevents indexing duplicated URLs, that just differ by their query parameters.

ignoreRobotsTxtRules
boolean

Whether to ignore rules defined in your robots.txt file.

indexPrefix
string <= 64 characters

A prefix for all indices created by this crawler. It's combined with the indexName for each action to form the complete index name.

object

Initial index settings, one settings object per index.

This setting is only applied when the index is first created. Settings are not re-applied. This prevents overriding any settings changes after the index was created.

object

Function for extracting URLs for links found on crawled pages.

fetchRequest (object) or browserRequest (object) or oauthRequest (object)

Authorization method and credentials for crawling protected content.

maxDepth
number [ 1 .. 100 ]

Maximum path depth of crawled URLs. For example, if maxDepth is 2, https://example.com/foo/bar is crawled, but https://example.com/foo/bar/baz won't. Trailing slashes increase the URL depth.

maxUrls
number [ 1 .. 15000000 ]

Maximum number of crawled URLs.

Setting maxUrls doesn't guarantee consistency between crawls because the crawler processes URLs in parallel.

boolean or Array of strings or object

Crawl JavaScript-rendered pages by rendering them with a headless browser.

Rendering JavaScript-based pages is slower than crawling regular HTML pages.

object

Options to add to all HTTP requests made by the crawler.

object

Safety checks for ensuring data integrity between crawls.

saveBackup
boolean

Whether to back up your index before the crawler overwrites it with new records.

schedule
string

Schedule for running the crawl, expressed in Later.js syntax. If omitted, you must start crawls manually.

  • The interval between two scheduled crawls must be at least 24 hours.
  • Times are in UTC.
  • Minutes must be explicit: at 3:00 pm not at 3 pm.
  • Everyday is every 1 day.
  • Midnight is at 12:00 pm.
  • If you omit the time, a crawl might start any time after midnight UTC.
sitemaps
Array of strings <= 9999 items

Sitemaps with URLs from where to start crawling.

startUrls
Array of strings <= 9999 items

URLs from where to start crawling.

Responses

Response samples

Content type
application/json
{
  • "taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

List configuration versions

Lists previous versions of the specified crawler's configuration, including who authored the change. Every time you update the configuration of a crawler, a new version is added.

path Parameters
id
required
string
Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

query Parameters
itemsPerPage
integer [ 1 .. 100 ]
Default: 20

Number of items per page to retrieve.

page
integer [ 1 .. 100 ]
Default: 1

Page to retrieve.

Responses

Response samples

Content type
application/json
{
  • "itemsPerPage": 20,
  • "page": 1,
  • "total": 100,
  • "items": [
    ]
}

Retrieve a configuration version

Retrieves the specified version of the crawler configuration.

You can use this to restore a previous version of the configuration.

path Parameters
id
required
string
Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

version
required
integer

The version of the targeted Crawler revision.

Responses

Response samples

Content type
application/json
{
  • "version": 1,
  • "config": {
    },
  • "createdAt": "2023-07-04T12:49:15Z",
  • "authorId": "7d79f0dd-2dab-4296-8098-957a1fdc0637"
}

Crawler

A crawler is an object with a name and a configuration. Use these endpoints to create, rename, and delete crawlers.

List crawlers

Lists all your crawlers.

query Parameters
appID
string

Algolia application ID for filtering the API response.

itemsPerPage
integer [ 1 .. 100 ]
Default: 20

Number of items per page to retrieve.

name
string <= 64 characters
Example: name=test-crawler

Name of the crawler for filtering the API response.

page
integer [ 1 .. 100 ]
Default: 1

Page to retrieve.

Responses

Response samples

Content type
application/json
{
  • "itemsPerPage": 20,
  • "page": 1,
  • "total": 100,
  • "items": [
    ]
}

Create a crawler

Creates a new crawler with the provided configuration.

Request Body schema: application/json
required
object

Crawler configuration.

name
required
string <= 64 characters

Name of the crawler.

Responses

Response samples

Content type
application/json
{
  • "id": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}

Retrieve crawler details

Retrieves details about the specified crawler, optionally with its configuration.

path Parameters
id
required
string
Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

query Parameters
withConfig
boolean

Whether the response should include the crawler's configuration.

Responses

Response samples

Content type
application/json
Example
{
  • "name": "test-crawler",
  • "createdAt": "2023-07-04T12:49:15Z",
  • "updatedAt": "2023-07-04T12:49:15Z",
  • "running": true,
  • "reindexing": true,
  • "blocked": true,
  • "blockingError": "Error: Failed to fetch external data for source 'testCSV': 404\n",
  • "blockingTaskId": "string",
  • "lastReindexStartAt": null,
  • "lastReindexEndedAt": null
}

Update crawler

Updates the crawler, either its name or its configuration.

Use this endpoint to update the crawler's name. While you can use this endpoint to completely replace the crawler's configuration, you should update the crawler configuration instead.

If you replace the configuration, you must provide the full configuration, including the settings you want to keep. Configuration changes from this endpoint aren't versioned.

path Parameters
id
required
string
Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Request Body schema: application/json
object

Crawler configuration.

name
string <= 64 characters

Name of the crawler.

Responses

Response samples

Content type
application/json
{
  • "taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

Retrieve crawler stats

Retrieves information about the number of crawled, skipped, and failed URLs.

path Parameters
id
required
string
Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Responses

Response samples

Content type
application/json
{
  • "count": 0,
  • "data": [ ]
}

Domains

List registered domains.

List registered domains

Lists registered domains.

Crawlers will only run if the URLs match any of the registered domains.

query Parameters
appID
string

Algolia application ID for filtering the API response.

itemsPerPage
integer [ 1 .. 100 ]
Default: 20

Number of items per page to retrieve.

page
integer [ 1 .. 100 ]
Default: 1

Page to retrieve.

Responses

Response samples

Content type
application/json
{
  • "itemsPerPage": 20,
  • "page": 1,
  • "total": 100,
  • "items": [
    ]
}

Tasks

Tasks

Retrieve task status

Retrieves the status of the specified tasks, whether they're pending or completed.

path Parameters
id
required
string
Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

taskID
required
string
Example: 98458796-b7bb-4703-8b1b-785c1080b110

Task ID.

Responses

Response samples

Content type
application/json
{
  • "pending": true
}

Cancel a blocking task

Cancels a blocking task.

Tasks that ran into an error block the futher schedule of your Crawler. To unblock the crawler, you can cancel the blocking task.

path Parameters
id
required
string
Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

taskID
required
string
Example: 98458796-b7bb-4703-8b1b-785c1080b110

Task ID.

Responses

Response samples

Content type
application/json
{
  • "error": {
    }
}