Crawler API (1.0.0)

Download OpenAPI specification:Download

The Crawler API lets you manage and run your crawlers.

Base URL

The base URL for making requests to the Crawler API is:

https://crawler.algolia.com/api

All requests must use HTTPS.

Availability and authentication

Acess to the Crawler API is available with the Crawler add-on.

To authenticate your API requests, use the basic authentication header:

Authorization: Basic <credentials>

where <credentials> is a base64-encoded string <user-id>:<api-key>.

<user-id>. The Crawler user ID.
<api-key>. The Crawler API key.

You can find both in the Crawler dashboard. The Crawler dashboard and API key are different from the regular Algolia dashboard and API keys.

Request format

Request bodies must be JSON objects.

Parameters

Parameters are passed as query parameters for GET requests, and in the request body for POST and PATCH requests.

Query parameters must be URL-encoded. Non-ASCII characters must be UTF-8 encoded.

Response status and errors

The Crawler API returns JSON responses. Since JSON doesn't guarantee any specific ordering, don't rely on the order of attributes in the API response.

Successful responses return a 2xx status. Client errors return a 4xx status. Server errors are indicated by a 5xx status. Error responses have a message property with more information.

Version

The current version of the Crawler API is version 1, as indicated by the /1/ in each endpoint's URL.

Actions

Actions change the state of crawlers, such as pausing and unpausing crawl schedules or testing the crawler with specific URLs.

Unpause a crawler

Unpauses the specified crawler. Previously ongoing crawls will be resumed. Otherwise, the crawler waits for its next scheduled run.

path Parameters

id

required

string

Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Responses

Response samples

200
400

Content type

application/json

{"taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

Pause a crawler

Pauses the specified crawler.

path Parameters

id

required

string

Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Responses

Response samples

200
400

Content type

application/json

{"taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

Start a crawl

Starts or resumes a crawl.

path Parameters

id

required

string

Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Responses

Response samples

200
400

Content type

application/json

{"taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

Test crawling a URL

Tests a URL with the crawler's configuration and shows the extracted records.

You can override parts of the configuration to test your changes before updating the configuration.

path Parameters

id

required

string

Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Request Body schema: application/json

url required	string URL to test.
	object Crawler configuration to update. You can only update top-level configuration properties. To update a nested configuration, such as `actions.recordExtractor`, you must provide the complete top-level object such as `actions`.

Responses

Response samples

200
400

Content type

application/json

{"startDate": "2024-04-02T15:34:29Z",
"endDate": "2024-04-02T15:34:29Z",
"logs": [["Processing url 'https://www.algolia.com/blog'"
]
],
"records": [{"indexName": "testIndex",
"records": [{"objectID": "https://www.algolia.com/blog",
"numberOfLinks": 2
}
],
"recordsPerExtractor": [{"index": 0,
"type": "custom",
"records": [{"objectID": "https://www.algolia.com/blog"
}
]
}
]
}
],
"links": ["https://blog.algolia.com/challenging-migration-heroku-google-kubernetes-engine/",
"https://blog.algolia.com/tale-two-engines-algolia-unity/"
],
"externalData": {"externalData1": {"data1": "val1",
"data2": "val2"
},
"externalData2": {"data1": "val1",
"data2": "val2"
}
},
"error": { }
}

Crawl URLs

Crawls the specified URLs, extracts records from them, and adds them to the index. If a crawl is currently running (the crawler's reindexing property is true), the records are added to a temporary index.

path Parameters

id

required

string

Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Request Body schema: application/json

urls required	Array of strings URLs to crawl.
save	boolean Whether the specified URLs should be added to the `extraURLs` property of the crawler configuration. If unspecified, the URLs are added to the `extraUrls` field only if they haven't been indexed during the last reindex.

Responses

Response samples

200
400

Content type

application/json

{"taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

Configuration

In the Crawler configuration, you specify which URLs to crawl, when to crawl, how to extract records from the crawl, and where to index the extracted records. The configuration is versioned, so you can always restore a previous version. It's easiest to make configuration changes in the Crawler dashboard. The editor has autocomplete and builtin validation so you can try your configuration changes before comitting them.

Update crawler configuration

Updates the configuration of the specified crawler. Every time you update the configuration, a new version is created.

path Parameters

id

required

string

Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Request Body schema: application/json

required	Array of objects [ 1 .. 30 ] items Instructions how to process crawled URLs. Each action defines: The targeted subset of URLs it processes. What information to extract from the web pages. The Algolia indices where the extracted records will be stored. A single web page can match multiple actions. In this case, the crawler produces one record for each matched action.
appId required	string Algolia application ID where the crawler creates and updates indices. The Crawler add-on must be enabled for this application.
rateLimit required	number [ 1 .. 100 ] Number of concurrent tasks per second. If processing each URL takes n seconds, your crawler can process `rateLimit / n` URLs per second. Higher numbers mean faster crawls but they also increase your bandwidth and server load.
apiKey	string Algolia API key for indexing the records. The API key must have the following access control list (ACL) permissions: `search`, `browse`, `listIndexes`, `addObject`, `deleteObject`, `deleteIndex`, `settings`, `editSettings`. The API key must not be the admin API key of the application. The API key must have access to the indices which the crawler is supposed to create. For example, if `indexPrefix` is `crawler_`, the API key must have access to all `crawler_*` indices.
exclusionPatterns	Array of strings <= 100 items URLs to exclude from crawling.
externalData	Array of strings <= 10 items References to external data sources for enriching the extracted records. For more information, see Enrich extrated records with external data.
extraUrls	Array of strings <= 9999 items URLs from where to start crawling. These are the same as `startUrls`. URLs you crawl manually can be added to `extraUrls`.
	boolean or Array of strings
ignoreNoFollowTo	boolean Whether to ignore the `nofollow` meta tag or link attribute. If true, links with the `rel="nofollow"` attribute or links on pages with the `nofollow` robots meta tag will be crawled.
ignoreNoIndex	boolean Whether to ignore the `noindex` robots meta tag. If true, pages with this meta tag will be crawled.
ignoreQueryParams	Array of strings <= 9999 items Query parameters to ignore while crawling. All URLs with the matching query parameters will be treated as identical. This prevents indexing duplicated URLs, that just differ by their query parameters.
ignoreRobotsTxtRules	boolean Whether to ignore rules defined in your `robots.txt` file.
indexPrefix	string <= 64 characters A prefix for all indices created by this crawler. It's combined with the `indexName` for each action to form the complete index name.
	object Initial index settings, one settings object per index. This setting is only applied when the index is first created. Settings are not re-applied. This prevents overriding any settings changes after the index was created.
	object Function for extracting URLs for links found on crawled pages.
	fetchRequest (object) or browserRequest (object) or oauthRequest (object) Authorization method and credentials for crawling protected content.
maxDepth	number [ 1 .. 100 ] Maximum path depth of crawled URLs. For example, if `maxDepth` is 2, `https://example.com/foo/bar` is crawled, but `https://example.com/foo/bar/baz` won't. Trailing slashes increase the URL depth.
maxUrls	number [ 1 .. 15000000 ] Maximum number of crawled URLs. Setting `maxUrls` doesn't guarantee consistency between crawls because the crawler processes URLs in parallel.
	boolean or Array of strings or object Crawl JavaScript-rendered pages by rendering them with a headless browser. Rendering JavaScript-based pages is slower than crawling regular HTML pages.
	object Options to add to all HTTP requests made by the crawler.
	object Safety checks for ensuring data integrity between crawls.
saveBackup	boolean Whether to back up your index before the crawler overwrites it with new records.
schedule	string Schedule for running the crawl, expressed in Later.js syntax. If omitted, you must start crawls manually. The interval between two scheduled crawls must be at least 24 hours. Times are in UTC. Minutes must be explicit: `at 3:00 pm` not `at 3 pm`. Everyday is `every 1 day`. Midnight is `at 12:00 pm`. If you omit the time, a crawl might start any time after midnight UTC.
sitemaps	Array of strings <= 9999 items Sitemaps with URLs from where to start crawling.
startUrls	Array of strings <= 9999 items URLs from where to start crawling.

Responses

Response samples

200
400

Content type

application/json

{"taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

List configuration versions

Lists previous versions of the specified crawler's configuration, including who authored the change. Every time you update the configuration of a crawler, a new version is added.

path Parameters

id

required

string

Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

query Parameters

itemsPerPage	integer [ 1 .. 100 ] Default: 20 Number of items per page to retrieve.
page	integer [ 1 .. 100 ] Default: 1 Page to retrieve.

Responses

Response samples

200

Content type

application/json

{"itemsPerPage": 20,
"page": 1,
"total": 100,
"items": [{"version": 1,
"createdAt": "2023-07-04T12:49:15Z",
"authorId": "7d79f0dd-2dab-4296-8098-957a1fdc0637"
}
]
}

Retrieve a configuration version

Retrieves the specified version of the crawler configuration.

You can use this to restore a previous version of the configuration.

path Parameters

id required	string Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809 Crawler ID.
version required	integer The version of the targeted Crawler revision.

Responses

Response samples

200

Content type

application/json

{"version": 1,
"config": {"actions": [{"autoGenerateObjectIDs": true,
"cache": {"enabled": true
},
"discoveryPatterns": ["https://www.algolia.com/**"
],
"fileTypesToMatch": ["html",
"pdf"
],
"hostnameAliases": {"dev.example.com": "example.com"
},
"indexName": "algolia_website",
"name": "string",
"pathAliases": {"example.com": {"/foo": "/bar"
}
},
"pathsToMatch": ["https://www.algolia.com/**"
],
"recordExtractor": {"__type": "function",
"source": "string"
},
"selectorsToMatch": [".products",
"!.featured"
]
}
],
"apiKey": "string",
"appId": "string",
"exclusionPatterns": ["https://www.example.com/excluded",
"!https://www.example.com/this-one-url",
"https://www.example.com/exclude/**"
],
"externalData": ["testCSV"
],
"extraUrls": ["string"
],
"ignoreCanonicalTo": true,
"ignoreNoFollowTo": true,
"ignoreNoIndex": true,
"ignoreQueryParams": ["ref",
"utm_*"
],
"ignoreRobotsTxtRules": true,
"indexPrefix": "crawler_",
"initialIndexSettings": {"indexName1": {"attributesForFaceting": ["author",
"filterOnly(isbn)",
"searchable(edition)",
"afterDistinct(category)",
"afterDistinct(searchable(publisher))"
],
"replicas": ["virtual(prod_products_price_asc)",
"dev_products_replica"
],
"paginationLimitedTo": 100,
"unretrievableAttributes": ["total_sales"
],
"disableTypoToleranceOnWords": ["wheel",
"1X2BCD"
],
"attributesToTransliterate": ["name",
"description"
],
"camelCaseAttributes": ["description"
],
"decompoundedAttributes": {"de": ["name"
]
},
"indexLanguages": ["ja"
],
"disablePrefixOnAttributes": ["sku"
],
"allowCompressionOfIntegerArray": false,
"numericAttributesForFiltering": ["equalOnly(quantity)",
"popularity"
],
"separatorsToIndex": "+#",
"searchableAttributes": ["title,alternative_title",
"author",
"unordered(text)",
"emails.personal"
],
"userData": {"settingID": "f2a7b51e3503acc6a39b3784ffb84300",
"pluginVersion": "1.6.0"
},
"customNormalization": {"default": {"ä": "ae",
"ü": "ue"
}
},
"attributeForDistinct": "url",
"attributesToRetrieve": ["author",
"title",
"content"
],
"ranking": ["typo",
"geo",
"words",
"filters",
"proximity",
"attribute",
"exact",
"custom"
],
"customRanking": ["desc(popularity)",
"asc(price)"
],
"relevancyStrictness": 90,
"attributesToHighlight": ["author",
"title",
"conten",
"content"
],
"attributesToSnippet": ["content:80",
"description"
],
"highlightPreTag": "<em>",
"highlightPostTag": "</em>",
"snippetEllipsisText": "…",
"restrictHighlightAndSnippetArrays": false,
"hitsPerPage": 20,
"minWordSizefor1Typo": 4,
"minWordSizefor2Typos": 8,
"typoTolerance": true,
"allowTyposOnNumericTokens": true,
"disableTypoToleranceOnAttributes": ["sku"
],
"ignorePlurals": ["ca",
"es"
],
"removeStopWords": ["ca",
"es"
],
"keepDiacriticsOnCharacters": "øé",
"queryLanguages": ["es"
],
"decompoundQuery": true,
"enableRules": true,
"enablePersonalization": false,
"queryType": "prefixAll",
"removeWordsIfNoResults": "firstWords",
"mode": "keywordSearch",
"semanticSearch": {"eventSources": ["string"
]
},
"advancedSyntax": false,
"optionalWords": ["blue",
"iphone case"
],
"disableExactOnAttributes": ["description"
],
"exactOnSingleWordQuery": "attribute",
"alternativesAsExact": ["ignorePlurals",
"singleWordSynonym"
],
"advancedSyntaxFeatures": ["exactPhrase",
"excludeWords"
],
"distinct": 1,
"replaceSynonymsInHighlight": false,
"minProximity": 1,
"responseFields": ["*"
],
"maxFacetHits": 10,
"maxValuesPerFacet": 100,
"sortFacetValuesBy": "count",
"attributeCriteriaComputedByMinProximity": false,
"renderingContent": {"facetOrdering": {"facets": {"order": ["string"
]
},
"values": {"facet1": {"order": ["string"
],
"sortRemainingBy": "alpha"
},
"facet2": {"order": ["string"
],
"sortRemainingBy": "alpha"
}
}
}
},
"enableReRanking": true,
"reRankingApplyFilter": [["string"
]
]
},
"indexName2": {"attributesForFaceting": ["author",
"filterOnly(isbn)",
"searchable(edition)",
"afterDistinct(category)",
"afterDistinct(searchable(publisher))"
],
"replicas": ["virtual(prod_products_price_asc)",
"dev_products_replica"
],
"paginationLimitedTo": 100,
"unretrievableAttributes": ["total_sales"
],
"disableTypoToleranceOnWords": ["wheel",
"1X2BCD"
],
"attributesToTransliterate": ["name",
"description"
],
"camelCaseAttributes": ["description"
],
"decompoundedAttributes": {"de": ["name"
]
},
"indexLanguages": ["ja"
],
"disablePrefixOnAttributes": ["sku"
],
"allowCompressionOfIntegerArray": false,
"numericAttributesForFiltering": ["equalOnly(quantity)",
"popularity"
],
"separatorsToIndex": "+#",
"searchableAttributes": ["title,alternative_title",
"author",
"unordered(text)",
"emails.personal"
],
"userData": {"settingID": "f2a7b51e3503acc6a39b3784ffb84300",
"pluginVersion": "1.6.0"
},
"customNormalization": {"default": {"ä": "ae",
"ü": "ue"
}
},
"attributeForDistinct": "url",
"attributesToRetrieve": ["author",
"title",
"content"
],
"ranking": ["typo",
"geo",
"words",
"filters",
"proximity",
"attribute",
"exact",
"custom"
],
"customRanking": ["desc(popularity)",
"asc(price)"
],
"relevancyStrictness": 90,
"attributesToHighlight": ["author",
"title",
"conten",
"content"
],
"attributesToSnippet": ["content:80",
"description"
],
"highlightPreTag": "<em>",
"highlightPostTag": "</em>",
"snippetEllipsisText": "…",
"restrictHighlightAndSnippetArrays": false,
"hitsPerPage": 20,
"minWordSizefor1Typo": 4,
"minWordSizefor2Typos": 8,
"typoTolerance": true,
"allowTyposOnNumericTokens": true,
"disableTypoToleranceOnAttributes": ["sku"
],
"ignorePlurals": ["ca",
"es"
],
"removeStopWords": ["ca",
"es"
],
"keepDiacriticsOnCharacters": "øé",
"queryLanguages": ["es"
],
"decompoundQuery": true,
"enableRules": true,
"enablePersonalization": false,
"queryType": "prefixAll",
"removeWordsIfNoResults": "firstWords",
"mode": "keywordSearch",
"semanticSearch": {"eventSources": ["string"
]
},
"advancedSyntax": false,
"optionalWords": ["blue",
"iphone case"
],
"disableExactOnAttributes": ["description"
],
"exactOnSingleWordQuery": "attribute",
"alternativesAsExact": ["ignorePlurals",
"singleWordSynonym"
],
"advancedSyntaxFeatures": ["exactPhrase",
"excludeWords"
],
"distinct": 1,
"replaceSynonymsInHighlight": false,
"minProximity": 1,
"responseFields": ["*"
],
"maxFacetHits": 10,
"maxValuesPerFacet": 100,
"sortFacetValuesBy": "count",
"attributeCriteriaComputedByMinProximity": false,
"renderingContent": {"facetOrdering": {"facets": {"order": ["string"
]
},
"values": {"facet1": {"order": ["string"
],
"sortRemainingBy": "alpha"
},
"facet2": {"order": ["string"
],
"sortRemainingBy": "alpha"
}
}
}
},
"enableReRanking": true,
"reRankingApplyFilter": [["string"
]
]
}
},
"linkExtractor": {"__type": "function",
"source": "({ $, url, defaultExtractor }) => {\n  if (/example.com\\/doc\\//.test(url.href)) {\n    // For all pages under `/doc`, only extract the first found URL.\n    return defaultExtractor().slice(0, 1)\n  }\n  // For all other pages, use the default.\n  return defaultExtractor()\n}\n"
},
"login": {"url": "https://example.com/login",
"requestOptions": {"method": "POST",
"headers": {"Accept-Language": "fr-FR",
"Authorization": "Bearer Aerehdf==",
"Cookie": "session=1234"
},
"body": "id=user&password=s3cr3t",
"timeout": 0
}
},
"maxDepth": 1,
"maxUrls": 1,
"rateLimit": 4,
"renderJavaScript": true,
"requestOptions": {"proxy": "string",
"timeout": 30000,
"retries": 3,
"headers": {"Accept-Language": "fr-FR",
"Authorization": "Bearer Aerehdf==",
"Cookie": "session=1234"
}
},
"safetyChecks": {"beforeIndexPublishing": {"maxLostRecordsPercentage": 10
}
},
"saveBackup": true,
"schedule": "every weekday at 12:00 pm",
"sitemaps": ["https://example.com/sitemap.xyz"
],
"startUrls": ["https://www.example.com"
]
},
"createdAt": "2023-07-04T12:49:15Z",
"authorId": "7d79f0dd-2dab-4296-8098-957a1fdc0637"
}

Crawler

A crawler is an object with a name and a configuration. Use these endpoints to create, rename, and delete crawlers.

List crawlers

Lists all your crawlers.

query Parameters

appID	string Algolia application ID for filtering the API response.
itemsPerPage	integer [ 1 .. 100 ] Default: 20 Number of items per page to retrieve.
name	string <= 64 characters Example: name=test-crawler Name of the crawler for filtering the API response.
page	integer [ 1 .. 100 ] Default: 1 Page to retrieve.

Responses

Response samples

200
400

Content type

application/json

{"itemsPerPage": 20,
"page": 1,
"total": 100,
"items": [{"id": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809",
"name": "test-crawler"
}
]
}

Create a crawler

Creates a new crawler with the provided configuration.

Request Body schema: application/json

required	object Crawler configuration.
name required	string <= 64 characters Name of the crawler.

Responses

Response samples

200
400

Content type

application/json

{"id": "e0f6db8a-24f5-4092-83a4-1b2c6cb6d809"
}

Retrieve crawler details

Retrieves details about the specified crawler, optionally with its configuration.

path Parameters

id

required

string

Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

query Parameters

withConfig

boolean

Whether the response should include the crawler's configuration.

Responses

Response samples

200
400

Content type

application/json

Example

Without configuration

{"name": "test-crawler",
"createdAt": "2023-07-04T12:49:15Z",
"updatedAt": "2023-07-04T12:49:15Z",
"running": true,
"reindexing": true,
"blocked": true,
"blockingError": "Error: Failed to fetch external data for source 'testCSV': 404\n",
"blockingTaskId": "string",
"lastReindexStartAt": null,
"lastReindexEndedAt": null
}

Update crawler

Updates the crawler, either its name or its configuration.

Use this endpoint to update the crawler's name. While you can use this endpoint to completely replace the crawler's configuration, you should update the crawler configuration instead.

If you replace the configuration, you must provide the full configuration, including the settings you want to keep. Configuration changes from this endpoint aren't versioned.

path Parameters

id

required

string

Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Request Body schema: application/json

	object Crawler configuration.
name	string <= 64 characters Name of the crawler.

Responses

Response samples

200
400

Content type

application/json

{"taskId": "98458796-b7bb-4703-8b1b-785c1080b110"
}

Retrieve crawler stats

Retrieves information about the number of crawled, skipped, and failed URLs.

path Parameters

id

required

string

Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809

Crawler ID.

Responses

Response samples

200

Content type

application/json

{"count": 0,
"data": [ ]
}

Domains

List registered domains.

List registered domains

Lists registered domains.

Crawlers will only run if the URLs match any of the registered domains.

query Parameters

appID	string Algolia application ID for filtering the API response.
itemsPerPage	integer [ 1 .. 100 ] Default: 20 Number of items per page to retrieve.
page	integer [ 1 .. 100 ] Default: 1 Page to retrieve.

Responses

Response samples

200
400
403

Content type

application/json

{"itemsPerPage": 20,
"page": 1,
"total": 100,
"items": [{"appId": "string",
"domain": "wwww.algolia.com",
"validated": true
}
]
}

Tasks

Retrieve task status

Retrieves the status of the specified tasks, whether they're pending or completed.

path Parameters

id required	string Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809 Crawler ID.
taskID required	string Example: 98458796-b7bb-4703-8b1b-785c1080b110 Task ID.

Responses

Response samples

200

Content type

application/json

{"pending": true
}

Cancel a blocking task

Cancels a blocking task.

Tasks that ran into an error block the futher schedule of your Crawler. To unblock the crawler, you can cancel the blocking task.

path Parameters

id required	string Example: e0f6db8a-24f5-4092-83a4-1b2c6cb6d809 Crawler ID.
taskID required	string Example: 98458796-b7bb-4703-8b1b-785c1080b110 Task ID.

Responses

Response samples

400

Content type

application/json

{"error": {"code": "malformed_id"
}
}