Full-Site Crawl

Overview

Single-page audits catch problems on the page you ask about. Full-site crawls catch problems you didn't know existed — the printer-friendly URL variant ranking instead of the canonical, the eight-hop redirect chain from a 2018 site migration, the 200 product pages with identical meta descriptions. These tools wrap DataForSEO's On-Page crawler with JavaScript rendering enabled, so single-page apps and React-rendered content crawl correctly.

The crawl is asynchronous: submit-onpage-crawl kicks off the job and returns a task_id; every other tool in this section reads from that task. Pull get-onpage-crawl-summary until it shows the crawl complete, then drill in.

submit-onpage-crawl

Submits a full-site crawl for any domain. JavaScript and resource loading are on by default, so React, Vue, and Webflow sites render the way Google would see them. Each crawled page consumes credits — raise max_crawl_pages only when full-site coverage is required.

Parameters

ParameterTypeRequiredDescription
domainstringYesDomain to crawl ("example.com" or full URL)
max_crawl_pagesintegerNoPages to crawl. Default 100, max 500

Example

"Crawl up to 300 pages of elysiumpools.com so I can audit the full site."

get-onpage-crawl-summary

Polls the crawl task for progress and overall health scores. Returns empty while the crawl is still running. Once populated, you get pages_crawled, an onpage_score across tech / content / links, and issue counts by category — the at-a-glance executive view of how the site is doing.

Parameters

ParameterTypeRequiredDescription
task_idstringYesThe task_id returned by submit-onpage-crawl

get-onpage-crawl-pages

The per-page table: status code, on-page score, title, meta description, word count, and the list of issues each page tripped. Paginated — pull totals from the summary first, then walk the full set with limit and offset.

Parameters

ParameterTypeRequiredDescription
task_idstringYesThe task_id from the crawl
limitintegerNoPages per response. Default 100, max 100
offsetintegerNoZero-based offset for pagination

get-onpage-duplicate-tags

Finds pages with duplicate title or meta description tags. Classic SEO cleanup target — duplicate tags signal weak on-page targeting and confuse search engines about which page should rank for which query. Most ecommerce and CMS sites have dozens of these.

"Show me all the pages with duplicate title tags from the elysiumpools.com crawl."

get-onpage-duplicate-content

Finds pages with substantially identical body content. Duplicate content dilutes ranking signals across URLs — typical culprits are faceted navigation (color/size filter URLs), session IDs in query strings, printer-friendly variants, and staging environments accidentally indexed.

Tips

  • Pair with get-onpage-non-indexable — many duplicates are already excluded via canonical, and that's fine.
  • Treat staging bleed-through as urgent; canonical tags don't always rescue an accidentally-indexed staging domain.

Every internal and external link discovered during the crawl: source page, target URL, anchor text, direction, and link attributes (rel, nofollow). Use it for internal linking analysis, broken-link detection, and outbound risk auditing.

ParameterTypeRequiredDescription
task_idstringYesThe task_id from the crawl
limitintegerNoLinks per response. Default 100, max 100
offsetintegerNoZero-based offset for pagination

get-onpage-redirect-chains

Multi-hop redirect chains and redirect loops discovered during the crawl. Each extra hop bleeds PageRank and slows mobile users — flatten anything two or more hops deep. Loops are urgent: they stop the page from rendering at all.

"List every redirect chain on elysiumpools.com that's two or more hops deep."

get-onpage-non-indexable

Pages blocked from indexing — robots.txt disallow, noindex meta, canonical pointing elsewhere, status 4xx/5xx. Use it for two opposite jobs: verify the exclusions you meant to make are working, and catch high-value pages accidentally blocked (the most common version of this is a developer who left a noindex on a template).

get-onpage-keyword-density

Keyword-density frequency for crawled pages. Set keyword_length to 1 for single words, 2 for bigrams, or 3 for trigrams. Two practical uses: confirm a page actually targets what you think it does, and catch over-optimization before Google does.

Parameters

ParameterTypeRequiredDescription
task_idstringYesThe task_id from the crawl
keyword_lengthintegerNo1 = words, 2 = bigrams, 3 = trigrams. Default 1
limitintegerNoEntries per response. Default 100, max 100

get-onpage-raw-html

The raw HTML of a single crawled page. Use it when only the actual markup can confirm a signal — verifying meta tags, canonicals, hreflang declarations, JSON-LD schema, or whether a SPA actually rendered server-side. Cheaper than re-fetching the page yourself because the crawler already has it cached against the task_id.

Parameters

ParameterTypeRequiredDescription
task_idstringYesThe task_id from the crawl
urlstringYesURL of the crawled page whose HTML you want

Full-Site Crawl Workflow

  1. submit-onpage-crawl with the domain and the page budget you can afford. Save the task_id.
  2. get-onpage-crawl-summary on a loop until it reports the crawl complete (typically 1–5 minutes for <200 pages).
  3. get-onpage-crawl-pages to walk the full set and identify the worst per-page scores.
  4. get-onpage-duplicate-tagsget-onpage-duplicate-content for the cleanup hit list.
  5. get-onpage-redirect-chainsget-onpage-non-indexable to catch architectural problems.
  6. get-onpage-links for internal linking analysis — the cheapest ranking lift on most sites.
  7. get-onpage-raw-html on any page where you need to verify markup-level signals (schema, hreflang, canonical chains).

Tips

  • The crawl is credit-priced per page. A 100-page baseline catches the obvious wins; 500 pages is for migration audits and pre-launch reviews, not weekly checks.
  • Combine with sync-site-audit for ongoing tracking — the on-page crawler is point-in-time, the site audit history shows trend.
  • If the crawl returns far fewer pages than the site has, the crawler is hitting noindex, robots.txt, or auth walls. get-onpage-non-indexable tells you which.