Website to LLM-Ready Doc

This automation crawls a website starting from a given URL, uses AI to extract the main content from each page, and compiles all the information into a single, well-structured Google Doc.
This automation acts as an intelligent web scraper that organizes its findings into a convenient document. Here's how it works:
  1. Initiation: You provide a starting URL (seedUrl) and set your crawling preferences, such as the maximum number of pages to visit and how many levels of links to follow.
  2. Crawling: The automation begins at the seedUrl and systematically navigates from page to page. It intelligently keeps track of visited pages to avoid redundant work.
  3. Content Extraction: On each page, it uses AI to first dismiss common annoyances like cookie banners and popups. Then, it reads the page and extracts only the most important information—the page title and the main text content—while ignoring boilerplate like navigation menus, headers, and footers.
  4. Link Discovery: It scans each page for new links to add to its crawling queue, allowing it to discover and process more of the website.
  5. Compilation: All the extracted content is progressively added to a new Google Doc. Each entry is clearly marked with its title and original URL.
  6. Finalization: Once the crawl is complete, a summary section is added to the top of the document, detailing the parameters of the run and the total number of pages successfully scraped. The final output is a link to this newly created Google Doc.
Usage Ideas
  • Content Aggregation: Compile all articles from your favorite blog or news site into a single document for offline reading or analysis.
  • Market Research: Scrape competitor websites to gather information on their products, services, and marketing content.
  • Knowledge Base Creation: Crawl an internal company wiki or documentation site to create a comprehensive, searchable knowledge base.
  • Academic Research: Aggregate research papers or articles on a specific topic from university websites or online journals.
  • Lead Generation: Extract key information from company "About Us" or "Team" pages to build a list of contacts.
Customization Ideas
This template is a powerful starting point for your own custom web scraping tasks. You have the flexibility to:
  • Target Any Website: Simply change the starting URL to crawl any website you need information from.
  • Control the Scope: Easily adjust how many pages the automation should process and how deep it should crawl into the site's link structure.
  • Customize the Output: Change the title and formatting of the final Google Doc. You can also modify the structure of the content, such as how page titles and text are presented.
  • Refine Data Extraction: Tailor the AI's instructions to extract different or additional pieces of information from each page, like an author's name, publication date, or specific product details.
  • Change the Destination: Instead of saving to Google Docs, you can have the results sent to other services, like a Slack channel for team notifications or a Notion database for project management.
Agent Inputs
Required Parameters
Name
Type
Default
seedUrl
string
None
Starting URL for the web crawl
Optional Parameters
Name
Type
Default
documentTitle
string
Web Scraper Results
Title prefix for the Google Doc
maxDepth
number
2
How many link levels deep to crawl (1-5)
maxPages
number
10
Maximum number of pages to process (1-100)
stayInDomain
boolean
true
Whether to stay within the same domain when following links
HIPAA
SOC-2 TYPE 2
Airtop empowers anyone to turn ideas into powerful automations, by simply describing what they want to happen.
© 2025 Airtop