Skip to main content

Website Crawling

Learn how to add content to your chatbot by crawling websites. ChatMaven's website crawler can automatically extract and process content from your website or documentation pages.

Getting Started

Prerequisites

  • Website URL
  • Access permissions
  • Sitemap (optional)
  • Robots.txt compliance

Basic Setup

  1. Go to "Data Sources" → "Website"
  2. Click "Add Website"
  3. Enter the website URL
  4. Configure basic settings:
    • Crawl depth
    • Update frequency
    • Language detection

Configuration Options

URL Settings

  1. Include Patterns

    https://example.com/docs/*
    https://example.com/blog/*
  2. Exclude Patterns

    https://example.com/private/*
    https://example.com/admin/*
  3. Parameters

    • Follow redirects
    • Handle dynamic content
    • Respect nofollow

Authentication

  1. Basic Auth

    • Username/password
    • API key
    • Bearer token
  2. Cookie-based

    • Session cookies
    • Authentication tokens
    • Custom headers
  3. OAuth

    • OAuth 2.0 support
    • Token management
    • Refresh handling

Crawling Settings

Depth and Scope

  1. Crawl Depth

    • Surface (1 level)
    • Medium (3 levels)
    • Deep (unlimited)
  2. Content Selection

    • Main content
    • Navigation
    • Footers
    • Sidebars
  3. Media Handling

    • Images
    • PDFs
    • Downloads
    • Embedded content

Rate Limiting

  • Requests per second
  • Concurrent connections
  • Bandwidth limits
  • Crawl window

Content Processing

Extraction Rules

  1. Content Selectors

    article.content
    div.documentation
    section.main
  2. Ignore Elements

    .navigation
    .footer
    .ads
  3. Custom Rules

    • XPath queries
    • CSS selectors
    • Regular expressions

Content Cleaning

  • Remove ads
  • Clean formatting
  • Extract main content
  • Preserve structure

Scheduling

Automatic Updates

  1. Frequency Options

    • Hourly
    • Daily
    • Weekly
    • Monthly
  2. Update Types

    • Full crawl
    • Incremental
    • Changed pages only
  3. Notifications

    • Completion
    • Errors
    • Changes detected

Monitoring

Performance Metrics

  • Pages crawled
  • Success rate
  • Processing time
  • Error count

Content Changes

  • New pages
  • Modified content
  • Deleted pages
  • Structure changes

Error Tracking

  • Connection issues
  • Authentication failures
  • Processing errors
  • Rate limiting

Best Practices

Optimization

  1. Performance

    • Set appropriate delays
    • Use incremental updates
    • Optimize selectors
  2. Resource Usage

    • Limit concurrent requests
    • Schedule during off-peak
    • Monitor bandwidth
  3. Content Quality

    • Verify extracted content
    • Check formatting
    • Test in chatbot

Maintenance

  1. Regular Tasks

    • Review crawl logs
    • Update patterns
    • Check authentication
    • Verify content
  2. Troubleshooting

    • Monitor errors
    • Check access
    • Verify settings
    • Test selectors

Next Steps