Website Crawling
Learn how to add content to your chatbot by crawling websites. ChatMaven's website crawler can automatically extract and process content from your website or documentation pages.
Getting Started
Prerequisites
- Website URL
- Access permissions
- Sitemap (optional)
- Robots.txt compliance
Basic Setup
- Go to "Data Sources" → "Website"
- Click "Add Website"
- Enter the website URL
- Configure basic settings:
- Crawl depth
- Update frequency
- Language detection
Configuration Options
URL Settings
-
Include Patterns
https://example.com/docs/*
https://example.com/blog/* -
Exclude Patterns
https://example.com/private/*
https://example.com/admin/* -
Parameters
- Follow redirects
- Handle dynamic content
- Respect nofollow
Authentication
-
Basic Auth
- Username/password
- API key
- Bearer token
-
Cookie-based
- Session cookies
- Authentication tokens
- Custom headers
-
OAuth
- OAuth 2.0 support
- Token management
- Refresh handling
Crawling Settings
Depth and Scope
-
Crawl Depth
- Surface (1 level)
- Medium (3 levels)
- Deep (unlimited)
-
Content Selection
- Main content
- Navigation
- Footers
- Sidebars
-
Media Handling
- Images
- PDFs
- Downloads
- Embedded content
Rate Limiting
- Requests per second
- Concurrent connections
- Bandwidth limits
- Crawl window
Content Processing
Extraction Rules
-
Content Selectors
article.content
div.documentation
section.main -
Ignore Elements
.navigation
.footer
.ads -
Custom Rules
- XPath queries
- CSS selectors
- Regular expressions
Content Cleaning
- Remove ads
- Clean formatting
- Extract main content
- Preserve structure
Scheduling
Automatic Updates
-
Frequency Options
- Hourly
- Daily
- Weekly
- Monthly
-
Update Types
- Full crawl
- Incremental
- Changed pages only
-
Notifications
- Completion
- Errors
- Changes detected
Monitoring
Performance Metrics
- Pages crawled
- Success rate
- Processing time
- Error count
Content Changes
- New pages
- Modified content
- Deleted pages
- Structure changes
Error Tracking
- Connection issues
- Authentication failures
- Processing errors
- Rate limiting
Best Practices
Optimization
-
Performance
- Set appropriate delays
- Use incremental updates
- Optimize selectors
-
Resource Usage
- Limit concurrent requests
- Schedule during off-peak
- Monitor bandwidth
-
Content Quality
- Verify extracted content
- Check formatting
- Test in chatbot
Maintenance
-
Regular Tasks
- Review crawl logs
- Update patterns
- Check authentication
- Verify content
-
Troubleshooting
- Monitor errors
- Check access
- Verify settings
- Test selectors