On this page
The Model Context Protocol (MCP) has fundamentally changed how AI assistants interact with the web. This comprehensive guide covers everything developers need to know about MCP web scraping - from foundational concepts to advanced techniques.
Part 1: Understanding MCP
What is the Model Context Protocol?
MCP (Model Context Protocol) is an open standard developed by Anthropic that allows AI assistants like Claude to connect to external tools and data sources. Think of it as a universal adapter that lets AI models use specialized tools.
┌─────────────┐ ┌───────────────┐ ┌──────────────┐
│ Claude │ ←──→ │ MCP Server │ ←──→ │ External │
│ (AI Model) │ │ (CrawlForge) │ │ Resources │
└─────────────┘ └───────────────┘ └──────────────┘
↑
MCP Protocol
(JSON-RPC over stdio)
Why MCP Matters for Web Scraping
Before MCP, AI assistants couldn't reliably access web data:
| Approach | Problems |
|---|---|
| Training data | Outdated, knowledge cutoff |
| RAG (Retrieval) | Limited to indexed documents |
| Function calling | Requires custom implementation |
| Browser plugins | Inconsistent, security concerns |
MCP solves these by providing:
- Standardized interface - One protocol for all tools
- Real-time data - Fresh information from any source
- Tool composability - Combine multiple tools seamlessly
- Security model - Controlled access to external resources
How MCP Works
MCP uses a client-server architecture with JSON-RPC:
1. Server Discovery
{
"mcpServers": {
"crawlforge": {
"command": "npx",
"args": ["crawlforge-mcp-server"]
}
}
}2. Tool Registration
{
"tools": [
{
"name": "fetch_url",
"description": "Fetch content from a URL",
"inputSchema": {
"type": "object",
"properties": {
"url": { "type": "string", "format": "uri" }
},
"required": ["url"]
}
}
]
}3. Tool Invocation
{
"method": "tools/call",
"params": {
"name": "fetch_url",
"arguments": {
"url": "https://example.com"
}
}
}4. Response
{
"result": {
"content": "<html>...",
"status": 200,
"headers": {...}
}
}Part 2: The MCP Web Scraping Ecosystem
MCP Scraping Servers
Several MCP servers provide web scraping capabilities:
| Server | Tools | Focus |
|---|---|---|
| CrawlForge | 20 | Comprehensive scraping, research, stealth |
| Firecrawl | ~5 | Basic scraping and crawling |
| Browser MCP | ~3 | Browser automation |
| Fetch MCP | 1 | Simple HTTP requests |
Why CrawlForge Leads
CrawlForge was built specifically for MCP with the widest tool coverage:
CrawlForge: ████████████████████ 20 tools
Firecrawl: █████ 5 tools
Browser: ███ 3 tools
Fetch: █ 1 tool
Part 3: CrawlForge's 20 Tools Explained
Basic Scraping (1-2 credits)
1. fetch_url (1 credit)
The foundation of web scraping - fetches raw HTML from any URL.
// Usage:
"Fetch https://example.com"
// Returns:
{
"html": "<html>...",
"status": 200,
"headers": {...},
"timing": { "total": 523 }
}When to use: Starting point for any scraping task. Always try this first.
2. extract_text (1 credit)
Extracts clean text content, removing HTML tags, scripts, and styles.
// Usage:
"Extract text from https://example.com/article"
// Returns:
{
"text": "Article headline\n\nFirst paragraph...",
"wordCount": 1247,
"readingTime": 5
}When to use: Blog posts, articles, documentation where you need readable text.
3. extract_links (1 credit)
Discovers all links on a page with optional filtering.
// Usage:
"Extract all links from https://example.com"
// With filtering:
{
"url": "https://example.com",
"filter_external": true // Internal links only
}
// Returns:
{
"links": [
{ "href": "/about", "text": "About Us" },
{ "href": "/products", "text": "Products" }
],
"total": 45
}When to use: Site exploration, finding pages to scrape, building sitemaps.
4. extract_metadata (1 credit)
Pulls SEO metadata: title, description, Open Graph, JSON-LD.
// Usage:
"Get metadata from https://example.com"
// Returns:
{
"title": "Example Site - Homepage",
"description": "Welcome to Example...",
"openGraph": {
"title": "Example Site",
"image": "https://example.com/og.png"
},
"jsonLd": [...]
}When to use: SEO analysis, content previews, structured data extraction.
Structured Extraction (2-3 credits)
5. scrape_structured (2 credits)
Extracts specific data using CSS selectors.
// Usage:
{
"url": "https://example.com/products",
"selectors": {
"title": "h1.product-title",
"price": "span.price",
"description": ".product-description"
}
}
// Returns:
{
"data": {
"title": "Product Name",
"price": "$99.99",
"description": "Product description..."
}
}When to use: E-commerce scraping, structured data, known page layouts.
6. extract_content (2 credits)
Intelligent article extraction (like Readability).
// Usage:
"Extract the main content from https://blog.example.com/post"
// Returns:
{
"title": "Blog Post Title",
"author": "John Smith",
"publishedDate": "2026-01-15",
"content": "Clean article text...",
"images": ["..."],
"readingTime": 7
}When to use: News articles, blog posts, editorial content.
7. map_site (2 credits)
Discovers site structure and generates sitemaps.
// Usage:
{
"url": "https://example.com",
"max_urls": 1000,
"include_sitemap": true
}
// Returns:
{
"pages": [
{ "url": "/", "title": "Home", "depth": 0 },
{ "url": "/about", "title": "About", "depth": 1 },
...
],
"structure": {
"/": ["/about", "/products", "/blog"],
"/products": ["/products/1", "/products/2"]
}
}When to use: Site audits, crawl planning, content discovery.
8. analyze_content (3 credits)
NLP analysis: language, sentiment, topics, entities.
// Usage:
"Analyze this content: [text]"
// Returns:
{
"language": "en",
"sentiment": { "score": 0.7, "label": "positive" },
"topics": ["technology", "AI", "automation"],
"entities": [
{ "text": "OpenAI", "type": "organization" },
{ "text": "GPT-4", "type": "product" }
],
"readability": { "grade": 12, "score": 45 }
}When to use: Content analysis, sentiment tracking, topic extraction.
Advanced Scraping (4-5 credits)
9. process_document (2 credits)
Handles PDFs and documents.
// Usage:
{
"source": "https://example.com/report.pdf",
"sourceType": "pdf_url"
}
// Returns:
{
"text": "Extracted PDF text...",
"pages": 15,
"metadata": {
"author": "...",
"created": "..."
}
}When to use: Research papers, reports, documentation PDFs.
10. summarize_content (4 credits)
AI-powered summarization.
// Usage:
"Summarize this article: [long text]"
// Returns:
{
"summary": "Concise summary...",
"keyPoints": [
"Point 1",
"Point 2",
"Point 3"
],
"wordReduction": "85%"
}When to use: Long documents, research synthesis, content digests.
11. crawl_deep (4 credits)
Multi-page crawling with configurable depth.
// Usage:
{
"url": "https://example.com",
"max_depth": 3,
"max_pages": 100,
"include_patterns": ["/blog/*"],
"exclude_patterns": ["/admin/*"]
}
// Returns:
{
"pages": [...], // All crawled pages
"stats": {
"total": 87,
"successful": 85,
"failed": 2
}
}When to use: Full site scraping, content aggregation, archiving.
12. batch_scrape (5 credits)
Parallel scraping of multiple URLs.
// Usage:
{
"urls": [
"https://example1.com",
"https://example2.com",
// ... up to 50 URLs
],
"maxConcurrency": 10
}
// Returns:
{
"results": [
{ "url": "...", "success": true, "data": {...} },
...
],
"stats": { "successful": 48, "failed": 2 }
}When to use: Multiple known URLs, competitor monitoring, price tracking.
13. scrape_with_actions (5 credits)
Browser automation with actions.
// Usage:
{
"url": "https://example.com/app",
"actions": [
{ "type": "wait", "selector": ".content" },
{ "type": "click", "selector": "#load-more" },
{ "type": "wait", "timeout": 2000 },
{ "type": "scroll", "selector": "body" },
{ "type": "screenshot" }
]
}
// Returns:
{
"finalContent": "...",
"screenshots": ["base64..."],
"actionsExecuted": 5
}When to use: SPAs, infinite scroll, dynamic content, login required.
14. search_web (5 credits)
Google search integration.
// Usage:
{
"query": "web scraping best practices 2026",
"limit": 10,
"site": "github.com" // Optional site filter
}
// Returns:
{
"results": [
{
"title": "...",
"url": "...",
"snippet": "..."
},
...
]
}When to use: Discovery, finding sources, research starting point.
Specialized Tools (3-10 credits)
15. stealth_mode (5 credits)
Anti-detection bypass (detailed in Stealth Mode Guide).
// Usage:
{
"operation": "create_context",
"stealthConfig": {
"level": "advanced",
"hideWebDriver": true,
"randomizeFingerprint": true,
"simulateHumanBehavior": true
}
}When to use: Protected sites, Cloudflare bypass, anti-bot evasion.
16. track_changes (3 credits)
Content monitoring and change detection.
// Usage:
{
"url": "https://example.com/pricing",
"operation": "create_baseline",
"monitoringOptions": {
"interval": 86400000, // Daily
"notificationThreshold": "moderate"
}
}
// Returns (on change):
{
"changes": [
{
"type": "text_change",
"path": ".pricing-tier-1",
"before": "$19/mo",
"after": "$29/mo",
"significance": "major"
}
]
}When to use: Price monitoring, competitor tracking, content updates.
17. localization (2 credits)
Geo-targeted scraping.
// Usage:
{
"operation": "configure_country",
"countryCode": "GB",
"language": "en-GB"
}
// Then scrape to get UK-specific content/pricingWhen to use: Regional pricing, localized content, geo-restricted data.
18. extract_structured (3 credits)
LLM-powered schema-driven extraction with CSS selector fallback.
// Usage:
{
"url": "https://example.com/product/123",
"schema": {
"type": "object",
"properties": {
"title": { "type": "string" },
"price": { "type": "number" }
},
"required": ["title"]
},
"prompt": "Extract the product name and price"
}When to use: When you want typed output matching a schema without writing selectors.
19. generate_llms_txt (5 credits)
Analyze a site and emit standard-compliant llms.txt and llms-full.txt files.
// Usage:
{
"url": "https://example.com",
"format": "both",
"complianceLevel": "standard",
"outputOptions": {
"organizationName": "Example Inc.",
"contactEmail": "ai@example.com"
}
}When to use: Publishing AI interaction guidelines for your website.
20. deep_research (10 credits)
Comprehensive multi-source research (detailed in Deep Research Guide).
// Usage:
{
"topic": "quantum computing commercialization",
"maxUrls": 50,
"enableSourceVerification": true,
"enableConflictDetection": true
}
// Returns:
{
"synthesis": "Comprehensive analysis...",
"sources": [...],
"conflicts": [...],
"citations": [...]
}When to use: Research projects, due diligence, market analysis.
Part 4: Integration Guide
Claude Code Setup
# 1. Install CrawlForge MCP server
npm install -g crawlforge-mcp-server
# 2. Run setup wizard
npx crawlforge-setup
# 3. Add to Claude Code
claude
> /mcp add crawlforge npx crawlforge-mcp-server
# 4. Verify
> /mcp list
# Should show: crawlforge (20 tools)Claude Desktop Setup
Edit your Claude Desktop config file:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
{
"mcpServers": {
"crawlforge": {
"command": "npx",
"args": ["crawlforge-mcp-server"],
"env": {
"CRAWLFORGE_API_KEY": "cf_live_your_key_here"
}
}
}
}Custom Application Integration
import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";
const transport = new StdioClientTransport({
command: "npx",
args: ["crawlforge-mcp-server"],
env: {
CRAWLFORGE_API_KEY: process.env.CRAWLFORGE_API_KEY
}
});
const client = new Client({
name: "my-app",
version: "1.0.0"
});
await client.connect(transport);
// List available tools
const tools = await client.listTools();
console.log(`Available tools: ${tools.tools.length}`);
// Call a tool
const result = await client.callTool({
name: "fetch_url",
arguments: {
url: "https://example.com"
}
});Part 5: Best Practices
Credit Optimization
| Goal | Expensive | Efficient |
|---|---|---|
| Check if page exists | deep_research (10) | fetch_url (1) |
| Get article text | scrape_with_actions (5) | extract_content (2) |
| Find competitor URLs | search_web × 10 (50) | extract_links (1) |
| Scrape 20 product pages | fetch_url × 20 (20) | batch_scrape (5) |
Error Handling
// Always handle failures gracefully
try {
const result = await fetchUrl(url);
if (result.status >= 400) {
// Try with stealth mode
return await stealthMode(url);
}
return result;
} catch (error) {
// Log and retry with exponential backoff
await sleep(retryDelay * attempt);
return retry(url, attempt + 1);
}Rate Limiting
Respect target sites:
// Good: Reasonable delays
for (const url of urls) {
await scrape(url);
await sleep(1000 + Math.random() * 2000); // 1-3s delay
}
// Better: Use batch_scrape with built-in rate limiting
await batchScrape(urls, { delayBetweenRequests: 1500 });Caching
Don't scrape the same URL twice:
const cache = new Map<string, ScrapedContent>();
async function smartScrape(url: string) {
if (cache.has(url)) {
return cache.get(url);
}
const result = await fetchUrl(url);
cache.set(url, result);
return result;
}Part 6: The Future of MCP Scraping
Emerging Trends
- AI-Native Extraction - LLMs directly parsing unstructured HTML
- Self-Healing Scrapers - AI adapts to site changes automatically
- Semantic Search - Natural language queries across scraped data
- Cross-Site Analysis - AI connecting information across sources
CrawlForge Roadmap
Coming in 2026:
- Real-time monitoring - Instant change notifications
- AI schema generation - Automatic extraction templates
- Cross-tool workflows - Chain tools intelligently
- Enhanced privacy - Zero-knowledge scraping options
Getting Started
Ready to start MCP web scraping? Here's your path:
Free Tier (Perfect for Getting Started)
- 1,000 one-time trial credits
- All 20 tools available
- No credit card required
# Quick start
npm install -g crawlforge-mcp-server
npx crawlforge-setup
# Visit: https://crawlforge.dev/signupWhat You Can Do with 1,000 Credits
| Use Case | Tools | Credits | Monthly Capacity |
|---|---|---|---|
| Basic scraping | fetch_url | 1 | 1,000 pages |
| Article extraction | extract_content | 2 | 500 articles |
| Site mapping | map_site | 2 | 500 sites |
| Batch jobs | batch_scrape | 5 | 200 batches (10K URLs) |
| Research projects | deep_research | 10 | 100 topics |
Summary
MCP has revolutionized web scraping for AI applications. Key takeaways:
- MCP is the standard - All major AI assistants support it
- CrawlForge leads with 20 tools - 4x more than alternatives
- Start simple - Use fetch_url (1 credit) before advanced tools
- Combine tools - Chain operations for powerful workflows
- Be ethical - Respect robots.txt and rate limits
Related Resources: