SpiderForce4AI Documentation
Run Your Own Web Crawler
No limits. No subscriptions. Just $10/month on any basic VPS (1GB RAM, 1 core is enough).
Installation
SpiderForce4AI can be installed and run in multiple ways:
Quick start with Docker (recommended):
docker run -d --restart unless-stopped -p 3004:3004 --name spiderforce4ai petertamai/spiderforce4ai:latest
Container started successfully!
Install from GitHub repository:
git clone https://github.com/petertamai/spiderforce4ai.git
cd spiderforce4ai
npm install
mkdir logs
cp .env.example .env
npm install -g pm2
npm run start:pm2
Service started on port 3004
Configuration
SpiderForce4AI can be configured through environment variables:
# Server Configuration
PORT=3004
NODE_ENV=production
MAX_RETRIES=2
PAGE_TIMEOUT=30000
MAX_CONCURRENT_PAGES=10
# Cleaning Configuration
AGGRESSIVE_CLEANING=true
REMOVE_IMAGES=false
# Dynamic Content Handling
MIN_CONTENT_LENGTH=500
SCROLL_WAIT_TIME=200
| Variable | Default | Description |
|---|---|---|
PORT |
3004 | Server port |
MIN_CONTENT_LENGTH |
500 | Minimum content length threshold (characters) |
AGGRESSIVE_CLEANING |
true | Enable aggressive content cleaning |
MAX_CONCURRENT_PAGES |
10 | Maximum number of concurrent pages |
Basic Conversion
Convert a URL to clean Markdown format.
curl "http://localhost:3004/convert?url=https://petertam.pro"
URL: https://petertam.pro
Title: Example Domain
Description: Example website for illustrative purposes
---
# Example Domain
This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.
[More information...](https://www.iana.org/domains/example)
Advanced Targeting
Extract specific content using CSS selectors.
curl -X POST "http://localhost:3004/convert" \
-H "Content-Type: application/json" \
-d '{
"url": "https://petertam.pro"
}'
| Parameter | Type | Description |
|---|---|---|
url |
string | The URL to convert (required) |
targetSelectors |
array | CSS selectors to target specific content elements |
removeSelectors |
array | CSS selectors to remove unwanted elements |
Dynamic Content Handling
Configure dynamic content handling parameters.
curl -X POST "http://localhost:3004/convert" \
-H "Content-Type: application/json" \
-d '{
"url": "https://petertam.pro",
"min_content_length": 1000,
"scroll_wait_time": 300,
"aggressive_cleaning": true
}'
-
STAGE 0 (Default) Fast extraction with aggressive cleaning - optimized for speed
-
STAGE 1 (First Fallback) If content is insufficient, re-run with scroll to the bottom, wait 200ms, then try extraction with aggressive cleaning
-
STAGE 2 (Last Resort) If content is still insufficient, re-run with scroll to the bottom, wait 200ms, and disable aggressive cleaning
| Parameter | Type | Default | Description |
|---|---|---|---|
min_content_length |
number | 500 | Minimum acceptable content length in characters |
scroll_wait_time |
number | 200 | Time to wait after scrolling in milliseconds |
aggressive_cleaning |
boolean | true | Enable/disable aggressive cleaning |
Batch Processing
Process an entire website using its sitemap.
curl -X POST "http://localhost:3004/crawl_sitemap" \
-H "Content-Type: application/json" \
-d '{
"sitemapUrl": "https://petertam.pro/sitemap.xml",
"webhook": {
"url": "https://your-webhook.com/endpoint",
"headers": {
"Authorization": "Bearer your-token"
},
"progressUpdates": true,
"extraFields": {
"project": "blog-crawler",
"source": "sitemap"
}
}
}'
{
"jobId": "job_1234567890",
"status": "started",
"config": {
"sitemapUrl": "https://petertam.pro/sitemap.xml",
"webhook": {
"url": "https://your-webhook.com/endpoint",
"hasHeaders": true
}
}
}
Process multiple URLs in batch.
curl -X POST "http://localhost:3004/crawl_urls" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://petertam.pro/page1",
"https://petertam.pro/page2",
"https://petertam.pro/page3"
]
}'
Check status of a batch processing job.
curl "http://localhost:3004/job/job_1234567890"
Webhook Integration
Convert a URL and send results to a webhook.
curl -X POST "http://localhost:3004/convert" \
-H "Content-Type: application/json" \
-d '{
"url": "https://petertam.pro",
"custom_webhook": {
"url": "https://your-vectordb-api.com/ingest",
"method": "POST",
"headers": {
"Authorization": "Bearer your-token"
},
"data": {
"content": "##markdown##",
"metadata": {
"url": "##url##",
"timestamp": "##timestamp##",
"title": "##metadata##"
}
}
}
}'
| Variable | Description |
|---|---|
##markdown## |
The converted markdown content |
##url## |
The original URL |
##timestamp## |
The current timestamp |
##metadata## |
The extracted metadata |
Firecrawl API Compatibility (beta)
SpiderForce4AI provides full compatibility with Firecrawl's API endpoints, allowing for easy migration:
Scrape a single URL (Firecrawl-compatible).
curl -X POST "http://localhost:3004/v1/scrape" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"url": "https://petertam.pro",
"formats": ["markdown"]
}'
{
"success": true,
"data": {
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
"metadata": {
"title": "Example Domain",
"description": "Example website for illustrative purposes",
"language": "en",
"sourceURL": "https://petertam.pro"
}
}
}
Crawl a sitemap (Firecrawl-compatible).
curl -X POST "http://localhost:3004/v1/crawl" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"url": "https://petertam.pro/sitemap.xml",
"limit": 100,
"webhook": "https://your-webhook.com/endpoint",
"scrapeOptions": {
"formats": ["markdown"]
}
}'
n8n Integration
Use SpiderForce4AI with n8n automation platform:
// SpiderForce4AI Tool for n8n
// This tool processes URLs through SpiderForce4AI and returns markdown content.
// The input comes as 'query' parameter containing a URL
const SPIDERFORCE_BASE_URL = 'http://localhost:3004'; // or your cloud deployment instance URL
try {
// Validate input
if (!query || !query.startsWith('http')) {
return 'Error: Invalid URL provided. URL must start with http:// or https://';
}
const options = {
url: `${SPIDERFORCE_BASE_URL}/convert`,
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: {
url: query,
// targetSelectors: ['.main-content', 'article', '#content'],
// removeSelectors: ['.ads', '.nav', '.footer', '.header']
}
};
const response = await this.helpers.httpRequest(options);
// Return markdown content from response
return response.markdown || response.toString();
} catch (error) {
return `Error processing URL: ${error.message}`;
}
DIFY Integration
Use SpiderForce4AI with DIFY AI platform:
openapi: 3.0.0
info:
title: SpiderForce4AI Web to Markdown Converter
description: Convert a web page to Markdown using Dify
author: Piotr Tamulewicz
version: 1.0.0
servers:
- url: http://localhost:3004
paths:
/convert:
get:
summary: Convert a web page to Markdown
description: Retrieves the content of a web page and converts it to Markdown format
parameters:
- name: url
in: query
description: The URL of the web page to convert
required: true
schema:
type: string
format: uri
- name: targetSelectors
in: query
description: CSS selectors to target specific content (comma-separated)
required: false
schema:
type: string
- name: removeSelectors
in: query
description: CSS selectors to remove unwanted content (comma-separated)
required: false
schema:
type: string
responses:
'200':
description: Successful conversion
content:
text/markdown:
schema:
type: string
'400':
description: Bad request (missing or invalid URL)
'500':
description: Internal server error
Python Wrapper
For Python developers, you can use our official Python wrapper:
from spiderforce4ai import SpiderForce4AI, CrawlConfig
import requests
from pathlib import Path
# Define extraction template structure
extraction_template = """
{
"Title": "Extract the main title of the content",
"MetaDescription": "Extract a search-friendly meta description (under 160 characters)",
"KeyPoints": ["Extract 3-5 key points from the content"],
"Categories": ["Extract relevant categories for this content"],
"ReadingTimeMinutes": "Estimate reading time in minutes"
}
"""
# Define a custom webhook function
def post_extraction_webhook(extraction_result):
"""Send extraction results to a webhook and return transformed data."""
# Add custom fields or transform the data as needed
payload = {
"url": extraction_result.get("url", ""),
"title": extraction_result.get("Title", ""),
"description": extraction_result.get("MetaDescription", ""),
"key_points": extraction_result.get("KeyPoints", []),
"categories": extraction_result.get("Categories", []),
"reading_time": extraction_result.get("ReadingTimeMinutes", ""),
"processed_at": extraction_result.get("timestamp", "")
}
# Send to webhook (example using a different webhook than the main one)
try:
response = requests.post(
"https://webhook.site/your-extraction-webhook-id",
json=payload,
headers={
"Authorization": "Bearer extraction-token",
"Content-Type": "application/json"
},
timeout=10
)
print(f"Extraction webhook sent: Status {response.status_code}")
except Exception as e:
print(f"Extraction webhook error: {str(e)}")
# Return the transformed data (will be stored in result.extraction_result)
return payload
# Initialize crawler
spider = SpiderForce4AI("http://localhost:3004")
# Configure with post-extraction and webhooks
config = CrawlConfig(
# Basic crawling settings
target_selector="article",
remove_selectors=[".ads", ".navigation", ".comments"],
max_concurrent_requests=5,
# Regular webhook for crawl results
webhook_url="https://webhook.site/your-crawl-webhook-id",
webhook_headers={
"Authorization": "Bearer crawl-token",
"Content-Type": "application/json"
},
# Post-extraction LLM processing
post_extraction_agent={
"model": "gpt-4-turbo", # Or another compatible model
"api_key": "your-api-key-here",
"max_tokens": 1000,
"temperature": 0.3,
"response_format": "json_object", # Request JSON response format
"messages": [
{
"role": "system",
"content": f"Extract the following information from the content. Return ONLY valid JSON, no explanations:\n\n{extraction_template}"
},
{
"role": "user",
"content": "{here_markdown_content}" # Will be replaced with actual content
}
]
},
# Save combined extraction results
post_extraction_agent_save_to_file="extraction_results.json",
# Custom function to transform and send extraction webhook
post_agent_transformer_function=post_extraction_webhook
)
# Crawl a sitemap with both regular and post-extraction webhooks
results = spider.crawl_sitemap_server_parallel("https://petertam.pro/sitemap.xml", config)
Benchmarks
SpiderForce4AI has been benchmarked against other popular web crawling solutions:
Performance Comparison
Crawling Time Comparison
Average Processing Time
Performance Optimization
Tips for optimizing SpiderForce4AI performance:
Server Requirements
Run SpiderForce4AI on virtually any server with these minimal requirements:
Configuration Tips
Fine-tune your setup for the optimal balance of speed and thoroughness:
MAX_CONCURRENT_PAGES based on available server resourcesMIN_CONTENT_LENGTH to balance between speed and thoroughnessMIN_CONTENT_LENGTH=0 to disable fallbacksBatch Processing
Scale up your content extraction with these batch processing tips:
Advanced Performance Tuning
For high-volume processing and enterprise use cases, consider these additional optimizations:
Memory Management
- Enable browser recycling via
RECYCLE_BROWSER - Set
MAX_MEMORY_USAGEto trigger garbage collection - Use selectors that target only text-heavy elements
Network Optimization
- Use a proxy rotation service for high-volume crawling
- Set appropriate delays between requests with
REQUEST_DELAY - Distribute crawling across multiple instances for large datasets