SpiderForce4AI Documentation

SpiderForce4AI is a high-performance HTML to Markdown converter optimized for AI training data collection and RAG systems.
http://localhost:3004/convert?url=https://petertam.pro

Run Your Own Web Crawler

No limits. No subscriptions. Just $10/month on any basic VPS (1GB RAM, 1 core is enough).

SpiderForce4AI
1.69s
Fastest in benchmarks
Competitors
3.49s+
206% slower

Installation

SpiderForce4AI can be installed and run in multiple ways:

Quick start with Docker (recommended):

docker run -d --restart unless-stopped -p 3004:3004 --name spiderforce4ai petertamai/spiderforce4ai:latest
Container started successfully!

Install from GitHub repository:

git clone https://github.com/petertamai/spiderforce4ai.git
cd spiderforce4ai
npm install
mkdir logs
cp .env.example .env
npm install -g pm2
npm run start:pm2
Service started on port 3004

Deploy to DigitalOcean with a single click:

Configuration

SpiderForce4AI can be configured through environment variables:

# Server Configuration
    PORT=3004
    NODE_ENV=production
    MAX_RETRIES=2
    PAGE_TIMEOUT=30000
    MAX_CONCURRENT_PAGES=10

    # Cleaning Configuration
    AGGRESSIVE_CLEANING=true
    REMOVE_IMAGES=false

    # Dynamic Content Handling
    MIN_CONTENT_LENGTH=500
    SCROLL_WAIT_TIME=200
Variable Default Description
PORT 3004 Server port
MIN_CONTENT_LENGTH 500 Minimum content length threshold (characters)
AGGRESSIVE_CLEANING true Enable aggressive content cleaning
MAX_CONCURRENT_PAGES 10 Maximum number of concurrent pages

Basic Conversion

GET /convert?url=https://petertam.pro

Convert a URL to clean Markdown format.

Example Request
curl "http://localhost:3004/convert?url=https://petertam.pro"
Response
URL: https://petertam.pro

    Title: Example Domain
    Description: Example website for illustrative purposes

    ---

    # Example Domain

    This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

    [More information...](https://www.iana.org/domains/example)

Advanced Targeting

POST /convert

Extract specific content using CSS selectors.

Example Request
curl -X POST "http://localhost:3004/convert" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://petertam.pro"
  }'
Request Parameters
Parameter Type Description
url string The URL to convert (required)
targetSelectors array CSS selectors to target specific content elements
removeSelectors array CSS selectors to remove unwanted elements

Dynamic Content Handling

POST /convert

Configure dynamic content handling parameters.

Example Request
curl -X POST "http://localhost:3004/convert" \
            -H "Content-Type: application/json" \
            -d '{
              "url": "https://petertam.pro",
              "min_content_length": 1000,
              "scroll_wait_time": 300,
              "aggressive_cleaning": true
            }'
Dynamic Content Strategy
  • STAGE 0 (Default) Fast extraction with aggressive cleaning - optimized for speed
  • STAGE 1 (First Fallback) If content is insufficient, re-run with scroll to the bottom, wait 200ms, then try extraction with aggressive cleaning
  • STAGE 2 (Last Resort) If content is still insufficient, re-run with scroll to the bottom, wait 200ms, and disable aggressive cleaning
Auto-Adapting Extraction
SpiderForce4AI automatically determines when to apply these fallback strategies based on content length.
Parameter Type Default Description
min_content_length number 500 Minimum acceptable content length in characters
scroll_wait_time number 200 Time to wait after scrolling in milliseconds
aggressive_cleaning boolean true Enable/disable aggressive cleaning

Batch Processing

POST /crawl_sitemap

Process an entire website using its sitemap.

Example Request
curl -X POST "http://localhost:3004/crawl_sitemap" \
  -H "Content-Type: application/json" \
  -d '{
    "sitemapUrl": "https://petertam.pro/sitemap.xml",
    "webhook": {
      "url": "https://your-webhook.com/endpoint",
      "headers": {
        "Authorization": "Bearer your-token"
      },
      "progressUpdates": true,
      "extraFields": {
        "project": "blog-crawler",
        "source": "sitemap"
      }
    }
  }'
Response
{
  "jobId": "job_1234567890",
  "status": "started",
  "config": {
    "sitemapUrl": "https://petertam.pro/sitemap.xml",
    "webhook": {
      "url": "https://your-webhook.com/endpoint",
      "hasHeaders": true
    }
  }
}
POST /crawl_urls

Process multiple URLs in batch.

Example Request
curl -X POST "http://localhost:3004/crawl_urls" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://petertam.pro/page1",
      "https://petertam.pro/page2",
      "https://petertam.pro/page3"
    ]
  }'
GET /job/:jobId

Check status of a batch processing job.

Example Request
curl "http://localhost:3004/job/job_1234567890"

Webhook Integration

POST /convert

Convert a URL and send results to a webhook.

Example Request
curl -X POST "http://localhost:3004/convert" \
    -H "Content-Type: application/json" \
    -d '{
      "url": "https://petertam.pro",
      "custom_webhook": {
        "url": "https://your-vectordb-api.com/ingest",
        "method": "POST",
        "headers": {
          "Authorization": "Bearer your-token"
        },
        "data": {
          "content": "##markdown##",
          "metadata": {
            "url": "##url##",
            "timestamp": "##timestamp##",
            "title": "##metadata##"
          }
        }
      }
    }'
Webhook Variables
Variable Description
##markdown## The converted markdown content
##url## The original URL
##timestamp## The current timestamp
##metadata## The extracted metadata

Firecrawl API Compatibility (beta)

SpiderForce4AI provides full compatibility with Firecrawl's API endpoints, allowing for easy migration:

POST /v1/scrape

Scrape a single URL (Firecrawl-compatible).

Example Request
curl -X POST "http://localhost:3004/v1/scrape" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -d '{
      "url": "https://petertam.pro",
      "formats": ["markdown"]
    }'
Response
{
    "success": true,
    "data": {
      "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
      "metadata": {
        "title": "Example Domain",
        "description": "Example website for illustrative purposes",
        "language": "en",
        "sourceURL": "https://petertam.pro"
      }
    }
  }
POST /v1/crawl

Crawl a sitemap (Firecrawl-compatible).

Example Request
curl -X POST "http://localhost:3004/v1/crawl" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -d '{
      "url": "https://petertam.pro/sitemap.xml",
      "limit": 100,
      "webhook": "https://your-webhook.com/endpoint",
      "scrapeOptions": {
        "formats": ["markdown"]
      }
    }'

n8n Integration

Use SpiderForce4AI with n8n automation platform:

// SpiderForce4AI Tool for n8n
  // This tool processes URLs through SpiderForce4AI and returns markdown content.
  // The input comes as 'query' parameter containing a URL
  
  const SPIDERFORCE_BASE_URL = 'http://localhost:3004'; // or your cloud deployment instance URL
  
  try {
      // Validate input
      if (!query || !query.startsWith('http')) {
          return 'Error: Invalid URL provided. URL must start with http:// or https://';
      }
  
      const options = {
          url: `${SPIDERFORCE_BASE_URL}/convert`,
          method: 'POST',
          headers: {
              'Content-Type': 'application/json'
          },
          body: {
              url: query,
             // targetSelectors: ['.main-content', 'article', '#content'],
             // removeSelectors: ['.ads', '.nav', '.footer', '.header']
          }
      };
  
      const response = await this.helpers.httpRequest(options);
      
      // Return markdown content from response
      return response.markdown || response.toString();
  
  } catch (error) {
      return `Error processing URL: ${error.message}`;
  }
https://n8n.io/workflow
n8n Integration Example

DIFY Integration

Use SpiderForce4AI with DIFY AI platform:

openapi: 3.0.0
  info:
    title: SpiderForce4AI Web to Markdown Converter
    description: Convert a web page to Markdown using Dify
    author: Piotr Tamulewicz
    version: 1.0.0
  
  servers:
    - url: http://localhost:3004
  
  paths:
    /convert:
      get:
        summary: Convert a web page to Markdown
        description: Retrieves the content of a web page and converts it to Markdown format
        parameters:
          - name: url
            in: query
            description: The URL of the web page to convert
            required: true
            schema:
              type: string
              format: uri
          - name: targetSelectors
            in: query
            description: CSS selectors to target specific content (comma-separated)
            required: false
            schema:
              type: string
          - name: removeSelectors
            in: query  
            description: CSS selectors to remove unwanted content (comma-separated)
            required: false
            schema:
              type: string
        responses:
          '200':
            description: Successful conversion
            content:
              text/markdown:
                schema:
                  type: string
          '400':
            description: Bad request (missing or invalid URL)
          '500':
            description: Internal server error

Python Wrapper

For Python developers, you can use our official Python wrapper:

PyPI
spiderforce4ai pip install spiderforce4ai

from spiderforce4ai import SpiderForce4AI, CrawlConfig
import requests
from pathlib import Path

# Define extraction template structure
extraction_template = """
{
    "Title": "Extract the main title of the content",
    "MetaDescription": "Extract a search-friendly meta description (under 160 characters)",
    "KeyPoints": ["Extract 3-5 key points from the content"],
    "Categories": ["Extract relevant categories for this content"],
    "ReadingTimeMinutes": "Estimate reading time in minutes"
}
"""

# Define a custom webhook function
def post_extraction_webhook(extraction_result):
    """Send extraction results to a webhook and return transformed data."""
    # Add custom fields or transform the data as needed
    payload = {
        "url": extraction_result.get("url", ""),
        "title": extraction_result.get("Title", ""),
        "description": extraction_result.get("MetaDescription", ""),
        "key_points": extraction_result.get("KeyPoints", []),
        "categories": extraction_result.get("Categories", []),
        "reading_time": extraction_result.get("ReadingTimeMinutes", ""),
        "processed_at": extraction_result.get("timestamp", "")
    }
    
    # Send to webhook (example using a different webhook than the main one)
    try:
        response = requests.post(
            "https://webhook.site/your-extraction-webhook-id",
            json=payload,
            headers={
                "Authorization": "Bearer extraction-token",
                "Content-Type": "application/json"
            },
            timeout=10
        )
        print(f"Extraction webhook sent: Status {response.status_code}")
    except Exception as e:
        print(f"Extraction webhook error: {str(e)}")
    
    # Return the transformed data (will be stored in result.extraction_result)
    return payload

# Initialize crawler
spider = SpiderForce4AI("http://localhost:3004")

# Configure with post-extraction and webhooks
config = CrawlConfig(
    # Basic crawling settings
    target_selector="article",
    remove_selectors=[".ads", ".navigation", ".comments"],
    max_concurrent_requests=5,
    
    # Regular webhook for crawl results
    webhook_url="https://webhook.site/your-crawl-webhook-id",
    webhook_headers={
        "Authorization": "Bearer crawl-token",
        "Content-Type": "application/json"
    },
    
    # Post-extraction LLM processing
    post_extraction_agent={
        "model": "gpt-4-turbo",  # Or another compatible model
        "api_key": "your-api-key-here",
        "max_tokens": 1000,
        "temperature": 0.3,
        "response_format": "json_object",  # Request JSON response format
        "messages": [
            {
                "role": "system",
                "content": f"Extract the following information from the content. Return ONLY valid JSON, no explanations:\n\n{extraction_template}"
            },
            {
                "role": "user",
                "content": "{here_markdown_content}"  # Will be replaced with actual content
            }
        ]
    },
    # Save combined extraction results
    post_extraction_agent_save_to_file="extraction_results.json",
    # Custom function to transform and send extraction webhook
    post_agent_transformer_function=post_extraction_webhook
)

# Crawl a sitemap with both regular and post-extraction webhooks
results = spider.crawl_sitemap_server_parallel("https://petertam.pro/sitemap.xml", config)
              

Benchmarks

SpiderForce4AI has been benchmarked against other popular web crawling solutions:

Performance Comparison

SpiderForce4AI
1.69s
100% (baseline)
Jina AI
3.49s
206% slower
Firecrawl
5.83s
345% slower
Benchmark Charts

Crawling Time Comparison

Crawling Time Comparison

Average Processing Time

Average Processing Time
SpiderForce4AI consistently outperforms other solutions, especially on dynamic content-heavy sites.

Performance Optimization

Tips for optimizing SpiderForce4AI performance:

Server Requirements

Run SpiderForce4AI on virtually any server with these minimal requirements:

Minimum
1GB RAM, 1 CPU core
Suitable for basic usage
Recommended
2GB RAM, 2 CPU cores
Comfortable for higher volume
Cost
~$10/month on any VPS
No premium subscriptions needed

Configuration Tips

Fine-tune your setup for the optimal balance of speed and thoroughness:

Concurrent Pages
Set MAX_CONCURRENT_PAGES based on available server resources
Content Length
Adjust MIN_CONTENT_LENGTH to balance between speed and thoroughness
Speed Mode
For maximum speed, set MIN_CONTENT_LENGTH=0 to disable fallbacks
Targeted Extraction
Use specific CSS selectors to extract only required content

Batch Processing

Scale up your content extraction with these batch processing tips:

Optimal Batch Size
Process URLs in batches of 50-100 for maximum throughput
Asynchronous Processing
Use webhooks to handle large volumes asynchronously
Rate Limiting
Implement rate limiting when crawling multiple sites to avoid blocks
Sitemap API
Use the sitemap API for efficient large-scale content extraction

Advanced Performance Tuning

For high-volume processing and enterprise use cases, consider these additional optimizations:

Memory Management

  • Enable browser recycling via RECYCLE_BROWSER
  • Set MAX_MEMORY_USAGE to trigger garbage collection
  • Use selectors that target only text-heavy elements

Network Optimization

  • Use a proxy rotation service for high-volume crawling
  • Set appropriate delays between requests with REQUEST_DELAY
  • Distribute crawling across multiple instances for large datasets
For enterprise support and custom optimization, contact [email protected]