SpiderForce4AI Documentation

SpiderForce4AI is a high-performance HTML to Markdown converter optimized for AI training data collection and RAG systems.

Run Your Own Web Crawler

No limits. No subscriptions. Just $10/month on any basic VPS (1GB RAM, 1 core is enough).

SpiderForce4AI

1.69s

Fastest in benchmarks

Competitors

3.49s+

206% slower

Installation

SpiderForce4AI can be installed and run in multiple ways:

Quick start with Docker (recommended):

docker run -d --restart unless-stopped -p 3004:3004 --name spiderforce4ai petertamai/spiderforce4ai:latest

Container started successfully!

Install from GitHub repository:

git clone https://github.com/petertamai/spiderforce4ai.git

cd spiderforce4ai

npm install

mkdir logs

cp .env.example .env

npm install -g pm2

npm run start:pm2

Service started on port 3004

Deploy to DigitalOcean with a single click:

Configuration

SpiderForce4AI can be configured through environment variables:

# Server Configuration
    PORT=3004
    NODE_ENV=production
    MAX_RETRIES=2
    PAGE_TIMEOUT=30000
    MAX_CONCURRENT_PAGES=10

    # Cleaning Configuration
    AGGRESSIVE_CLEANING=true
    REMOVE_IMAGES=false

    # Dynamic Content Handling
    MIN_CONTENT_LENGTH=500
    SCROLL_WAIT_TIME=200

Variable	Default	Description
`PORT`	3004	Server port
`MIN_CONTENT_LENGTH`	500	Minimum content length threshold (characters)
`AGGRESSIVE_CLEANING`	true	Enable aggressive content cleaning
`MAX_CONCURRENT_PAGES`	10	Maximum number of concurrent pages

Basic Conversion

GET /convert?url=https://petertam.pro

Convert a URL to clean Markdown format.

Example Request

curl "http://localhost:3004/convert?url=https://petertam.pro"

Response

URL: https://petertam.pro

    Title: Example Domain
    Description: Example website for illustrative purposes

    ---

    # Example Domain

    This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

    [More information...](https://www.iana.org/domains/example)

Advanced Targeting

POST /convert

Extract specific content using CSS selectors.

Example Request

curl -X POST "http://localhost:3004/convert" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://petertam.pro"
  }'

Request Parameters

Parameter	Type	Description
`url`	string	The URL to convert (required)
`targetSelectors`	array	CSS selectors to target specific content elements
`removeSelectors`	array	CSS selectors to remove unwanted elements

Dynamic Content Handling

POST /convert

Configure dynamic content handling parameters.

Example Request

curl -X POST "http://localhost:3004/convert" \
            -H "Content-Type: application/json" \
            -d '{
              "url": "https://petertam.pro",
              "min_content_length": 1000,
              "scroll_wait_time": 300,
              "aggressive_cleaning": true
            }'

Dynamic Content Strategy

STAGE 0 (Default) Fast extraction with aggressive cleaning - optimized for speed
STAGE 1 (First Fallback) If content is insufficient, re-run with scroll to the bottom, wait 200ms, then try extraction with aggressive cleaning
STAGE 2 (Last Resort) If content is still insufficient, re-run with scroll to the bottom, wait 200ms, and disable aggressive cleaning

Auto-Adapting Extraction

SpiderForce4AI automatically determines when to apply these fallback strategies based on content length.

Parameter	Type	Default	Description
`min_content_length`	number	500	Minimum acceptable content length in characters
`scroll_wait_time`	number	200	Time to wait after scrolling in milliseconds
`aggressive_cleaning`	boolean	true	Enable/disable aggressive cleaning

Batch Processing

POST /crawl_sitemap

Process an entire website using its sitemap.

Example Request

curl -X POST "http://localhost:3004/crawl_sitemap" \
  -H "Content-Type: application/json" \
  -d '{
    "sitemapUrl": "https://petertam.pro/sitemap.xml",
    "webhook": {
      "url": "https://your-webhook.com/endpoint",
      "headers": {
        "Authorization": "Bearer your-token"
      },
      "progressUpdates": true,
      "extraFields": {
        "project": "blog-crawler",
        "source": "sitemap"
      }
    }
  }'

Response

{
  "jobId": "job_1234567890",
  "status": "started",
  "config": {
    "sitemapUrl": "https://petertam.pro/sitemap.xml",
    "webhook": {
      "url": "https://your-webhook.com/endpoint",
      "hasHeaders": true
    }
  }
}

POST /crawl_urls

Process multiple URLs in batch.

Example Request

curl -X POST "http://localhost:3004/crawl_urls" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://petertam.pro/page1",
      "https://petertam.pro/page2",
      "https://petertam.pro/page3"
    ]
  }'

GET /job/:jobId

Check status of a batch processing job.

Example Request

curl "http://localhost:3004/job/job_1234567890"

Webhook Integration

POST /convert

Convert a URL and send results to a webhook.

Example Request

curl -X POST "http://localhost:3004/convert" \
    -H "Content-Type: application/json" \
    -d '{
      "url": "https://petertam.pro",
      "custom_webhook": {
        "url": "https://your-vectordb-api.com/ingest",
        "method": "POST",
        "headers": {
          "Authorization": "Bearer your-token"
        },
        "data": {
          "content": "##markdown##",
          "metadata": {
            "url": "##url##",
            "timestamp": "##timestamp##",
            "title": "##metadata##"
          }
        }
      }
    }'

Webhook Variables

Variable	Description
`##markdown##`	The converted markdown content
`##url##`	The original URL
`##timestamp##`	The current timestamp
`##metadata##`	The extracted metadata

Firecrawl API Compatibility (beta)

SpiderForce4AI provides full compatibility with Firecrawl's API endpoints, allowing for easy migration:

POST /v1/scrape

Scrape a single URL (Firecrawl-compatible).

Example Request

curl -X POST "http://localhost:3004/v1/scrape" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -d '{
      "url": "https://petertam.pro",
      "formats": ["markdown"]
    }'

Response

{
    "success": true,
    "data": {
      "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
      "metadata": {
        "title": "Example Domain",
        "description": "Example website for illustrative purposes",
        "language": "en",
        "sourceURL": "https://petertam.pro"
      }
    }
  }

POST /v1/crawl

Crawl a sitemap (Firecrawl-compatible).

Example Request

curl -X POST "http://localhost:3004/v1/crawl" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -d '{
      "url": "https://petertam.pro/sitemap.xml",
      "limit": 100,
      "webhook": "https://your-webhook.com/endpoint",
      "scrapeOptions": {
        "formats": ["markdown"]
      }
    }'

n8n Integration

Use SpiderForce4AI with n8n automation platform:

// SpiderForce4AI Tool for n8n
  // This tool processes URLs through SpiderForce4AI and returns markdown content.
  // The input comes as 'query' parameter containing a URL
  
  const SPIDERFORCE_BASE_URL = 'http://localhost:3004'; // or your cloud deployment instance URL
  
  try {
      // Validate input
      if (!query || !query.startsWith('http')) {
          return 'Error: Invalid URL provided. URL must start with http:// or https://';
      }
  
      const options = {
          url: `${SPIDERFORCE_BASE_URL}/convert`,
          method: 'POST',
          headers: {
              'Content-Type': 'application/json'
          },
          body: {
              url: query,
             // targetSelectors: ['.main-content', 'article', '#content'],
             // removeSelectors: ['.ads', '.nav', '.footer', '.header']
          }
      };
  
      const response = await this.helpers.httpRequest(options);
      
      // Return markdown content from response
      return response.markdown || response.toString();
  
  } catch (error) {
      return `Error processing URL: ${error.message}`;
  }

DIFY Integration

Use SpiderForce4AI with DIFY AI platform:

openapi: 3.0.0
  info:
    title: SpiderForce4AI Web to Markdown Converter
    description: Convert a web page to Markdown using Dify
    author: Piotr Tamulewicz
    version: 1.0.0
  
  servers:
    - url: http://localhost:3004
  
  paths:
    /convert:
      get:
        summary: Convert a web page to Markdown
        description: Retrieves the content of a web page and converts it to Markdown format
        parameters:
          - name: url
            in: query
            description: The URL of the web page to convert
            required: true
            schema:
              type: string
              format: uri
          - name: targetSelectors
            in: query
            description: CSS selectors to target specific content (comma-separated)
            required: false
            schema:
              type: string
          - name: removeSelectors
            in: query  
            description: CSS selectors to remove unwanted content (comma-separated)
            required: false
            schema:
              type: string
        responses:
          '200':
            description: Successful conversion
            content:
              text/markdown:
                schema:
                  type: string
          '400':
            description: Bad request (missing or invalid URL)
          '500':
            description: Internal server error

Python Wrapper

For Python developers, you can use our official Python wrapper:

PyPI

spiderforce4ai pip install spiderforce4ai


from spiderforce4ai import SpiderForce4AI, CrawlConfig
import requests
from pathlib import Path

# Define extraction template structure
extraction_template = """
{
    "Title": "Extract the main title of the content",
    "MetaDescription": "Extract a search-friendly meta description (under 160 characters)",
    "KeyPoints": ["Extract 3-5 key points from the content"],
    "Categories": ["Extract relevant categories for this content"],
    "ReadingTimeMinutes": "Estimate reading time in minutes"
}
"""

# Define a custom webhook function
def post_extraction_webhook(extraction_result):
    """Send extraction results to a webhook and return transformed data."""
    # Add custom fields or transform the data as needed
    payload = {
        "url": extraction_result.get("url", ""),
        "title": extraction_result.get("Title", ""),
        "description": extraction_result.get("MetaDescription", ""),
        "key_points": extraction_result.get("KeyPoints", []),
        "categories": extraction_result.get("Categories", []),
        "reading_time": extraction_result.get("ReadingTimeMinutes", ""),
        "processed_at": extraction_result.get("timestamp", "")
    }
    
    # Send to webhook (example using a different webhook than the main one)
    try:
        response = requests.post(
            "https://webhook.site/your-extraction-webhook-id",
            json=payload,
            headers={
                "Authorization": "Bearer extraction-token",
                "Content-Type": "application/json"
            },
            timeout=10
        )
        print(f"Extraction webhook sent: Status {response.status_code}")
    except Exception as e:
        print(f"Extraction webhook error: {str(e)}")
    
    # Return the transformed data (will be stored in result.extraction_result)
    return payload

# Initialize crawler
spider = SpiderForce4AI("http://localhost:3004")

# Configure with post-extraction and webhooks
config = CrawlConfig(
    # Basic crawling settings
    target_selector="article",
    remove_selectors=[".ads", ".navigation", ".comments"],
    max_concurrent_requests=5,
    
    # Regular webhook for crawl results
    webhook_url="https://webhook.site/your-crawl-webhook-id",
    webhook_headers={
        "Authorization": "Bearer crawl-token",
        "Content-Type": "application/json"
    },
    
    # Post-extraction LLM processing
    post_extraction_agent={
        "model": "gpt-4-turbo",  # Or another compatible model
        "api_key": "your-api-key-here",
        "max_tokens": 1000,
        "temperature": 0.3,
        "response_format": "json_object",  # Request JSON response format
        "messages": [
            {
                "role": "system",
                "content": f"Extract the following information from the content. Return ONLY valid JSON, no explanations:\n\n{extraction_template}"
            },
            {
                "role": "user",
                "content": "{here_markdown_content}"  # Will be replaced with actual content
            }
        ]
    },
    # Save combined extraction results
    post_extraction_agent_save_to_file="extraction_results.json",
    # Custom function to transform and send extraction webhook
    post_agent_transformer_function=post_extraction_webhook
)

# Crawl a sitemap with both regular and post-extraction webhooks
results = spider.crawl_sitemap_server_parallel("https://petertam.pro/sitemap.xml", config)

Benchmarks

SpiderForce4AI has been benchmarked against other popular web crawling solutions:

Performance Comparison

SpiderForce4AI

1.69s

100% (baseline)

Jina AI

3.49s

206% slower

Firecrawl

5.83s

345% slower

Benchmark Charts

Crawling Time Comparison

Average Processing Time

SpiderForce4AI consistently outperforms other solutions, especially on dynamic content-heavy sites.

Performance Optimization

Tips for optimizing SpiderForce4AI performance:

Server Requirements

Run SpiderForce4AI on virtually any server with these minimal requirements:

Minimum

1GB RAM, 1 CPU core

Suitable for basic usage

Recommended

2GB RAM, 2 CPU cores

Comfortable for higher volume

Cost

~$10/month on any VPS

No premium subscriptions needed

Configuration Tips

Fine-tune your setup for the optimal balance of speed and thoroughness:

Concurrent Pages

Set MAX_CONCURRENT_PAGES based on available server resources

Content Length

Adjust MIN_CONTENT_LENGTH to balance between speed and thoroughness

Speed Mode

For maximum speed, set MIN_CONTENT_LENGTH=0 to disable fallbacks

Targeted Extraction

Use specific CSS selectors to extract only required content

Batch Processing

Scale up your content extraction with these batch processing tips:

Optimal Batch Size

Process URLs in batches of 50-100 for maximum throughput

Asynchronous Processing

Use webhooks to handle large volumes asynchronously

Rate Limiting

Implement rate limiting when crawling multiple sites to avoid blocks

Sitemap API

Use the sitemap API for efficient large-scale content extraction

Advanced Performance Tuning

For high-volume processing and enterprise use cases, consider these additional optimizations:

Memory Management

Enable browser recycling via RECYCLE_BROWSER
Set MAX_MEMORY_USAGE to trigger garbage collection
Use selectors that target only text-heavy elements

Network Optimization

Use a proxy rotation service for high-volume crawling
Set appropriate delays between requests with REQUEST_DELAY
Distribute crawling across multiple instances for large datasets

For enterprise support and custom optimization, contact [email protected]