firecrawl/SELF_HOST.md

# Self-hosting Firecrawl

#### Contributor?

Welcome to [Firecrawl](https://firecrawl.dev) 🔥! Here are some instructions on how to get the project locally so you can run it on your own and contribute.

If you're contributing, note that the process is similar to other open-source repos, i.e., fork Firecrawl, make changes, run tests, PR.

If you have any questions or would like help getting on board, reach out to hello@mendable.ai for more information or submit an issue!

## Why?

Self-hosting Firecrawl is particularly beneficial for organizations with stringent security policies that require data to remain within controlled environments. Here are some key reasons to consider self-hosting:

- **Enhanced Security and Compliance:** By self-hosting, you ensure that all data handling and processing complies with internal and external regulations, keeping sensitive information within your secure infrastructure. Note that Firecrawl is a Mendable product and relies on SOC2 Type2 certification, which means that the platform adheres to high industry standards for managing data security.
- **Customizable Services:** Self-hosting allows you to tailor the services, such as the Playwright service, to meet specific needs or handle particular use cases that may not be supported by the standard cloud offering.
- **Learning and Community Contribution:** By setting up and maintaining your own instance, you gain a deeper understanding of how Firecrawl works, which can also lead to more meaningful contributions to the project.

### Considerations

However, there are some limitations and additional responsibilities to be aware of:

1. **Limited Access to Fire-engine:** Currently, self-hosted instances of Firecrawl do not have access to Fire-engine, which includes advanced features for handling IP blocks, robot detection mechanisms, and more. This means that while you can manage basic scraping tasks, more complex scenarios might require additional configuration or might not be supported.
2. **Manual Configuration Required:** If you need to use scraping methods beyond the basic fetch and Playwright options, you will need to manually configure these in the `.env` file. This requires a deeper understanding of the technologies and might involve more setup time.

Self-hosting Firecrawl is ideal for those who need full control over their scraping and data processing environments but comes with the trade-off of additional maintenance and configuration efforts.

## Steps

1. First, start by installing the dependencies

- Docker [instructions](https://docs.docker.com/get-docker/)


2. Set environment variables

Create an `.env` in the root directory you can copy over the template in `apps/api/.env.example`

To start, we wont set up authentication, or any optional sub services (pdf parsing, JS blocking support, AI features)

`.env:`
```
# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379

## To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false

# ===== Optional ENVS ======

# Supabase Setup (used to support DB authentication, advanced logging, etc.)
SUPABASE_ANON_TOKEN=
SUPABASE_URL=
SUPABASE_SERVICE_TOKEN=

# Other Optionals
TEST_API_KEY= # use if you've set up authentication and want to test with a real API key
SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking
OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.)
BULL_AUTH_KEY= @
LOGTAIL_KEY= # Use if you're configuring basic logging with logtail
PLAYWRIGHT_MICROSERVICE_URL=  # set if you'd like to run a playwright fallback
LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs
SERPER_API_KEY= #Set if you have a serper key you'd like to use as a search api
SLACK_WEBHOOK_URL= # set if you'd like to send slack server health status messages
POSTHOG_API_KEY= # set if you'd like to send posthog events like job logs
POSTHOG_HOST= # set if you'd like to send posthog events like job logs
```

3.  *(Optional) Running with TypeScript Playwright Service*
    
    *   Update the `docker-compose.yml` file to change the Playwright service:
        
        ```plaintext
            build: apps/playwright-service
        ```
        TO
        ```plaintext
            build: apps/playwright-service-ts
        ```
        
    *   Set the `PLAYWRIGHT_MICROSERVICE_URL` in your `.env` file:
        
        ```plaintext
        PLAYWRIGHT_MICROSERVICE_URL=http://localhost:3000/scrape
        ```
        
    *   Don't forget to set the proxy server in your `.env` file as needed.

4.  Build and run the Docker containers:
    
    ```bash
    docker compose build
    docker compose up
    ```

This will run a local instance of Firecrawl which can be accessed at `http://localhost:3002`.

You should be able to see the Bull Queue Manager UI on `http://localhost:3002/admin/@/queues`.

5. *(Optional)* Test the API

If you’d like to test the crawl endpoint, you can run this:

  ```bash
  curl -X POST http://localhost:3002/v0/crawl \
      -H 'Content-Type: application/json' \
      -d '{
        "url": "https://mendable.ai"
      }'
  ```   

## Troubleshooting

This section provides solutions to common issues you might encounter while setting up or running your self-hosted instance of Firecrawl.

### Supabase client is not configured

**Symptom:**
```bash
[YYYY-MM-DDTHH:MM:SS.SSSz]ERROR - Attempted to access Supabase client when it's not configured.
[YYYY-MM-DDTHH:MM:SS.SSSz]ERROR - Error inserting scrape event: Error: Supabase client is not configured.
```

**Explanation:**
This error occurs because the Supabase client setup is not completed. You should be able to scrape and crawl with no problems. Right now it's not possible to configure Supabase in self-hosted instances.

### You're bypassing authentication

**Symptom:**
```bash
[YYYY-MM-DDTHH:MM:SS.SSSz]WARN - You're bypassing authentication
```

**Explanation:**
This error occurs because the Supabase client setup is not completed. You should be able to scrape and crawl with no problems. Right now it's not possible to configure Supabase in self-hosted instances.

### Docker containers fail to start

**Symptom:**
Docker containers exit unexpectedly or fail to start.

**Solution:**
Check the Docker logs for any error messages using the command:
```bash
docker logs [container_name]
```

- Ensure all required environment variables are set correctly in the .env file.
- Verify that all Docker services defined in docker-compose.yml are correctly configured and the necessary images are available.

### Connection issues with Redis

**Symptom:**
Errors related to connecting to Redis, such as timeouts or "Connection refused".

**Solution:**
- Ensure that the Redis service is up and running in your Docker environment.
- Verify that the REDIS_URL and REDIS_RATE_LIMIT_URL in your .env file point to the correct Redis instance.
- Check network settings and firewall rules that may block the connection to the Redis port.

### API endpoint does not respond

**Symptom:**
API requests to the Firecrawl instance timeout or return no response.

**Solution:**
- Ensure that the Firecrawl service is running by checking the Docker container status.
- Verify that the PORT and HOST settings in your .env file are correct and that no other service is using the same port.
- Check the network configuration to ensure that the host is accessible from the client making the API request.

By addressing these common issues, you can ensure a smoother setup and operation of your self-hosted Firecrawl instance.

## Install Firecrawl on a Kubernetes Cluster (Simple Version)

Read the [examples/kubernetes-cluster-install/README.md](https://github.com/mendableai/firecrawl/blob/main/examples/kubernetes-cluster-install/README.md) for instructions on how to install Firecrawl on a Kubernetes Cluster.
-												Update SELF_HOST.md

											
										
										
											2024-07-31 02:59:42 +08:00
+								# Self-hosting Firecrawl
-												Initial commit

											
										
										
											2024-04-16 05:01:47 +08:00
-												Update SELF_HOST.md

											
										
										
											2024-07-31 02:59:42 +08:00
+								#### Contributor?
-												Removed .env.example, improved docs and docker compose envs

											
										
										
											2024-05-10 22:38:17 +08:00
-												Update SELF_HOST.md

											
										
										
											2024-07-31 02:59:42 +08:00
+								Welcome to [Firecrawl](https://firecrawl.dev) 🔥! Here are some instructions on how to get the project locally so you can run it on your own and contribute.
-												Initial commit

											
										
										
											2024-04-16 05:01:47 +08:00
-												Update SELF_HOST.md

											
										
										
											2024-07-31 02:59:42 +08:00
+								If you're contributing, note that the process is similar to other open-source repos, i.e., fork Firecrawl, make changes, run tests, PR.
-												Update SELF_HOST.md

											
										
										
											2024-05-06 00:03:42 +08:00
-												Update SELF_HOST.md

											
										
										
											2024-07-31 02:59:42 +08:00
+								If you have any questions or would like help getting on board, reach out to hello@mendable.ai for more information or submit an issue!
-												Added missing instruction

											
										
										
											2024-05-14 00:47:49 +08:00
-												Update SELF_HOST.md

											
										
										
											2024-07-31 02:59:42 +08:00
+								## Why?
-												Update SELF_HOST.md

											
										
										
											2024-05-06 00:03:42 +08:00
-												Update SELF_HOST.md

											
										
										
											2024-07-31 02:59:42 +08:00
+								Self-hosting Firecrawl is particularly beneficial for organizations with stringent security policies that require data to remain within controlled environments. Here are some key reasons to consider self-hosting:
 								- **Enhanced Security and Compliance:** By self-hosting, you ensure that all data handling and processing complies with internal and external regulations, keeping sensitive information within your secure infrastructure. Note that Firecrawl is a Mendable product and relies on SOC2 Type2 certification, which means that the platform adheres to high industry standards for managing data security.
 								- **Customizable Services:** Self-hosting allows you to tailor the services, such as the Playwright service, to meet specific needs or handle particular use cases that may not be supported by the standard cloud offering.
 								- **Learning and Community Contribution:** By setting up and maintaining your own instance, you gain a deeper understanding of how Firecrawl works, which can also lead to more meaningful contributions to the project.
 								### Considerations
 								However, there are some limitations and additional responsibilities to be aware of:
 . **Limited Access to Fire-engine:** Currently, self-hosted instances of Firecrawl do not have access to Fire-engine, which includes advanced features for handling IP blocks, robot detection mechanisms, and more. This means that while you can manage basic scraping tasks, more complex scenarios might require additional configuration or might not be supported.
 . **Manual Configuration Required:** If you need to use scraping methods beyond the basic fetch and Playwright options, you will need to manually configure these in the `.env` file. This requires a deeper understanding of the technologies and might involve more setup time.
 								Self-hosting Firecrawl is ideal for those who need full control over their scraping and data processing environments but comes with the trade-off of additional maintenance and configuration efforts.
 								## Steps
 . First, start by installing the dependencies
 								- Docker [instructions](https://docs.docker.com/get-docker/)
 . Set environment variables
 								Create an `.env` in the root directory you can copy over the template in `apps/api/.env.example`
 								To start, we wont set up authentication, or any optional sub services (pdf parsing, JS blocking support, AI features)
 								`.env:`
 								```
 								# ===== Required ENVS ======
 								NUM_WORKERS_PER_QUEUE=8
 								PORT=3002
 								HOST=0.0.0.0
 								REDIS_URL=redis://redis:6379
 								REDIS_RATE_LIMIT_URL=redis://redis:6379
 								## To turn on DB authentication, you need to set up supabase.
 								USE_DB_AUTHENTICATION=false
 								# ===== Optional ENVS ======
 								# Supabase Setup (used to support DB authentication, advanced logging, etc.)
 								SUPABASE_ANON_TOKEN=
 								SUPABASE_URL=
 								SUPABASE_SERVICE_TOKEN=
 								# Other Optionals
 								TEST_API_KEY= # use if you've set up authentication and want to test with a real API key
 								SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking
 								OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.)
 								BULL_AUTH_KEY= @
 								LOGTAIL_KEY= # Use if you're configuring basic logging with logtail
 								PLAYWRIGHT_MICROSERVICE_URL=  # set if you'd like to run a playwright fallback
 								LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs
 								SERPER_API_KEY= #Set if you have a serper key you'd like to use as a search api
 								SLACK_WEBHOOK_URL= # set if you'd like to send slack server health status messages
 								POSTHOG_API_KEY= # set if you'd like to send posthog events like job logs
 								POSTHOG_HOST= # set if you'd like to send posthog events like job logs
 								```
 .  *(Optional) Running with TypeScript Playwright Service*
-												(Docs) Self Host added new ts playwright service instructions

											
										
										
											2024-07-04 03:00:44 +08:00
 								    *   Update the `docker-compose.yml` file to change the Playwright service:
 								        ```plaintext
 								            build: apps/playwright-service
 								        ```
 								        TO
 								        ```plaintext
 								            build: apps/playwright-service-ts
 								        ```
 								    *   Set the `PLAYWRIGHT_MICROSERVICE_URL` in your `.env` file:
 								        ```plaintext
 								        PLAYWRIGHT_MICROSERVICE_URL=http://localhost:3000/scrape
 								        ```
 								    *   Don't forget to set the proxy server in your `.env` file as needed.
-												Update SELF_HOST.md

											
										
										
											2024-07-31 02:59:42 +08:00
 .  Build and run the Docker containers:
-												(Docs) Self Host added new ts playwright service instructions

											
										
										
											2024-07-04 03:00:44 +08:00
 								    ```bash
 								    docker compose build
 								    docker compose up
 								    ```
-												Update SELF_HOST.md
											
										
										
											2024-05-18 09:46:59 +08:00
-												Update SELF_HOST.md

											
										
										
											2024-05-06 00:03:42 +08:00
+								This will run a local instance of Firecrawl which can be accessed at `http://localhost:3002`.
-												Add Kubernetes configuration for Firecrawl deployment

Added new files for setting up Firecrawl on a Kubernetes Cluster. The files include Kubernetes manifests for deploying API, worker, playwright service, and Redis with associated ConfigMap and Secret associated resources. Also, updated the self-host documentation to include instructions for Kubernetes deployment.

											
										
										
											2024-06-05 02:52:08 +08:00
-												Update SELF_HOST.md

											
										
										
											2024-07-31 02:59:42 +08:00
+								You should be able to see the Bull Queue Manager UI on `http://localhost:3002/admin/@/queues`.
 . *(Optional)* Test the API
 								If you’d like to test the crawl endpoint, you can run this:
 								  ```bash
 								  curl -X POST http://localhost:3002/v0/crawl \
 								      -H 'Content-Type: application/json' \
 								      -d '{
 								        "url": "https://mendable.ai"
 								      }'
 								  ```
 								## Troubleshooting
 								This section provides solutions to common issues you might encounter while setting up or running your self-hosted instance of Firecrawl.
 								### Supabase client is not configured
 								**Symptom:**
 								```bash
 								[YYYY-MM-DDTHH:MM:SS.SSSz]ERROR - Attempted to access Supabase client when it's not configured.
 								[YYYY-MM-DDTHH:MM:SS.SSSz]ERROR - Error inserting scrape event: Error: Supabase client is not configured.
 								```
 								**Explanation:**
 								This error occurs because the Supabase client setup is not completed. You should be able to scrape and crawl with no problems. Right now it's not possible to configure Supabase in self-hosted instances.
 								### You're bypassing authentication
 								**Symptom:**
 								```bash
 								[YYYY-MM-DDTHH:MM:SS.SSSz]WARN - You're bypassing authentication
 								```
 								**Explanation:**
 								This error occurs because the Supabase client setup is not completed. You should be able to scrape and crawl with no problems. Right now it's not possible to configure Supabase in self-hosted instances.
 								### Docker containers fail to start
 								**Symptom:**
 								Docker containers exit unexpectedly or fail to start.
 								**Solution:**
 								Check the Docker logs for any error messages using the command:
 								```bash
 								docker logs [container_name]
 								```
 								- Ensure all required environment variables are set correctly in the .env file.
 								- Verify that all Docker services defined in docker-compose.yml are correctly configured and the necessary images are available.
 								### Connection issues with Redis
 								**Symptom:**
 								Errors related to connecting to Redis, such as timeouts or "Connection refused".
 								**Solution:**
 								- Ensure that the Redis service is up and running in your Docker environment.
 								- Verify that the REDIS_URL and REDIS_RATE_LIMIT_URL in your .env file point to the correct Redis instance.
 								- Check network settings and firewall rules that may block the connection to the Redis port.
 								### API endpoint does not respond
 								**Symptom:**
 								API requests to the Firecrawl instance timeout or return no response.
 								**Solution:**
 								- Ensure that the Firecrawl service is running by checking the Docker container status.
 								- Verify that the PORT and HOST settings in your .env file are correct and that no other service is using the same port.
 								- Check the network configuration to ensure that the host is accessible from the client making the API request.
 								By addressing these common issues, you can ensure a smoother setup and operation of your self-hosted Firecrawl instance.
-												(Docs) Self Host added new ts playwright service instructions

											
										
										
											2024-07-04 03:00:44 +08:00
+								## Install Firecrawl on a Kubernetes Cluster (Simple Version)
-												Update SELF_HOST.md

											
										
										
											2024-07-31 02:59:42 +08:00
+								Read the [examples/kubernetes-cluster-install/README.md](https://github.com/mendableai/firecrawl/blob/main/examples/kubernetes-cluster-install/README.md) for instructions on how to install Firecrawl on a Kubernetes Cluster.