Merge branch 'main' into mog/js-sdk-cjs

This commit is contained in:
Gergő Móricz 2024-08-07 01:12:36 +02:00 committed by GitHub
commit 7380d7799f
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
129 changed files with 8941 additions and 1000 deletions

View File

@ -1,20 +0,0 @@
name: Check Redis
on:
schedule:
- cron: '*/5 * * * *'
env:
BULL_AUTH_KEY: ${{ secrets.BULL_AUTH_KEY }}
jobs:
clean-jobs:
runs-on: ubuntu-latest
steps:
- name: Send GET request to check queues
run: |
response=$(curl --write-out '%{http_code}' --silent --output /dev/null --max-time 180 https://api.firecrawl.dev/admin/${{ secrets.BULL_AUTH_KEY }}/redis-health)
if [ "$response" -ne 200 ]; then
echo "Failed to check queues. Response: $response"
exit 1
fi
echo "Successfully checked queues. Response: $response"

View File

@ -1,7 +1,7 @@
name: Fly Deploy Direct name: Fly Deploy Direct
on: on:
schedule: schedule:
- cron: '0 */2 * * *' - cron: '0 */6 * * *'
env: env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
@ -30,7 +30,7 @@ jobs:
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v3
- uses: superfly/flyctl-actions/setup-flyctl@master - uses: superfly/flyctl-actions/setup-flyctl@master
- run: flyctl deploy --remote-only -a firecrawl-scraper-js && curl -X POST https://api.firecrawl.dev/admin/$BULL_AUTH_KEY/unpause - run: flyctl deploy --remote-only -a firecrawl-scraper-js
working-directory: ./apps/api working-directory: ./apps/api
env: env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }} FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

View File

@ -176,7 +176,7 @@ jobs:
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v3
- uses: superfly/flyctl-actions/setup-flyctl@master - uses: superfly/flyctl-actions/setup-flyctl@master
- run: flyctl deploy --remote-only -a firecrawl-scraper-js && curl -X POST https://api.firecrawl.dev/admin/$BULL_AUTH_KEY/unpause - run: flyctl deploy --remote-only -a firecrawl-scraper-js
working-directory: ./apps/api working-directory: ./apps/api
env: env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }} FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

3
.gitignore vendored
View File

@ -17,4 +17,5 @@ apps/test-suite/logs
apps/test-suite/load-test-results/test-run-report.json apps/test-suite/load-test-results/test-run-report.json
apps/playwright-service-ts/node_modules/ apps/playwright-service-ts/node_modules/
apps/playwright-service-ts/package-lock.json apps/playwright-service-ts/package-lock.json

View File

@ -24,6 +24,7 @@ NUM_WORKERS_PER_QUEUE=8
PORT=3002 PORT=3002
HOST=0.0.0.0 HOST=0.0.0.0
REDIS_URL=redis://localhost:6379 REDIS_URL=redis://localhost:6379
REDIS_RATE_LIMIT_URL=redis://localhost:6379
## To turn on DB authentication, you need to set up supabase. ## To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false USE_DB_AUTHENTICATION=false

View File

@ -405,12 +405,12 @@ _It is the sole responsibility of the end users to respect websites' policies wh
## License Disclaimer ## License Disclaimer
This project is primarily licensed under the GNU Affero General Public License v3.0 (AGPL-3.0), as specified in the LICENSE file in the root directory of this repository. However, certain components of this project, specifically the SDKs located in the `/apps/js-sdk` and `/apps/python-sdk` directories, are licensed under the MIT License. This project is primarily licensed under the GNU Affero General Public License v3.0 (AGPL-3.0), as specified in the LICENSE file in the root directory of this repository. However, certain components of this project are licensed under the MIT License. Refer to the LICENSE files in these specific directories for details.
Please note: Please note:
- The AGPL-3.0 license applies to all parts of the project unless otherwise specified. - The AGPL-3.0 license applies to all parts of the project unless otherwise specified.
- The SDKs in `/apps/js-sdk` and `/apps/python-sdk` are licensed under the MIT License. Refer to the LICENSE files in these specific directories for details. - The SDKs and some UI components are licensed under the MIT License. Refer to the LICENSE files in these specific directories for details.
- When using or contributing to this project, ensure you comply with the appropriate license terms for the specific component you are working with. - When using or contributing to this project, ensure you comply with the appropriate license terms for the specific component you are working with.
For more details on the licensing of specific components, please refer to the LICENSE files in the respective directories or contact the project maintainers. For more details on the licensing of specific components, please refer to the LICENSE files in the respective directories or contact the project maintainers.

View File

@ -1,36 +1,77 @@
## Self-hosting Firecrawl # Self-hosting Firecrawl
_We're currently working on a more in-depth guide on how to self-host, but in the meantime, here is a simplified version._ #### Contributor?
Refer to [CONTRIBUTING.md](https://github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md) for instructions on how to run it locally. Welcome to [Firecrawl](https://firecrawl.dev) 🔥! Here are some instructions on how to get the project locally so you can run it on your own and contribute.
## Getting Started If you're contributing, note that the process is similar to other open-source repos, i.e., fork Firecrawl, make changes, run tests, PR.
First, clone this repository and copy the example env file from the API folder `.env.example` to `.env`. If you have any questions or would like help getting on board, join our Discord community [here](https://discord.gg/gSmWdAkdwd) for more information or submit an issue on Github [here](https://github.com/mendableai/firecrawl/issues/new/choose)!
### Steps ## Why?
1. Clone the repository: Self-hosting Firecrawl is particularly beneficial for organizations with stringent security policies that require data to remain within controlled environments. Here are some key reasons to consider self-hosting:
```bash - **Enhanced Security and Compliance:** By self-hosting, you ensure that all data handling and processing complies with internal and external regulations, keeping sensitive information within your secure infrastructure. Note that Firecrawl is a Mendable product and relies on SOC2 Type2 certification, which means that the platform adheres to high industry standards for managing data security.
git clone https://github.com/mendableai/firecrawl.git - **Customizable Services:** Self-hosting allows you to tailor the services, such as the Playwright service, to meet specific needs or handle particular use cases that may not be supported by the standard cloud offering.
cd firecrawl - **Learning and Community Contribution:** By setting up and maintaining your own instance, you gain a deeper understanding of how Firecrawl works, which can also lead to more meaningful contributions to the project.
cp ./apps/api/.env.example ./.env
``` ### Considerations
2. For running the simplest version of FireCrawl, edit the `USE_DB_AUTHENTICATION` in `.env` to not use the database authentication: However, there are some limitations and additional responsibilities to be aware of:
```plaintext 1. **Limited Access to Fire-engine:** Currently, self-hosted instances of Firecrawl do not have access to Fire-engine, which includes advanced features for handling IP blocks, robot detection mechanisms, and more. This means that while you can manage basic scraping tasks, more complex scenarios might require additional configuration or might not be supported.
USE_DB_AUTHENTICATION=false 2. **Manual Configuration Required:** If you need to use scraping methods beyond the basic fetch and Playwright options, you will need to manually configure these in the `.env` file. This requires a deeper understanding of the technologies and might involve more setup time.
```
Self-hosting Firecrawl is ideal for those who need full control over their scraping and data processing environments but comes with the trade-off of additional maintenance and configuration efforts.
3. Update the Redis URL in the .env file to align with the Docker configuration:
## Steps
```plaintext
REDIS_URL=redis://redis:6379 1. First, start by installing the dependencies
```
- Docker [instructions](https://docs.docker.com/get-docker/)
4. #### Option: Running with TypeScript Playwright Service
2. Set environment variables
Create an `.env` in the root directory you can copy over the template in `apps/api/.env.example`
To start, we wont set up authentication, or any optional sub services (pdf parsing, JS blocking support, AI features)
`.env:`
```
# ===== Required ENVS ======
NUM_WORKERS_PER_QUEUE=8
PORT=3002
HOST=0.0.0.0
REDIS_URL=redis://redis:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379
## To turn on DB authentication, you need to set up supabase.
USE_DB_AUTHENTICATION=false
# ===== Optional ENVS ======
# Supabase Setup (used to support DB authentication, advanced logging, etc.)
SUPABASE_ANON_TOKEN=
SUPABASE_URL=
SUPABASE_SERVICE_TOKEN=
# Other Optionals
TEST_API_KEY= # use if you've set up authentication and want to test with a real API key
SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking
OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.)
BULL_AUTH_KEY= @
LOGTAIL_KEY= # Use if you're configuring basic logging with logtail
PLAYWRIGHT_MICROSERVICE_URL= # set if you'd like to run a playwright fallback
LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs
SERPER_API_KEY= #Set if you have a serper key you'd like to use as a search api
SLACK_WEBHOOK_URL= # set if you'd like to send slack server health status messages
POSTHOG_API_KEY= # set if you'd like to send posthog events like job logs
POSTHOG_HOST= # set if you'd like to send posthog events like job logs
```
3. *(Optional) Running with TypeScript Playwright Service*
* Update the `docker-compose.yml` file to change the Playwright service: * Update the `docker-compose.yml` file to change the Playwright service:
@ -49,16 +90,91 @@ First, clone this repository and copy the example env file from the API folder `
``` ```
* Don't forget to set the proxy server in your `.env` file as needed. * Don't forget to set the proxy server in your `.env` file as needed.
5. Build and run the Docker containers:
4. Build and run the Docker containers:
```bash ```bash
docker compose build docker compose build
docker compose up docker compose up
``` ```
This will run a local instance of Firecrawl which can be accessed at `http://localhost:3002`. This will run a local instance of Firecrawl which can be accessed at `http://localhost:3002`.
You should be able to see the Bull Queue Manager UI on `http://localhost:3002/admin/@/queues`.
5. *(Optional)* Test the API
If youd like to test the crawl endpoint, you can run this:
```bash
curl -X POST http://localhost:3002/v0/crawl \
-H 'Content-Type: application/json' \
-d '{
"url": "https://mendable.ai"
}'
```
## Troubleshooting
This section provides solutions to common issues you might encounter while setting up or running your self-hosted instance of Firecrawl.
### Supabase client is not configured
**Symptom:**
```bash
[YYYY-MM-DDTHH:MM:SS.SSSz]ERROR - Attempted to access Supabase client when it's not configured.
[YYYY-MM-DDTHH:MM:SS.SSSz]ERROR - Error inserting scrape event: Error: Supabase client is not configured.
```
**Explanation:**
This error occurs because the Supabase client setup is not completed. You should be able to scrape and crawl with no problems. Right now it's not possible to configure Supabase in self-hosted instances.
### You're bypassing authentication
**Symptom:**
```bash
[YYYY-MM-DDTHH:MM:SS.SSSz]WARN - You're bypassing authentication
```
**Explanation:**
This error occurs because the Supabase client setup is not completed. You should be able to scrape and crawl with no problems. Right now it's not possible to configure Supabase in self-hosted instances.
### Docker containers fail to start
**Symptom:**
Docker containers exit unexpectedly or fail to start.
**Solution:**
Check the Docker logs for any error messages using the command:
```bash
docker logs [container_name]
```
- Ensure all required environment variables are set correctly in the .env file.
- Verify that all Docker services defined in docker-compose.yml are correctly configured and the necessary images are available.
### Connection issues with Redis
**Symptom:**
Errors related to connecting to Redis, such as timeouts or "Connection refused".
**Solution:**
- Ensure that the Redis service is up and running in your Docker environment.
- Verify that the REDIS_URL and REDIS_RATE_LIMIT_URL in your .env file point to the correct Redis instance.
- Check network settings and firewall rules that may block the connection to the Redis port.
### API endpoint does not respond
**Symptom:**
API requests to the Firecrawl instance timeout or return no response.
**Solution:**
- Ensure that the Firecrawl service is running by checking the Docker container status.
- Verify that the PORT and HOST settings in your .env file are correct and that no other service is using the same port.
- Check the network configuration to ensure that the host is accessible from the client making the API request.
By addressing these common issues, you can ensure a smoother setup and operation of your self-hosted Firecrawl instance.
## Install Firecrawl on a Kubernetes Cluster (Simple Version) ## Install Firecrawl on a Kubernetes Cluster (Simple Version)
Read the [examples/kubernetes-cluster-install/README.md](https://github.com/mendableai/firecrawl/blob/main/examples/kubernetes-cluster-install/README.md) for instructions on how to install Firecrawl on a Kubernetes Cluster. Read the [examples/kubernetes-cluster-install/README.md](https://github.com/mendableai/firecrawl/blob/main/examples/kubernetes-cluster-install/README.md) for instructions on how to install Firecrawl on a Kubernetes Cluster.

View File

@ -57,3 +57,14 @@ SELF_HOSTED_WEBHOOK_URL=
# Resend API Key for transactional emails # Resend API Key for transactional emails
RESEND_API_KEY= RESEND_API_KEY=
# LOGGING_LEVEL determines the verbosity of logs that the system will output.
# Available levels are:
# NONE - No logs will be output.
# ERROR - For logging error messages that indicate a failure in a specific operation.
# WARN - For logging potentially harmful situations that are not necessarily errors.
# INFO - For logging informational messages that highlight the progress of the application.
# DEBUG - For logging detailed information on the flow through the system, primarily used for debugging.
# TRACE - For logging more detailed information than the DEBUG level.
# Set LOGGING_LEVEL to one of the above options to control logging output.
LOGGING_LEVEL=INFO

4
apps/api/.gitignore vendored
View File

@ -3,4 +3,6 @@
.env .env
*.csv *.csv
dump.rdb dump.rdb
/mongo-data /mongo-data
/.next/

View File

@ -8,9 +8,6 @@ primary_region = 'mia'
kill_signal = 'SIGINT' kill_signal = 'SIGINT'
kill_timeout = '30s' kill_timeout = '30s'
[deploy]
release_command = 'node dist/src/trigger-shutdown.js https://staging-firecrawl-scraper-js.fly.dev'
[build] [build]
[processes] [processes]

View File

@ -4,18 +4,15 @@
# #
app = 'firecrawl-scraper-js' app = 'firecrawl-scraper-js'
primary_region = 'mia' primary_region = 'iad'
kill_signal = 'SIGINT' kill_signal = 'SIGINT'
kill_timeout = '30s' kill_timeout = '30s'
[deploy]
release_command = 'node dist/src/trigger-shutdown.js https://api.firecrawl.dev'
[build] [build]
[processes] [processes]
app = 'node dist/src/index.js' app = 'node --max-old-space-size=8192 dist/src/index.js'
worker = 'node dist/src/services/queue-worker.js' worker = 'node --max-old-space-size=8192 dist/src/services/queue-worker.js'
[http_service] [http_service]
internal_port = 8080 internal_port = 8080
@ -27,8 +24,8 @@ kill_timeout = '30s'
[http_service.concurrency] [http_service.concurrency]
type = "requests" type = "requests"
hard_limit = 100 hard_limit = 200
soft_limit = 50 soft_limit = 75
[[http_service.checks]] [[http_service.checks]]
grace_period = "20s" grace_period = "20s"

View File

@ -41,33 +41,20 @@
"pageOptions": { "pageOptions": {
"type": "object", "type": "object",
"properties": { "properties": {
"onlyMainContent": { "headers": {
"type": "boolean", "type": "object",
"description": "Only return the main content of the page excluding headers, navs, footers, etc.", "description": "Headers to send with the request. Can be used to send cookies, user-agent, etc."
"default": false
}, },
"includeHtml": { "includeHtml": {
"type": "boolean", "type": "boolean",
"description": "Include the raw HTML content of the page. Will output a html key in the response.", "description": "Include the HTML version of the content on page. Will output a html key in the response.",
"default": false "default": false
}, },
"screenshot": { "includeRawHtml": {
"type": "boolean", "type": "boolean",
"description": "Include a screenshot of the top of the page that you are scraping.", "description": "Include the raw HTML content of the page. Will output a rawHtml key in the response.",
"default": false "default": false
}, },
"waitFor": {
"type": "integer",
"description": "Wait x amount of milliseconds for the page to load to fetch content",
"default": 0
},
"removeTags": {
"type": "array",
"items": {
"type": "string"
},
"description": "Tags, classes and ids to remove from the page. Use comma separated values. Example: 'script, .ad, #footer'"
},
"onlyIncludeTags": { "onlyIncludeTags": {
"type": "array", "type": "array",
"items": { "items": {
@ -75,34 +62,58 @@
}, },
"description": "Only include tags, classes and ids from the page in the final output. Use comma separated values. Example: 'script, .ad, #footer'" "description": "Only include tags, classes and ids from the page in the final output. Use comma separated values. Example: 'script, .ad, #footer'"
}, },
"headers": { "onlyMainContent": {
"type": "object", "type": "boolean",
"description": "Headers to send with the request. Can be used to send cookies, user-agent, etc." "description": "Only return the main content of the page excluding headers, navs, footers, etc.",
"default": false
},
"removeTags": {
"type": "array",
"items": {
"type": "string"
},
"description": "Tags, classes and ids to remove from the page. Use comma separated values. Example: 'script, .ad, #footer'"
}, },
"replaceAllPathsWithAbsolutePaths": { "replaceAllPathsWithAbsolutePaths": {
"type": "boolean", "type": "boolean",
"description": "Replace all relative paths with absolute paths for images and links", "description": "Replace all relative paths with absolute paths for images and links",
"default": false "default": false
},
"screenshot": {
"type": "boolean",
"description": "Include a screenshot of the top of the page that you are scraping.",
"default": false
},
"fullPageScreenshot": {
"type": "boolean",
"description": "Include a full page screenshot of the page that you are scraping.",
"default": false
},
"waitFor": {
"type": "integer",
"description": "Wait x amount of milliseconds for the page to load to fetch content",
"default": 0
} }
} }
}, },
"extractorOptions": { "extractorOptions": {
"type": "object", "type": "object",
"description": "Options for LLM-based extraction of structured information from the page content", "description": "Options for extraction of structured information from the page content. Note: LLM-based extraction is not performed by default and only occurs when explicitly configured. The 'markdown' mode simply returns the scraped markdown and is the default mode for scraping.",
"default": {},
"properties": { "properties": {
"mode": { "mode": {
"type": "string", "type": "string",
"enum": ["llm-extraction", "llm-extraction-from-raw-html"], "enum": ["markdown", "llm-extraction", "llm-extraction-from-raw-html", "llm-extraction-from-markdown"],
"description": "The extraction mode to use. llm-extraction: Extracts information from the cleaned and parsed content. llm-extraction-from-raw-html: Extracts information directly from the raw HTML." "description": "The extraction mode to use. 'markdown': Returns the scraped markdown content, does not perform LLM extraction. 'llm-extraction': Extracts information from the cleaned and parsed content using LLM. 'llm-extraction-from-raw-html': Extracts information directly from the raw HTML using LLM. 'llm-extraction-from-markdown': Extracts information from the markdown content using LLM."
}, },
"extractionPrompt": { "extractionPrompt": {
"type": "string", "type": "string",
"description": "A prompt describing what information to extract from the page" "description": "A prompt describing what information to extract from the page, applicable for LLM extraction modes."
}, },
"extractionSchema": { "extractionSchema": {
"type": "object", "type": "object",
"additionalProperties": true, "additionalProperties": true,
"description": "The schema for the data to be extracted", "description": "The schema for the data to be extracted, required only for LLM extraction modes.",
"required": [ "required": [
"company_mission", "company_mission",
"supports_sso", "supports_sso",
@ -134,13 +145,52 @@
} }
}, },
"402": { "402": {
"description": "Payment required" "description": "Payment required",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "Payment required to access this resource."
}
}
}
}
}
}, },
"429": { "429": {
"description": "Too many requests" "description": "Too many requests",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "Request rate limit exceeded. Please wait and try again later."
}
}
}
}
}
}, },
"500": { "500": {
"description": "Server error" "description": "Server error",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "An unexpected error occurred on the server."
}
}
}
}
}
} }
} }
} }
@ -216,7 +266,12 @@
}, },
"allowBackwardCrawling": { "allowBackwardCrawling": {
"type": "boolean", "type": "boolean",
"description": "Allow backward crawling (crawl from the base URL to the previous URLs)", "description": "Enables the crawler to navigate from a specific URL to previously linked pages. For instance, from 'example.com/product/123' back to 'example.com/product'",
"default": false
},
"allowExternalContentLinks": {
"type": "boolean",
"description": "Allows the crawler to follow links to external websites.",
"default": false "default": false
} }
} }
@ -224,25 +279,32 @@
"pageOptions": { "pageOptions": {
"type": "object", "type": "object",
"properties": { "properties": {
"headers": {
"type": "object",
"description": "Headers to send with the request. Can be used to send cookies, user-agent, etc."
},
"includeHtml": {
"type": "boolean",
"description": "Include the HTML version of the content on page. Will output a html key in the response.",
"default": false
},
"includeRawHtml": {
"type": "boolean",
"description": "Include the raw HTML content of the page. Will output a rawHtml key in the response.",
"default": false
},
"onlyIncludeTags": {
"type": "array",
"items": {
"type": "string"
},
"description": "Only include tags, classes and ids from the page in the final output. Use comma separated values. Example: 'script, .ad, #footer'"
},
"onlyMainContent": { "onlyMainContent": {
"type": "boolean", "type": "boolean",
"description": "Only return the main content of the page excluding headers, navs, footers, etc.", "description": "Only return the main content of the page excluding headers, navs, footers, etc.",
"default": false "default": false
}, },
"includeHtml": {
"type": "boolean",
"description": "Include the raw HTML content of the page. Will output a html key in the response.",
"default": false
},
"screenshot": {
"type": "boolean",
"description": "Include a screenshot of the top of the page that you are scraping.",
"default": false
},
"headers": {
"type": "object",
"description": "Headers to send with the request when scraping. Can be used to send cookies, user-agent, etc."
},
"removeTags": { "removeTags": {
"type": "array", "type": "array",
"items": { "items": {
@ -254,6 +316,21 @@
"type": "boolean", "type": "boolean",
"description": "Replace all relative paths with absolute paths for images and links", "description": "Replace all relative paths with absolute paths for images and links",
"default": false "default": false
},
"screenshot": {
"type": "boolean",
"description": "Include a screenshot of the top of the page that you are scraping.",
"default": false
},
"fullPageScreenshot": {
"type": "boolean",
"description": "Include a full page screenshot of the page that you are scraping.",
"default": false
},
"waitFor": {
"type": "integer",
"description": "Wait x amount of milliseconds for the page to load to fetch content",
"default": 0
} }
} }
} }
@ -275,13 +352,52 @@
} }
}, },
"402": { "402": {
"description": "Payment required" "description": "Payment required",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "Payment required to access this resource."
}
}
}
}
}
}, },
"429": { "429": {
"description": "Too many requests" "description": "Too many requests",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "Request rate limit exceeded. Please wait and try again later."
}
}
}
}
}
}, },
"500": { "500": {
"description": "Server error" "description": "Server error",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "An unexpected error occurred on the server."
}
}
}
}
}
} }
} }
} }
@ -323,7 +439,12 @@
}, },
"includeHtml": { "includeHtml": {
"type": "boolean", "type": "boolean",
"description": "Include the raw HTML content of the page. Will output a html key in the response.", "description": "Include the HTML version of the content on page. Will output a html key in the response.",
"default": false
},
"includeRawHtml": {
"type": "boolean",
"description": "Include the raw HTML content of the page. Will output a rawHtml key in the response.",
"default": false "default": false
} }
} }
@ -355,13 +476,52 @@
} }
}, },
"402": { "402": {
"description": "Payment required" "description": "Payment required",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "Payment required to access this resource."
}
}
}
}
}
}, },
"429": { "429": {
"description": "Too many requests" "description": "Too many requests",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "Request rate limit exceeded. Please wait and try again later."
}
}
}
}
}
}, },
"500": { "500": {
"description": "Server error" "description": "Server error",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "An unexpected error occurred on the server."
}
}
}
}
}
} }
} }
} }
@ -403,14 +563,6 @@
"type": "integer", "type": "integer",
"description": "Current page number" "description": "Current page number"
}, },
"current_url": {
"type": "string",
"description": "Current URL being scraped"
},
"current_step": {
"type": "string",
"description": "Current step in the process"
},
"total": { "total": {
"type": "integer", "type": "integer",
"description": "Total number of pages" "description": "Total number of pages"
@ -427,7 +579,7 @@
"items": { "items": {
"$ref": "#/components/schemas/CrawlStatusResponseObj" "$ref": "#/components/schemas/CrawlStatusResponseObj"
}, },
"description": "Partial documents returned as it is being crawled (streaming). **This feature is currently in alpha - expect breaking changes** When a page is ready, it will append to the partial_data array, so there is no need to wait for the entire website to be crawled. There is a max of 50 items in the array response. The oldest item (top of the array) will be removed when the new item is added to the array." "description": "Partial documents returned as it is being crawled (streaming). **This feature is currently in alpha - expect breaking changes** When a page is ready, it will append to the partial_data array, so there is no need to wait for the entire website to be crawled. When the crawl is done, partial_data will become empty and the result will be available in `data`. There is a max of 50 items in the array response. The oldest item (top of the array) will be removed when the new item is added to the array."
} }
} }
} }
@ -435,13 +587,52 @@
} }
}, },
"402": { "402": {
"description": "Payment required" "description": "Payment required",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "Payment required to access this resource."
}
}
}
}
}
}, },
"429": { "429": {
"description": "Too many requests" "description": "Too many requests",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "Request rate limit exceeded. Please wait and try again later."
}
}
}
}
}
}, },
"500": { "500": {
"description": "Server error" "description": "Server error",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "An unexpected error occurred on the server."
}
}
}
}
}
} }
} }
} }
@ -485,13 +676,52 @@
} }
}, },
"402": { "402": {
"description": "Payment required" "description": "Payment required",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "Payment required to access this resource."
}
}
}
}
}
}, },
"429": { "429": {
"description": "Too many requests" "description": "Too many requests",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "Request rate limit exceeded. Please wait and try again later."
}
}
}
}
}
}, },
"500": { "500": {
"description": "Server error" "description": "Server error",
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"error": {
"type": "string",
"example": "An unexpected error occurred on the server."
}
}
}
}
}
} }
} }
} }
@ -523,7 +753,12 @@
"html": { "html": {
"type": "string", "type": "string",
"nullable": true, "nullable": true,
"description": "Raw HTML content of the page if `includeHtml` is true" "description": "HTML version of the content on page if `includeHtml` is true"
},
"rawHtml": {
"type": "string",
"nullable": true,
"description": "Raw HTML content of the page if `includeRawHtml` is true"
}, },
"metadata": { "metadata": {
"type": "object", "type": "object",
@ -583,7 +818,12 @@
"html": { "html": {
"type": "string", "type": "string",
"nullable": true, "nullable": true,
"description": "Raw HTML content of the page if `includeHtml` is true" "description": "HTML version of the content on page if `includeHtml` is true"
},
"rawHtml": {
"type": "string",
"nullable": true,
"description": "Raw HTML content of the page if `includeRawHtml` is true"
}, },
"index": { "index": {
"type": "integer", "type": "integer",

View File

@ -19,13 +19,14 @@
"mongo-docker": "docker run -d -p 2717:27017 -v ./mongo-data:/data/db --name mongodb mongo:latest", "mongo-docker": "docker run -d -p 2717:27017 -v ./mongo-data:/data/db --name mongodb mongo:latest",
"mongo-docker-console": "docker exec -it mongodb mongosh", "mongo-docker-console": "docker exec -it mongodb mongosh",
"run-example": "npx ts-node src/example.ts", "run-example": "npx ts-node src/example.ts",
"deploy:fly": "flyctl deploy && node postdeploy.js https://api.firecrawl.dev", "deploy:fly": "flyctl deploy",
"deploy:fly:staging": "fly deploy -c fly.staging.toml && node postdeploy.js https://staging-firecrawl-scraper-js.fly.dev" "deploy:fly:staging": "fly deploy -c fly.staging.toml"
}, },
"author": "", "author": "",
"license": "ISC", "license": "ISC",
"devDependencies": { "devDependencies": {
"@flydotio/dockerfile": "^0.4.10", "@flydotio/dockerfile": "^0.4.10",
"@jest/globals": "^29.7.0",
"@tsconfig/recommended": "^1.0.3", "@tsconfig/recommended": "^1.0.3",
"@types/body-parser": "^1.19.2", "@types/body-parser": "^1.19.2",
"@types/bull": "^4.10.0", "@types/bull": "^4.10.0",
@ -63,6 +64,7 @@
"axios": "^1.3.4", "axios": "^1.3.4",
"bottleneck": "^2.19.5", "bottleneck": "^2.19.5",
"bull": "^4.15.0", "bull": "^4.15.0",
"cacheable-lookup": "^6.1.0",
"cheerio": "^1.0.0-rc.12", "cheerio": "^1.0.0-rc.12",
"cohere": "^1.1.1", "cohere": "^1.1.1",
"cors": "^2.8.5", "cors": "^2.8.5",
@ -92,6 +94,7 @@
"promptable": "^0.0.10", "promptable": "^0.0.10",
"puppeteer": "^22.12.1", "puppeteer": "^22.12.1",
"rate-limiter-flexible": "2.4.2", "rate-limiter-flexible": "2.4.2",
"redlock": "5.0.0-beta.2",
"resend": "^3.4.0", "resend": "^3.4.0",
"robots-parser": "^3.0.1", "robots-parser": "^3.0.1",
"scrapingbee": "^1.7.4", "scrapingbee": "^1.7.4",

View File

@ -59,6 +59,9 @@ importers:
bull: bull:
specifier: ^4.15.0 specifier: ^4.15.0
version: 4.15.0 version: 4.15.0
cacheable-lookup:
specifier: ^6.1.0
version: 6.1.0
cheerio: cheerio:
specifier: ^1.0.0-rc.12 specifier: ^1.0.0-rc.12
version: 1.0.0-rc.12 version: 1.0.0-rc.12
@ -146,6 +149,9 @@ importers:
rate-limiter-flexible: rate-limiter-flexible:
specifier: 2.4.2 specifier: 2.4.2
version: 2.4.2 version: 2.4.2
redlock:
specifier: 5.0.0-beta.2
version: 5.0.0-beta.2
resend: resend:
specifier: ^3.4.0 specifier: ^3.4.0
version: 3.4.0 version: 3.4.0
@ -189,6 +195,9 @@ importers:
'@flydotio/dockerfile': '@flydotio/dockerfile':
specifier: ^0.4.10 specifier: ^0.4.10
version: 0.4.11 version: 0.4.11
'@jest/globals':
specifier: ^29.7.0
version: 29.7.0
'@tsconfig/recommended': '@tsconfig/recommended':
specifier: ^1.0.3 specifier: ^1.0.3
version: 1.0.6 version: 1.0.6
@ -1937,6 +1946,10 @@ packages:
resolution: {integrity: sha512-/Nf7TyzTx6S3yRJObOAV7956r8cr2+Oj8AC5dt8wSP3BQAoeX58NoHyCU8P8zGkNXStjTSi6fzO6F0pBdcYbEg==} resolution: {integrity: sha512-/Nf7TyzTx6S3yRJObOAV7956r8cr2+Oj8AC5dt8wSP3BQAoeX58NoHyCU8P8zGkNXStjTSi6fzO6F0pBdcYbEg==}
engines: {node: '>= 0.8'} engines: {node: '>= 0.8'}
cacheable-lookup@6.1.0:
resolution: {integrity: sha512-KJ/Dmo1lDDhmW2XDPMo+9oiy/CeqosPguPCrgcVzKyZrL6pM1gU2GmPY/xo6OQPTUaA/c0kwHuywB4E6nmT9ww==}
engines: {node: '>=10.6.0'}
call-bind@1.0.7: call-bind@1.0.7:
resolution: {integrity: sha512-GHTSNSYICQ7scH7sZ+M2rFopRoLh8t2bLSW6BbgrtLsahOIB5iyAVJf9GjWK3cYTDaMj4XdBpM1cA6pIS0Kv2w==} resolution: {integrity: sha512-GHTSNSYICQ7scH7sZ+M2rFopRoLh8t2bLSW6BbgrtLsahOIB5iyAVJf9GjWK3cYTDaMj4XdBpM1cA6pIS0Kv2w==}
engines: {node: '>= 0.4'} engines: {node: '>= 0.4'}
@ -3523,6 +3536,9 @@ packages:
resolution: {integrity: sha512-dBpDMdxv9Irdq66304OLfEmQ9tbNRFnFTuZiLo+bD+r332bBmMJ8GBLXklIXXgxd3+v9+KUnZaUR5PJMa75Gsg==} resolution: {integrity: sha512-dBpDMdxv9Irdq66304OLfEmQ9tbNRFnFTuZiLo+bD+r332bBmMJ8GBLXklIXXgxd3+v9+KUnZaUR5PJMa75Gsg==}
engines: {node: '>= 0.4.0'} engines: {node: '>= 0.4.0'}
node-abort-controller@3.1.1:
resolution: {integrity: sha512-AGK2yQKIjRuqnc6VkX2Xj5d+QW8xZ87pa1UK6yA6ouUyuxfHuMP6umE5QK7UmTeOAymo+Zx1Fxiuw9rVx8taHQ==}
node-domexception@1.0.0: node-domexception@1.0.0:
resolution: {integrity: sha512-/jKZoMpw0F8GRwl4/eLROPA3cfcXtLApP0QzLmUT/HuPCZWyB7IY9ZrMeKw2O/nFIqPQB3PVM9aYm0F312AXDQ==} resolution: {integrity: sha512-/jKZoMpw0F8GRwl4/eLROPA3cfcXtLApP0QzLmUT/HuPCZWyB7IY9ZrMeKw2O/nFIqPQB3PVM9aYm0F312AXDQ==}
engines: {node: '>=10.5.0'} engines: {node: '>=10.5.0'}
@ -3946,6 +3962,10 @@ packages:
redis@4.6.14: redis@4.6.14:
resolution: {integrity: sha512-GrNg/e33HtsQwNXL7kJT+iNFPSwE1IPmd7wzV3j4f2z0EYxZfZE7FVTmUysgAtqQQtg5NXF5SNLR9OdO/UHOfw==} resolution: {integrity: sha512-GrNg/e33HtsQwNXL7kJT+iNFPSwE1IPmd7wzV3j4f2z0EYxZfZE7FVTmUysgAtqQQtg5NXF5SNLR9OdO/UHOfw==}
redlock@5.0.0-beta.2:
resolution: {integrity: sha512-2RDWXg5jgRptDrB1w9O/JgSZC0j7y4SlaXnor93H/UJm/QyDiFgBKNtrh0TI6oCXqYSaSoXxFh6Sd3VtYfhRXw==}
engines: {node: '>=12'}
regenerator-runtime@0.14.1: regenerator-runtime@0.14.1:
resolution: {integrity: sha512-dYnhHh0nJoMfnkZs6GmmhFknAGRrLznOu5nc9ML+EJxGvrx6H7teuevqVqCuPcPK//3eDrrjQhehXVx9cnkGdw==} resolution: {integrity: sha512-dYnhHh0nJoMfnkZs6GmmhFknAGRrLznOu5nc9ML+EJxGvrx6H7teuevqVqCuPcPK//3eDrrjQhehXVx9cnkGdw==}
@ -4369,8 +4389,8 @@ packages:
engines: {node: '>=14.17'} engines: {node: '>=14.17'}
hasBin: true hasBin: true
typescript@5.5.3: typescript@5.5.4:
resolution: {integrity: sha512-/hreyEujaB0w76zKo6717l3L0o/qEUtRgdvUBvlkhoWeOVMjMuHNHk0BRBzikzuGDqNmPQbg5ifMEqsHLiIUcQ==} resolution: {integrity: sha512-Mtq29sKDAEYP7aljRgtPOpTvOfbwRWlS6dPRzwjdE+C0R4brX/GUyhHSecbHMFLNBLcJIPt9nl9yG5TZ1weH+Q==}
engines: {node: '>=14.17'} engines: {node: '>=14.17'}
hasBin: true hasBin: true
@ -6917,6 +6937,8 @@ snapshots:
bytes@3.1.2: {} bytes@3.1.2: {}
cacheable-lookup@6.1.0: {}
call-bind@1.0.7: call-bind@1.0.7:
dependencies: dependencies:
es-define-property: 1.0.0 es-define-property: 1.0.0
@ -8593,6 +8615,8 @@ snapshots:
netmask@2.0.2: {} netmask@2.0.2: {}
node-abort-controller@3.1.1: {}
node-domexception@1.0.0: {} node-domexception@1.0.0: {}
node-ensure@0.0.0: {} node-ensure@0.0.0: {}
@ -8927,7 +8951,7 @@ snapshots:
csv-parse: 5.5.6 csv-parse: 5.5.6
gpt3-tokenizer: 1.1.5 gpt3-tokenizer: 1.1.5
openai: 3.3.0 openai: 3.3.0
typescript: 5.5.3 typescript: 5.5.4
uuid: 9.0.1 uuid: 9.0.1
zod: 3.23.8 zod: 3.23.8
transitivePeerDependencies: transitivePeerDependencies:
@ -9096,6 +9120,10 @@ snapshots:
'@redis/search': 1.1.6(@redis/client@1.5.16) '@redis/search': 1.1.6(@redis/client@1.5.16)
'@redis/time-series': 1.0.5(@redis/client@1.5.16) '@redis/time-series': 1.0.5(@redis/client@1.5.16)
redlock@5.0.0-beta.2:
dependencies:
node-abort-controller: 3.1.1
regenerator-runtime@0.14.1: {} regenerator-runtime@0.14.1: {}
require-directory@2.1.1: {} require-directory@2.1.1: {}
@ -9519,7 +9547,7 @@ snapshots:
typescript@5.4.5: {} typescript@5.4.5: {}
typescript@5.5.3: {} typescript@5.5.4: {}
typesense@1.8.2(@babel/runtime@7.24.6): typesense@1.8.2(@babel/runtime@7.24.6):
dependencies: dependencies:

View File

@ -1,11 +0,0 @@
require("dotenv").config();
fetch(process.argv[2] + "/admin/" + process.env.BULL_AUTH_KEY + "/unpause", {
method: "POST"
}).then(async x => {
console.log(await x.text());
process.exit(0);
}).catch(e => {
console.error(e);
process.exit(1);
});

View File

@ -858,7 +858,6 @@ describe("E2E Tests for API Routes", () => {
await new Promise((resolve) => setTimeout(resolve, 1000)); // Wait for 1 second before checking again await new Promise((resolve) => setTimeout(resolve, 1000)); // Wait for 1 second before checking again
} }
} }
console.log(crawlData)
expect(crawlData.length).toBeGreaterThan(0); expect(crawlData.length).toBeGreaterThan(0);
expect(crawlData).toEqual(expect.arrayContaining([ expect(crawlData).toEqual(expect.arrayContaining([
expect.objectContaining({ url: expect.stringContaining("https://firecrawl.dev/?ref=mendable+banner") }), expect.objectContaining({ url: expect.stringContaining("https://firecrawl.dev/?ref=mendable+banner") }),

View File

@ -0,0 +1,87 @@
import { Request, Response } from "express";
import { Job } from "bull";
import { Logger } from "../../lib/logger";
import { getWebScraperQueue } from "../../services/queue-service";
import { checkAlerts } from "../../services/alerts";
export async function cleanBefore24hCompleteJobsController(
req: Request,
res: Response
) {
Logger.info("🐂 Cleaning jobs older than 24h");
try {
const webScraperQueue = getWebScraperQueue();
const batchSize = 10;
const numberOfBatches = 9; // Adjust based on your needs
const completedJobsPromises: Promise<Job[]>[] = [];
for (let i = 0; i < numberOfBatches; i++) {
completedJobsPromises.push(
webScraperQueue.getJobs(
["completed"],
i * batchSize,
i * batchSize + batchSize,
true
)
);
}
const completedJobs: Job[] = (
await Promise.all(completedJobsPromises)
).flat();
const before24hJobs =
completedJobs.filter(
(job) => job.finishedOn < Date.now() - 24 * 60 * 60 * 1000
) || [];
let count = 0;
if (!before24hJobs) {
return res.status(200).send(`No jobs to remove.`);
}
for (const job of before24hJobs) {
try {
await job.remove();
count++;
} catch (jobError) {
Logger.error(`🐂 Failed to remove job with ID ${job.id}: ${jobError}`);
}
}
return res.status(200).send(`Removed ${count} completed jobs.`);
} catch (error) {
Logger.error(`🐂 Failed to clean last 24h complete jobs: ${error}`);
return res.status(500).send("Failed to clean jobs");
}
}
export async function checkQueuesController(req: Request, res: Response) {
try {
await checkAlerts();
return res.status(200).send("Alerts initialized");
} catch (error) {
Logger.debug(`Failed to initialize alerts: ${error}`);
return res.status(500).send("Failed to initialize alerts");
}
}
// Use this as a "health check" that way we dont destroy the server
export async function queuesController(req: Request, res: Response) {
try {
const webScraperQueue = getWebScraperQueue();
const [webScraperActive] = await Promise.all([
webScraperQueue.getActiveCount(),
]);
const noActiveJobs = webScraperActive === 0;
// 200 if no active jobs, 503 if there are active jobs
return res.status(noActiveJobs ? 200 : 500).json({
webScraperActive,
noActiveJobs,
});
} catch (error) {
Logger.error(error);
return res.status(500).json({ error: error.message });
}
}

View File

@ -0,0 +1,85 @@
import { Request, Response } from "express";
import Redis from "ioredis";
import { Logger } from "../../lib/logger";
import { redisRateLimitClient } from "../../services/rate-limiter";
export async function redisHealthController(req: Request, res: Response) {
const retryOperation = async (operation, retries = 3) => {
for (let attempt = 1; attempt <= retries; attempt++) {
try {
return await operation();
} catch (error) {
if (attempt === retries) throw error;
Logger.warn(`Attempt ${attempt} failed: ${error.message}. Retrying...`);
await new Promise((resolve) => setTimeout(resolve, 2000)); // Wait 2 seconds before retrying
}
}
};
try {
const queueRedis = new Redis(process.env.REDIS_URL);
const testKey = "test";
const testValue = "test";
// Test queueRedis
let queueRedisHealth;
try {
await retryOperation(() => queueRedis.set(testKey, testValue));
queueRedisHealth = await retryOperation(() => queueRedis.get(testKey));
await retryOperation(() => queueRedis.del(testKey));
} catch (error) {
Logger.error(`queueRedis health check failed: ${error}`);
queueRedisHealth = null;
}
// Test redisRateLimitClient
let redisRateLimitHealth;
try {
await retryOperation(() => redisRateLimitClient.set(testKey, testValue));
redisRateLimitHealth = await retryOperation(() =>
redisRateLimitClient.get(testKey)
);
await retryOperation(() => redisRateLimitClient.del(testKey));
} catch (error) {
Logger.error(`redisRateLimitClient health check failed: ${error}`);
redisRateLimitHealth = null;
}
const healthStatus = {
queueRedis: queueRedisHealth === testValue ? "healthy" : "unhealthy",
redisRateLimitClient:
redisRateLimitHealth === testValue ? "healthy" : "unhealthy",
};
if (
healthStatus.queueRedis === "healthy" &&
healthStatus.redisRateLimitClient === "healthy"
) {
Logger.info("Both Redis instances are healthy");
return res.status(200).json({ status: "healthy", details: healthStatus });
} else {
Logger.info(
`Redis instances health check: ${JSON.stringify(healthStatus)}`
);
// await sendSlackWebhook(
// `[REDIS DOWN] Redis instances health check: ${JSON.stringify(
// healthStatus
// )}`,
// true
// );
return res
.status(500)
.json({ status: "unhealthy", details: healthStatus });
}
} catch (error) {
Logger.error(`Redis health check failed: ${error}`);
// await sendSlackWebhook(
// `[REDIS DOWN] Redis instances health check: ${error.message}`,
// true
// );
return res
.status(500)
.json({ status: "unhealthy", message: error.message });
}
}

View File

@ -6,6 +6,7 @@ import { withAuth } from "../../src/lib/withAuth";
import { RateLimiterRedis } from "rate-limiter-flexible"; import { RateLimiterRedis } from "rate-limiter-flexible";
import { setTraceAttributes } from '@hyperdx/node-opentelemetry'; import { setTraceAttributes } from '@hyperdx/node-opentelemetry';
import { sendNotification } from "../services/notification/email_notification"; import { sendNotification } from "../services/notification/email_notification";
import { Logger } from "../lib/logger";
export async function authenticateUser(req, res, mode?: RateLimiterMode): Promise<AuthResponse> { export async function authenticateUser(req, res, mode?: RateLimiterMode): Promise<AuthResponse> {
return withAuth(supaAuthenticateUser)(req, res, mode); return withAuth(supaAuthenticateUser)(req, res, mode);
@ -17,7 +18,7 @@ function setTrace(team_id: string, api_key: string) {
api_key api_key
}); });
} catch (error) { } catch (error) {
console.error('Error setting trace attributes:', error); Logger.error(`Error setting trace attributes: ${error.message}`);
} }
} }
@ -82,12 +83,15 @@ export async function supaAuthenticateUser(
// $$ language plpgsql; // $$ language plpgsql;
if (error) { if (error) {
console.error('Error fetching key and price_id:', error); Logger.warn(`Error fetching key and price_id: ${error.message}`);
} else { } else {
// console.log('Key and Price ID:', data); // console.log('Key and Price ID:', data);
} }
if (error || !data || data.length === 0) { if (error || !data || data.length === 0) {
Logger.warn(`Error fetching api key: ${error.message} or data is empty`);
return { return {
success: false, success: false,
error: "Unauthorized: Invalid token", error: "Unauthorized: Invalid token",
@ -135,7 +139,7 @@ export async function supaAuthenticateUser(
try { try {
await rateLimiter.consume(team_endpoint_token); await rateLimiter.consume(team_endpoint_token);
} catch (rateLimiterRes) { } catch (rateLimiterRes) {
console.error(rateLimiterRes); Logger.error(`Rate limit exceeded: ${rateLimiterRes}`);
const secs = Math.round(rateLimiterRes.msBeforeNext / 1000) || 1; const secs = Math.round(rateLimiterRes.msBeforeNext / 1000) || 1;
const retryDate = new Date(Date.now() + rateLimiterRes.msBeforeNext); const retryDate = new Date(Date.now() + rateLimiterRes.msBeforeNext);
@ -177,7 +181,10 @@ export async function supaAuthenticateUser(
.select("*") .select("*")
.eq("key", normalizedApi); .eq("key", normalizedApi);
if (error || !data || data.length === 0) { if (error || !data || data.length === 0) {
Logger.warn(`Error fetching api key: ${error.message} or data is empty`);
return { return {
success: false, success: false,
error: "Unauthorized: Invalid token", error: "Unauthorized: Invalid token",
@ -190,7 +197,6 @@ export async function supaAuthenticateUser(
return { success: true, team_id: subscriptionData.team_id, plan: subscriptionData.plan ?? ""}; return { success: true, team_id: subscriptionData.team_id, plan: subscriptionData.plan ?? ""};
} }
function getPlanByPriceId(price_id: string) { function getPlanByPriceId(price_id: string) {
switch (price_id) { switch (price_id) {
case process.env.STRIPE_PRICE_ID_STARTER: case process.env.STRIPE_PRICE_ID_STARTER:
@ -199,11 +205,14 @@ function getPlanByPriceId(price_id: string) {
return 'standard'; return 'standard';
case process.env.STRIPE_PRICE_ID_SCALE: case process.env.STRIPE_PRICE_ID_SCALE:
return 'scale'; return 'scale';
case process.env.STRIPE_PRICE_ID_HOBBY || process.env.STRIPE_PRICE_ID_HOBBY_YEARLY: case process.env.STRIPE_PRICE_ID_HOBBY:
case process.env.STRIPE_PRICE_ID_HOBBY_YEARLY:
return 'hobby'; return 'hobby';
case process.env.STRIPE_PRICE_ID_STANDARD_NEW || process.env.STRIPE_PRICE_ID_STANDARD_NEW_YEARLY: case process.env.STRIPE_PRICE_ID_STANDARD_NEW:
case process.env.STRIPE_PRICE_ID_STANDARD_NEW_YEARLY:
return 'standardnew'; return 'standardnew';
case process.env.STRIPE_PRICE_ID_GROWTH || process.env.STRIPE_PRICE_ID_GROWTH_YEARLY: case process.env.STRIPE_PRICE_ID_GROWTH:
case process.env.STRIPE_PRICE_ID_GROWTH_YEARLY:
return 'growth'; return 'growth';
default: default:
return 'free'; return 'free';

View File

@ -5,6 +5,7 @@ import { addWebScraperJob } from "../../src/services/queue-jobs";
import { getWebScraperQueue } from "../../src/services/queue-service"; import { getWebScraperQueue } from "../../src/services/queue-service";
import { supabase_service } from "../../src/services/supabase"; import { supabase_service } from "../../src/services/supabase";
import { billTeam } from "../../src/services/billing/credit_billing"; import { billTeam } from "../../src/services/billing/credit_billing";
import { Logger } from "../../src/lib/logger";
export async function crawlCancelController(req: Request, res: Response) { export async function crawlCancelController(req: Request, res: Response) {
try { try {
@ -43,25 +44,28 @@ export async function crawlCancelController(req: Request, res: Response) {
const { partialDocs } = await job.progress(); const { partialDocs } = await job.progress();
if (partialDocs && partialDocs.length > 0 && jobState === "active") { if (partialDocs && partialDocs.length > 0 && jobState === "active") {
console.log("Billing team for partial docs..."); Logger.info("Billing team for partial docs...");
// Note: the credits that we will bill them here might be lower than the actual // Note: the credits that we will bill them here might be lower than the actual
// due to promises that are not yet resolved // due to promises that are not yet resolved
await billTeam(team_id, partialDocs.length); await billTeam(team_id, partialDocs.length);
} }
try { try {
await getWebScraperQueue().client.del(job.lockKey());
await job.takeLock();
await job.discard();
await job.moveToFailed(Error("Job cancelled by user"), true); await job.moveToFailed(Error("Job cancelled by user"), true);
} catch (error) { } catch (error) {
console.error(error); Logger.error(error);
} }
const newJobState = await job.getState(); const newJobState = await job.getState();
res.json({ res.json({
status: newJobState === "failed" ? "cancelled" : "Cancelling...", status: "cancelled"
}); });
} catch (error) { } catch (error) {
console.error(error); Logger.error(error);
return res.status(500).json({ error: error.message }); return res.status(500).json({ error: error.message });
} }
} }

View File

@ -4,6 +4,7 @@ import { RateLimiterMode } from "../../src/types";
import { addWebScraperJob } from "../../src/services/queue-jobs"; import { addWebScraperJob } from "../../src/services/queue-jobs";
import { getWebScraperQueue } from "../../src/services/queue-service"; import { getWebScraperQueue } from "../../src/services/queue-service";
import { supabaseGetJobById } from "../../src/lib/supabase-jobs"; import { supabaseGetJobById } from "../../src/lib/supabase-jobs";
import { Logger } from "../../src/lib/logger";
export async function crawlStatusController(req: Request, res: Response) { export async function crawlStatusController(req: Request, res: Response) {
try { try {
@ -44,7 +45,7 @@ export async function crawlStatusController(req: Request, res: Response) {
partial_data: jobStatus == 'completed' ? [] : partialDocs, partial_data: jobStatus == 'completed' ? [] : partialDocs,
}); });
} catch (error) { } catch (error) {
console.error(error); Logger.error(error);
return res.status(500).json({ error: error.message }); return res.status(500).json({ error: error.message });
} }
} }

View File

@ -10,6 +10,8 @@ import { logCrawl } from "../../src/services/logging/crawl_log";
import { validateIdempotencyKey } from "../../src/services/idempotency/validate"; import { validateIdempotencyKey } from "../../src/services/idempotency/validate";
import { createIdempotencyKey } from "../../src/services/idempotency/create"; import { createIdempotencyKey } from "../../src/services/idempotency/create";
import { defaultCrawlPageOptions, defaultCrawlerOptions, defaultOrigin } from "../../src/lib/default-values"; import { defaultCrawlPageOptions, defaultCrawlerOptions, defaultOrigin } from "../../src/lib/default-values";
import { v4 as uuidv4 } from "uuid";
import { Logger } from "../../src/lib/logger";
export async function crawlController(req: Request, res: Response) { export async function crawlController(req: Request, res: Response) {
try { try {
@ -30,7 +32,7 @@ export async function crawlController(req: Request, res: Response) {
try { try {
createIdempotencyKey(req); createIdempotencyKey(req);
} catch (error) { } catch (error) {
console.error(error); Logger.error(error);
return res.status(500).json({ error: error.message }); return res.status(500).json({ error: error.message });
} }
} }
@ -60,10 +62,11 @@ export async function crawlController(req: Request, res: Response) {
const crawlerOptions = { ...defaultCrawlerOptions, ...req.body.crawlerOptions }; const crawlerOptions = { ...defaultCrawlerOptions, ...req.body.crawlerOptions };
const pageOptions = { ...defaultCrawlPageOptions, ...req.body.pageOptions }; const pageOptions = { ...defaultCrawlPageOptions, ...req.body.pageOptions };
if (mode === "single_urls" && !url.includes(",")) { if (mode === "single_urls" && !url.includes(",")) { // NOTE: do we need this?
try { try {
const a = new WebScraperDataProvider(); const a = new WebScraperDataProvider();
await a.setOptions({ await a.setOptions({
jobId: uuidv4(),
mode: "single_urls", mode: "single_urls",
urls: [url], urls: [url],
crawlerOptions: { ...crawlerOptions, returnOnlyUrls: true }, crawlerOptions: { ...crawlerOptions, returnOnlyUrls: true },
@ -83,7 +86,7 @@ export async function crawlController(req: Request, res: Response) {
documents: docs, documents: docs,
}); });
} catch (error) { } catch (error) {
console.error(error); Logger.error(error);
return res.status(500).json({ error: error.message }); return res.status(500).json({ error: error.message });
} }
} }
@ -101,7 +104,7 @@ export async function crawlController(req: Request, res: Response) {
res.json({ jobId: job.id }); res.json({ jobId: job.id });
} catch (error) { } catch (error) {
console.error(error); Logger.error(error);
return res.status(500).json({ error: error.message }); return res.status(500).json({ error: error.message });
} }
} }

View File

@ -3,6 +3,7 @@ import { authenticateUser } from "./auth";
import { RateLimiterMode } from "../../src/types"; import { RateLimiterMode } from "../../src/types";
import { addWebScraperJob } from "../../src/services/queue-jobs"; import { addWebScraperJob } from "../../src/services/queue-jobs";
import { isUrlBlocked } from "../../src/scraper/WebScraper/utils/blocklist"; import { isUrlBlocked } from "../../src/scraper/WebScraper/utils/blocklist";
import { Logger } from "../../src/lib/logger";
export async function crawlPreviewController(req: Request, res: Response) { export async function crawlPreviewController(req: Request, res: Response) {
try { try {
@ -39,7 +40,7 @@ export async function crawlPreviewController(req: Request, res: Response) {
res.json({ jobId: job.id }); res.json({ jobId: job.id });
} catch (error) { } catch (error) {
console.error(error); Logger.error(error);
return res.status(500).json({ error: error.message }); return res.status(500).json({ error: error.message });
} }
} }

View File

@ -0,0 +1,6 @@
import { Request, Response } from "express";
export async function livenessController(req: Request, res: Response) {
//TODO: add checks if the application is live and healthy like checking the redis connection
res.status(200).json({ status: "ok" });
}

View File

@ -0,0 +1,6 @@
import { Request, Response } from "express";
export async function readinessController(req: Request, res: Response) {
// TODO: add checks when the application is ready to serve traffic
res.status(200).json({ status: "ok" });
}

View File

@ -9,8 +9,11 @@ import { Document } from "../lib/entities";
import { isUrlBlocked } from "../scraper/WebScraper/utils/blocklist"; // Import the isUrlBlocked function import { isUrlBlocked } from "../scraper/WebScraper/utils/blocklist"; // Import the isUrlBlocked function
import { numTokensFromString } from '../lib/LLM-extraction/helpers'; import { numTokensFromString } from '../lib/LLM-extraction/helpers';
import { defaultPageOptions, defaultExtractorOptions, defaultTimeout, defaultOrigin } from '../lib/default-values'; import { defaultPageOptions, defaultExtractorOptions, defaultTimeout, defaultOrigin } from '../lib/default-values';
import { v4 as uuidv4 } from "uuid";
import { Logger } from '../lib/logger';
export async function scrapeHelper( export async function scrapeHelper(
jobId: string,
req: Request, req: Request,
team_id: string, team_id: string,
crawlerOptions: any, crawlerOptions: any,
@ -35,6 +38,7 @@ export async function scrapeHelper(
const a = new WebScraperDataProvider(); const a = new WebScraperDataProvider();
await a.setOptions({ await a.setOptions({
jobId,
mode: "single_urls", mode: "single_urls",
urls: [url], urls: [url],
crawlerOptions: { crawlerOptions: {
@ -73,28 +77,6 @@ export async function scrapeHelper(
}); });
} }
let creditsToBeBilled = filteredDocs.length;
const creditsPerLLMExtract = 50;
if (extractorOptions.mode === "llm-extraction" || extractorOptions.mode === "llm-extraction-from-raw-html" || extractorOptions.mode === "llm-extraction-from-markdown") {
creditsToBeBilled = creditsToBeBilled + (creditsPerLLMExtract * filteredDocs.length);
}
const billingResult = await billTeam(
team_id,
creditsToBeBilled
);
if (!billingResult.success) {
return {
success: false,
error:
"Failed to bill team. Insufficient credits or subscription not found.",
returnCode: 402,
};
}
return { return {
success: true, success: true,
data: filteredDocs[0], data: filteredDocs[0],
@ -104,6 +86,7 @@ export async function scrapeHelper(
export async function scrapeController(req: Request, res: Response) { export async function scrapeController(req: Request, res: Response) {
try { try {
let earlyReturn = false;
// make sure to authenticate user first, Bearer <token> // make sure to authenticate user first, Bearer <token>
const { success, team_id, error, status, plan } = await authenticateUser( const { success, team_id, error, status, plan } = await authenticateUser(
req, req,
@ -113,30 +96,40 @@ export async function scrapeController(req: Request, res: Response) {
if (!success) { if (!success) {
return res.status(status).json({ error }); return res.status(status).json({ error });
} }
const crawlerOptions = req.body.crawlerOptions ?? {}; const crawlerOptions = req.body.crawlerOptions ?? {};
const pageOptions = { ...defaultPageOptions, ...req.body.pageOptions }; const pageOptions = { ...defaultPageOptions, ...req.body.pageOptions };
const extractorOptions = { ...defaultExtractorOptions, ...req.body.extractorOptions }; const extractorOptions = { ...defaultExtractorOptions, ...req.body.extractorOptions };
const origin = req.body.origin ?? defaultOrigin; const origin = req.body.origin ?? defaultOrigin;
let timeout = req.body.timeout ?? defaultTimeout; let timeout = req.body.timeout ?? defaultTimeout;
if (extractorOptions.mode === "llm-extraction") { if (extractorOptions.mode.includes("llm-extraction")) {
pageOptions.onlyMainContent = true; pageOptions.onlyMainContent = true;
timeout = req.body.timeout ?? 90000; timeout = req.body.timeout ?? 90000;
} }
try { const checkCredits = async () => {
const { success: creditsCheckSuccess, message: creditsCheckMessage } = try {
await checkTeamCredits(team_id, 1); const { success: creditsCheckSuccess, message: creditsCheckMessage } = await checkTeamCredits(team_id, 1);
if (!creditsCheckSuccess) { if (!creditsCheckSuccess) {
return res.status(402).json({ error: "Insufficient credits" }); earlyReturn = true;
return res.status(402).json({ error: "Insufficient credits" });
}
} catch (error) {
Logger.error(error);
earlyReturn = true;
return res.status(500).json({ error: "Error checking team credits. Please contact hello@firecrawl.com for help." });
} }
} catch (error) { };
console.error(error);
return res.status(500).json({ error: "Internal server error" });
} await checkCredits();
const jobId = uuidv4();
const startTime = new Date().getTime(); const startTime = new Date().getTime();
const result = await scrapeHelper( const result = await scrapeHelper(
jobId,
req, req,
team_id, team_id,
crawlerOptions, crawlerOptions,
@ -149,7 +142,35 @@ export async function scrapeController(req: Request, res: Response) {
const timeTakenInSeconds = (endTime - startTime) / 1000; const timeTakenInSeconds = (endTime - startTime) / 1000;
const numTokens = (result.data && result.data.markdown) ? numTokensFromString(result.data.markdown, "gpt-3.5-turbo") : 0; const numTokens = (result.data && result.data.markdown) ? numTokensFromString(result.data.markdown, "gpt-3.5-turbo") : 0;
if (result.success) {
let creditsToBeBilled = 1; // Assuming 1 credit per document
const creditsPerLLMExtract = 50;
if (extractorOptions.mode.includes("llm-extraction")) {
// creditsToBeBilled = creditsToBeBilled + (creditsPerLLMExtract * filteredDocs.length);
creditsToBeBilled += creditsPerLLMExtract;
}
let startTimeBilling = new Date().getTime();
if (earlyReturn) {
// Don't bill if we're early returning
return;
}
const billingResult = await billTeam(
team_id,
creditsToBeBilled
);
if (!billingResult.success) {
return res.status(402).json({
success: false,
error: "Failed to bill team. Insufficient credits or subscription not found.",
});
}
}
logJob({ logJob({
job_id: jobId,
success: result.success, success: result.success,
message: result.error, message: result.error,
num_docs: 1, num_docs: 1,
@ -164,9 +185,12 @@ export async function scrapeController(req: Request, res: Response) {
extractor_options: extractorOptions, extractor_options: extractorOptions,
num_tokens: numTokens, num_tokens: numTokens,
}); });
return res.status(result.returnCode).json(result); return res.status(result.returnCode).json(result);
} catch (error) { } catch (error) {
console.error(error); Logger.error(error);
return res.status(500).json({ error: error.message }); return res.status(500).json({ error: error.message });
} }
} }

View File

@ -7,8 +7,11 @@ import { logJob } from "../services/logging/log_job";
import { PageOptions, SearchOptions } from "../lib/entities"; import { PageOptions, SearchOptions } from "../lib/entities";
import { search } from "../search"; import { search } from "../search";
import { isUrlBlocked } from "../scraper/WebScraper/utils/blocklist"; import { isUrlBlocked } from "../scraper/WebScraper/utils/blocklist";
import { v4 as uuidv4 } from "uuid";
import { Logger } from "../lib/logger";
export async function searchHelper( export async function searchHelper(
jobId: string,
req: Request, req: Request,
team_id: string, team_id: string,
crawlerOptions: any, crawlerOptions: any,
@ -75,6 +78,7 @@ export async function searchHelper(
const a = new WebScraperDataProvider(); const a = new WebScraperDataProvider();
await a.setOptions({ await a.setOptions({
jobId,
mode: "single_urls", mode: "single_urls",
urls: res.map((r) => r.url).slice(0, searchOptions.limit ?? 7), urls: res.map((r) => r.url).slice(0, searchOptions.limit ?? 7),
crawlerOptions: { crawlerOptions: {
@ -148,6 +152,8 @@ export async function searchController(req: Request, res: Response) {
const searchOptions = req.body.searchOptions ?? { limit: 7 }; const searchOptions = req.body.searchOptions ?? { limit: 7 };
const jobId = uuidv4();
try { try {
const { success: creditsCheckSuccess, message: creditsCheckMessage } = const { success: creditsCheckSuccess, message: creditsCheckMessage } =
await checkTeamCredits(team_id, 1); await checkTeamCredits(team_id, 1);
@ -155,11 +161,12 @@ export async function searchController(req: Request, res: Response) {
return res.status(402).json({ error: "Insufficient credits" }); return res.status(402).json({ error: "Insufficient credits" });
} }
} catch (error) { } catch (error) {
console.error(error); Logger.error(error);
return res.status(500).json({ error: "Internal server error" }); return res.status(500).json({ error: "Internal server error" });
} }
const startTime = new Date().getTime(); const startTime = new Date().getTime();
const result = await searchHelper( const result = await searchHelper(
jobId,
req, req,
team_id, team_id,
crawlerOptions, crawlerOptions,
@ -169,6 +176,7 @@ export async function searchController(req: Request, res: Response) {
const endTime = new Date().getTime(); const endTime = new Date().getTime();
const timeTakenInSeconds = (endTime - startTime) / 1000; const timeTakenInSeconds = (endTime - startTime) / 1000;
logJob({ logJob({
job_id: jobId,
success: result.success, success: result.success,
message: result.error, message: result.error,
num_docs: result.data ? result.data.length : 0, num_docs: result.data ? result.data.length : 0,
@ -183,7 +191,7 @@ export async function searchController(req: Request, res: Response) {
}); });
return res.status(result.returnCode).json(result); return res.status(result.returnCode).json(result);
} catch (error) { } catch (error) {
console.error(error); Logger.error(error);
return res.status(500).json({ error: error.message }); return res.status(500).json({ error: error.message });
} }
} }

View File

@ -1,6 +1,7 @@
import { Request, Response } from "express"; import { Request, Response } from "express";
import { getWebScraperQueue } from "../../src/services/queue-service"; import { getWebScraperQueue } from "../../src/services/queue-service";
import { supabaseGetJobById } from "../../src/lib/supabase-jobs"; import { supabaseGetJobById } from "../../src/lib/supabase-jobs";
import { Logger } from "../../src/lib/logger";
export async function crawlJobStatusPreviewController(req: Request, res: Response) { export async function crawlJobStatusPreviewController(req: Request, res: Response) {
try { try {
@ -19,7 +20,10 @@ export async function crawlJobStatusPreviewController(req: Request, res: Respons
} }
} }
const jobStatus = await job.getState(); let jobStatus = await job.getState();
if (jobStatus === 'waiting' || jobStatus === 'stuck') {
jobStatus = 'active';
}
res.json({ res.json({
status: jobStatus, status: jobStatus,
@ -32,7 +36,7 @@ export async function crawlJobStatusPreviewController(req: Request, res: Respons
partial_data: jobStatus == 'completed' ? [] : partialDocs, partial_data: jobStatus == 'completed' ? [] : partialDocs,
}); });
} catch (error) { } catch (error) {
console.error(error); Logger.error(error);
return res.status(500).json({ error: error.message }); return res.status(500).json({ error: error.message });
} }
} }

View File

@ -4,6 +4,7 @@ async function example() {
const example = new WebScraperDataProvider(); const example = new WebScraperDataProvider();
await example.setOptions({ await example.setOptions({
jobId: "TEST",
mode: "crawl", mode: "crawl",
urls: ["https://mendable.ai"], urls: ["https://mendable.ai"],
crawlerOptions: {}, crawlerOptions: {},

View File

@ -7,21 +7,30 @@ import { v0Router } from "./routes/v0";
import { initSDK } from "@hyperdx/node-opentelemetry"; import { initSDK } from "@hyperdx/node-opentelemetry";
import cluster from "cluster"; import cluster from "cluster";
import os from "os"; import os from "os";
import { Job } from "bull"; import { Logger } from "./lib/logger";
import { sendSlackWebhook } from "./services/alerts/slack"; import { adminRouter } from "./routes/admin";
import { checkAlerts } from "./services/alerts"; import { ScrapeEvents } from "./lib/scrape-events";
import Redis from "ioredis"; import http from 'node:http';
import { redisRateLimitClient } from "./services/rate-limiter"; import https from 'node:https';
import CacheableLookup from 'cacheable-lookup';
const { createBullBoard } = require("@bull-board/api"); const { createBullBoard } = require("@bull-board/api");
const { BullAdapter } = require("@bull-board/api/bullAdapter"); const { BullAdapter } = require("@bull-board/api/bullAdapter");
const { ExpressAdapter } = require("@bull-board/express"); const { ExpressAdapter } = require("@bull-board/express");
const numCPUs = process.env.ENV === "local" ? 2 : os.cpus().length; const numCPUs = process.env.ENV === "local" ? 2 : os.cpus().length;
console.log(`Number of CPUs: ${numCPUs} available`); Logger.info(`Number of CPUs: ${numCPUs} available`);
const cacheable = new CacheableLookup({
// this is important to avoid querying local hostnames see https://github.com/szmarczak/cacheable-lookup readme
lookup:false
});
cacheable.install(http.globalAgent);
cacheable.install(https.globalAgent)
if (cluster.isMaster) { if (cluster.isMaster) {
console.log(`Master ${process.pid} is running`); Logger.info(`Master ${process.pid} is running`);
// Fork workers. // Fork workers.
for (let i = 0; i < numCPUs; i++) { for (let i = 0; i < numCPUs; i++) {
@ -30,8 +39,8 @@ if (cluster.isMaster) {
cluster.on("exit", (worker, code, signal) => { cluster.on("exit", (worker, code, signal) => {
if (code !== null) { if (code !== null) {
console.log(`Worker ${worker.process.pid} exited`); Logger.info(`Worker ${worker.process.pid} exited`);
console.log("Starting a new worker"); Logger.info("Starting a new worker");
cluster.fork(); cluster.fork();
} }
}); });
@ -45,7 +54,6 @@ if (cluster.isMaster) {
app.use(cors()); // Add this line to enable CORS app.use(cors()); // Add this line to enable CORS
const serverAdapter = new ExpressAdapter(); const serverAdapter = new ExpressAdapter();
serverAdapter.setBasePath(`/admin/${process.env.BULL_AUTH_KEY}/queues`); serverAdapter.setBasePath(`/admin/${process.env.BULL_AUTH_KEY}/queues`);
@ -70,6 +78,7 @@ if (cluster.isMaster) {
// register router // register router
app.use(v0Router); app.use(v0Router);
app.use(adminRouter);
const DEFAULT_PORT = process.env.PORT ?? 3002; const DEFAULT_PORT = process.env.PORT ?? 3002;
const HOST = process.env.HOST ?? "localhost"; const HOST = process.env.HOST ?? "localhost";
@ -81,14 +90,9 @@ if (cluster.isMaster) {
function startServer(port = DEFAULT_PORT) { function startServer(port = DEFAULT_PORT) {
const server = app.listen(Number(port), HOST, () => { const server = app.listen(Number(port), HOST, () => {
console.log(`Worker ${process.pid} listening on port ${port}`); Logger.info(`Worker ${process.pid} listening on port ${port}`);
console.log( Logger.info(
`For the UI, open http://${HOST}:${port}/admin/${process.env.BULL_AUTH_KEY}/queues` `For the Queue UI, open: http://${HOST}:${port}/admin/${process.env.BULL_AUTH_KEY}/queues`
);
console.log("");
console.log("1. Make sure Redis is running on port 6379 by default");
console.log(
"2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002 "
); );
}); });
return server; return server;
@ -98,84 +102,6 @@ if (cluster.isMaster) {
startServer(); startServer();
} }
// Use this as a "health check" that way we dont destroy the server
app.get(`/admin/${process.env.BULL_AUTH_KEY}/queues`, async (req, res) => {
try {
const webScraperQueue = getWebScraperQueue();
const [webScraperActive] = await Promise.all([
webScraperQueue.getActiveCount(),
]);
const noActiveJobs = webScraperActive === 0;
// 200 if no active jobs, 503 if there are active jobs
return res.status(noActiveJobs ? 200 : 500).json({
webScraperActive,
noActiveJobs,
});
} catch (error) {
console.error(error);
return res.status(500).json({ error: error.message });
}
});
app.post(`/admin/${process.env.BULL_AUTH_KEY}/shutdown`, async (req, res) => {
// return res.status(200).json({ ok: true });
try {
console.log("Gracefully shutting down...");
await getWebScraperQueue().pause(false, true);
res.json({ ok: true });
} catch (error) {
console.error(error);
return res.status(500).json({ error: error.message });
}
});
app.post(`/admin/${process.env.BULL_AUTH_KEY}/unpause`, async (req, res) => {
try {
const wsq = getWebScraperQueue();
const jobs = await wsq.getActive();
console.log("Requeueing", jobs.length, "jobs...");
if (jobs.length > 0) {
console.log(" Removing", jobs.length, "jobs...");
await Promise.all(
jobs.map(async (x) => {
try {
await wsq.client.del(await x.lockKey());
await x.takeLock();
await x.moveToFailed({ message: "interrupted" });
await x.remove();
} catch (e) {
console.warn("Failed to remove job", x.id, e);
}
})
);
console.log(" Re-adding", jobs.length, "jobs...");
await wsq.addBulk(
jobs.map((x) => ({
data: x.data,
opts: {
jobId: x.id,
},
}))
);
console.log(" Done!");
}
await getWebScraperQueue().resume(false);
res.json({ ok: true });
} catch (error) {
console.error(error);
return res.status(500).json({ error: error.message });
}
});
app.get(`/serverHealthCheck`, async (req, res) => { app.get(`/serverHealthCheck`, async (req, res) => {
try { try {
const webScraperQueue = getWebScraperQueue(); const webScraperQueue = getWebScraperQueue();
@ -189,7 +115,7 @@ if (cluster.isMaster) {
waitingJobs, waitingJobs,
}); });
} catch (error) { } catch (error) {
console.error(error); Logger.error(error);
return res.status(500).json({ error: error.message }); return res.status(500).json({ error: error.message });
} }
}); });
@ -234,13 +160,13 @@ if (cluster.isMaster) {
}); });
if (!response.ok) { if (!response.ok) {
console.error("Failed to send Slack notification"); Logger.error("Failed to send Slack notification");
} }
} }
}, timeout); }, timeout);
} }
} catch (error) { } catch (error) {
console.error(error); Logger.debug(error);
} }
}; };
@ -248,140 +174,18 @@ if (cluster.isMaster) {
} }
}); });
app.get(
`/admin/${process.env.BULL_AUTH_KEY}/check-queues`,
async (req, res) => {
try {
await checkAlerts();
return res.status(200).send("Alerts initialized");
} catch (error) {
console.error("Failed to initialize alerts:", error);
return res.status(500).send("Failed to initialize alerts");
}
}
);
app.get(
`/admin/${process.env.BULL_AUTH_KEY}/clean-before-24h-complete-jobs`,
async (req, res) => {
try {
const webScraperQueue = getWebScraperQueue();
const batchSize = 10;
const numberOfBatches = 9; // Adjust based on your needs
const completedJobsPromises: Promise<Job[]>[] = [];
for (let i = 0; i < numberOfBatches; i++) {
completedJobsPromises.push(
webScraperQueue.getJobs(
["completed"],
i * batchSize,
i * batchSize + batchSize,
true
)
);
}
const completedJobs: Job[] = (
await Promise.all(completedJobsPromises)
).flat();
const before24hJobs =
completedJobs.filter(
(job) => job.finishedOn < Date.now() - 24 * 60 * 60 * 1000
) || [];
let count = 0;
if (!before24hJobs) {
return res.status(200).send(`No jobs to remove.`);
}
for (const job of before24hJobs) {
try {
await job.remove();
count++;
} catch (jobError) {
console.error(`Failed to remove job with ID ${job.id}:`, jobError);
}
}
return res.status(200).send(`Removed ${count} completed jobs.`);
} catch (error) {
console.error("Failed to clean last 24h complete jobs:", error);
return res.status(500).send("Failed to clean jobs");
}
}
);
app.get("/is-production", (req, res) => { app.get("/is-production", (req, res) => {
res.send({ isProduction: global.isProduction }); res.send({ isProduction: global.isProduction });
}); });
app.get( Logger.info(`Worker ${process.pid} started`);
`/admin/${process.env.BULL_AUTH_KEY}/redis-health`,
async (req, res) => {
try {
const queueRedis = new Redis(process.env.REDIS_URL);
const testKey = "test";
const testValue = "test";
// Test queueRedis
let queueRedisHealth;
try {
await queueRedis.set(testKey, testValue);
queueRedisHealth = await queueRedis.get(testKey);
await queueRedis.del(testKey);
} catch (error) {
console.error("queueRedis health check failed:", error);
queueRedisHealth = null;
}
// Test redisRateLimitClient
let redisRateLimitHealth;
try {
await redisRateLimitClient.set(testKey, testValue);
redisRateLimitHealth = await redisRateLimitClient.get(testKey);
await redisRateLimitClient.del(testKey);
} catch (error) {
console.error("redisRateLimitClient health check failed:", error);
redisRateLimitHealth = null;
}
const healthStatus = {
queueRedis: queueRedisHealth === testValue ? "healthy" : "unhealthy",
redisRateLimitClient:
redisRateLimitHealth === testValue ? "healthy" : "unhealthy",
};
if (
healthStatus.queueRedis === "healthy" &&
healthStatus.redisRateLimitClient === "healthy"
) {
console.log("Both Redis instances are healthy");
return res
.status(200)
.json({ status: "healthy", details: healthStatus });
} else {
console.log("Redis instances health check:", healthStatus);
await sendSlackWebhook(
`[REDIS DOWN] Redis instances health check: ${JSON.stringify(
healthStatus
)}`,
true
);
return res
.status(500)
.json({ status: "unhealthy", details: healthStatus });
}
} catch (error) {
console.error("Redis health check failed:", error);
await sendSlackWebhook(
`[REDIS DOWN] Redis instances health check: ${error.message}`,
true
);
return res
.status(500)
.json({ status: "unhealthy", message: error.message });
}
}
);
console.log(`Worker ${process.pid} started`);
} }
const wsq = getWebScraperQueue();
wsq.on("waiting", j => ScrapeEvents.logJobEvent(j, "waiting"));
wsq.on("active", j => ScrapeEvents.logJobEvent(j, "active"));
wsq.on("completed", j => ScrapeEvents.logJobEvent(j, "completed"));
wsq.on("paused", j => ScrapeEvents.logJobEvent(j, "paused"));
wsq.on("resumed", j => ScrapeEvents.logJobEvent(j, "resumed"));
wsq.on("removed", j => ScrapeEvents.logJobEvent(j, "removed"));

View File

@ -4,6 +4,7 @@ const ajv = new Ajv(); // Initialize AJV for JSON schema validation
import { generateOpenAICompletions } from "./models"; import { generateOpenAICompletions } from "./models";
import { Document, ExtractorOptions } from "../entities"; import { Document, ExtractorOptions } from "../entities";
import { Logger } from "../logger";
// Generate completion using OpenAI // Generate completion using OpenAI
export async function generateCompletions( export async function generateCompletions(
@ -44,7 +45,7 @@ export async function generateCompletions(
return completionResult; return completionResult;
} catch (error) { } catch (error) {
console.error(`Error generating completions: ${error}`); Logger.error(`Error generating completions: ${error}`);
throw new Error(`Error generating completions: ${error.message}`); throw new Error(`Error generating completions: ${error.message}`);
} }
default: default:

View File

@ -48,7 +48,7 @@ function prepareOpenAIDoc(
export async function generateOpenAICompletions({ export async function generateOpenAICompletions({
client, client,
model = "gpt-4o", model = process.env.MODEL_NAME || "gpt-4o",
document, document,
schema, //TODO - add zod dynamic type checking schema, //TODO - add zod dynamic type checking
prompt = defaultPrompt, prompt = defaultPrompt,

View File

@ -1,12 +1,13 @@
export const defaultOrigin = "api"; export const defaultOrigin = "api";
export const defaultTimeout = 30000; // 30 seconds export const defaultTimeout = 45000; // 45 seconds
export const defaultPageOptions = { export const defaultPageOptions = {
onlyMainContent: false, onlyMainContent: false,
includeHtml: false, includeHtml: false,
waitFor: 0, waitFor: 0,
screenshot: false, screenshot: false,
fullPageScreenshot: false,
parsePDF: true parsePDF: true
}; };

View File

@ -18,6 +18,7 @@ export type PageOptions = {
fetchPageContent?: boolean; fetchPageContent?: boolean;
waitFor?: number; waitFor?: number;
screenshot?: boolean; screenshot?: boolean;
fullPageScreenshot?: boolean;
headers?: Record<string, string>; headers?: Record<string, string>;
replaceAllPathsWithAbsolutePaths?: boolean; replaceAllPathsWithAbsolutePaths?: boolean;
parsePDF?: boolean; parsePDF?: boolean;
@ -42,8 +43,8 @@ export type SearchOptions = {
export type CrawlerOptions = { export type CrawlerOptions = {
returnOnlyUrls?: boolean; returnOnlyUrls?: boolean;
includes?: string[]; includes?: string | string[];
excludes?: string[]; excludes?: string | string[];
maxCrawledLinks?: number; maxCrawledLinks?: number;
maxDepth?: number; maxDepth?: number;
limit?: number; limit?: number;
@ -56,6 +57,7 @@ export type CrawlerOptions = {
} }
export type WebScraperOptions = { export type WebScraperOptions = {
jobId: string;
urls: string[]; urls: string[];
mode: "single_urls" | "sitemap" | "crawl"; mode: "single_urls" | "sitemap" | "crawl";
crawlerOptions?: CrawlerOptions; crawlerOptions?: CrawlerOptions;
@ -138,4 +140,5 @@ export interface FireEngineOptions{
engine?: string; engine?: string;
blockMedia?: boolean; blockMedia?: boolean;
blockAds?: boolean; blockAds?: boolean;
disableJsDom?: boolean;
} }

View File

@ -0,0 +1,53 @@
enum LogLevel {
NONE = 'NONE', // No logs will be output.
ERROR = 'ERROR', // For logging error messages that indicate a failure in a specific operation.
WARN = 'WARN', // For logging potentially harmful situations that are not necessarily errors.
INFO = 'INFO', // For logging informational messages that highlight the progress of the application.
DEBUG = 'DEBUG', // For logging detailed information on the flow through the system, primarily used for debugging.
TRACE = 'TRACE' // For logging more detailed information than the DEBUG level.
}
export class Logger {
static colors = {
ERROR: '\x1b[31m%s\x1b[0m', // Red
WARN: '\x1b[33m%s\x1b[0m', // Yellow
INFO: '\x1b[34m%s\x1b[0m', // Blue
DEBUG: '\x1b[36m%s\x1b[0m', // Cyan
TRACE: '\x1b[35m%s\x1b[0m' // Magenta
};
static log (message: string, level: LogLevel) {
const logLevel: LogLevel = LogLevel[process.env.LOGGING_LEVEL as keyof typeof LogLevel] || LogLevel.INFO;
const levels = [LogLevel.NONE, LogLevel.ERROR, LogLevel.WARN, LogLevel.INFO, LogLevel.DEBUG, LogLevel.TRACE];
const currentLevelIndex = levels.indexOf(logLevel);
const messageLevelIndex = levels.indexOf(level);
if (currentLevelIndex >= messageLevelIndex) {
const color = Logger.colors[level];
console[level.toLowerCase()](color, `[${new Date().toISOString()}]${level} - ${message}`);
// if (process.env.USE_DB_AUTH) {
// save to supabase? another place?
// supabase.from('logs').insert({ level: level, message: message, timestamp: new Date().toISOString(), success: boolean });
// }
}
}
static error(message: string | any) {
Logger.log(message, LogLevel.ERROR);
}
static warn(message: string) {
Logger.log(message, LogLevel.WARN);
}
static info(message: string) {
Logger.log(message, LogLevel.INFO);
}
static debug(message: string) {
Logger.log(message, LogLevel.DEBUG);
}
static trace(message: string) {
Logger.log(message, LogLevel.TRACE);
}
}

View File

@ -0,0 +1,84 @@
import { Job, JobId } from "bull";
import type { baseScrapers } from "../scraper/WebScraper/single_url";
import { supabase_service as supabase } from "../services/supabase";
import { Logger } from "./logger";
export type ScrapeErrorEvent = {
type: "error",
message: string,
stack?: string,
}
export type ScrapeScrapeEvent = {
type: "scrape",
url: string,
worker?: string,
method: (typeof baseScrapers)[number],
result: null | {
success: boolean,
response_code?: number,
response_size?: number,
error?: string | object,
// proxy?: string,
time_taken: number,
},
}
export type ScrapeQueueEvent = {
type: "queue",
event: "waiting" | "active" | "completed" | "paused" | "resumed" | "removed" | "failed",
worker?: string,
}
export type ScrapeEvent = ScrapeErrorEvent | ScrapeScrapeEvent | ScrapeQueueEvent;
export class ScrapeEvents {
static async insert(jobId: string, content: ScrapeEvent) {
if (jobId === "TEST") return null;
if (process.env.USE_DB_AUTHENTICATION) {
try {
const result = await supabase.from("scrape_events").insert({
job_id: jobId,
type: content.type,
content: content,
// created_at
}).select().single();
return (result.data as any).id;
} catch (error) {
// Logger.error(`Error inserting scrape event: ${error}`);
return null;
}
}
return null;
}
static async updateScrapeResult(logId: number | null, result: ScrapeScrapeEvent["result"]) {
if (logId === null) return;
try {
const previousLog = (await supabase.from("scrape_events").select().eq("id", logId).single()).data as any;
await supabase.from("scrape_events").update({
content: {
...previousLog.content,
result,
}
}).eq("id", logId);
} catch (error) {
Logger.error(`Error updating scrape result: ${error}`);
}
}
static async logJobEvent(job: Job | JobId, event: ScrapeQueueEvent["event"]) {
try {
await this.insert(((job as any).id ? (job as any).id : job) as string, {
type: "queue",
event,
worker: process.env.FLY_MACHINE_ID,
});
} catch (error) {
Logger.error(`Error logging job event: ${error}`);
}
}
}

View File

@ -1,4 +1,5 @@
import { AuthResponse } from "../../src/types"; import { AuthResponse } from "../../src/types";
import { Logger } from "./logger";
let warningCount = 0; let warningCount = 0;
@ -8,7 +9,7 @@ export function withAuth<T extends AuthResponse, U extends any[]>(
return async function (...args: U): Promise<T> { return async function (...args: U): Promise<T> {
if (process.env.USE_DB_AUTHENTICATION === "false") { if (process.env.USE_DB_AUTHENTICATION === "false") {
if (warningCount < 5) { if (warningCount < 5) {
console.warn("WARNING - You're bypassing authentication"); Logger.warn("You're bypassing authentication");
warningCount++; warningCount++;
} }
return { success: true } as T; return { success: true } as T;
@ -16,7 +17,7 @@ export function withAuth<T extends AuthResponse, U extends any[]>(
try { try {
return await originalFunction(...args); return await originalFunction(...args);
} catch (error) { } catch (error) {
console.error("Error in withAuth function: ", error); Logger.error(`Error in withAuth function: ${error}`);
return { success: false, error: error.message } as T; return { success: false, error: error.message } as T;
} }
} }

View File

@ -10,6 +10,8 @@ import { DocumentUrl, Progress } from "../lib/entities";
import { billTeam } from "../services/billing/credit_billing"; import { billTeam } from "../services/billing/credit_billing";
import { Document } from "../lib/entities"; import { Document } from "../lib/entities";
import { supabase_service } from "../services/supabase"; import { supabase_service } from "../services/supabase";
import { Logger } from "../lib/logger";
import { ScrapeEvents } from "../lib/scrape-events";
export async function startWebScraperPipeline({ export async function startWebScraperPipeline({
job, job,
@ -23,6 +25,7 @@ export async function startWebScraperPipeline({
crawlerOptions: job.data.crawlerOptions, crawlerOptions: job.data.crawlerOptions,
pageOptions: job.data.pageOptions, pageOptions: job.data.pageOptions,
inProgress: (progress) => { inProgress: (progress) => {
Logger.debug(`🐂 Job in progress ${job.id}`);
if (progress.currentDocument) { if (progress.currentDocument) {
partialDocs.push(progress.currentDocument); partialDocs.push(progress.currentDocument);
if (partialDocs.length > 50) { if (partialDocs.length > 50) {
@ -32,9 +35,12 @@ export async function startWebScraperPipeline({
} }
}, },
onSuccess: (result) => { onSuccess: (result) => {
Logger.debug(`🐂 Job completed ${job.id}`);
saveJob(job, result); saveJob(job, result);
}, },
onError: (error) => { onError: (error) => {
Logger.error(`🐂 Job failed ${job.id}`);
ScrapeEvents.logJobEvent(job, "failed");
job.moveToFailed(error); job.moveToFailed(error);
}, },
team_id: job.data.team_id, team_id: job.data.team_id,
@ -56,6 +62,7 @@ export async function runWebScraper({
const provider = new WebScraperDataProvider(); const provider = new WebScraperDataProvider();
if (mode === "crawl") { if (mode === "crawl") {
await provider.setOptions({ await provider.setOptions({
jobId: bull_job_id,
mode: mode, mode: mode,
urls: [url], urls: [url],
crawlerOptions: crawlerOptions, crawlerOptions: crawlerOptions,
@ -64,6 +71,7 @@ export async function runWebScraper({
}); });
} else { } else {
await provider.setOptions({ await provider.setOptions({
jobId: bull_job_id,
mode: mode, mode: mode,
urls: url.split(","), urls: url.split(","),
crawlerOptions: crawlerOptions, crawlerOptions: crawlerOptions,
@ -108,7 +116,6 @@ export async function runWebScraper({
// this return doesn't matter too much for the job completion result // this return doesn't matter too much for the job completion result
return { success: true, message: "", docs: filteredDocs }; return { success: true, message: "", docs: filteredDocs };
} catch (error) { } catch (error) {
console.error("Error running web scraper", error);
onError(error); onError(error);
return { success: false, message: error.message, docs: [] }; return { success: false, message: error.message, docs: [] };
} }
@ -135,7 +142,8 @@ const saveJob = async (job: Job, result: any) => {
// I think the job won't exist here anymore // I think the job won't exist here anymore
} }
} }
ScrapeEvents.logJobEvent(job, "completed");
} catch (error) { } catch (error) {
console.error("Failed to update job status:", error); Logger.error(`🐂 Failed to update job status: ${error}`);
} }
}; };

View File

@ -0,0 +1,29 @@
import express from "express";
import { redisHealthController } from "../controllers/admin/redis-health";
import {
checkQueuesController,
cleanBefore24hCompleteJobsController,
queuesController,
} from "../controllers/admin/queue";
export const adminRouter = express.Router();
adminRouter.get(
`/admin/${process.env.BULL_AUTH_KEY}/redis-health`,
redisHealthController
);
adminRouter.get(
`/admin/${process.env.BULL_AUTH_KEY}/clean-before-24h-complete-jobs`,
cleanBefore24hCompleteJobsController
);
adminRouter.get(
`/admin/${process.env.BULL_AUTH_KEY}/check-queues`,
checkQueuesController
);
adminRouter.get(
`/admin/${process.env.BULL_AUTH_KEY}/queues`,
queuesController
);

View File

@ -7,6 +7,8 @@ import { crawlJobStatusPreviewController } from "../../src/controllers/status";
import { searchController } from "../../src/controllers/search"; import { searchController } from "../../src/controllers/search";
import { crawlCancelController } from "../../src/controllers/crawl-cancel"; import { crawlCancelController } from "../../src/controllers/crawl-cancel";
import { keyAuthController } from "../../src/controllers/keyAuth"; import { keyAuthController } from "../../src/controllers/keyAuth";
import { livenessController } from "../controllers/liveness";
import { readinessController } from "../controllers/readiness";
export const v0Router = express.Router(); export const v0Router = express.Router();
@ -23,3 +25,6 @@ v0Router.get("/v0/keyAuth", keyAuthController);
// Search routes // Search routes
v0Router.post("/v0/search", searchController); v0Router.post("/v0/search", searchController);
// Health/Probe routes
v0Router.get("/v0/health/liveness", livenessController);
v0Router.get("/v0/health/readiness", readinessController);

View File

@ -42,6 +42,7 @@ describe('WebCrawler', () => {
crawler = new WebCrawler({ crawler = new WebCrawler({
jobId: "TEST",
initialUrl: initialUrl, initialUrl: initialUrl,
includes: [], includes: [],
excludes: [], excludes: [],
@ -76,6 +77,7 @@ describe('WebCrawler', () => {
crawler = new WebCrawler({ crawler = new WebCrawler({
jobId: "TEST",
initialUrl: initialUrl, initialUrl: initialUrl,
includes: [], includes: [],
excludes: [], excludes: [],
@ -104,6 +106,7 @@ describe('WebCrawler', () => {
crawler = new WebCrawler({ crawler = new WebCrawler({
jobId: "TEST",
initialUrl: initialUrl, initialUrl: initialUrl,
includes: [], includes: [],
excludes: [], excludes: [],
@ -133,6 +136,7 @@ describe('WebCrawler', () => {
crawler = new WebCrawler({ crawler = new WebCrawler({
jobId: "TEST",
initialUrl: initialUrl, initialUrl: initialUrl,
includes: [], includes: [],
excludes: [], excludes: [],
@ -161,6 +165,7 @@ describe('WebCrawler', () => {
// Setup the crawler with the specific test case options // Setup the crawler with the specific test case options
const crawler = new WebCrawler({ const crawler = new WebCrawler({
jobId: "TEST",
initialUrl: initialUrl, initialUrl: initialUrl,
includes: [], includes: [],
excludes: [], excludes: [],
@ -194,6 +199,7 @@ describe('WebCrawler', () => {
const limit = 2; // Set a limit for the number of links const limit = 2; // Set a limit for the number of links
crawler = new WebCrawler({ crawler = new WebCrawler({
jobId: "TEST",
initialUrl: initialUrl, initialUrl: initialUrl,
includes: [], includes: [],
excludes: [], excludes: [],

View File

@ -0,0 +1,15 @@
import CacheableLookup from 'cacheable-lookup';
import https from 'node:https';
import axios from "axios";
describe("DNS", () => {
it("cached dns", async () => {
const cachedDns = new CacheableLookup();
cachedDns.install(https.globalAgent);
jest.spyOn(cachedDns, "lookupAsync");
const res = await axios.get("https://example.com");
expect(res.status).toBe(200);
expect(cachedDns.lookupAsync).toHaveBeenCalled();
});
});

View File

@ -15,23 +15,23 @@ describe('scrapSingleUrl', () => {
const pageOptionsWithHtml: PageOptions = { includeHtml: true }; const pageOptionsWithHtml: PageOptions = { includeHtml: true };
const pageOptionsWithoutHtml: PageOptions = { includeHtml: false }; const pageOptionsWithoutHtml: PageOptions = { includeHtml: false };
const resultWithHtml = await scrapSingleUrl(url, pageOptionsWithHtml); const resultWithHtml = await scrapSingleUrl("TEST", url, pageOptionsWithHtml);
const resultWithoutHtml = await scrapSingleUrl(url, pageOptionsWithoutHtml); const resultWithoutHtml = await scrapSingleUrl("TEST", url, pageOptionsWithoutHtml);
expect(resultWithHtml.html).toBeDefined(); expect(resultWithHtml.html).toBeDefined();
expect(resultWithoutHtml.html).toBeUndefined(); expect(resultWithoutHtml.html).toBeUndefined();
}, 10000); }, 10000);
}); });
it('should return a list of links on the mendable.ai page', async () => { it('should return a list of links on the firecrawl.ai page', async () => {
const url = 'https://mendable.ai'; const url = 'https://example.com';
const pageOptions: PageOptions = { includeHtml: true }; const pageOptions: PageOptions = { includeHtml: true };
const result = await scrapSingleUrl(url, pageOptions); const result = await scrapSingleUrl("TEST", url, pageOptions);
// Check if the result contains a list of links // Check if the result contains a list of links
expect(result.linksOnPage).toBeDefined(); expect(result.linksOnPage).toBeDefined();
expect(Array.isArray(result.linksOnPage)).toBe(true); expect(Array.isArray(result.linksOnPage)).toBe(true);
expect(result.linksOnPage.length).toBeGreaterThan(0); expect(result.linksOnPage.length).toBeGreaterThan(0);
expect(result.linksOnPage).toContain('https://mendable.ai/blog') expect(result.linksOnPage).toContain('https://www.iana.org/domains/example')
}, 10000); }, 10000);

View File

@ -8,8 +8,10 @@ import { scrapSingleUrl } from "./single_url";
import robotsParser from "robots-parser"; import robotsParser from "robots-parser";
import { getURLDepth } from "./utils/maxDepthUtils"; import { getURLDepth } from "./utils/maxDepthUtils";
import { axiosTimeout } from "../../../src/lib/timeout"; import { axiosTimeout } from "../../../src/lib/timeout";
import { Logger } from "../../../src/lib/logger";
export class WebCrawler { export class WebCrawler {
private jobId: string;
private initialUrl: string; private initialUrl: string;
private baseUrl: string; private baseUrl: string;
private includes: string[]; private includes: string[];
@ -26,6 +28,7 @@ export class WebCrawler {
private allowExternalContentLinks: boolean; private allowExternalContentLinks: boolean;
constructor({ constructor({
jobId,
initialUrl, initialUrl,
includes, includes,
excludes, excludes,
@ -36,6 +39,7 @@ export class WebCrawler {
allowBackwardCrawling = false, allowBackwardCrawling = false,
allowExternalContentLinks = false allowExternalContentLinks = false
}: { }: {
jobId: string;
initialUrl: string; initialUrl: string;
includes?: string[]; includes?: string[];
excludes?: string[]; excludes?: string[];
@ -46,6 +50,7 @@ export class WebCrawler {
allowBackwardCrawling?: boolean; allowBackwardCrawling?: boolean;
allowExternalContentLinks?: boolean; allowExternalContentLinks?: boolean;
}) { }) {
this.jobId = jobId;
this.initialUrl = initialUrl; this.initialUrl = initialUrl;
this.baseUrl = new URL(initialUrl).origin; this.baseUrl = new URL(initialUrl).origin;
this.includes = includes ?? []; this.includes = includes ?? [];
@ -64,7 +69,7 @@ export class WebCrawler {
private filterLinks(sitemapLinks: string[], limit: number, maxDepth: number): string[] { private filterLinks(sitemapLinks: string[], limit: number, maxDepth: number): string[] {
return sitemapLinks return sitemapLinks
.filter((link) => { .filter((link) => {
const url = new URL(link); const url = new URL(link.trim(), this.baseUrl);
const path = url.pathname; const path = url.pathname;
const depth = getURLDepth(url.toString()); const depth = getURLDepth(url.toString());
@ -116,7 +121,7 @@ export class WebCrawler {
const isAllowed = this.robots.isAllowed(link, "FireCrawlAgent") ?? true; const isAllowed = this.robots.isAllowed(link, "FireCrawlAgent") ?? true;
// Check if the link is disallowed by robots.txt // Check if the link is disallowed by robots.txt
if (!isAllowed) { if (!isAllowed) {
console.log(`Link disallowed by robots.txt: ${link}`); Logger.debug(`Link disallowed by robots.txt: ${link}`);
return false; return false;
} }
@ -133,15 +138,19 @@ export class WebCrawler {
limit: number = 10000, limit: number = 10000,
maxDepth: number = 10 maxDepth: number = 10
): Promise<{ url: string, html: string }[]> { ): Promise<{ url: string, html: string }[]> {
Logger.debug(`Crawler starting with ${this.initialUrl}`);
// Fetch and parse robots.txt // Fetch and parse robots.txt
try { try {
const response = await axios.get(this.robotsTxtUrl, { timeout: axiosTimeout }); const response = await axios.get(this.robotsTxtUrl, { timeout: axiosTimeout });
this.robots = robotsParser(this.robotsTxtUrl, response.data); this.robots = robotsParser(this.robotsTxtUrl, response.data);
Logger.debug(`Crawler robots.txt fetched with ${this.robotsTxtUrl}`);
} catch (error) { } catch (error) {
console.log(`Failed to fetch robots.txt from ${this.robotsTxtUrl}`); Logger.debug(`Failed to fetch robots.txt from ${this.robotsTxtUrl}`);
} }
if(!crawlerOptions?.ignoreSitemap){ if (!crawlerOptions?.ignoreSitemap){
Logger.debug(`Fetching sitemap links from ${this.initialUrl}`);
const sitemapLinks = await this.tryFetchSitemapLinks(this.initialUrl); const sitemapLinks = await this.tryFetchSitemapLinks(this.initialUrl);
if (sitemapLinks.length > 0) { if (sitemapLinks.length > 0) {
let filteredLinks = this.filterLinks(sitemapLinks, limit, maxDepth); let filteredLinks = this.filterLinks(sitemapLinks, limit, maxDepth);
@ -155,7 +164,7 @@ export class WebCrawler {
concurrencyLimit, concurrencyLimit,
inProgress inProgress
); );
if ( if (
urls.length === 0 && urls.length === 0 &&
this.filterLinks([this.initialUrl], limit, this.maxCrawledDepth).length > 0 this.filterLinks([this.initialUrl], limit, this.maxCrawledDepth).length > 0
@ -175,6 +184,7 @@ export class WebCrawler {
inProgress?: (progress: Progress) => void, inProgress?: (progress: Progress) => void,
): Promise<{ url: string, html: string }[]> { ): Promise<{ url: string, html: string }[]> {
const queue = async.queue(async (task: string, callback) => { const queue = async.queue(async (task: string, callback) => {
Logger.debug(`Crawling ${task}`);
if (this.crawledUrls.size >= Math.min(this.maxCrawledLinks, this.limit)) { if (this.crawledUrls.size >= Math.min(this.maxCrawledLinks, this.limit)) {
if (callback && typeof callback === "function") { if (callback && typeof callback === "function") {
callback(); callback();
@ -216,16 +226,18 @@ export class WebCrawler {
} }
}, concurrencyLimit); }, concurrencyLimit);
Logger.debug(`🐂 Pushing ${urls.length} URLs to the queue`);
queue.push( queue.push(
urls.filter( urls.filter(
(url) => (url) =>
!this.visited.has(url) && this.robots.isAllowed(url, "FireCrawlAgent") !this.visited.has(url) && this.robots.isAllowed(url, "FireCrawlAgent")
), ),
(err) => { (err) => {
if (err) console.error(err); if (err) Logger.error(`🐂 Error pushing URLs to the queue: ${err}`);
} }
); );
await queue.drain(); await queue.drain();
Logger.debug(`🐂 Crawled ${this.crawledUrls.size} URLs, Queue drained.`);
return Array.from(this.crawledUrls.entries()).map(([url, html]) => ({ url, html })); return Array.from(this.crawledUrls.entries()).map(([url, html]) => ({ url, html }));
} }
@ -253,7 +265,7 @@ export class WebCrawler {
// If it is the first link, fetch with single url // If it is the first link, fetch with single url
if (this.visited.size === 1) { if (this.visited.size === 1) {
const page = await scrapSingleUrl(url, { ...pageOptions, includeHtml: true }); const page = await scrapSingleUrl(this.jobId, url, { ...pageOptions, includeHtml: true });
content = page.html ?? ""; content = page.html ?? "";
pageStatusCode = page.metadata?.pageStatusCode; pageStatusCode = page.metadata?.pageStatusCode;
pageError = page.metadata?.pageError || undefined; pageError = page.metadata?.pageError || undefined;
@ -282,7 +294,6 @@ export class WebCrawler {
const urlObj = new URL(fullUrl); const urlObj = new URL(fullUrl);
const path = urlObj.pathname; const path = urlObj.pathname;
if (this.isInternalLink(fullUrl)) { // INTERNAL LINKS if (this.isInternalLink(fullUrl)) { // INTERNAL LINKS
if (this.isInternalLink(fullUrl) && if (this.isInternalLink(fullUrl) &&
this.noSections(fullUrl) && this.noSections(fullUrl) &&
@ -383,7 +394,7 @@ export class WebCrawler {
return linkDomain === baseDomain; return linkDomain === baseDomain;
} }
private isFile(url: string): boolean { public isFile(url: string): boolean {
const fileExtensions = [ const fileExtensions = [
".png", ".png",
".jpg", ".jpg",
@ -393,6 +404,7 @@ export class WebCrawler {
".js", ".js",
".ico", ".ico",
".svg", ".svg",
".tiff",
// ".pdf", // ".pdf",
".zip", ".zip",
".exe", ".exe",
@ -408,9 +420,10 @@ export class WebCrawler {
".woff", ".woff",
".ttf", ".ttf",
".woff2", ".woff2",
".webp" ".webp",
".inc"
]; ];
return fileExtensions.some((ext) => url.endsWith(ext)); return fileExtensions.some((ext) => url.toLowerCase().endsWith(ext));
} }
private isSocialMediaOrEmail(url: string): boolean { private isSocialMediaOrEmail(url: string): boolean {
@ -451,7 +464,7 @@ export class WebCrawler {
sitemapLinks = await getLinksFromSitemap({ sitemapUrl }); sitemapLinks = await getLinksFromSitemap({ sitemapUrl });
} }
} catch (error) { } catch (error) {
console.error(`Failed to fetch sitemap with axios from ${sitemapUrl}: ${error}`); Logger.debug(`Failed to fetch sitemap with axios from ${sitemapUrl}: ${error}`);
const response = await getLinksFromSitemap({ sitemapUrl, mode: 'fire-engine' }); const response = await getLinksFromSitemap({ sitemapUrl, mode: 'fire-engine' });
if (response) { if (response) {
sitemapLinks = response; sitemapLinks = response;
@ -463,10 +476,10 @@ export class WebCrawler {
try { try {
const response = await axios.get(baseUrlSitemap, { timeout: axiosTimeout }); const response = await axios.get(baseUrlSitemap, { timeout: axiosTimeout });
if (response.status === 200) { if (response.status === 200) {
sitemapLinks = await getLinksFromSitemap({ sitemapUrl: baseUrlSitemap }); sitemapLinks = await getLinksFromSitemap({ sitemapUrl: baseUrlSitemap, mode: 'fire-engine' });
} }
} catch (error) { } catch (error) {
console.error(`Failed to fetch sitemap from ${baseUrlSitemap}: ${error}`); Logger.debug(`Failed to fetch sitemap from ${baseUrlSitemap}: ${error}`);
sitemapLinks = await getLinksFromSitemap({ sitemapUrl: baseUrlSitemap, mode: 'fire-engine' }); sitemapLinks = await getLinksFromSitemap({ sitemapUrl: baseUrlSitemap, mode: 'fire-engine' });
} }
} }

View File

@ -1,10 +1,12 @@
import { Logger } from "../../../lib/logger";
export async function handleCustomScraping( export async function handleCustomScraping(
text: string, text: string,
url: string url: string
): Promise<{ scraper: string; url: string; waitAfterLoad?: number, pageOptions?: { scrollXPaths?: string[] } } | null> { ): Promise<{ scraper: string; url: string; waitAfterLoad?: number, pageOptions?: { scrollXPaths?: string[] } } | null> {
// Check for Readme Docs special case // Check for Readme Docs special case
if (text.includes('<meta name="readme-deploy"')) { if (text.includes('<meta name="readme-deploy"')) {
console.log( Logger.debug(
`Special use case detected for ${url}, using Fire Engine with wait time 1000ms` `Special use case detected for ${url}, using Fire Engine with wait time 1000ms`
); );
return { return {
@ -19,7 +21,7 @@ export async function handleCustomScraping(
// Check for Vanta security portals // Check for Vanta security portals
if (text.includes('<link href="https://static.vanta.com')) { if (text.includes('<link href="https://static.vanta.com')) {
console.log( Logger.debug(
`Vanta link detected for ${url}, using Fire Engine with wait time 3000ms` `Vanta link detected for ${url}, using Fire Engine with wait time 3000ms`
); );
return { return {
@ -34,7 +36,7 @@ export async function handleCustomScraping(
const googleDriveMetaMatch = text.match(googleDriveMetaPattern); const googleDriveMetaMatch = text.match(googleDriveMetaPattern);
if (googleDriveMetaMatch) { if (googleDriveMetaMatch) {
const url = googleDriveMetaMatch[1]; const url = googleDriveMetaMatch[1];
console.log(`Google Drive PDF link detected: ${url}`); Logger.debug(`Google Drive PDF link detected: ${url}`);
const fileIdMatch = url.match(/https:\/\/drive\.google\.com\/file\/d\/([^\/]+)\/view/); const fileIdMatch = url.match(/https:\/\/drive\.google\.com\/file\/d\/([^\/]+)\/view/);
if (fileIdMatch) { if (fileIdMatch) {

View File

@ -19,13 +19,16 @@ import { generateCompletions } from "../../lib/LLM-extraction";
import { getWebScraperQueue } from "../../../src/services/queue-service"; import { getWebScraperQueue } from "../../../src/services/queue-service";
import { fetchAndProcessDocx } from "./utils/docxProcessor"; import { fetchAndProcessDocx } from "./utils/docxProcessor";
import { getAdjustedMaxDepth, getURLDepth } from "./utils/maxDepthUtils"; import { getAdjustedMaxDepth, getURLDepth } from "./utils/maxDepthUtils";
import { Logger } from "../../lib/logger";
import { ScrapeEvents } from "../../lib/scrape-events";
export class WebScraperDataProvider { export class WebScraperDataProvider {
private jobId: string;
private bullJobId: string; private bullJobId: string;
private urls: string[] = [""]; private urls: string[] = [""];
private mode: "single_urls" | "sitemap" | "crawl" = "single_urls"; private mode: "single_urls" | "sitemap" | "crawl" = "single_urls";
private includes: string[]; private includes: string | string[];
private excludes: string[]; private excludes: string | string[];
private maxCrawledLinks: number; private maxCrawledLinks: number;
private maxCrawledDepth: number = 10; private maxCrawledDepth: number = 10;
private returnOnlyUrls: boolean; private returnOnlyUrls: boolean;
@ -65,6 +68,7 @@ export class WebScraperDataProvider {
batchUrls.map(async (url, index) => { batchUrls.map(async (url, index) => {
const existingHTML = allHtmls ? allHtmls[i + index] : ""; const existingHTML = allHtmls ? allHtmls[i + index] : "";
const result = await scrapSingleUrl( const result = await scrapSingleUrl(
this.jobId,
url, url,
this.pageOptions, this.pageOptions,
this.extractorOptions, this.extractorOptions,
@ -89,14 +93,14 @@ export class WebScraperDataProvider {
const job = await getWebScraperQueue().getJob(this.bullJobId); const job = await getWebScraperQueue().getJob(this.bullJobId);
const jobStatus = await job.getState(); const jobStatus = await job.getState();
if (jobStatus === "failed") { if (jobStatus === "failed") {
console.error( Logger.info(
"Job has failed or has been cancelled by the user. Stopping the job..." "Job has failed or has been cancelled by the user. Stopping the job..."
); );
return [] as Document[]; return [] as Document[];
} }
} }
} catch (error) { } catch (error) {
console.error(error); Logger.error(error.message);
return [] as Document[]; return [] as Document[];
} }
} }
@ -164,11 +168,11 @@ export class WebScraperDataProvider {
private async handleCrawlMode( private async handleCrawlMode(
inProgress?: (progress: Progress) => void inProgress?: (progress: Progress) => void
): Promise<Document[]> { ): Promise<Document[]> {
const crawler = new WebCrawler({ const crawler = new WebCrawler({
jobId: this.jobId,
initialUrl: this.urls[0], initialUrl: this.urls[0],
includes: this.includes, includes: Array.isArray(this.includes) ? this.includes : this.includes.split(','),
excludes: this.excludes, excludes: Array.isArray(this.excludes) ? this.excludes : this.excludes.split(','),
maxCrawledLinks: this.maxCrawledLinks, maxCrawledLinks: this.maxCrawledLinks,
maxCrawledDepth: getAdjustedMaxDepth(this.urls[0], this.maxCrawledDepth), maxCrawledDepth: getAdjustedMaxDepth(this.urls[0], this.maxCrawledDepth),
limit: this.limit, limit: this.limit,
@ -225,7 +229,6 @@ export class WebScraperDataProvider {
return this.returnOnlyUrlsResponse(links, inProgress); return this.returnOnlyUrlsResponse(links, inProgress);
} }
let documents = await this.processLinks(links, inProgress); let documents = await this.processLinks(links, inProgress);
return this.cacheAndFinalizeDocuments(documents, links); return this.cacheAndFinalizeDocuments(documents, links);
} }
@ -253,35 +256,60 @@ export class WebScraperDataProvider {
inProgress?: (progress: Progress) => void, inProgress?: (progress: Progress) => void,
allHtmls?: string[] allHtmls?: string[]
): Promise<Document[]> { ): Promise<Document[]> {
const pdfLinks = links.filter(link => link.endsWith(".pdf")); const pdfLinks = links.filter((link) => link.endsWith(".pdf"));
const docLinks = links.filter(link => link.endsWith(".doc") || link.endsWith(".docx")); const docLinks = links.filter(
(link) => link.endsWith(".doc") || link.endsWith(".docx")
const pdfDocuments = await this.fetchPdfDocuments(pdfLinks);
const docxDocuments = await this.fetchDocxDocuments(docLinks);
links = links.filter(link => !pdfLinks.includes(link) && !docLinks.includes(link));
let documents = await this.convertUrlsToDocuments(
links,
inProgress,
allHtmls
); );
documents = await this.getSitemapData(this.urls[0], documents); const [pdfDocuments, docxDocuments] = await Promise.all([
this.fetchPdfDocuments(pdfLinks),
this.fetchDocxDocuments(docLinks),
]);
links = links.filter(
(link) => !pdfLinks.includes(link) && !docLinks.includes(link)
);
let [documents, sitemapData] = await Promise.all([
this.convertUrlsToDocuments(links, inProgress, allHtmls),
this.mode === "single_urls" && links.length > 0
? this.getSitemapDataForSingleUrl(this.urls[0], links[0], 1500).catch(
(error) => {
Logger.debug(`Failed to fetch sitemap data: ${error}`);
return null;
}
)
: Promise.resolve(null),
]);
if (this.mode === "single_urls" && documents.length > 0) {
documents[0].metadata.sitemap = sitemapData ?? undefined;
} else {
documents = await this.getSitemapData(this.urls[0], documents);
}
documents = this.applyPathReplacements(documents); documents = this.applyPathReplacements(documents);
// documents = await this.applyImgAltText(documents); // documents = await this.applyImgAltText(documents);
if ( if (
(this.extractorOptions.mode === "llm-extraction" || this.extractorOptions.mode === "llm-extraction-from-markdown") && (this.extractorOptions.mode === "llm-extraction" ||
this.extractorOptions.mode === "llm-extraction-from-markdown") &&
this.mode === "single_urls" this.mode === "single_urls"
) { ) {
documents = await generateCompletions(documents, this.extractorOptions, "markdown"); documents = await generateCompletions(
documents,
this.extractorOptions,
"markdown"
);
} }
if ( if (
(this.extractorOptions.mode === "llm-extraction-from-raw-html") && this.extractorOptions.mode === "llm-extraction-from-raw-html" &&
this.mode === "single_urls" this.mode === "single_urls"
) { ) {
documents = await generateCompletions(documents, this.extractorOptions, "raw-html"); documents = await generateCompletions(
documents,
this.extractorOptions,
"raw-html"
);
} }
return documents.concat(pdfDocuments).concat(docxDocuments); return documents.concat(pdfDocuments).concat(docxDocuments);
} }
@ -289,7 +317,28 @@ export class WebScraperDataProvider {
private async fetchPdfDocuments(pdfLinks: string[]): Promise<Document[]> { private async fetchPdfDocuments(pdfLinks: string[]): Promise<Document[]> {
return Promise.all( return Promise.all(
pdfLinks.map(async (pdfLink) => { pdfLinks.map(async (pdfLink) => {
const { content, pageStatusCode, pageError } = await fetchAndProcessPdf(pdfLink, this.pageOptions.parsePDF); const timer = Date.now();
const logInsertPromise = ScrapeEvents.insert(this.jobId, {
type: "scrape",
url: pdfLink,
worker: process.env.FLY_MACHINE_ID,
method: "pdf-scrape",
result: null,
});
const { content, pageStatusCode, pageError } = await fetchAndProcessPdf(
pdfLink,
this.pageOptions.parsePDF
);
const insertedLogId = await logInsertPromise;
ScrapeEvents.updateScrapeResult(insertedLogId, {
response_size: content.length,
success: !(pageStatusCode && pageStatusCode >= 400) && !!content && (content.trim().length >= 100),
error: pageError,
response_code: pageStatusCode,
time_taken: Date.now() - timer,
});
return { return {
content: content, content: content,
metadata: { sourceURL: pdfLink, pageStatusCode, pageError }, metadata: { sourceURL: pdfLink, pageStatusCode, pageError },
@ -300,11 +349,32 @@ export class WebScraperDataProvider {
} }
private async fetchDocxDocuments(docxLinks: string[]): Promise<Document[]> { private async fetchDocxDocuments(docxLinks: string[]): Promise<Document[]> {
return Promise.all( return Promise.all(
docxLinks.map(async (p) => { docxLinks.map(async (docxLink) => {
const { content, pageStatusCode, pageError } = await fetchAndProcessDocx(p); const timer = Date.now();
const logInsertPromise = ScrapeEvents.insert(this.jobId, {
type: "scrape",
url: docxLink,
worker: process.env.FLY_MACHINE_ID,
method: "docx-scrape",
result: null,
});
const { content, pageStatusCode, pageError } = await fetchAndProcessDocx(
docxLink
);
const insertedLogId = await logInsertPromise;
ScrapeEvents.updateScrapeResult(insertedLogId, {
response_size: content.length,
success: !(pageStatusCode && pageStatusCode >= 400) && !!content && (content.trim().length >= 100),
error: pageError,
response_code: pageStatusCode,
time_taken: Date.now() - timer,
});
return { return {
content, content,
metadata: { sourceURL: p, pageStatusCode, pageError }, metadata: { sourceURL: docxLink, pageStatusCode, pageError },
provider: "web-scraper", provider: "web-scraper",
}; };
}) })
@ -328,7 +398,7 @@ export class WebScraperDataProvider {
documents: Document[], documents: Document[],
links: string[] links: string[]
): Promise<Document[]> { ): Promise<Document[]> {
await this.setCachedDocuments(documents, links); // await this.setCachedDocuments(documents, links);
documents = this.removeChildLinks(documents); documents = this.removeChildLinks(documents);
return documents.splice(0, this.limit); return documents.splice(0, this.limit);
} }
@ -375,6 +445,10 @@ export class WebScraperDataProvider {
const url = new URL(document.metadata.sourceURL); const url = new URL(document.metadata.sourceURL);
const path = url.pathname; const path = url.pathname;
if (!Array.isArray(this.excludes)) {
this.excludes = this.excludes.split(',');
}
if (this.excludes.length > 0 && this.excludes[0] !== "") { if (this.excludes.length > 0 && this.excludes[0] !== "") {
// Check if the link should be excluded // Check if the link should be excluded
if ( if (
@ -386,6 +460,10 @@ export class WebScraperDataProvider {
} }
} }
if (!Array.isArray(this.includes)) {
this.includes = this.includes.split(',');
}
if (this.includes.length > 0 && this.includes[0] !== "") { if (this.includes.length > 0 && this.includes[0] !== "") {
// Check if the link matches the include patterns, if any are specified // Check if the link matches the include patterns, if any are specified
if (this.includes.length > 0) { if (this.includes.length > 0) {
@ -424,7 +502,7 @@ export class WebScraperDataProvider {
...document, ...document,
childrenLinks: childrenLinks || [], childrenLinks: childrenLinks || [],
}), }),
60 * 60 * 24 * 10 60 * 60
); // 10 days ); // 10 days
} }
} }
@ -433,7 +511,7 @@ export class WebScraperDataProvider {
let documents: Document[] = []; let documents: Document[] = [];
for (const url of urls) { for (const url of urls) {
const normalizedUrl = this.normalizeUrl(url); const normalizedUrl = this.normalizeUrl(url);
console.log( Logger.debug(
"Getting cached document for web-scraper-cache:" + normalizedUrl "Getting cached document for web-scraper-cache:" + normalizedUrl
); );
const cachedDocumentString = await getValue( const cachedDocumentString = await getValue(
@ -472,6 +550,7 @@ export class WebScraperDataProvider {
throw new Error("Urls are required"); throw new Error("Urls are required");
} }
this.jobId = options.jobId;
this.bullJobId = options.bullJobId; this.bullJobId = options.bullJobId;
this.urls = options.urls; this.urls = options.urls;
this.mode = options.mode; this.mode = options.mode;
@ -489,16 +568,28 @@ export class WebScraperDataProvider {
includeHtml: false, includeHtml: false,
replaceAllPathsWithAbsolutePaths: false, replaceAllPathsWithAbsolutePaths: false,
parsePDF: true, parsePDF: true,
removeTags: [] removeTags: [],
}; };
this.extractorOptions = options.extractorOptions ?? {mode: "markdown"} this.extractorOptions = options.extractorOptions ?? { mode: "markdown" };
this.replaceAllPathsWithAbsolutePaths = options.crawlerOptions?.replaceAllPathsWithAbsolutePaths ?? options.pageOptions?.replaceAllPathsWithAbsolutePaths ?? false; this.replaceAllPathsWithAbsolutePaths =
//! @nicolas, for some reason this was being injected and breaking everything. Don't have time to find source of the issue so adding this check options.crawlerOptions?.replaceAllPathsWithAbsolutePaths ??
this.excludes = this.excludes.filter((item) => item !== ""); options.pageOptions?.replaceAllPathsWithAbsolutePaths ??
false;
if (typeof options.crawlerOptions?.excludes === 'string') {
this.excludes = options.crawlerOptions?.excludes.split(',').filter((item) => item.trim() !== "");
}
if (typeof options.crawlerOptions?.includes === 'string') {
this.includes = options.crawlerOptions?.includes.split(',').filter((item) => item.trim() !== "");
}
this.crawlerMode = options.crawlerOptions?.mode ?? "default"; this.crawlerMode = options.crawlerOptions?.mode ?? "default";
this.ignoreSitemap = options.crawlerOptions?.ignoreSitemap ?? false; this.ignoreSitemap = options.crawlerOptions?.ignoreSitemap ?? false;
this.allowBackwardCrawling = options.crawlerOptions?.allowBackwardCrawling ?? false; this.allowBackwardCrawling =
this.allowExternalContentLinks = options.crawlerOptions?.allowExternalContentLinks ?? false; options.crawlerOptions?.allowBackwardCrawling ?? false;
this.allowExternalContentLinks =
options.crawlerOptions?.allowExternalContentLinks ?? false;
// make sure all urls start with https:// // make sure all urls start with https://
this.urls = this.urls.map((url) => { this.urls = this.urls.map((url) => {
@ -537,6 +628,34 @@ export class WebScraperDataProvider {
} }
return documents; return documents;
} }
private async getSitemapDataForSingleUrl(
baseUrl: string,
url: string,
timeout?: number
) {
const sitemapData = await fetchSitemapData(baseUrl, timeout);
if (sitemapData) {
const docInSitemapData = sitemapData.find(
(data) => this.normalizeUrl(data.loc) === this.normalizeUrl(url)
);
if (docInSitemapData) {
let sitemapDocData: Partial<SitemapEntry> = {};
if (docInSitemapData.changefreq) {
sitemapDocData.changefreq = docInSitemapData.changefreq;
}
if (docInSitemapData.priority) {
sitemapDocData.priority = Number(docInSitemapData.priority);
}
if (docInSitemapData.lastmod) {
sitemapDocData.lastmod = docInSitemapData.lastmod;
}
if (Object.keys(sitemapDocData).length !== 0) {
return sitemapDocData;
}
}
}
return null;
}
generatesImgAltText = async (documents: Document[]): Promise<Document[]> => { generatesImgAltText = async (documents: Document[]): Promise<Document[]> => {
await Promise.all( await Promise.all(
documents.map(async (document) => { documents.map(async (document) => {

View File

@ -2,6 +2,7 @@ import axios from "axios";
import { logScrape } from "../../../services/logging/scrape_log"; import { logScrape } from "../../../services/logging/scrape_log";
import { fetchAndProcessPdf } from "../utils/pdfProcessor"; import { fetchAndProcessPdf } from "../utils/pdfProcessor";
import { universalTimeout } from "../global"; import { universalTimeout } from "../global";
import { Logger } from "../../../lib/logger";
/** /**
* Scrapes a URL with Axios * Scrapes a URL with Axios
@ -34,9 +35,7 @@ export async function scrapWithFetch(
}); });
if (response.status !== 200) { if (response.status !== 200) {
console.error( Logger.debug(`⛏️ Axios: Failed to fetch url: ${url} with status: ${response.status}`);
`[Axios] Error fetching url: ${url} with status: ${response.status}`
);
logParams.error_message = response.statusText; logParams.error_message = response.statusText;
logParams.response_code = response.status; logParams.response_code = response.status;
return { return {
@ -63,10 +62,10 @@ export async function scrapWithFetch(
} catch (error) { } catch (error) {
if (error.code === "ECONNABORTED") { if (error.code === "ECONNABORTED") {
logParams.error_message = "Request timed out"; logParams.error_message = "Request timed out";
console.log(`[Axios] Request timed out for ${url}`); Logger.debug(`⛏️ Axios: Request timed out for ${url}`);
} else { } else {
logParams.error_message = error.message || error; logParams.error_message = error.message || error;
console.error(`[Axios] Error fetching url: ${url} -> ${error}`); Logger.debug(`⛏️ Axios: Failed to fetch url: ${url} | Error: ${error}`);
} }
return { content: "", pageStatusCode: null, pageError: logParams.error_message }; return { content: "", pageStatusCode: null, pageError: logParams.error_message };
} finally { } finally {

View File

@ -4,12 +4,14 @@ import { logScrape } from "../../../services/logging/scrape_log";
import { generateRequestParams } from "../single_url"; import { generateRequestParams } from "../single_url";
import { fetchAndProcessPdf } from "../utils/pdfProcessor"; import { fetchAndProcessPdf } from "../utils/pdfProcessor";
import { universalTimeout } from "../global"; import { universalTimeout } from "../global";
import { Logger } from "../../../lib/logger";
/** /**
* Scrapes a URL with Fire-Engine * Scrapes a URL with Fire-Engine
* @param url The URL to scrape * @param url The URL to scrape
* @param waitFor The time to wait for the page to load * @param waitFor The time to wait for the page to load
* @param screenshot Whether to take a screenshot * @param screenshot Whether to take a screenshot
* @param fullPageScreenshot Whether to take a full page screenshot
* @param pageOptions The options for the page * @param pageOptions The options for the page
* @param headers The headers to send with the request * @param headers The headers to send with the request
* @param options The options for the request * @param options The options for the request
@ -19,6 +21,7 @@ export async function scrapWithFireEngine({
url, url,
waitFor = 0, waitFor = 0,
screenshot = false, screenshot = false,
fullPageScreenshot = false,
pageOptions = { parsePDF: true }, pageOptions = { parsePDF: true },
fireEngineOptions = {}, fireEngineOptions = {},
headers, headers,
@ -27,6 +30,7 @@ export async function scrapWithFireEngine({
url: string; url: string;
waitFor?: number; waitFor?: number;
screenshot?: boolean; screenshot?: boolean;
fullPageScreenshot?: boolean;
pageOptions?: { scrollXPaths?: string[]; parsePDF?: boolean }; pageOptions?: { scrollXPaths?: string[]; parsePDF?: boolean };
fireEngineOptions?: FireEngineOptions; fireEngineOptions?: FireEngineOptions;
headers?: Record<string, string>; headers?: Record<string, string>;
@ -46,16 +50,24 @@ export async function scrapWithFireEngine({
try { try {
const reqParams = await generateRequestParams(url); const reqParams = await generateRequestParams(url);
const waitParam = reqParams["params"]?.wait ?? waitFor; const waitParam = reqParams["params"]?.wait ?? waitFor;
const engineParam = reqParams["params"]?.engine ?? reqParams["params"]?.fireEngineOptions?.engine ?? fireEngineOptions?.engine ?? "playwright";
const screenshotParam = reqParams["params"]?.screenshot ?? screenshot; const screenshotParam = reqParams["params"]?.screenshot ?? screenshot;
const fullPageScreenshotParam = reqParams["params"]?.fullPageScreenshot ?? fullPageScreenshot;
const fireEngineOptionsParam : FireEngineOptions = reqParams["params"]?.fireEngineOptions ?? fireEngineOptions; const fireEngineOptionsParam : FireEngineOptions = reqParams["params"]?.fireEngineOptions ?? fireEngineOptions;
let endpoint = fireEngineOptionsParam.method === "get" ? "/request" : "/scrape";
console.log( let endpoint = "/scrape";
`[Fire-Engine] Scraping ${url} with wait: ${waitParam} and screenshot: ${screenshotParam} and method: ${fireEngineOptionsParam?.method ?? "null"}`
if(options?.endpoint === "request") {
endpoint = "/request";
}
let engine = engineParam; // do we want fireEngineOptions as first choice?
Logger.info(
`⛏️ Fire-Engine (${engine}): Scraping ${url} | params: { wait: ${waitParam}, screenshot: ${screenshotParam}, fullPageScreenshot: ${fullPageScreenshot}, method: ${fireEngineOptionsParam?.method ?? "null"} }`
); );
console.log(fireEngineOptionsParam)
const response = await axios.post( const response = await axios.post(
process.env.FIRE_ENGINE_BETA_URL + endpoint, process.env.FIRE_ENGINE_BETA_URL + endpoint,
@ -63,6 +75,7 @@ export async function scrapWithFireEngine({
url: url, url: url,
wait: waitParam, wait: waitParam,
screenshot: screenshotParam, screenshot: screenshotParam,
fullPageScreenshot: fullPageScreenshotParam,
headers: headers, headers: headers,
pageOptions: pageOptions, pageOptions: pageOptions,
...fireEngineOptionsParam, ...fireEngineOptionsParam,
@ -76,15 +89,15 @@ export async function scrapWithFireEngine({
); );
if (response.status !== 200) { if (response.status !== 200) {
console.error( Logger.debug(
`[Fire-Engine] Error fetching url: ${url} with status: ${response.status}` `⛏️ Fire-Engine (${engine}): Failed to fetch url: ${url} \t status: ${response.status}`
); );
logParams.error_message = response.data?.pageError; logParams.error_message = response.data?.pageError;
logParams.response_code = response.data?.pageStatusCode; logParams.response_code = response.data?.pageStatusCode;
if(response.data && response.data?.pageStatusCode !== 200) { if(response.data && response.data?.pageStatusCode !== 200) {
console.error(`[Fire-Engine] Error fetching url: ${url} with status: ${response.status}`); Logger.debug(`⛏️ Fire-Engine (${engine}): Failed to fetch url: ${url} \t status: ${response.status}`);
} }
return { return {
@ -122,10 +135,10 @@ export async function scrapWithFireEngine({
} }
} catch (error) { } catch (error) {
if (error.code === "ECONNABORTED") { if (error.code === "ECONNABORTED") {
console.log(`[Fire-Engine] Request timed out for ${url}`); Logger.debug(`⛏️ Fire-Engine: Request timed out for ${url}`);
logParams.error_message = "Request timed out"; logParams.error_message = "Request timed out";
} else { } else {
console.error(`[Fire-Engine][c] Error fetching url: ${url} -> ${error}`); Logger.debug(`⛏️ Fire-Engine: Failed to fetch url: ${url} | Error: ${error}`);
logParams.error_message = error.message || error; logParams.error_message = error.message || error;
} }
return { html: "", screenshot: "", pageStatusCode: null, pageError: logParams.error_message }; return { html: "", screenshot: "", pageStatusCode: null, pageError: logParams.error_message };

View File

@ -3,6 +3,7 @@ import { logScrape } from "../../../services/logging/scrape_log";
import { generateRequestParams } from "../single_url"; import { generateRequestParams } from "../single_url";
import { fetchAndProcessPdf } from "../utils/pdfProcessor"; import { fetchAndProcessPdf } from "../utils/pdfProcessor";
import { universalTimeout } from "../global"; import { universalTimeout } from "../global";
import { Logger } from "../../../lib/logger";
/** /**
* Scrapes a URL with Playwright * Scrapes a URL with Playwright
@ -51,8 +52,8 @@ export async function scrapWithPlaywright(
); );
if (response.status !== 200) { if (response.status !== 200) {
console.error( Logger.debug(
`[Playwright] Error fetching url: ${url} with status: ${response.status}` `⛏️ Playwright: Failed to fetch url: ${url} | status: ${response.status}, error: ${response.data?.pageError}`
); );
logParams.error_message = response.data?.pageError; logParams.error_message = response.data?.pageError;
logParams.response_code = response.data?.pageStatusCode; logParams.response_code = response.data?.pageStatusCode;
@ -86,8 +87,8 @@ export async function scrapWithPlaywright(
}; };
} catch (jsonError) { } catch (jsonError) {
logParams.error_message = jsonError.message || jsonError; logParams.error_message = jsonError.message || jsonError;
console.error( Logger.debug(
`[Playwright] Error parsing JSON response for url: ${url} -> ${jsonError}` `⛏️ Playwright: Error parsing JSON response for url: ${url} | Error: ${jsonError}`
); );
return { content: "", pageStatusCode: null, pageError: logParams.error_message }; return { content: "", pageStatusCode: null, pageError: logParams.error_message };
} }
@ -95,10 +96,10 @@ export async function scrapWithPlaywright(
} catch (error) { } catch (error) {
if (error.code === "ECONNABORTED") { if (error.code === "ECONNABORTED") {
logParams.error_message = "Request timed out"; logParams.error_message = "Request timed out";
console.log(`[Playwright] Request timed out for ${url}`); Logger.debug(`⛏️ Playwright: Request timed out for ${url}`);
} else { } else {
logParams.error_message = error.message || error; logParams.error_message = error.message || error;
console.error(`[Playwright] Error fetching url: ${url} -> ${error}`); Logger.debug(`⛏️ Playwright: Failed to fetch url: ${url} | Error: ${error}`);
} }
return { content: "", pageStatusCode: null, pageError: logParams.error_message }; return { content: "", pageStatusCode: null, pageError: logParams.error_message };
} finally { } finally {

View File

@ -3,6 +3,7 @@ import { generateRequestParams } from "../single_url";
import { fetchAndProcessPdf } from "../utils/pdfProcessor"; import { fetchAndProcessPdf } from "../utils/pdfProcessor";
import { universalTimeout } from "../global"; import { universalTimeout } from "../global";
import { ScrapingBeeClient } from "scrapingbee"; import { ScrapingBeeClient } from "scrapingbee";
import { Logger } from "../../../lib/logger";
/** /**
* Scrapes a URL with ScrapingBee * Scrapes a URL with ScrapingBee
@ -56,8 +57,8 @@ export async function scrapWithScrapingBee(
text = decoder.decode(response.data); text = decoder.decode(response.data);
logParams.success = true; logParams.success = true;
} catch (decodeError) { } catch (decodeError) {
console.error( Logger.debug(
`[ScrapingBee][c] Error decoding response data for url: ${url} -> ${decodeError}` `⛏️ ScrapingBee: Error decoding response data for url: ${url} | Error: ${decodeError}`
); );
logParams.error_message = decodeError.message || decodeError; logParams.error_message = decodeError.message || decodeError;
} }
@ -72,7 +73,7 @@ export async function scrapWithScrapingBee(
}; };
} }
} catch (error) { } catch (error) {
console.error(`[ScrapingBee][c] Error fetching url: ${url} -> ${error}`); Logger.debug(`⛏️ ScrapingBee: Error fetching url: ${url} | Error: ${error}`);
logParams.error_message = error.message || error; logParams.error_message = error.message || error;
logParams.response_code = error.response?.status; logParams.response_code = error.response?.status;
return { return {

View File

@ -17,16 +17,20 @@ import { scrapWithFireEngine } from "./scrapers/fireEngine";
import { scrapWithPlaywright } from "./scrapers/playwright"; import { scrapWithPlaywright } from "./scrapers/playwright";
import { scrapWithScrapingBee } from "./scrapers/scrapingBee"; import { scrapWithScrapingBee } from "./scrapers/scrapingBee";
import { extractLinks } from "./utils/utils"; import { extractLinks } from "./utils/utils";
import { Logger } from "../../lib/logger";
import { ScrapeEvents } from "../../lib/scrape-events";
import { clientSideError } from "../../strings";
dotenv.config(); dotenv.config();
const baseScrapers = [ export const baseScrapers = [
"fire-engine", "fire-engine",
"fire-engine;chrome-cdp",
"scrapingBee", "scrapingBee",
"playwright", process.env.USE_DB_AUTHENTICATION ? undefined : "playwright",
"scrapingBeeLoad", "scrapingBeeLoad",
"fetch", "fetch",
] as const; ].filter(Boolean);
export async function generateRequestParams( export async function generateRequestParams(
url: string, url: string,
@ -47,7 +51,7 @@ export async function generateRequestParams(
return defaultParams; return defaultParams;
} }
} catch (error) { } catch (error) {
console.error(`Error generating URL key: ${error}`); Logger.error(`Error generating URL key: ${error}`);
return defaultParams; return defaultParams;
} }
} }
@ -71,6 +75,8 @@ function getScrapingFallbackOrder(
return !!process.env.SCRAPING_BEE_API_KEY; return !!process.env.SCRAPING_BEE_API_KEY;
case "fire-engine": case "fire-engine":
return !!process.env.FIRE_ENGINE_BETA_URL; return !!process.env.FIRE_ENGINE_BETA_URL;
case "fire-engine;chrome-cdp":
return !!process.env.FIRE_ENGINE_BETA_URL;
case "playwright": case "playwright":
return !!process.env.PLAYWRIGHT_MICROSERVICE_URL; return !!process.env.PLAYWRIGHT_MICROSERVICE_URL;
default: default:
@ -79,21 +85,22 @@ function getScrapingFallbackOrder(
}); });
let defaultOrder = [ let defaultOrder = [
!process.env.USE_DB_AUTHENTICATION ? undefined : "fire-engine",
!process.env.USE_DB_AUTHENTICATION ? undefined : "fire-engine;chrome-cdp",
"scrapingBee", "scrapingBee",
"fire-engine", process.env.USE_DB_AUTHENTICATION ? undefined : "playwright",
"playwright",
"scrapingBeeLoad", "scrapingBeeLoad",
"fetch", "fetch",
]; ].filter(Boolean);
if (isWaitPresent || isScreenshotPresent || isHeadersPresent) { if (isWaitPresent || isScreenshotPresent || isHeadersPresent) {
defaultOrder = [ defaultOrder = [
"fire-engine", "fire-engine",
"playwright", process.env.USE_DB_AUTHENTICATION ? undefined : "playwright",
...defaultOrder.filter( ...defaultOrder.filter(
(scraper) => scraper !== "fire-engine" && scraper !== "playwright" (scraper) => scraper !== "fire-engine" && scraper !== "playwright"
), ),
]; ].filter(Boolean);
} }
const filteredDefaultOrder = defaultOrder.filter( const filteredDefaultOrder = defaultOrder.filter(
@ -113,6 +120,7 @@ function getScrapingFallbackOrder(
export async function scrapSingleUrl( export async function scrapSingleUrl(
jobId: string,
urlToScrap: string, urlToScrap: string,
pageOptions: PageOptions = { pageOptions: PageOptions = {
onlyMainContent: true, onlyMainContent: true,
@ -120,6 +128,7 @@ export async function scrapSingleUrl(
includeRawHtml: false, includeRawHtml: false,
waitFor: 0, waitFor: 0,
screenshot: false, screenshot: false,
fullPageScreenshot: false,
headers: undefined, headers: undefined,
}, },
extractorOptions: ExtractorOptions = { extractorOptions: ExtractorOptions = {
@ -139,16 +148,36 @@ export async function scrapSingleUrl(
metadata: { pageStatusCode?: number; pageError?: string | null }; metadata: { pageStatusCode?: number; pageError?: string | null };
} = { text: "", screenshot: "", metadata: {} }; } = { text: "", screenshot: "", metadata: {} };
let screenshot = ""; let screenshot = "";
const timer = Date.now();
const logInsertPromise = ScrapeEvents.insert(jobId, {
type: "scrape",
url,
worker: process.env.FLY_MACHINE_ID,
method,
result: null,
});
switch (method) { switch (method) {
case "fire-engine": case "fire-engine":
case "fire-engine;chrome-cdp":
let engine: "playwright" | "chrome-cdp" | "tlsclient" = "playwright";
if(method === "fire-engine;chrome-cdp"){
engine = "chrome-cdp";
}
if (process.env.FIRE_ENGINE_BETA_URL) { if (process.env.FIRE_ENGINE_BETA_URL) {
console.log(`Scraping ${url} with Fire Engine`);
const response = await scrapWithFireEngine({ const response = await scrapWithFireEngine({
url, url,
waitFor: pageOptions.waitFor, waitFor: pageOptions.waitFor,
screenshot: pageOptions.screenshot, screenshot: pageOptions.screenshot,
fullPageScreenshot: pageOptions.fullPageScreenshot,
pageOptions: pageOptions, pageOptions: pageOptions,
headers: pageOptions.headers, headers: pageOptions.headers,
fireEngineOptions: {
engine: engine,
}
}); });
scraperResponse.text = response.html; scraperResponse.text = response.html;
scraperResponse.screenshot = response.screenshot; scraperResponse.screenshot = response.screenshot;
@ -239,8 +268,19 @@ export async function scrapSingleUrl(
} }
//* TODO: add an optional to return markdown or structured/extracted content //* TODO: add an optional to return markdown or structured/extracted content
let cleanedHtml = removeUnwantedElements(scraperResponse.text, pageOptions); let cleanedHtml = removeUnwantedElements(scraperResponse.text, pageOptions);
const text = await parseMarkdown(cleanedHtml);
const insertedLogId = await logInsertPromise;
ScrapeEvents.updateScrapeResult(insertedLogId, {
response_size: scraperResponse.text.length,
success: !(scraperResponse.metadata.pageStatusCode && scraperResponse.metadata.pageStatusCode >= 400) && !!text && (text.trim().length >= 100),
error: scraperResponse.metadata.pageError,
response_code: scraperResponse.metadata.pageStatusCode,
time_taken: Date.now() - timer,
});
return { return {
text: await parseMarkdown(cleanedHtml), text,
html: cleanedHtml, html: cleanedHtml,
rawHtml: scraperResponse.text, rawHtml: scraperResponse.text,
screenshot: scraperResponse.screenshot, screenshot: scraperResponse.screenshot,
@ -262,19 +302,19 @@ export async function scrapSingleUrl(
try { try {
urlKey = new URL(urlToScrap).hostname.replace(/^www\./, ""); urlKey = new URL(urlToScrap).hostname.replace(/^www\./, "");
} catch (error) { } catch (error) {
console.error(`Invalid URL key, trying: ${urlToScrap}`); Logger.error(`Invalid URL key, trying: ${urlToScrap}`);
} }
const defaultScraper = urlSpecificParams[urlKey]?.defaultScraper ?? ""; const defaultScraper = urlSpecificParams[urlKey]?.defaultScraper ?? "";
const scrapersInOrder = getScrapingFallbackOrder( const scrapersInOrder = getScrapingFallbackOrder(
defaultScraper, defaultScraper,
pageOptions && pageOptions.waitFor && pageOptions.waitFor > 0, pageOptions && pageOptions.waitFor && pageOptions.waitFor > 0,
pageOptions && pageOptions.screenshot && pageOptions.screenshot === true, pageOptions && (pageOptions.screenshot || pageOptions.fullPageScreenshot) && (pageOptions.screenshot === true || pageOptions.fullPageScreenshot === true),
pageOptions && pageOptions.headers && pageOptions.headers !== undefined pageOptions && pageOptions.headers && pageOptions.headers !== undefined
); );
for (const scraper of scrapersInOrder) { for (const scraper of scrapersInOrder) {
// If exists text coming from crawler, use it // If exists text coming from crawler, use it
if (existingHtml && existingHtml.trim().length >= 100) { if (existingHtml && existingHtml.trim().length >= 100 && !existingHtml.includes(clientSideError)) {
let cleanedHtml = removeUnwantedElements(existingHtml, pageOptions); let cleanedHtml = removeUnwantedElements(existingHtml, pageOptions);
text = await parseMarkdown(cleanedHtml); text = await parseMarkdown(cleanedHtml);
html = cleanedHtml; html = cleanedHtml;
@ -296,12 +336,18 @@ export async function scrapSingleUrl(
pageError = undefined; pageError = undefined;
} }
if (text && text.trim().length >= 100) break; if (text && text.trim().length >= 100) {
if (pageStatusCode && pageStatusCode == 404) break; Logger.debug(`⛏️ ${scraper}: Successfully scraped ${urlToScrap} with text length >= 100, breaking`);
const nextScraperIndex = scrapersInOrder.indexOf(scraper) + 1; break;
if (nextScraperIndex < scrapersInOrder.length) {
console.info(`Falling back to ${scrapersInOrder[nextScraperIndex]}`);
} }
if (pageStatusCode && pageStatusCode == 404) {
Logger.debug(`⛏️ ${scraper}: Successfully scraped ${urlToScrap} with status code 404, breaking`);
break;
}
// const nextScraperIndex = scrapersInOrder.indexOf(scraper) + 1;
// if (nextScraperIndex < scrapersInOrder.length) {
// Logger.debug(`⛏️ ${scraper} Failed to fetch URL: ${urlToScrap} with status: ${pageStatusCode}, error: ${pageError} | Falling back to ${scrapersInOrder[nextScraperIndex]}`);
// }
} }
if (!text) { if (!text) {
@ -357,7 +403,12 @@ export async function scrapSingleUrl(
return document; return document;
} catch (error) { } catch (error) {
console.error(`Error: ${error} - Failed to fetch URL: ${urlToScrap}`); Logger.debug(`⛏️ Error: ${error.message} - Failed to fetch URL: ${urlToScrap}`);
ScrapeEvents.insert(jobId, {
type: "error",
message: typeof error === "string" ? error : typeof error.message === "string" ? error.message : JSON.stringify(error),
stack: error.stack,
});
return { return {
content: "", content: "",
markdown: "", markdown: "",

View File

@ -2,6 +2,8 @@ import axios from "axios";
import { axiosTimeout } from "../../lib/timeout"; import { axiosTimeout } from "../../lib/timeout";
import { parseStringPromise } from "xml2js"; import { parseStringPromise } from "xml2js";
import { scrapWithFireEngine } from "./scrapers/fireEngine"; import { scrapWithFireEngine } from "./scrapers/fireEngine";
import { WebCrawler } from "./crawler";
import { Logger } from "../../lib/logger";
export async function getLinksFromSitemap( export async function getLinksFromSitemap(
{ {
@ -17,15 +19,15 @@ export async function getLinksFromSitemap(
try { try {
let content: string; let content: string;
try { try {
if (mode === 'axios') { if (mode === 'axios' || process.env.FIRE_ENGINE_BETA_URL === '') {
const response = await axios.get(sitemapUrl, { timeout: axiosTimeout }); const response = await axios.get(sitemapUrl, { timeout: axiosTimeout });
content = response.data; content = response.data;
} else if (mode === 'fire-engine') { } else if (mode === 'fire-engine') {
const response = await scrapWithFireEngine({ url: sitemapUrl, fireEngineOptions: { engine: "request", method: "get", mobileProxy: true } }); const response = await scrapWithFireEngine({ url: sitemapUrl, fireEngineOptions: { engine:"tlsclient", disableJsDom: true, mobileProxy: true } });
content = response.html; content = response.html;
} }
} catch (error) { } catch (error) {
console.error(`Request failed for ${sitemapUrl}: ${error}`); Logger.error(`Request failed for ${sitemapUrl}: ${error.message}`);
return allUrls; return allUrls;
} }
@ -41,22 +43,22 @@ export async function getLinksFromSitemap(
} }
} else if (root && root.url) { } else if (root && root.url) {
for (const url of root.url) { for (const url of root.url) {
if (url.loc && url.loc.length > 0) { if (url.loc && url.loc.length > 0 && !WebCrawler.prototype.isFile(url.loc[0])) {
allUrls.push(url.loc[0]); allUrls.push(url.loc[0]);
} }
} }
} }
} catch (error) { } catch (error) {
console.error(`Error processing ${sitemapUrl}: ${error}`); Logger.debug(`Error processing sitemapUrl: ${sitemapUrl} | Error: ${error.message}`);
} }
return allUrls; return allUrls;
} }
export const fetchSitemapData = async (url: string): Promise<SitemapEntry[] | null> => { export const fetchSitemapData = async (url: string, timeout?: number): Promise<SitemapEntry[] | null> => {
const sitemapUrl = url.endsWith("/sitemap.xml") ? url : `${url}/sitemap.xml`; const sitemapUrl = url.endsWith("/sitemap.xml") ? url : `${url}/sitemap.xml`;
try { try {
const response = await axios.get(sitemapUrl, { timeout: axiosTimeout }); const response = await axios.get(sitemapUrl, { timeout: timeout || axiosTimeout });
if (response.status === 200) { if (response.status === 200) {
const xml = response.data; const xml = response.data;
const parsedXml = await parseStringPromise(xml); const parsedXml = await parseStringPromise(xml);

View File

@ -1,3 +1,4 @@
import { Logger } from '../../../../lib/logger';
import { isUrlBlocked } from '../blocklist'; import { isUrlBlocked } from '../blocklist';
describe('isUrlBlocked', () => { describe('isUrlBlocked', () => {
@ -19,7 +20,7 @@ describe('isUrlBlocked', () => {
blockedUrls.forEach(url => { blockedUrls.forEach(url => {
if (!isUrlBlocked(url)) { if (!isUrlBlocked(url)) {
console.log(`URL not blocked: ${url}`); Logger.debug(`URL not blocked: ${url}`);
} }
expect(isUrlBlocked(url)).toBe(true); expect(isUrlBlocked(url)).toBe(true);
}); });

View File

@ -1,3 +1,5 @@
import { Logger } from "../../../lib/logger";
const socialMediaBlocklist = [ const socialMediaBlocklist = [
'facebook.com', 'facebook.com',
'x.com', 'x.com',
@ -59,7 +61,7 @@ export function isUrlBlocked(url: string): boolean {
return isBlocked; return isBlocked;
} catch (e) { } catch (e) {
// If an error occurs (e.g., invalid URL), return false // If an error occurs (e.g., invalid URL), return false
console.error(`Error processing URL: ${url}`, e); Logger.error(`Error parsing the following URL: ${url}`);
return false; return false;
} }
} }

View File

@ -22,7 +22,7 @@ export const urlSpecificParams = {
}, },
}, },
"support.greenpay.me":{ "support.greenpay.me":{
defaultScraper: "playwright", defaultScraper: "fire-engine",
params: { params: {
wait_browser: "networkidle2", wait_browser: "networkidle2",
block_resources: false, block_resources: false,
@ -43,7 +43,7 @@ export const urlSpecificParams = {
}, },
}, },
"docs.pdw.co":{ "docs.pdw.co":{
defaultScraper: "playwright", defaultScraper: "fire-engine",
params: { params: {
wait_browser: "networkidle2", wait_browser: "networkidle2",
block_resources: false, block_resources: false,
@ -83,7 +83,7 @@ export const urlSpecificParams = {
}, },
}, },
"developers.notion.com":{ "developers.notion.com":{
defaultScraper: "playwright", defaultScraper: "fire-engine",
params: { params: {
wait_browser: "networkidle2", wait_browser: "networkidle2",
block_resources: false, block_resources: false,
@ -103,7 +103,7 @@ export const urlSpecificParams = {
}, },
}, },
"docs2.hubitat.com":{ "docs2.hubitat.com":{
defaultScraper: "playwright", defaultScraper: "fire-engine",
params: { params: {
wait_browser: "networkidle2", wait_browser: "networkidle2",
block_resources: false, block_resources: false,
@ -153,7 +153,7 @@ export const urlSpecificParams = {
}, },
}, },
"help.salesforce.com":{ "help.salesforce.com":{
defaultScraper: "playwright", defaultScraper: "fire-engine",
params: { params: {
wait_browser: "networkidle2", wait_browser: "networkidle2",
block_resources: false, block_resources: false,
@ -175,6 +175,7 @@ export const urlSpecificParams = {
"firecrawl.dev":{ "firecrawl.dev":{
defaultScraper: "fire-engine", defaultScraper: "fire-engine",
params: { params: {
engine: "playwright",
headers: { headers: {
"User-Agent": "User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
@ -202,4 +203,41 @@ export const urlSpecificParams = {
}, },
}, },
}, },
"notion.com":{
defaultScraper: "fire-engine",
params: {
wait_browser: "networkidle2",
block_resources: false,
wait: 2000,
engine: "playwright",
}
},
"mendable.ai":{
defaultScraper: "fire-engine",
params:{
fireEngineOptions:{
mobileProxy: true,
method: "get",
engine: "chrome-cdp",
},
},
},
"developer.apple.com":{
defaultScraper: "fire-engine",
params:{
engine: "playwright",
wait: 2000,
fireEngineOptions: {
blockMedia: false,
}
},
},
"amazon.com":{
defaultScraper: "fire-engine",
params:{
fireEngineOptions:{
engine: "chrome-cdp",
},
},
},
}; };

View File

@ -4,38 +4,76 @@ import { createWriteStream } from "node:fs";
import path from "path"; import path from "path";
import os from "os"; import os from "os";
import mammoth from "mammoth"; import mammoth from "mammoth";
import { Logger } from "../../../lib/logger";
export async function fetchAndProcessDocx(url: string): Promise<{ content: string; pageStatusCode: number; pageError: string }> { export async function fetchAndProcessDocx(url: string): Promise<{ content: string; pageStatusCode: number; pageError: string }> {
const { tempFilePath, pageStatusCode, pageError } = await downloadDocx(url); let tempFilePath = '';
const content = await processDocxToText(tempFilePath); let pageStatusCode = 200;
fs.unlinkSync(tempFilePath); // Clean up the temporary file let pageError = '';
let content = '';
try {
const downloadResult = await downloadDocx(url);
tempFilePath = downloadResult.tempFilePath;
pageStatusCode = downloadResult.pageStatusCode;
pageError = downloadResult.pageError;
content = await processDocxToText(tempFilePath);
} catch (error) {
Logger.error(`Failed to fetch and process DOCX: ${error.message}`);
pageStatusCode = 500;
pageError = error.message;
content = '';
} finally {
if (tempFilePath) {
fs.unlinkSync(tempFilePath); // Clean up the temporary file
}
}
return { content, pageStatusCode, pageError }; return { content, pageStatusCode, pageError };
} }
async function downloadDocx(url: string): Promise<{ tempFilePath: string; pageStatusCode: number; pageError: string }> { async function downloadDocx(url: string): Promise<{ tempFilePath: string; pageStatusCode: number; pageError: string }> {
const response = await axios({ try {
url, const response = await axios({
method: "GET", url,
responseType: "stream", method: "GET",
}); responseType: "stream",
});
const tempFilePath = path.join(os.tmpdir(), `tempDocx-${Date.now()}.docx`); const tempFilePath = path.join(os.tmpdir(), `tempDocx-${Date.now()}.docx`);
const writer = createWriteStream(tempFilePath); const writer = createWriteStream(tempFilePath);
response.data.pipe(writer); response.data.pipe(writer);
return new Promise((resolve, reject) => { return new Promise((resolve, reject) => {
writer.on("finish", () => resolve({ tempFilePath, pageStatusCode: response.status, pageError: response.statusText != "OK" ? response.statusText : undefined })); writer.on("finish", () => resolve({ tempFilePath, pageStatusCode: response.status, pageError: response.statusText != "OK" ? response.statusText : undefined }));
writer.on("error", reject); writer.on("error", () => {
}); Logger.error('Failed to write DOCX file to disk');
reject(new Error('Failed to write DOCX file to disk'));
});
});
} catch (error) {
Logger.error(`Failed to download DOCX: ${error.message}`);
return { tempFilePath: "", pageStatusCode: 500, pageError: error.message };
}
} }
export async function processDocxToText(filePath: string): Promise<string> { export async function processDocxToText(filePath: string): Promise<string> {
const content = await extractTextFromDocx(filePath); try {
return content; const content = await extractTextFromDocx(filePath);
return content;
} catch (error) {
Logger.error(`Failed to process DOCX to text: ${error.message}`);
return "";
}
} }
async function extractTextFromDocx(filePath: string): Promise<string> { async function extractTextFromDocx(filePath: string): Promise<string> {
const result = await mammoth.extractRawText({ path: filePath }); try {
return result.value; const result = await mammoth.extractRawText({ path: filePath });
return result.value;
} catch (error) {
Logger.error(`Failed to extract text from DOCX: ${error.message}`);
return "";
}
} }

View File

@ -1,5 +1,6 @@
import Anthropic from '@anthropic-ai/sdk'; import Anthropic from '@anthropic-ai/sdk';
import axios from 'axios'; import axios from 'axios';
import { Logger } from '../../../lib/logger';
export async function getImageDescription( export async function getImageDescription(
imageUrl: string, imageUrl: string,
@ -82,7 +83,7 @@ export async function getImageDescription(
} }
} }
} catch (error) { } catch (error) {
console.error("Error generating image alt text:", error?.message); Logger.error(`Error generating image alt text: ${error}`);
return ""; return "";
} }
} }

View File

@ -1,4 +1,6 @@
import { CheerioAPI } from "cheerio"; import { CheerioAPI } from "cheerio";
import { Logger } from "../../../lib/logger";
interface Metadata { interface Metadata {
title?: string; title?: string;
description?: string; description?: string;
@ -105,7 +107,7 @@ export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
dctermsCreated = soup('meta[name="dcterms.created"]').attr("content") || null; dctermsCreated = soup('meta[name="dcterms.created"]').attr("content") || null;
} catch (error) { } catch (error) {
console.error("Error extracting metadata:", error); Logger.error(`Error extracting metadata: ${error}`);
} }
return { return {

View File

@ -7,14 +7,20 @@ import pdf from "pdf-parse";
import path from "path"; import path from "path";
import os from "os"; import os from "os";
import { axiosTimeout } from "../../../lib/timeout"; import { axiosTimeout } from "../../../lib/timeout";
import { Logger } from "../../../lib/logger";
dotenv.config(); dotenv.config();
export async function fetchAndProcessPdf(url: string, parsePDF: boolean): Promise<{ content: string, pageStatusCode?: number, pageError?: string }> { export async function fetchAndProcessPdf(url: string, parsePDF: boolean): Promise<{ content: string, pageStatusCode?: number, pageError?: string }> {
const { tempFilePath, pageStatusCode, pageError } = await downloadPdf(url); try {
const content = await processPdfToText(tempFilePath, parsePDF); const { tempFilePath, pageStatusCode, pageError } = await downloadPdf(url);
fs.unlinkSync(tempFilePath); // Clean up the temporary file const content = await processPdfToText(tempFilePath, parsePDF);
return { content, pageStatusCode, pageError }; fs.unlinkSync(tempFilePath); // Clean up the temporary file
return { content, pageStatusCode, pageError };
} catch (error) {
Logger.error(`Failed to fetch and process PDF: ${error.message}`);
return { content: "", pageStatusCode: 500, pageError: error.message };
}
} }
async function downloadPdf(url: string): Promise<{ tempFilePath: string, pageStatusCode?: number, pageError?: string }> { async function downloadPdf(url: string): Promise<{ tempFilePath: string, pageStatusCode?: number, pageError?: string }> {
@ -39,6 +45,7 @@ export async function processPdfToText(filePath: string, parsePDF: boolean): Pro
let content = ""; let content = "";
if (process.env.LLAMAPARSE_API_KEY && parsePDF) { if (process.env.LLAMAPARSE_API_KEY && parsePDF) {
Logger.debug("Processing pdf document w/ LlamaIndex");
const apiKey = process.env.LLAMAPARSE_API_KEY; const apiKey = process.env.LLAMAPARSE_API_KEY;
const headers = { const headers = {
Authorization: `Bearer ${apiKey}`, Authorization: `Bearer ${apiKey}`,
@ -69,7 +76,6 @@ export async function processPdfToText(filePath: string, parsePDF: boolean): Pro
let attempt = 0; let attempt = 0;
const maxAttempts = 10; // Maximum number of attempts const maxAttempts = 10; // Maximum number of attempts
let resultAvailable = false; let resultAvailable = false;
while (attempt < maxAttempts && !resultAvailable) { while (attempt < maxAttempts && !resultAvailable) {
try { try {
resultResponse = await axios.get(resultUrl, { headers, timeout: (axiosTimeout * 2) }); resultResponse = await axios.get(resultUrl, { headers, timeout: (axiosTimeout * 2) });
@ -81,31 +87,54 @@ export async function processPdfToText(filePath: string, parsePDF: boolean): Pro
await new Promise((resolve) => setTimeout(resolve, 500)); // Wait for 0.5 seconds await new Promise((resolve) => setTimeout(resolve, 500)); // Wait for 0.5 seconds
} }
} catch (error) { } catch (error) {
console.error("Error fetching result w/ LlamaIndex"); Logger.debug("Error fetching result w/ LlamaIndex");
attempt++; attempt++;
if (attempt >= maxAttempts) {
Logger.error("Max attempts reached, unable to fetch result.");
break; // Exit the loop if max attempts are reached
}
await new Promise((resolve) => setTimeout(resolve, 500)); // Wait for 0.5 seconds before retrying await new Promise((resolve) => setTimeout(resolve, 500)); // Wait for 0.5 seconds before retrying
// You may want to handle specific errors differently // You may want to handle specific errors differently
} }
} }
if (!resultAvailable) { if (!resultAvailable) {
content = await processPdf(filePath); try {
content = await processPdf(filePath);
} catch (error) {
Logger.error(`Failed to process PDF: ${error}`);
content = "";
}
} }
content = resultResponse.data[resultType]; content = resultResponse.data[resultType];
} catch (error) { } catch (error) {
console.error("Error processing pdf document w/ LlamaIndex(2)"); Logger.debug("Error processing pdf document w/ LlamaIndex(2)");
content = await processPdf(filePath); content = await processPdf(filePath);
} }
} else if (parsePDF) { } else if (parsePDF) {
content = await processPdf(filePath); try {
content = await processPdf(filePath);
} catch (error) {
Logger.error(`Failed to process PDF: ${error}`);
content = "";
}
} else { } else {
content = fs.readFileSync(filePath, "utf-8"); try {
content = fs.readFileSync(filePath, "utf-8");
} catch (error) {
Logger.error(`Failed to read PDF file: ${error}`);
content = "";
}
} }
return content; return content;
} }
async function processPdf(file: string) { async function processPdf(file: string) {
const fileContent = fs.readFileSync(file); try {
const data = await pdf(fileContent); const fileContent = fs.readFileSync(file);
return data.text; const data = await pdf(fileContent);
return data.text;
} catch (error) {
throw error;
}
} }

View File

@ -1,3 +1,4 @@
import { Logger } from "../../../lib/logger";
import { Document } from "../../../lib/entities"; import { Document } from "../../../lib/entities";
export const replacePathsWithAbsolutePaths = (documents: Document[]): Document[] => { export const replacePathsWithAbsolutePaths = (documents: Document[]): Document[] => {
@ -6,13 +7,13 @@ export const replacePathsWithAbsolutePaths = (documents: Document[]): Document[]
const baseUrl = new URL(document.metadata.sourceURL).origin; const baseUrl = new URL(document.metadata.sourceURL).origin;
const paths = const paths =
document.content.match( document.content.match(
/(!?\[.*?\])\(((?:[^()]+|\((?:[^()]+|\([^()]*\))*\))*)\)|href="([^"]+)"/g /!?\[.*?\]\(.*?\)|href=".+?"/g
) || []; ) || [];
paths.forEach((path: string) => { paths.forEach((path: string) => {
try { try {
const isImage = path.startsWith("!"); const isImage = path.startsWith("!");
let matchedUrl = path.match(/\(([^)]+)\)/) || path.match(/href="([^"]+)"/); let matchedUrl = path.match(/\((.*?)\)/) || path.match(/href="([^"]+)"/);
let url = matchedUrl[1]; let url = matchedUrl[1];
if (!url.startsWith("data:") && !url.startsWith("http")) { if (!url.startsWith("data:") && !url.startsWith("http")) {
@ -39,7 +40,7 @@ export const replacePathsWithAbsolutePaths = (documents: Document[]): Document[]
return documents; return documents;
} catch (error) { } catch (error) {
console.error("Error replacing paths with absolute paths", error); Logger.debug(`Error replacing paths with absolute paths: ${error}`);
return documents; return documents;
} }
}; };
@ -50,11 +51,11 @@ export const replaceImgPathsWithAbsolutePaths = (documents: Document[]): Documen
const baseUrl = new URL(document.metadata.sourceURL).origin; const baseUrl = new URL(document.metadata.sourceURL).origin;
const images = const images =
document.content.match( document.content.match(
/!\[.*?\]\(((?:[^()]+|\((?:[^()]+|\([^()]*\))*\))*)\)/g /!\[.*?\]\(.*?\)/g
) || []; ) || [];
images.forEach((image: string) => { images.forEach((image: string) => {
let imageUrl = image.match(/\(([^)]+)\)/)[1]; let imageUrl = image.match(/\((.*?)\)/)[1];
let altText = image.match(/\[(.*?)\]/)[1]; let altText = image.match(/\[(.*?)\]/)[1];
if (!imageUrl.startsWith("data:image")) { if (!imageUrl.startsWith("data:image")) {
@ -78,7 +79,7 @@ export const replaceImgPathsWithAbsolutePaths = (documents: Document[]): Documen
return documents; return documents;
} catch (error) { } catch (error) {
console.error("Error replacing img paths with absolute paths", error); Logger.error(`Error replacing img paths with absolute paths: ${error}`);
return documents; return documents;
} }
}; };

View File

@ -1,5 +1,6 @@
import axios from "axios"; import axios from "axios";
import * as cheerio from "cheerio"; import * as cheerio from "cheerio";
import { Logger } from "../../../lib/logger";
export async function attemptScrapWithRequests( export async function attemptScrapWithRequests(
@ -9,13 +10,13 @@ export async function attemptScrapWithRequests(
const response = await axios.get(urlToScrap, { timeout: 15000 }); const response = await axios.get(urlToScrap, { timeout: 15000 });
if (!response.data) { if (!response.data) {
console.log("Failed normal requests as well"); Logger.debug("Failed normal requests as well");
return null; return null;
} }
return response.data; return response.data;
} catch (error) { } catch (error) {
console.error(`Error in attemptScrapWithRequests: ${error}`); Logger.debug(`Error in attemptScrapWithRequests: ${error}`);
return null; return null;
} }
} }

View File

@ -2,6 +2,7 @@ import axios from 'axios';
import * as cheerio from 'cheerio'; import * as cheerio from 'cheerio';
import * as querystring from 'querystring'; import * as querystring from 'querystring';
import { SearchResult } from '../../src/lib/entities'; import { SearchResult } from '../../src/lib/entities';
import { Logger } from '../../src/lib/logger';
const _useragent_list = [ const _useragent_list = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0',
@ -96,7 +97,7 @@ export async function google_search(term: string, advanced = false, num_results
await new Promise(resolve => setTimeout(resolve, sleep_interval * 1000)); await new Promise(resolve => setTimeout(resolve, sleep_interval * 1000));
} catch (error) { } catch (error) {
if (error.message === 'Too many requests') { if (error.message === 'Too many requests') {
console.warn('Too many requests, breaking the loop'); Logger.warn('Too many requests, breaking the loop');
break; break;
} }
throw error; throw error;
@ -107,7 +108,7 @@ export async function google_search(term: string, advanced = false, num_results
} }
} }
if (attempts >= maxAttempts) { if (attempts >= maxAttempts) {
console.warn('Max attempts reached, breaking the loop'); Logger.warn('Max attempts reached, breaking the loop');
} }
return results return results
} }

View File

@ -1,3 +1,4 @@
import { Logger } from "../../src/lib/logger";
import { SearchResult } from "../../src/lib/entities"; import { SearchResult } from "../../src/lib/entities";
import { google_search } from "./googlesearch"; import { google_search } from "./googlesearch";
import { serper_search } from "./serper"; import { serper_search } from "./serper";
@ -47,7 +48,7 @@ export async function search({
timeout timeout
); );
} catch (error) { } catch (error) {
console.error("Error in search function: ", error); Logger.error(`Error in search function: ${error}`);
return [] return []
} }
// if process.env.SERPER_API_KEY is set, use serper // if process.env.SERPER_API_KEY is set, use serper

View File

@ -1,3 +1,4 @@
import { Logger } from "../../../src/lib/logger";
import { getWebScraperQueue } from "../queue-service"; import { getWebScraperQueue } from "../queue-service";
import { sendSlackWebhook } from "./slack"; import { sendSlackWebhook } from "./slack";
@ -9,13 +10,13 @@ export async function checkAlerts() {
process.env.ALERT_NUM_ACTIVE_JOBS && process.env.ALERT_NUM_ACTIVE_JOBS &&
process.env.ALERT_NUM_WAITING_JOBS process.env.ALERT_NUM_WAITING_JOBS
) { ) {
console.info("Initializing alerts"); Logger.info("Initializing alerts");
const checkActiveJobs = async () => { const checkActiveJobs = async () => {
try { try {
const webScraperQueue = getWebScraperQueue(); const webScraperQueue = getWebScraperQueue();
const activeJobs = await webScraperQueue.getActiveCount(); const activeJobs = await webScraperQueue.getActiveCount();
if (activeJobs > Number(process.env.ALERT_NUM_ACTIVE_JOBS)) { if (activeJobs > Number(process.env.ALERT_NUM_ACTIVE_JOBS)) {
console.warn( Logger.warn(
`Alert: Number of active jobs is over ${process.env.ALERT_NUM_ACTIVE_JOBS}. Current active jobs: ${activeJobs}.` `Alert: Number of active jobs is over ${process.env.ALERT_NUM_ACTIVE_JOBS}. Current active jobs: ${activeJobs}.`
); );
sendSlackWebhook( sendSlackWebhook(
@ -23,12 +24,12 @@ export async function checkAlerts() {
true true
); );
} else { } else {
console.info( Logger.info(
`Number of active jobs is under ${process.env.ALERT_NUM_ACTIVE_JOBS}. Current active jobs: ${activeJobs}` `Number of active jobs is under ${process.env.ALERT_NUM_ACTIVE_JOBS}. Current active jobs: ${activeJobs}`
); );
} }
} catch (error) { } catch (error) {
console.error("Failed to check active jobs:", error); Logger.error(`Failed to check active jobs: ${error}`);
} }
}; };
@ -38,7 +39,7 @@ export async function checkAlerts() {
const paused = await webScraperQueue.getPausedCount(); const paused = await webScraperQueue.getPausedCount();
if (waitingJobs !== paused && waitingJobs > Number(process.env.ALERT_NUM_WAITING_JOBS)) { if (waitingJobs !== paused && waitingJobs > Number(process.env.ALERT_NUM_WAITING_JOBS)) {
console.warn( Logger.warn(
`Alert: Number of waiting jobs is over ${process.env.ALERT_NUM_WAITING_JOBS}. Current waiting jobs: ${waitingJobs}.` `Alert: Number of waiting jobs is over ${process.env.ALERT_NUM_WAITING_JOBS}. Current waiting jobs: ${waitingJobs}.`
); );
sendSlackWebhook( sendSlackWebhook(
@ -57,6 +58,6 @@ export async function checkAlerts() {
// setInterval(checkAll, 10000); // Run every // setInterval(checkAll, 10000); // Run every
} }
} catch (error) { } catch (error) {
console.error("Failed to initialize alerts:", error); Logger.error(`Failed to initialize alerts: ${error}`);
} }
} }

View File

@ -1,4 +1,5 @@
import axios from "axios"; import axios from "axios";
import { Logger } from "../../../src/lib/logger";
export async function sendSlackWebhook( export async function sendSlackWebhook(
message: string, message: string,
@ -16,8 +17,8 @@ export async function sendSlackWebhook(
"Content-Type": "application/json", "Content-Type": "application/json",
}, },
}); });
console.log("Webhook sent successfully:", response.data); Logger.log("Webhook sent successfully:", response.data);
} catch (error) { } catch (error) {
console.error("Error sending webhook:", error); Logger.debug(`Error sending webhook: ${error}`);
} }
} }

View File

@ -2,9 +2,39 @@ import { NotificationType } from "../../types";
import { withAuth } from "../../lib/withAuth"; import { withAuth } from "../../lib/withAuth";
import { sendNotification } from "../notification/email_notification"; import { sendNotification } from "../notification/email_notification";
import { supabase_service } from "../supabase"; import { supabase_service } from "../supabase";
import { Logger } from "../../lib/logger";
import { getValue, setValue } from "../redis";
import Redlock from "redlock";
import Client from "ioredis";
const FREE_CREDITS = 500; const FREE_CREDITS = 500;
const redlock = new Redlock(
// You should have one client for each independent redis node
// or cluster.
[new Client(process.env.REDIS_RATE_LIMIT_URL)],
{
// The expected clock drift; for more details see:
// http://redis.io/topics/distlock
driftFactor: 0.01, // multiplied by lock ttl to determine drift time
// The max number of times Redlock will attempt to lock a resource
// before erroring.
retryCount: 5,
// the time in ms between attempts
retryDelay: 100, // time in ms
// the max time in ms randomly added to retries
// to improve performance under high contention
// see https://www.awsarchitectureblog.com/2015/03/backoff.html
retryJitter: 200, // time in ms
// The minimum remaining time on a lock before an extension is automatically
// attempted with the `using` API.
automaticExtensionThreshold: 500, // time in ms
}
);
export async function billTeam(team_id: string, credits: number) { export async function billTeam(team_id: string, credits: number) {
return withAuth(supaBillTeam)(team_id, credits); return withAuth(supaBillTeam)(team_id, credits);
} }
@ -12,27 +42,27 @@ export async function supaBillTeam(team_id: string, credits: number) {
if (team_id === "preview") { if (team_id === "preview") {
return { success: true, message: "Preview team, no credits used" }; return { success: true, message: "Preview team, no credits used" };
} }
console.log(`Billing team ${team_id} for ${credits} credits`); Logger.info(`Billing team ${team_id} for ${credits} credits`);
// When the API is used, you can log the credit usage in the credit_usage table: // When the API is used, you can log the credit usage in the credit_usage table:
// team_id: The ID of the team using the API. // team_id: The ID of the team using the API.
// subscription_id: The ID of the team's active subscription. // subscription_id: The ID of the team's active subscription.
// credits_used: The number of credits consumed by the API call. // credits_used: The number of credits consumed by the API call.
// created_at: The timestamp of the API usage. // created_at: The timestamp of the API usage.
// 1. get the subscription // 1. get the subscription and check for available coupons concurrently
const { data: subscription } = await supabase_service const [{ data: subscription }, { data: coupons }] = await Promise.all([
.from("subscriptions") supabase_service
.select("*") .from("subscriptions")
.eq("team_id", team_id) .select("*")
.eq("status", "active") .eq("team_id", team_id)
.single(); .eq("status", "active")
.single(),
// 2. Check for available coupons supabase_service
const { data: coupons } = await supabase_service .from("coupons")
.from("coupons") .select("id, credits")
.select("id, credits") .eq("team_id", team_id)
.eq("team_id", team_id) .eq("status", "active"),
.eq("status", "active"); ]);
let couponCredits = 0; let couponCredits = 0;
if (coupons && coupons.length > 0) { if (coupons && coupons.length > 0) {
@ -169,21 +199,21 @@ export async function supaCheckTeamCredits(team_id: string, credits: number) {
return { success: true, message: "Preview team, no credits used" }; return { success: true, message: "Preview team, no credits used" };
} }
// Retrieve the team's active subscription // Retrieve the team's active subscription and check for available coupons concurrently
const { data: subscription, error: subscriptionError } = const [{ data: subscription, error: subscriptionError }, { data: coupons }] =
await supabase_service await Promise.all([
.from("subscriptions") supabase_service
.select("id, price_id, current_period_start, current_period_end") .from("subscriptions")
.eq("team_id", team_id) .select("id, price_id, current_period_start, current_period_end")
.eq("status", "active") .eq("team_id", team_id)
.single(); .eq("status", "active")
.single(),
// Check for available coupons supabase_service
const { data: coupons } = await supabase_service .from("coupons")
.from("coupons") .select("credits")
.select("credits") .eq("team_id", team_id)
.eq("team_id", team_id) .eq("status", "active"),
.eq("status", "active"); ]);
let couponCredits = 0; let couponCredits = 0;
if (coupons && coupons.length > 0) { if (coupons && coupons.length > 0) {
@ -218,7 +248,7 @@ export async function supaCheckTeamCredits(team_id: string, credits: number) {
0 0
); );
console.log("totalCreditsUsed", totalCreditsUsed); Logger.info(`totalCreditsUsed: ${totalCreditsUsed}`);
const end = new Date(); const end = new Date();
end.setDate(end.getDate() + 30); end.setDate(end.getDate() + 30);
@ -238,7 +268,6 @@ export async function supaCheckTeamCredits(team_id: string, credits: number) {
// 5. Compare the total credits used with the credits allowed by the plan. // 5. Compare the total credits used with the credits allowed by the plan.
if (totalCreditsUsed + credits > FREE_CREDITS) { if (totalCreditsUsed + credits > FREE_CREDITS) {
// Send email notification for insufficient credits // Send email notification for insufficient credits
await sendNotification( await sendNotification(
team_id, team_id,
NotificationType.LIMIT_REACHED, NotificationType.LIMIT_REACHED,
@ -254,28 +283,45 @@ export async function supaCheckTeamCredits(team_id: string, credits: number) {
} }
let totalCreditsUsed = 0; let totalCreditsUsed = 0;
const cacheKey = `credit_usage_${subscription.id}_${subscription.current_period_start}_${subscription.current_period_end}_lc`;
const redLockKey = `lock_${cacheKey}`;
const lockTTL = 10000; // 10 seconds
try { try {
const { data: creditUsages, error: creditUsageError } = const lock = await redlock.acquire([redLockKey], lockTTL);
await supabase_service.rpc("get_credit_usage_2", {
sub_id: subscription.id,
start_time: subscription.current_period_start,
end_time: subscription.current_period_end,
});
if (creditUsageError) { try {
console.error("Error calculating credit usage:", creditUsageError); const cachedCreditUsage = await getValue(cacheKey);
}
if (creditUsages && creditUsages.length > 0) { if (cachedCreditUsage) {
totalCreditsUsed = creditUsages[0].total_credits_used; totalCreditsUsed = parseInt(cachedCreditUsage);
} else {
const { data: creditUsages, error: creditUsageError } =
await supabase_service.rpc("get_credit_usage_2", {
sub_id: subscription.id,
start_time: subscription.current_period_start,
end_time: subscription.current_period_end,
});
if (creditUsageError) {
Logger.error(`Error calculating credit usage: ${creditUsageError}`);
}
if (creditUsages && creditUsages.length > 0) {
totalCreditsUsed = creditUsages[0].total_credits_used;
await setValue(cacheKey, totalCreditsUsed.toString(), 1800); // Cache for 30 minutes
// Logger.info(`Cache set for credit usage: ${totalCreditsUsed}`);
}
}
} finally {
await lock.release();
} }
} catch (error) { } catch (error) {
console.error("Error calculating credit usage:", error); Logger.error(`Error acquiring lock or calculating credit usage: ${error}`);
} }
// Adjust total credits used by subtracting coupon value // Adjust total credits used by subtracting coupon value
const adjustedCreditsUsed = Math.max(0, totalCreditsUsed - couponCredits); const adjustedCreditsUsed = Math.max(0, totalCreditsUsed - couponCredits);
// Get the price details // Get the price details
const { data: price, error: priceError } = await supabase_service const { data: price, error: priceError } = await supabase_service
.from("prices") .from("prices")

View File

@ -1,5 +1,6 @@
import { Request } from "express"; import { Request } from "express";
import { supabase_service } from "../supabase"; import { supabase_service } from "../supabase";
import { Logger } from "../../../src/lib/logger";
export async function createIdempotencyKey( export async function createIdempotencyKey(
req: Request, req: Request,
@ -14,7 +15,7 @@ export async function createIdempotencyKey(
.insert({ key: idempotencyKey }); .insert({ key: idempotencyKey });
if (error) { if (error) {
console.error("Failed to create idempotency key:", error); Logger.error(`Failed to create idempotency key: ${error}`);
throw error; throw error;
} }

View File

@ -1,6 +1,7 @@
import { Request } from "express"; import { Request } from "express";
import { supabase_service } from "../supabase"; import { supabase_service } from "../supabase";
import { validate as isUuid } from 'uuid'; import { validate as isUuid } from 'uuid';
import { Logger } from "../../../src/lib/logger";
export async function validateIdempotencyKey( export async function validateIdempotencyKey(
req: Request, req: Request,
@ -13,7 +14,7 @@ export async function validateIdempotencyKey(
// Ensure idempotencyKey is treated as a string // Ensure idempotencyKey is treated as a string
const key = Array.isArray(idempotencyKey) ? idempotencyKey[0] : idempotencyKey; const key = Array.isArray(idempotencyKey) ? idempotencyKey[0] : idempotencyKey;
if (!isUuid(key)) { if (!isUuid(key)) {
console.error("Invalid idempotency key provided in the request headers."); Logger.debug("Invalid idempotency key provided in the request headers.");
return false; return false;
} }
@ -23,7 +24,7 @@ export async function validateIdempotencyKey(
.eq("key", idempotencyKey); .eq("key", idempotencyKey);
if (error) { if (error) {
console.error(error); Logger.error(`Error validating idempotency key: ${error}`);
} }
if (!data || data.length === 0) { if (!data || data.length === 0) {

View File

@ -1,4 +1,5 @@
import { supabase_service } from "../supabase"; import { supabase_service } from "../supabase";
import { Logger } from "../../../src/lib/logger";
import "dotenv/config"; import "dotenv/config";
export async function logCrawl(job_id: string, team_id: string) { export async function logCrawl(job_id: string, team_id: string) {
@ -13,7 +14,7 @@ export async function logCrawl(job_id: string, team_id: string) {
}, },
]); ]);
} catch (error) { } catch (error) {
console.error("Error logging crawl job:\n", error); Logger.error(`Error logging crawl job to supabase:\n${error}`);
} }
} }
} }

View File

@ -3,6 +3,7 @@ import { supabase_service } from "../supabase";
import { FirecrawlJob } from "../../types"; import { FirecrawlJob } from "../../types";
import { posthog } from "../posthog"; import { posthog } from "../posthog";
import "dotenv/config"; import "dotenv/config";
import { Logger } from "../../lib/logger";
export async function logJob(job: FirecrawlJob) { export async function logJob(job: FirecrawlJob) {
try { try {
@ -68,9 +69,9 @@ export async function logJob(job: FirecrawlJob) {
posthog.capture(phLog); posthog.capture(phLog);
} }
if (error) { if (error) {
console.error("Error logging job:\n", error); Logger.error(`Error logging job: ${error.message}`);
} }
} catch (error) { } catch (error) {
console.error("Error logging job:\n", error); Logger.error(`Error logging job: ${error.message}`);
} }
} }

View File

@ -2,11 +2,16 @@ import "dotenv/config";
import { ScrapeLog } from "../../types"; import { ScrapeLog } from "../../types";
import { supabase_service } from "../supabase"; import { supabase_service } from "../supabase";
import { PageOptions } from "../../lib/entities"; import { PageOptions } from "../../lib/entities";
import { Logger } from "../../lib/logger";
export async function logScrape( export async function logScrape(
scrapeLog: ScrapeLog, scrapeLog: ScrapeLog,
pageOptions?: PageOptions pageOptions?: PageOptions
) { ) {
if (process.env.USE_DB_AUTHENTICATION === "false") {
Logger.debug("Skipping logging scrape to Supabase");
return;
}
try { try {
// Only log jobs in production // Only log jobs in production
// if (process.env.ENV !== "production") { // if (process.env.ENV !== "production") {
@ -32,16 +37,16 @@ export async function logScrape(
retried: scrapeLog.retried, retried: scrapeLog.retried,
error_message: scrapeLog.error_message, error_message: scrapeLog.error_message,
date_added: new Date().toISOString(), date_added: new Date().toISOString(),
html: scrapeLog.html, html: "Removed to save db space",
ipv4_support: scrapeLog.ipv4_support, ipv4_support: scrapeLog.ipv4_support,
ipv6_support: scrapeLog.ipv6_support, ipv6_support: scrapeLog.ipv6_support,
}, },
]); ]);
if (error) { if (error) {
console.error("Error logging proxy:\n", error); Logger.error(`Error logging proxy:\n${error}`);
} }
} catch (error) { } catch (error) {
console.error("Error logging proxy:\n", error); Logger.error(`Error logging proxy:\n${error}`);
} }
} }

View File

@ -1,19 +1,20 @@
import { Logtail } from "@logtail/node"; import { Logtail } from "@logtail/node";
import "dotenv/config"; import "dotenv/config";
import { Logger } from "../lib/logger";
// A mock Logtail class to handle cases where LOGTAIL_KEY is not provided // A mock Logtail class to handle cases where LOGTAIL_KEY is not provided
class MockLogtail { class MockLogtail {
info(message: string, context?: Record<string, any>): void { info(message: string, context?: Record<string, any>): void {
console.log(message, context); Logger.debug(`${message} - ${context}`);
} }
error(message: string, context: Record<string, any> = {}): void { error(message: string, context: Record<string, any> = {}): void {
console.error(message, context); Logger.error(`${message} - ${context}`);
} }
} }
// Using the actual Logtail class if LOGTAIL_KEY exists, otherwise using the mock class // Using the actual Logtail class if LOGTAIL_KEY exists, otherwise using the mock class
// Additionally, print a warning to the terminal if LOGTAIL_KEY is not provided // Additionally, print a warning to the terminal if LOGTAIL_KEY is not provided
export const logtail = process.env.LOGTAIL_KEY ? new Logtail(process.env.LOGTAIL_KEY) : (() => { export const logtail = process.env.LOGTAIL_KEY ? new Logtail(process.env.LOGTAIL_KEY) : (() => {
console.warn("LOGTAIL_KEY is not provided - your events will not be logged. Using MockLogtail as a fallback. see logtail.ts for more."); Logger.warn("LOGTAIL_KEY is not provided - your events will not be logged. Using MockLogtail as a fallback. see logtail.ts for more.");
return new MockLogtail(); return new MockLogtail();
})(); })();

View File

@ -2,6 +2,7 @@ import { supabase_service } from "../supabase";
import { withAuth } from "../../lib/withAuth"; import { withAuth } from "../../lib/withAuth";
import { Resend } from "resend"; import { Resend } from "resend";
import { NotificationType } from "../../types"; import { NotificationType } from "../../types";
import { Logger } from "../../../src/lib/logger";
const emailTemplates: Record< const emailTemplates: Record<
NotificationType, NotificationType,
@ -52,11 +53,11 @@ async function sendEmailNotification(
}); });
if (error) { if (error) {
console.error("Error sending email: ", error); Logger.debug(`Error sending email: ${error}`);
return { success: false }; return { success: false };
} }
} catch (error) { } catch (error) {
console.error("Error sending email (2): ", error); Logger.debug(`Error sending email (2): ${error}`);
return { success: false }; return { success: false };
} }
} }
@ -70,7 +71,28 @@ export async function sendNotificationInternal(
if (team_id === "preview") { if (team_id === "preview") {
return { success: true }; return { success: true };
} }
const fifteenDaysAgo = new Date();
fifteenDaysAgo.setDate(fifteenDaysAgo.getDate() - 15);
const { data, error } = await supabase_service const { data, error } = await supabase_service
.from("user_notifications")
.select("*")
.eq("team_id", team_id)
.eq("notification_type", notificationType)
.gte("sent_date", fifteenDaysAgo.toISOString());
if (error) {
Logger.debug(`Error fetching notifications: ${error}`);
return { success: false };
}
if (data.length !== 0) {
// Logger.debug(`Notification already sent for team_id: ${team_id} and notificationType: ${notificationType} in the last 15 days`);
return { success: false };
}
const { data: recentData, error: recentError } = await supabase_service
.from("user_notifications") .from("user_notifications")
.select("*") .select("*")
.eq("team_id", team_id) .eq("team_id", team_id)
@ -78,14 +100,16 @@ export async function sendNotificationInternal(
.gte("sent_date", startDateString) .gte("sent_date", startDateString)
.lte("sent_date", endDateString); .lte("sent_date", endDateString);
if (error) { if (recentError) {
console.error("Error fetching notifications: ", error); Logger.debug(`Error fetching recent notifications: ${recentError}`);
return { success: false }; return { success: false };
} }
if (data.length !== 0) { if (recentData.length !== 0) {
// Logger.debug(`Notification already sent for team_id: ${team_id} and notificationType: ${notificationType} within the specified date range`);
return { success: false }; return { success: false };
} else { } else {
console.log(`Sending notification for team_id: ${team_id} and notificationType: ${notificationType}`);
// get the emails from the user with the team_id // get the emails from the user with the team_id
const { data: emails, error: emailsError } = await supabase_service const { data: emails, error: emailsError } = await supabase_service
.from("users") .from("users")
@ -93,7 +117,7 @@ export async function sendNotificationInternal(
.eq("team_id", team_id); .eq("team_id", team_id);
if (emailsError) { if (emailsError) {
console.error("Error fetching emails: ", emailsError); Logger.debug(`Error fetching emails: ${emailsError}`);
return { success: false }; return { success: false };
} }
@ -112,7 +136,7 @@ export async function sendNotificationInternal(
]); ]);
if (insertError) { if (insertError) {
console.error("Error inserting notification record: ", insertError); Logger.debug(`Error inserting notification record: ${insertError}`);
return { success: false }; return { success: false };
} }

View File

@ -1,5 +1,6 @@
import { PostHog } from 'posthog-node'; import { PostHog } from 'posthog-node';
import "dotenv/config"; import "dotenv/config";
import { Logger } from '../../src/lib/logger';
export default function PostHogClient() { export default function PostHogClient() {
const posthogClient = new PostHog(process.env.POSTHOG_API_KEY, { const posthogClient = new PostHog(process.env.POSTHOG_API_KEY, {
@ -19,7 +20,7 @@ class MockPostHog {
export const posthog = process.env.POSTHOG_API_KEY export const posthog = process.env.POSTHOG_API_KEY
? PostHogClient() ? PostHogClient()
: (() => { : (() => {
console.warn( Logger.warn(
"POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more." "POSTHOG_API_KEY is not provided - your events will not be logged. Using MockPostHog as a fallback. See posthog.ts for more."
); );
return new MockPostHog(); return new MockPostHog();

View File

@ -1,5 +1,6 @@
import Queue from "bull"; import Queue from "bull";
import { Queue as BullQueue } from "bull"; import { Queue as BullQueue } from "bull";
import { Logger } from "../lib/logger";
let webScraperQueue: BullQueue; let webScraperQueue: BullQueue;
@ -7,11 +8,16 @@ export function getWebScraperQueue() {
if (!webScraperQueue) { if (!webScraperQueue) {
webScraperQueue = new Queue("web-scraper", process.env.REDIS_URL, { webScraperQueue = new Queue("web-scraper", process.env.REDIS_URL, {
settings: { settings: {
lockDuration: 2 * 60 * 60 * 1000, // 2 hours in milliseconds, lockDuration: 1 * 60 * 1000, // 1 minute in milliseconds,
lockRenewTime: 30 * 60 * 1000, // 30 minutes in milliseconds lockRenewTime: 15 * 1000, // 15 seconds in milliseconds
stalledInterval: 30 * 1000,
maxStalledCount: 10,
}, },
defaultJobOptions:{
attempts: 5
}
}); });
console.log("Web scraper queue created"); Logger.info("Web scraper queue created");
} }
return webScraperQueue; return webScraperQueue;
} }

View File

@ -6,8 +6,11 @@ import { startWebScraperPipeline } from "../main/runWebScraper";
import { callWebhook } from "./webhook"; import { callWebhook } from "./webhook";
import { logJob } from "./logging/log_job"; import { logJob } from "./logging/log_job";
import { initSDK } from '@hyperdx/node-opentelemetry'; import { initSDK } from '@hyperdx/node-opentelemetry';
import { Job } from "bull";
import { Logger } from "../lib/logger";
import { ScrapeEvents } from "../lib/scrape-events";
if(process.env.ENV === 'production') { if (process.env.ENV === 'production') {
initSDK({ initSDK({
consoleCapture: true, consoleCapture: true,
additionalInstrumentations: [], additionalInstrumentations: [],
@ -16,93 +19,107 @@ if(process.env.ENV === 'production') {
const wsq = getWebScraperQueue(); const wsq = getWebScraperQueue();
wsq.process( async function processJob(job: Job, done) {
Math.floor(Number(process.env.NUM_WORKERS_PER_QUEUE ?? 8)), Logger.debug(`🐂 Worker taking job ${job.id}`);
async function (job, done) {
try {
job.progress({
current: 1,
total: 100,
current_step: "SCRAPING",
current_url: "",
});
const start = Date.now();
const { success, message, docs } = await startWebScraperPipeline({ job });
const end = Date.now();
const timeTakenInSeconds = (end - start) / 1000;
const data = { try {
success: success, job.progress({
result: { current: 1,
links: docs.map((doc) => { total: 100,
return { content: doc, source: doc?.metadata?.sourceURL ?? doc?.url ?? "" }; current_step: "SCRAPING",
}), current_url: "",
}, });
project_id: job.data.project_id, const start = Date.now();
error: message /* etc... */, const { success, message, docs } = await startWebScraperPipeline({ job });
}; const end = Date.now();
const timeTakenInSeconds = (end - start) / 1000;
await callWebhook(job.data.team_id, job.id as string, data); const data = {
success: success,
result: {
links: docs.map((doc) => {
return { content: doc, source: doc?.metadata?.sourceURL ?? doc?.url ?? "" };
}),
},
project_id: job.data.project_id,
error: message /* etc... */,
};
await logJob({ await callWebhook(job.data.team_id, job.id as string, data);
job_id: job.id as string,
success: success,
message: message,
num_docs: docs.length,
docs: docs,
time_taken: timeTakenInSeconds,
team_id: job.data.team_id,
mode: "crawl",
url: job.data.url,
crawlerOptions: job.data.crawlerOptions,
pageOptions: job.data.pageOptions,
origin: job.data.origin,
});
done(null, data);
} catch (error) {
if (await getWebScraperQueue().isPaused(false)) {
return;
}
if (error instanceof CustomError) { await logJob({
// Here we handle the error, then save the failed job job_id: job.id as string,
console.error(error.message); // or any other error handling success: success,
message: message,
num_docs: docs.length,
docs: docs,
time_taken: timeTakenInSeconds,
team_id: job.data.team_id,
mode: "crawl",
url: job.data.url,
crawlerOptions: job.data.crawlerOptions,
pageOptions: job.data.pageOptions,
origin: job.data.origin,
});
Logger.debug(`🐂 Job done ${job.id}`);
done(null, data);
} catch (error) {
Logger.error(`🐂 Job errored ${job.id} - ${error}`);
if (await getWebScraperQueue().isPaused(false)) {
Logger.debug("🐂Queue is paused, ignoring");
return;
}
logtail.error("Custom error while ingesting", { if (error instanceof CustomError) {
job_id: job.id, // Here we handle the error, then save the failed job
error: error.message, Logger.error(error.message); // or any other error handling
dataIngestionJob: error.dataIngestionJob,
});
}
console.log(error);
logtail.error("Overall error ingesting", { logtail.error("Custom error while ingesting", {
job_id: job.id, job_id: job.id,
error: error.message, error: error.message,
dataIngestionJob: error.dataIngestionJob,
}); });
const data = {
success: false,
project_id: job.data.project_id,
error:
"Something went wrong... Contact help@mendable.ai or try again." /* etc... */,
};
await callWebhook(job.data.team_id, job.id as string, data);
await logJob({
job_id: job.id as string,
success: false,
message: typeof error === 'string' ? error : (error.message ?? "Something went wrong... Contact help@mendable.ai"),
num_docs: 0,
docs: [],
time_taken: 0,
team_id: job.data.team_id,
mode: "crawl",
url: job.data.url,
crawlerOptions: job.data.crawlerOptions,
pageOptions: job.data.pageOptions,
origin: job.data.origin,
});
done(null, data);
} }
Logger.error(error);
logtail.error("Overall error ingesting", {
job_id: job.id,
error: error.message,
});
const data = {
success: false,
project_id: job.data.project_id,
error:
"Something went wrong... Contact help@mendable.ai or try again." /* etc... */,
};
await callWebhook(job.data.team_id, job.id as string, data);
await logJob({
job_id: job.id as string,
success: false,
message: typeof error === 'string' ? error : (error.message ?? "Something went wrong... Contact help@mendable.ai"),
num_docs: 0,
docs: [],
time_taken: 0,
team_id: job.data.team_id,
mode: "crawl",
url: job.data.url,
crawlerOptions: job.data.crawlerOptions,
pageOptions: job.data.pageOptions,
origin: job.data.origin,
});
done(null, data);
} }
}
wsq.process(
Math.floor(Number(process.env.NUM_WORKERS_PER_QUEUE ?? 8)),
processJob
); );
wsq.on("waiting", j => ScrapeEvents.logJobEvent(j, "waiting"));
wsq.on("active", j => ScrapeEvents.logJobEvent(j, "active"));
wsq.on("completed", j => ScrapeEvents.logJobEvent(j, "completed"));
wsq.on("paused", j => ScrapeEvents.logJobEvent(j, "paused"));
wsq.on("resumed", j => ScrapeEvents.logJobEvent(j, "resumed"));
wsq.on("removed", j => ScrapeEvents.logJobEvent(j, "removed"));

View File

@ -9,7 +9,7 @@ const RATE_LIMITS = {
starter: 3, starter: 3,
standard: 5, standard: 5,
standardOld: 40, standardOld: 40,
scale: 20, scale: 50,
hobby: 3, hobby: 3,
standardNew: 10, standardNew: 10,
standardnew: 10, standardnew: 10,
@ -21,7 +21,7 @@ const RATE_LIMITS = {
starter: 20, starter: 20,
standard: 50, standard: 50,
standardOld: 40, standardOld: 40,
scale: 50, scale: 500,
hobby: 10, hobby: 10,
standardNew: 50, standardNew: 50,
standardnew: 50, standardnew: 50,
@ -33,7 +33,7 @@ const RATE_LIMITS = {
starter: 20, starter: 20,
standard: 40, standard: 40,
standardOld: 40, standardOld: 40,
scale: 50, scale: 500,
hobby: 10, hobby: 10,
standardNew: 50, standardNew: 50,
standardnew: 50, standardnew: 50,

View File

@ -1,14 +1,15 @@
import Redis from "ioredis"; import Redis from "ioredis";
import { redisRateLimitClient } from "./rate-limiter"; import { redisRateLimitClient } from "./rate-limiter";
import { Logger } from "../lib/logger";
// Listen to 'error' events to the Redis connection // Listen to 'error' events to the Redis connection
redisRateLimitClient.on("error", (error) => { redisRateLimitClient.on("error", (error) => {
try { try {
if (error.message === "ECONNRESET") { if (error.message === "ECONNRESET") {
console.log("Connection to Redis Session Store timed out."); Logger.error("Connection to Redis Session Rate Limit Store timed out.");
} else if (error.message === "ECONNREFUSED") { } else if (error.message === "ECONNREFUSED") {
console.log("Connection to Redis Session Store refused!"); Logger.error("Connection to Redis Session Rate Limit Store refused!");
} else console.log(error); } else Logger.error(error);
} catch (error) {} } catch (error) {}
}); });
@ -16,15 +17,15 @@ redisRateLimitClient.on("error", (error) => {
redisRateLimitClient.on("reconnecting", (err) => { redisRateLimitClient.on("reconnecting", (err) => {
try { try {
if (redisRateLimitClient.status === "reconnecting") if (redisRateLimitClient.status === "reconnecting")
console.log("Reconnecting to Redis Session Store..."); Logger.info("Reconnecting to Redis Session Rate Limit Store...");
else console.log("Error reconnecting to Redis Session Store."); else Logger.error("Error reconnecting to Redis Session Rate Limit Store.");
} catch (error) {} } catch (error) {}
}); });
// Listen to the 'connect' event to Redis // Listen to the 'connect' event to Redis
redisRateLimitClient.on("connect", (err) => { redisRateLimitClient.on("connect", (err) => {
try { try {
if (!err) console.log("Connected to Redis Session Store!"); if (!err) Logger.info("Connected to Redis Session Rate Limit Store!");
} catch (error) {} } catch (error) {}
}); });

View File

@ -1,4 +1,5 @@
import { createClient, SupabaseClient } from "@supabase/supabase-js"; import { createClient, SupabaseClient } from "@supabase/supabase-js";
import { Logger } from "../lib/logger";
// SupabaseService class initializes the Supabase client conditionally based on environment variables. // SupabaseService class initializes the Supabase client conditionally based on environment variables.
class SupabaseService { class SupabaseService {
@ -10,13 +11,13 @@ class SupabaseService {
// Only initialize the Supabase client if both URL and Service Token are provided. // Only initialize the Supabase client if both URL and Service Token are provided.
if (process.env.USE_DB_AUTHENTICATION === "false") { if (process.env.USE_DB_AUTHENTICATION === "false") {
// Warn the user that Authentication is disabled by setting the client to null // Warn the user that Authentication is disabled by setting the client to null
console.warn( Logger.warn(
"\x1b[33mAuthentication is disabled. Supabase client will not be initialized.\x1b[0m" "Authentication is disabled. Supabase client will not be initialized."
); );
this.client = null; this.client = null;
} else if (!supabaseUrl || !supabaseServiceToken) { } else if (!supabaseUrl || !supabaseServiceToken) {
console.error( Logger.error(
"\x1b[31mSupabase environment variables aren't configured correctly. Supabase client will not be initialized. Fix ENV configuration or disable DB authentication with USE_DB_AUTHENTICATION env variable\x1b[0m" "Supabase environment variables aren't configured correctly. Supabase client will not be initialized. Fix ENV configuration or disable DB authentication with USE_DB_AUTHENTICATION env variable"
); );
} else { } else {
this.client = createClient(supabaseUrl, supabaseServiceToken); this.client = createClient(supabaseUrl, supabaseServiceToken);
@ -35,10 +36,15 @@ export const supabase_service: SupabaseClient = new Proxy(
new SupabaseService(), new SupabaseService(),
{ {
get: function (target, prop, receiver) { get: function (target, prop, receiver) {
if (process.env.USE_DB_AUTHENTICATION === "false") {
Logger.debug(
"Attempted to access Supabase client when it's not configured."
);
}
const client = target.getClient(); const client = target.getClient();
// If the Supabase client is not initialized, intercept property access to provide meaningful error feedback. // If the Supabase client is not initialized, intercept property access to provide meaningful error feedback.
if (client === null) { if (client === null) {
console.error( Logger.error(
"Attempted to access Supabase client when it's not configured." "Attempted to access Supabase client when it's not configured."
); );
return () => { return () => {

View File

@ -1,3 +1,4 @@
import { Logger } from "../../src/lib/logger";
import { supabase_service } from "./supabase"; import { supabase_service } from "./supabase";
export const callWebhook = async (teamId: string, jobId: string,data: any) => { export const callWebhook = async (teamId: string, jobId: string,data: any) => {
@ -15,10 +16,7 @@ export const callWebhook = async (teamId: string, jobId: string,data: any) => {
.eq("team_id", teamId) .eq("team_id", teamId)
.limit(1); .limit(1);
if (error) { if (error) {
console.error( Logger.error(`Error fetching webhook URL for team ID: ${teamId}, error: ${error.message}`);
`Error fetching webhook URL for team ID: ${teamId}`,
error.message
);
return null; return null;
} }
@ -53,9 +51,6 @@ export const callWebhook = async (teamId: string, jobId: string,data: any) => {
}), }),
}); });
} catch (error) { } catch (error) {
console.error( Logger.debug(`Error sending webhook for team ID: ${teamId}, error: ${error.message}`);
`Error sending webhook for team ID: ${teamId}`,
error.message
);
} }
}; };

View File

@ -1,2 +1,4 @@
export const errorNoResults = export const errorNoResults =
"No results found, please check the URL or contact us at help@mendable.ai to file a ticket."; "No results found, please check the URL or contact us at help@mendable.ai to file a ticket.";
export const clientSideError = "client-side exception has occurred"

View File

@ -1,9 +0,0 @@
fetch(process.argv[2] + "/admin/" + process.env.BULL_AUTH_KEY + "/shutdown", {
method: "POST"
}).then(async x => {
console.log(await x.text());
process.exit(0);
}).catch(e => {
console.error(e);
process.exit(1);
});

View File

@ -0,0 +1,271 @@
"use strict";
var __awaiter = (this && this.__awaiter) || function (thisArg, _arguments, P, generator) {
function adopt(value) { return value instanceof P ? value : new P(function (resolve) { resolve(value); }); }
return new (P || (P = Promise))(function (resolve, reject) {
function fulfilled(value) { try { step(generator.next(value)); } catch (e) { reject(e); } }
function rejected(value) { try { step(generator["throw"](value)); } catch (e) { reject(e); } }
function step(result) { result.done ? resolve(result.value) : adopt(result.value).then(fulfilled, rejected); }
step((generator = generator.apply(thisArg, _arguments || [])).next());
});
};
var __importDefault = (this && this.__importDefault) || function (mod) {
return (mod && mod.__esModule) ? mod : { "default": mod };
};
Object.defineProperty(exports, "__esModule", { value: true });
const axios_1 = __importDefault(require("axios"));
const zod_1 = require("zod");
const zod_to_json_schema_1 = require("zod-to-json-schema");
/**
* Main class for interacting with the Firecrawl API.
*/
class FirecrawlApp {
/**
* Initializes a new instance of the FirecrawlApp class.
* @param {FirecrawlAppConfig} config - Configuration options for the FirecrawlApp instance.
*/
constructor({ apiKey = null, apiUrl = null }) {
this.apiKey = apiKey || "";
this.apiUrl = apiUrl || "https://api.firecrawl.dev";
if (!this.apiKey) {
throw new Error("No API key provided");
}
}
/**
* Scrapes a URL using the Firecrawl API.
* @param {string} url - The URL to scrape.
* @param {Params | null} params - Additional parameters for the scrape request.
* @returns {Promise<ScrapeResponse>} The response from the scrape operation.
*/
scrapeUrl(url, params = null) {
var _a;
return __awaiter(this, void 0, void 0, function* () {
const headers = {
"Content-Type": "application/json",
Authorization: `Bearer ${this.apiKey}`,
};
let jsonData = Object.assign({ url }, params);
if ((_a = params === null || params === void 0 ? void 0 : params.extractorOptions) === null || _a === void 0 ? void 0 : _a.extractionSchema) {
let schema = params.extractorOptions.extractionSchema;
// Check if schema is an instance of ZodSchema to correctly identify Zod schemas
if (schema instanceof zod_1.z.ZodSchema) {
schema = (0, zod_to_json_schema_1.zodToJsonSchema)(schema);
}
jsonData = Object.assign(Object.assign({}, jsonData), { extractorOptions: Object.assign(Object.assign({}, params.extractorOptions), { extractionSchema: schema, mode: params.extractorOptions.mode || "llm-extraction" }) });
}
try {
const response = yield axios_1.default.post(this.apiUrl + "/v0/scrape", jsonData, { headers });
if (response.status === 200) {
const responseData = response.data;
if (responseData.success) {
return responseData;
}
else {
throw new Error(`Failed to scrape URL. Error: ${responseData.error}`);
}
}
else {
this.handleError(response, "scrape URL");
}
}
catch (error) {
throw new Error(error.message);
}
return { success: false, error: "Internal server error." };
});
}
/**
* Searches for a query using the Firecrawl API.
* @param {string} query - The query to search for.
* @param {Params | null} params - Additional parameters for the search request.
* @returns {Promise<SearchResponse>} The response from the search operation.
*/
search(query, params = null) {
return __awaiter(this, void 0, void 0, function* () {
const headers = {
"Content-Type": "application/json",
Authorization: `Bearer ${this.apiKey}`,
};
let jsonData = { query };
if (params) {
jsonData = Object.assign(Object.assign({}, jsonData), params);
}
try {
const response = yield axios_1.default.post(this.apiUrl + "/v0/search", jsonData, { headers });
if (response.status === 200) {
const responseData = response.data;
if (responseData.success) {
return responseData;
}
else {
throw new Error(`Failed to search. Error: ${responseData.error}`);
}
}
else {
this.handleError(response, "search");
}
}
catch (error) {
throw new Error(error.message);
}
return { success: false, error: "Internal server error." };
});
}
/**
* Initiates a crawl job for a URL using the Firecrawl API.
* @param {string} url - The URL to crawl.
* @param {Params | null} params - Additional parameters for the crawl request.
* @param {boolean} waitUntilDone - Whether to wait for the crawl job to complete.
* @param {number} pollInterval - Time in seconds for job status checks.
* @param {string} idempotencyKey - Optional idempotency key for the request.
* @returns {Promise<CrawlResponse | any>} The response from the crawl operation.
*/
crawlUrl(url, params = null, waitUntilDone = true, pollInterval = 2, idempotencyKey) {
return __awaiter(this, void 0, void 0, function* () {
const headers = this.prepareHeaders(idempotencyKey);
let jsonData = { url };
if (params) {
jsonData = Object.assign(Object.assign({}, jsonData), params);
}
try {
const response = yield this.postRequest(this.apiUrl + "/v0/crawl", jsonData, headers);
if (response.status === 200) {
const jobId = response.data.jobId;
if (waitUntilDone) {
return this.monitorJobStatus(jobId, headers, pollInterval);
}
else {
return { success: true, jobId };
}
}
else {
this.handleError(response, "start crawl job");
}
}
catch (error) {
console.log(error);
throw new Error(error.message);
}
return { success: false, error: "Internal server error." };
});
}
/**
* Checks the status of a crawl job using the Firecrawl API.
* @param {string} jobId - The job ID of the crawl operation.
* @returns {Promise<JobStatusResponse>} The response containing the job status.
*/
checkCrawlStatus(jobId) {
return __awaiter(this, void 0, void 0, function* () {
const headers = this.prepareHeaders();
try {
const response = yield this.getRequest(this.apiUrl + `/v0/crawl/status/${jobId}`, headers);
if (response.status === 200) {
return {
success: true,
status: response.data.status,
current: response.data.current,
current_url: response.data.current_url,
current_step: response.data.current_step,
total: response.data.total,
data: response.data.data,
partial_data: !response.data.data
? response.data.partial_data
: undefined,
};
}
else {
this.handleError(response, "check crawl status");
}
}
catch (error) {
throw new Error(error.message);
}
return {
success: false,
status: "unknown",
current: 0,
current_url: "",
current_step: "",
total: 0,
error: "Internal server error.",
};
});
}
/**
* Prepares the headers for an API request.
* @returns {AxiosRequestHeaders} The prepared headers.
*/
prepareHeaders(idempotencyKey) {
return Object.assign({ "Content-Type": "application/json", Authorization: `Bearer ${this.apiKey}` }, (idempotencyKey ? { "x-idempotency-key": idempotencyKey } : {}));
}
/**
* Sends a POST request to the specified URL.
* @param {string} url - The URL to send the request to.
* @param {Params} data - The data to send in the request.
* @param {AxiosRequestHeaders} headers - The headers for the request.
* @returns {Promise<AxiosResponse>} The response from the POST request.
*/
postRequest(url, data, headers) {
return axios_1.default.post(url, data, { headers });
}
/**
* Sends a GET request to the specified URL.
* @param {string} url - The URL to send the request to.
* @param {AxiosRequestHeaders} headers - The headers for the request.
* @returns {Promise<AxiosResponse>} The response from the GET request.
*/
getRequest(url, headers) {
return axios_1.default.get(url, { headers });
}
/**
* Monitors the status of a crawl job until completion or failure.
* @param {string} jobId - The job ID of the crawl operation.
* @param {AxiosRequestHeaders} headers - The headers for the request.
* @param {number} timeout - Timeout in seconds for job status checks.
* @returns {Promise<any>} The final job status or data.
*/
monitorJobStatus(jobId, headers, checkInterval) {
return __awaiter(this, void 0, void 0, function* () {
while (true) {
const statusResponse = yield this.getRequest(this.apiUrl + `/v0/crawl/status/${jobId}`, headers);
if (statusResponse.status === 200) {
const statusData = statusResponse.data;
if (statusData.status === "completed") {
if ("data" in statusData) {
return statusData.data;
}
else {
throw new Error("Crawl job completed but no data was returned");
}
}
else if (["active", "paused", "pending", "queued"].includes(statusData.status)) {
if (checkInterval < 2) {
checkInterval = 2;
}
yield new Promise((resolve) => setTimeout(resolve, checkInterval * 1000)); // Wait for the specified timeout before checking again
}
else {
throw new Error(`Crawl job failed or was stopped. Status: ${statusData.status}`);
}
}
else {
this.handleError(statusResponse, "check crawl status");
}
}
});
}
/**
* Handles errors from API responses.
* @param {AxiosResponse} response - The response from the API.
* @param {string} action - The action being performed when the error occurred.
*/
handleError(response, action) {
if ([402, 408, 409, 500].includes(response.status)) {
const errorMessage = response.data.error || "Unknown error occurred";
throw new Error(`Failed to ${action}. Status code: ${response.status}. Error: ${errorMessage}`);
}
else {
throw new Error(`Unexpected error occurred while trying to ${action}. Status code: ${response.status}`);
}
}
}
exports.default = FirecrawlApp;

View File

@ -0,0 +1 @@
{"type": "commonjs"}

View File

@ -0,0 +1,265 @@
var __awaiter = (this && this.__awaiter) || function (thisArg, _arguments, P, generator) {
function adopt(value) { return value instanceof P ? value : new P(function (resolve) { resolve(value); }); }
return new (P || (P = Promise))(function (resolve, reject) {
function fulfilled(value) { try { step(generator.next(value)); } catch (e) { reject(e); } }
function rejected(value) { try { step(generator["throw"](value)); } catch (e) { reject(e); } }
function step(result) { result.done ? resolve(result.value) : adopt(result.value).then(fulfilled, rejected); }
step((generator = generator.apply(thisArg, _arguments || [])).next());
});
};
import axios from "axios";
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";
/**
* Main class for interacting with the Firecrawl API.
*/
export default class FirecrawlApp {
/**
* Initializes a new instance of the FirecrawlApp class.
* @param {FirecrawlAppConfig} config - Configuration options for the FirecrawlApp instance.
*/
constructor({ apiKey = null, apiUrl = null }) {
this.apiKey = apiKey || "";
this.apiUrl = apiUrl || "https://api.firecrawl.dev";
if (!this.apiKey) {
throw new Error("No API key provided");
}
}
/**
* Scrapes a URL using the Firecrawl API.
* @param {string} url - The URL to scrape.
* @param {Params | null} params - Additional parameters for the scrape request.
* @returns {Promise<ScrapeResponse>} The response from the scrape operation.
*/
scrapeUrl(url, params = null) {
var _a;
return __awaiter(this, void 0, void 0, function* () {
const headers = {
"Content-Type": "application/json",
Authorization: `Bearer ${this.apiKey}`,
};
let jsonData = Object.assign({ url }, params);
if ((_a = params === null || params === void 0 ? void 0 : params.extractorOptions) === null || _a === void 0 ? void 0 : _a.extractionSchema) {
let schema = params.extractorOptions.extractionSchema;
// Check if schema is an instance of ZodSchema to correctly identify Zod schemas
if (schema instanceof z.ZodSchema) {
schema = zodToJsonSchema(schema);
}
jsonData = Object.assign(Object.assign({}, jsonData), { extractorOptions: Object.assign(Object.assign({}, params.extractorOptions), { extractionSchema: schema, mode: params.extractorOptions.mode || "llm-extraction" }) });
}
try {
const response = yield axios.post(this.apiUrl + "/v0/scrape", jsonData, { headers });
if (response.status === 200) {
const responseData = response.data;
if (responseData.success) {
return responseData;
}
else {
throw new Error(`Failed to scrape URL. Error: ${responseData.error}`);
}
}
else {
this.handleError(response, "scrape URL");
}
}
catch (error) {
throw new Error(error.message);
}
return { success: false, error: "Internal server error." };
});
}
/**
* Searches for a query using the Firecrawl API.
* @param {string} query - The query to search for.
* @param {Params | null} params - Additional parameters for the search request.
* @returns {Promise<SearchResponse>} The response from the search operation.
*/
search(query, params = null) {
return __awaiter(this, void 0, void 0, function* () {
const headers = {
"Content-Type": "application/json",
Authorization: `Bearer ${this.apiKey}`,
};
let jsonData = { query };
if (params) {
jsonData = Object.assign(Object.assign({}, jsonData), params);
}
try {
const response = yield axios.post(this.apiUrl + "/v0/search", jsonData, { headers });
if (response.status === 200) {
const responseData = response.data;
if (responseData.success) {
return responseData;
}
else {
throw new Error(`Failed to search. Error: ${responseData.error}`);
}
}
else {
this.handleError(response, "search");
}
}
catch (error) {
throw new Error(error.message);
}
return { success: false, error: "Internal server error." };
});
}
/**
* Initiates a crawl job for a URL using the Firecrawl API.
* @param {string} url - The URL to crawl.
* @param {Params | null} params - Additional parameters for the crawl request.
* @param {boolean} waitUntilDone - Whether to wait for the crawl job to complete.
* @param {number} pollInterval - Time in seconds for job status checks.
* @param {string} idempotencyKey - Optional idempotency key for the request.
* @returns {Promise<CrawlResponse | any>} The response from the crawl operation.
*/
crawlUrl(url, params = null, waitUntilDone = true, pollInterval = 2, idempotencyKey) {
return __awaiter(this, void 0, void 0, function* () {
const headers = this.prepareHeaders(idempotencyKey);
let jsonData = { url };
if (params) {
jsonData = Object.assign(Object.assign({}, jsonData), params);
}
try {
const response = yield this.postRequest(this.apiUrl + "/v0/crawl", jsonData, headers);
if (response.status === 200) {
const jobId = response.data.jobId;
if (waitUntilDone) {
return this.monitorJobStatus(jobId, headers, pollInterval);
}
else {
return { success: true, jobId };
}
}
else {
this.handleError(response, "start crawl job");
}
}
catch (error) {
console.log(error);
throw new Error(error.message);
}
return { success: false, error: "Internal server error." };
});
}
/**
* Checks the status of a crawl job using the Firecrawl API.
* @param {string} jobId - The job ID of the crawl operation.
* @returns {Promise<JobStatusResponse>} The response containing the job status.
*/
checkCrawlStatus(jobId) {
return __awaiter(this, void 0, void 0, function* () {
const headers = this.prepareHeaders();
try {
const response = yield this.getRequest(this.apiUrl + `/v0/crawl/status/${jobId}`, headers);
if (response.status === 200) {
return {
success: true,
status: response.data.status,
current: response.data.current,
current_url: response.data.current_url,
current_step: response.data.current_step,
total: response.data.total,
data: response.data.data,
partial_data: !response.data.data
? response.data.partial_data
: undefined,
};
}
else {
this.handleError(response, "check crawl status");
}
}
catch (error) {
throw new Error(error.message);
}
return {
success: false,
status: "unknown",
current: 0,
current_url: "",
current_step: "",
total: 0,
error: "Internal server error.",
};
});
}
/**
* Prepares the headers for an API request.
* @returns {AxiosRequestHeaders} The prepared headers.
*/
prepareHeaders(idempotencyKey) {
return Object.assign({ "Content-Type": "application/json", Authorization: `Bearer ${this.apiKey}` }, (idempotencyKey ? { "x-idempotency-key": idempotencyKey } : {}));
}
/**
* Sends a POST request to the specified URL.
* @param {string} url - The URL to send the request to.
* @param {Params} data - The data to send in the request.
* @param {AxiosRequestHeaders} headers - The headers for the request.
* @returns {Promise<AxiosResponse>} The response from the POST request.
*/
postRequest(url, data, headers) {
return axios.post(url, data, { headers });
}
/**
* Sends a GET request to the specified URL.
* @param {string} url - The URL to send the request to.
* @param {AxiosRequestHeaders} headers - The headers for the request.
* @returns {Promise<AxiosResponse>} The response from the GET request.
*/
getRequest(url, headers) {
return axios.get(url, { headers });
}
/**
* Monitors the status of a crawl job until completion or failure.
* @param {string} jobId - The job ID of the crawl operation.
* @param {AxiosRequestHeaders} headers - The headers for the request.
* @param {number} timeout - Timeout in seconds for job status checks.
* @returns {Promise<any>} The final job status or data.
*/
monitorJobStatus(jobId, headers, checkInterval) {
return __awaiter(this, void 0, void 0, function* () {
while (true) {
const statusResponse = yield this.getRequest(this.apiUrl + `/v0/crawl/status/${jobId}`, headers);
if (statusResponse.status === 200) {
const statusData = statusResponse.data;
if (statusData.status === "completed") {
if ("data" in statusData) {
return statusData.data;
}
else {
throw new Error("Crawl job completed but no data was returned");
}
}
else if (["active", "paused", "pending", "queued"].includes(statusData.status)) {
if (checkInterval < 2) {
checkInterval = 2;
}
yield new Promise((resolve) => setTimeout(resolve, checkInterval * 1000)); // Wait for the specified timeout before checking again
}
else {
throw new Error(`Crawl job failed or was stopped. Status: ${statusData.status}`);
}
}
else {
this.handleError(statusResponse, "check crawl status");
}
}
});
}
/**
* Handles errors from API responses.
* @param {AxiosResponse} response - The response from the API.
* @param {string} action - The action being performed when the error occurred.
*/
handleError(response, action) {
if ([402, 408, 409, 500].includes(response.status)) {
const errorMessage = response.data.error || "Unknown error occurred";
throw new Error(`Failed to ${action}. Status code: ${response.status}. Error: ${errorMessage}`);
}
else {
throw new Error(`Unexpected error occurred while trying to ${action}. Status code: ${response.status}`);
}
}
}

View File

@ -0,0 +1 @@
{"type": "module"}

View File

@ -0,0 +1,265 @@
var __awaiter = (this && this.__awaiter) || function (thisArg, _arguments, P, generator) {
function adopt(value) { return value instanceof P ? value : new P(function (resolve) { resolve(value); }); }
return new (P || (P = Promise))(function (resolve, reject) {
function fulfilled(value) { try { step(generator.next(value)); } catch (e) { reject(e); } }
function rejected(value) { try { step(generator["throw"](value)); } catch (e) { reject(e); } }
function step(result) { result.done ? resolve(result.value) : adopt(result.value).then(fulfilled, rejected); }
step((generator = generator.apply(thisArg, _arguments || [])).next());
});
};
import axios from "axios";
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";
/**
* Main class for interacting with the Firecrawl API.
*/
export default class FirecrawlApp {
/**
* Initializes a new instance of the FirecrawlApp class.
* @param {FirecrawlAppConfig} config - Configuration options for the FirecrawlApp instance.
*/
constructor({ apiKey = null, apiUrl = null }) {
this.apiKey = apiKey || "";
this.apiUrl = apiUrl || "https://api.firecrawl.dev";
if (!this.apiKey) {
throw new Error("No API key provided");
}
}
/**
* Scrapes a URL using the Firecrawl API.
* @param {string} url - The URL to scrape.
* @param {Params | null} params - Additional parameters for the scrape request.
* @returns {Promise<ScrapeResponse>} The response from the scrape operation.
*/
scrapeUrl(url, params = null) {
var _a;
return __awaiter(this, void 0, void 0, function* () {
const headers = {
"Content-Type": "application/json",
Authorization: `Bearer ${this.apiKey}`,
};
let jsonData = Object.assign({ url }, params);
if ((_a = params === null || params === void 0 ? void 0 : params.extractorOptions) === null || _a === void 0 ? void 0 : _a.extractionSchema) {
let schema = params.extractorOptions.extractionSchema;
// Check if schema is an instance of ZodSchema to correctly identify Zod schemas
if (schema instanceof z.ZodSchema) {
schema = zodToJsonSchema(schema);
}
jsonData = Object.assign(Object.assign({}, jsonData), { extractorOptions: Object.assign(Object.assign({}, params.extractorOptions), { extractionSchema: schema, mode: params.extractorOptions.mode || "llm-extraction" }) });
}
try {
const response = yield axios.post(this.apiUrl + "/v0/scrape", jsonData, { headers });
if (response.status === 200) {
const responseData = response.data;
if (responseData.success) {
return responseData;
}
else {
throw new Error(`Failed to scrape URL. Error: ${responseData.error}`);
}
}
else {
this.handleError(response, "scrape URL");
}
}
catch (error) {
throw new Error(error.message);
}
return { success: false, error: "Internal server error." };
});
}
/**
* Searches for a query using the Firecrawl API.
* @param {string} query - The query to search for.
* @param {Params | null} params - Additional parameters for the search request.
* @returns {Promise<SearchResponse>} The response from the search operation.
*/
search(query, params = null) {
return __awaiter(this, void 0, void 0, function* () {
const headers = {
"Content-Type": "application/json",
Authorization: `Bearer ${this.apiKey}`,
};
let jsonData = { query };
if (params) {
jsonData = Object.assign(Object.assign({}, jsonData), params);
}
try {
const response = yield axios.post(this.apiUrl + "/v0/search", jsonData, { headers });
if (response.status === 200) {
const responseData = response.data;
if (responseData.success) {
return responseData;
}
else {
throw new Error(`Failed to search. Error: ${responseData.error}`);
}
}
else {
this.handleError(response, "search");
}
}
catch (error) {
throw new Error(error.message);
}
return { success: false, error: "Internal server error." };
});
}
/**
* Initiates a crawl job for a URL using the Firecrawl API.
* @param {string} url - The URL to crawl.
* @param {Params | null} params - Additional parameters for the crawl request.
* @param {boolean} waitUntilDone - Whether to wait for the crawl job to complete.
* @param {number} pollInterval - Time in seconds for job status checks.
* @param {string} idempotencyKey - Optional idempotency key for the request.
* @returns {Promise<CrawlResponse | any>} The response from the crawl operation.
*/
crawlUrl(url, params = null, waitUntilDone = true, pollInterval = 2, idempotencyKey) {
return __awaiter(this, void 0, void 0, function* () {
const headers = this.prepareHeaders(idempotencyKey);
let jsonData = { url };
if (params) {
jsonData = Object.assign(Object.assign({}, jsonData), params);
}
try {
const response = yield this.postRequest(this.apiUrl + "/v0/crawl", jsonData, headers);
if (response.status === 200) {
const jobId = response.data.jobId;
if (waitUntilDone) {
return this.monitorJobStatus(jobId, headers, pollInterval);
}
else {
return { success: true, jobId };
}
}
else {
this.handleError(response, "start crawl job");
}
}
catch (error) {
console.log(error);
throw new Error(error.message);
}
return { success: false, error: "Internal server error." };
});
}
/**
* Checks the status of a crawl job using the Firecrawl API.
* @param {string} jobId - The job ID of the crawl operation.
* @returns {Promise<JobStatusResponse>} The response containing the job status.
*/
checkCrawlStatus(jobId) {
return __awaiter(this, void 0, void 0, function* () {
const headers = this.prepareHeaders();
try {
const response = yield this.getRequest(this.apiUrl + `/v0/crawl/status/${jobId}`, headers);
if (response.status === 200) {
return {
success: true,
status: response.data.status,
current: response.data.current,
current_url: response.data.current_url,
current_step: response.data.current_step,
total: response.data.total,
data: response.data.data,
partial_data: !response.data.data
? response.data.partial_data
: undefined,
};
}
else {
this.handleError(response, "check crawl status");
}
}
catch (error) {
throw new Error(error.message);
}
return {
success: false,
status: "unknown",
current: 0,
current_url: "",
current_step: "",
total: 0,
error: "Internal server error.",
};
});
}
/**
* Prepares the headers for an API request.
* @returns {AxiosRequestHeaders} The prepared headers.
*/
prepareHeaders(idempotencyKey) {
return Object.assign({ "Content-Type": "application/json", Authorization: `Bearer ${this.apiKey}` }, (idempotencyKey ? { "x-idempotency-key": idempotencyKey } : {}));
}
/**
* Sends a POST request to the specified URL.
* @param {string} url - The URL to send the request to.
* @param {Params} data - The data to send in the request.
* @param {AxiosRequestHeaders} headers - The headers for the request.
* @returns {Promise<AxiosResponse>} The response from the POST request.
*/
postRequest(url, data, headers) {
return axios.post(url, data, { headers });
}
/**
* Sends a GET request to the specified URL.
* @param {string} url - The URL to send the request to.
* @param {AxiosRequestHeaders} headers - The headers for the request.
* @returns {Promise<AxiosResponse>} The response from the GET request.
*/
getRequest(url, headers) {
return axios.get(url, { headers });
}
/**
* Monitors the status of a crawl job until completion or failure.
* @param {string} jobId - The job ID of the crawl operation.
* @param {AxiosRequestHeaders} headers - The headers for the request.
* @param {number} timeout - Timeout in seconds for job status checks.
* @returns {Promise<any>} The final job status or data.
*/
monitorJobStatus(jobId, headers, checkInterval) {
return __awaiter(this, void 0, void 0, function* () {
while (true) {
const statusResponse = yield this.getRequest(this.apiUrl + `/v0/crawl/status/${jobId}`, headers);
if (statusResponse.status === 200) {
const statusData = statusResponse.data;
if (statusData.status === "completed") {
if ("data" in statusData) {
return statusData.data;
}
else {
throw new Error("Crawl job completed but no data was returned");
}
}
else if (["active", "paused", "pending", "queued"].includes(statusData.status)) {
if (checkInterval < 2) {
checkInterval = 2;
}
yield new Promise((resolve) => setTimeout(resolve, checkInterval * 1000)); // Wait for the specified timeout before checking again
}
else {
throw new Error(`Crawl job failed or was stopped. Status: ${statusData.status}`);
}
}
else {
this.handleError(statusResponse, "check crawl status");
}
}
});
}
/**
* Handles errors from API responses.
* @param {AxiosResponse} response - The response from the API.
* @param {string} action - The action being performed when the error occurred.
*/
handleError(response, action) {
if ([402, 408, 409, 500].includes(response.status)) {
const errorMessage = response.data.error || "Unknown error occurred";
throw new Error(`Failed to ${action}. Status code: ${response.status}. Error: ${errorMessage}`);
}
else {
throw new Error(`Unexpected error occurred while trying to ${action}. Status code: ${response.status}`);
}
}
}

View File

@ -1,5 +0,0 @@
/** @type {import('ts-jest').JestConfigWithTsJest} */
module.exports = {
preset: 'ts-jest',
testEnvironment: 'node',
};

View File

@ -0,0 +1,16 @@
/** @type {import('ts-jest').JestConfigWithTsJest} **/
export default {
testEnvironment: "node",
"moduleNameMapper": {
"^(\\.{1,2}/.*)\\.js$": "$1",
},
"extensionsToTreatAsEsm": [".ts"],
"transform": {
"^.+\\.(mt|t|cj|j)s$": [
"ts-jest",
{
"useESM": true
}
]
},
};

View File

@ -24,7 +24,7 @@
"@types/node": "^20.12.12", "@types/node": "^20.12.12",
"@types/uuid": "^9.0.8", "@types/uuid": "^9.0.8",
"jest": "^29.7.0", "jest": "^29.7.0",
"ts-jest": "^29.1.2", "ts-jest": "^29.2.2",
"typescript": "^5.4.5" "typescript": "^5.4.5"
} }
}, },
@ -42,12 +42,12 @@
} }
}, },
"node_modules/@babel/code-frame": { "node_modules/@babel/code-frame": {
"version": "7.24.2", "version": "7.24.7",
"resolved": "https://registry.npmjs.org/@babel/code-frame/-/code-frame-7.24.2.tgz", "resolved": "https://registry.npmjs.org/@babel/code-frame/-/code-frame-7.24.7.tgz",
"integrity": "sha512-y5+tLQyV8pg3fsiln67BVLD1P13Eg4lh5RW9mF0zUuvLrv9uIQ4MCL+CRT+FTsBlBjcIan6PGsLcBN0m3ClUyQ==", "integrity": "sha512-BcYH1CVJBO9tvyIZ2jVeXgSIMvGZ2FDRvDdOIVQyuklNKSsx+eppDEBq/g47Ayw+RqNFE+URvOShmf+f/qwAlA==",
"dev": true, "dev": true,
"dependencies": { "dependencies": {
"@babel/highlight": "^7.24.2", "@babel/highlight": "^7.24.7",
"picocolors": "^1.0.0" "picocolors": "^1.0.0"
}, },
"engines": { "engines": {
@ -55,9 +55,9 @@
} }
}, },
"node_modules/@babel/compat-data": { "node_modules/@babel/compat-data": {
"version": "7.24.4", "version": "7.24.9",
"resolved": "https://registry.npmjs.org/@babel/compat-data/-/compat-data-7.24.4.tgz", "resolved": "https://registry.npmjs.org/@babel/compat-data/-/compat-data-7.24.9.tgz",
"integrity": "sha512-vg8Gih2MLK+kOkHJp4gBEIkyaIi00jgWot2D9QOmmfLC8jINSOzmCLta6Bvz/JSBCqnegV0L80jhxkol5GWNfQ==", "integrity": "sha512-e701mcfApCJqMMueQI0Fb68Amflj83+dvAvHawoBpAz+GDjCIyGHzNwnefjsWJ3xiYAqqiQFoWbspGYBdb2/ng==",
"dev": true, "dev": true,
"engines": { "engines": {
"node": ">=6.9.0" "node": ">=6.9.0"
@ -94,12 +94,12 @@
} }
}, },
"node_modules/@babel/generator": { "node_modules/@babel/generator": {
"version": "7.24.4", "version": "7.24.10",
"resolved": "https://registry.npmjs.org/@babel/generator/-/generator-7.24.4.tgz", "resolved": "https://registry.npmjs.org/@babel/generator/-/generator-7.24.10.tgz",
"integrity": "sha512-Xd6+v6SnjWVx/nus+y0l1sxMOTOMBkyL4+BIdbALyatQnAe/SRVjANeDPSCYaX+i1iJmuGSKf3Z+E+V/va1Hvw==", "integrity": "sha512-o9HBZL1G2129luEUlG1hB4N/nlYNWHnpwlND9eOMclRqqu1YDy2sSYVCFUZwl8I1Gxh+QSRrP2vD7EpUmFVXxg==",
"dev": true, "dev": true,
"dependencies": { "dependencies": {
"@babel/types": "^7.24.0", "@babel/types": "^7.24.9",
"@jridgewell/gen-mapping": "^0.3.5", "@jridgewell/gen-mapping": "^0.3.5",
"@jridgewell/trace-mapping": "^0.3.25", "@jridgewell/trace-mapping": "^0.3.25",
"jsesc": "^2.5.1" "jsesc": "^2.5.1"
@ -109,14 +109,14 @@
} }
}, },
"node_modules/@babel/helper-compilation-targets": { "node_modules/@babel/helper-compilation-targets": {
"version": "7.23.6", "version": "7.24.8",
"resolved": "https://registry.npmjs.org/@babel/helper-compilation-targets/-/helper-compilation-targets-7.23.6.tgz", "resolved": "https://registry.npmjs.org/@babel/helper-compilation-targets/-/helper-compilation-targets-7.24.8.tgz",
"integrity": "sha512-9JB548GZoQVmzrFgp8o7KxdgkTGm6xs9DW0o/Pim72UDjzr5ObUQ6ZzYPqA+g9OTS2bBQoctLJrky0RDCAWRgQ==", "integrity": "sha512-oU+UoqCHdp+nWVDkpldqIQL/i/bvAv53tRqLG/s+cOXxe66zOYLU7ar/Xs3LdmBihrUMEUhwu6dMZwbNOYDwvw==",
"dev": true, "dev": true,
"dependencies": { "dependencies": {
"@babel/compat-data": "^7.23.5", "@babel/compat-data": "^7.24.8",
"@babel/helper-validator-option": "^7.23.5", "@babel/helper-validator-option": "^7.24.8",
"browserslist": "^4.22.2", "browserslist": "^4.23.1",
"lru-cache": "^5.1.1", "lru-cache": "^5.1.1",
"semver": "^6.3.1" "semver": "^6.3.1"
}, },
@ -125,62 +125,66 @@
} }
}, },
"node_modules/@babel/helper-environment-visitor": { "node_modules/@babel/helper-environment-visitor": {
"version": "7.22.20", "version": "7.24.7",
"resolved": "https://registry.npmjs.org/@babel/helper-environment-visitor/-/helper-environment-visitor-7.22.20.tgz", "resolved": "https://registry.npmjs.org/@babel/helper-environment-visitor/-/helper-environment-visitor-7.24.7.tgz",
"integrity": "sha512-zfedSIzFhat/gFhWfHtgWvlec0nqB9YEIVrpuwjruLlXfUSnA8cJB0miHKwqDnQ7d32aKo2xt88/xZptwxbfhA==", "integrity": "sha512-DoiN84+4Gnd0ncbBOM9AZENV4a5ZiL39HYMyZJGZ/AZEykHYdJw0wW3kdcsh9/Kn+BRXHLkkklZ51ecPKmI1CQ==",
"dev": true, "dev": true,
"dependencies": {
"@babel/types": "^7.24.7"
},
"engines": { "engines": {
"node": ">=6.9.0" "node": ">=6.9.0"
} }
}, },
"node_modules/@babel/helper-function-name": { "node_modules/@babel/helper-function-name": {
"version": "7.23.0", "version": "7.24.7",
"resolved": "https://registry.npmjs.org/@babel/helper-function-name/-/helper-function-name-7.23.0.tgz", "resolved": "https://registry.npmjs.org/@babel/helper-function-name/-/helper-function-name-7.24.7.tgz",
"integrity": "sha512-OErEqsrxjZTJciZ4Oo+eoZqeW9UIiOcuYKRJA4ZAgV9myA+pOXhhmpfNCKjEH/auVfEYVFJ6y1Tc4r0eIApqiw==", "integrity": "sha512-FyoJTsj/PEUWu1/TYRiXTIHc8lbw+TDYkZuoE43opPS5TrI7MyONBE1oNvfguEXAD9yhQRrVBnXdXzSLQl9XnA==",
"dev": true, "dev": true,
"dependencies": { "dependencies": {
"@babel/template": "^7.22.15", "@babel/template": "^7.24.7",
"@babel/types": "^7.23.0" "@babel/types": "^7.24.7"
}, },
"engines": { "engines": {
"node": ">=6.9.0" "node": ">=6.9.0"
} }
}, },
"node_modules/@babel/helper-hoist-variables": { "node_modules/@babel/helper-hoist-variables": {
"version": "7.22.5", "version": "7.24.7",
"resolved": "https://registry.npmjs.org/@babel/helper-hoist-variables/-/helper-hoist-variables-7.22.5.tgz", "resolved": "https://registry.npmjs.org/@babel/helper-hoist-variables/-/helper-hoist-variables-7.24.7.tgz",
"integrity": "sha512-wGjk9QZVzvknA6yKIUURb8zY3grXCcOZt+/7Wcy8O2uctxhplmUPkOdlgoNhmdVee2c92JXbf1xpMtVNbfoxRw==", "integrity": "sha512-MJJwhkoGy5c4ehfoRyrJ/owKeMl19U54h27YYftT0o2teQ3FJ3nQUf/I3LlJsX4l3qlw7WRXUmiyajvHXoTubQ==",
"dev": true, "dev": true,
"dependencies": { "dependencies": {
"@babel/types": "^7.22.5" "@babel/types": "^7.24.7"
}, },
"engines": { "engines": {
"node": ">=6.9.0" "node": ">=6.9.0"
} }
}, },
"node_modules/@babel/helper-module-imports": { "node_modules/@babel/helper-module-imports": {
"version": "7.24.3", "version": "7.24.7",
"resolved": "https://registry.npmjs.org/@babel/helper-module-imports/-/helper-module-imports-7.24.3.tgz", "resolved": "https://registry.npmjs.org/@babel/helper-module-imports/-/helper-module-imports-7.24.7.tgz",
"integrity": "sha512-viKb0F9f2s0BCS22QSF308z/+1YWKV/76mwt61NBzS5izMzDPwdq1pTrzf+Li3npBWX9KdQbkeCt1jSAM7lZqg==", "integrity": "sha512-8AyH3C+74cgCVVXow/myrynrAGv+nTVg5vKu2nZph9x7RcRwzmh0VFallJuFTZ9mx6u4eSdXZfcOzSqTUm0HCA==",
"dev": true, "dev": true,
"dependencies": { "dependencies": {
"@babel/types": "^7.24.0" "@babel/traverse": "^7.24.7",
"@babel/types": "^7.24.7"
}, },
"engines": { "engines": {
"node": ">=6.9.0" "node": ">=6.9.0"
} }
}, },
"node_modules/@babel/helper-module-transforms": { "node_modules/@babel/helper-module-transforms": {
"version": "7.23.3", "version": "7.24.9",
"resolved": "https://registry.npmjs.org/@babel/helper-module-transforms/-/helper-module-transforms-7.23.3.tgz", "resolved": "https://registry.npmjs.org/@babel/helper-module-transforms/-/helper-module-transforms-7.24.9.tgz",
"integrity": "sha512-7bBs4ED9OmswdfDzpz4MpWgSrV7FXlc3zIagvLFjS5H+Mk7Snr21vQ6QwrsoCGMfNC4e4LQPdoULEt4ykz0SRQ==", "integrity": "sha512-oYbh+rtFKj/HwBQkFlUzvcybzklmVdVV3UU+mN7n2t/q3yGHbuVdNxyFvSBO1tfvjyArpHNcWMAzsSPdyI46hw==",
"dev": true, "dev": true,
"dependencies": { "dependencies": {
"@babel/helper-environment-visitor": "^7.22.20", "@babel/helper-environment-visitor": "^7.24.7",
"@babel/helper-module-imports": "^7.22.15", "@babel/helper-module-imports": "^7.24.7",
"@babel/helper-simple-access": "^7.22.5", "@babel/helper-simple-access": "^7.24.7",
"@babel/helper-split-export-declaration": "^7.22.6", "@babel/helper-split-export-declaration": "^7.24.7",
"@babel/helper-validator-identifier": "^7.22.20" "@babel/helper-validator-identifier": "^7.24.7"
}, },
"engines": { "engines": {
"node": ">=6.9.0" "node": ">=6.9.0"
@ -190,60 +194,61 @@
} }
}, },
"node_modules/@babel/helper-plugin-utils": { "node_modules/@babel/helper-plugin-utils": {
"version": "7.24.0", "version": "7.24.8",
"resolved": "https://registry.npmjs.org/@babel/helper-plugin-utils/-/helper-plugin-utils-7.24.0.tgz", "resolved": "https://registry.npmjs.org/@babel/helper-plugin-utils/-/helper-plugin-utils-7.24.8.tgz",
"integrity": "sha512-9cUznXMG0+FxRuJfvL82QlTqIzhVW9sL0KjMPHhAOOvpQGL8QtdxnBKILjBqxlHyliz0yCa1G903ZXI/FuHy2w==", "integrity": "sha512-FFWx5142D8h2Mgr/iPVGH5G7w6jDn4jUSpZTyDnQO0Yn7Ks2Kuz6Pci8H6MPCoUJegd/UZQ3tAvfLCxQSnWWwg==",
"dev": true, "dev": true,
"engines": { "engines": {
"node": ">=6.9.0" "node": ">=6.9.0"
} }
}, },
"node_modules/@babel/helper-simple-access": { "node_modules/@babel/helper-simple-access": {
"version": "7.22.5", "version": "7.24.7",
"resolved": "https://registry.npmjs.org/@babel/helper-simple-access/-/helper-simple-access-7.22.5.tgz", "resolved": "https://registry.npmjs.org/@babel/helper-simple-access/-/helper-simple-access-7.24.7.tgz",
"integrity": "sha512-n0H99E/K+Bika3++WNL17POvo4rKWZ7lZEp1Q+fStVbUi8nxPQEBOlTmCOxW/0JsS56SKKQ+ojAe2pHKJHN35w==", "integrity": "sha512-zBAIvbCMh5Ts+b86r/CjU+4XGYIs+R1j951gxI3KmmxBMhCg4oQMsv6ZXQ64XOm/cvzfU1FmoCyt6+owc5QMYg==",
"dev": true, "dev": true,
"dependencies": { "dependencies": {
"@babel/types": "^7.22.5" "@babel/traverse": "^7.24.7",
"@babel/types": "^7.24.7"
}, },
"engines": { "engines": {
"node": ">=6.9.0" "node": ">=6.9.0"
} }
}, },
"node_modules/@babel/helper-split-export-declaration": { "node_modules/@babel/helper-split-export-declaration": {
"version": "7.22.6", "version": "7.24.7",
"resolved": "https://registry.npmjs.org/@babel/helper-split-export-declaration/-/helper-split-export-declaration-7.22.6.tgz", "resolved": "https://registry.npmjs.org/@babel/helper-split-export-declaration/-/helper-split-export-declaration-7.24.7.tgz",
"integrity": "sha512-AsUnxuLhRYsisFiaJwvp1QF+I3KjD5FOxut14q/GzovUe6orHLesW2C7d754kRm53h5gqrz6sFl6sxc4BVtE/g==", "integrity": "sha512-oy5V7pD+UvfkEATUKvIjvIAH/xCzfsFVw7ygW2SI6NClZzquT+mwdTfgfdbUiceh6iQO0CHtCPsyze/MZ2YbAA==",
"dev": true, "dev": true,
"dependencies": { "dependencies": {
"@babel/types": "^7.22.5" "@babel/types": "^7.24.7"
}, },
"engines": { "engines": {
"node": ">=6.9.0" "node": ">=6.9.0"
} }
}, },
"node_modules/@babel/helper-string-parser": { "node_modules/@babel/helper-string-parser": {
"version": "7.24.1", "version": "7.24.8",
"resolved": "https://registry.npmjs.org/@babel/helper-string-parser/-/helper-string-parser-7.24.1.tgz", "resolved": "https://registry.npmjs.org/@babel/helper-string-parser/-/helper-string-parser-7.24.8.tgz",
"integrity": "sha512-2ofRCjnnA9y+wk8b9IAREroeUP02KHp431N2mhKniy2yKIDKpbrHv9eXwm8cBeWQYcJmzv5qKCu65P47eCF7CQ==", "integrity": "sha512-pO9KhhRcuUyGnJWwyEgnRJTSIZHiT+vMD0kPeD+so0l7mxkMT19g3pjY9GTnHySck/hDzq+dtW/4VgnMkippsQ==",
"dev": true, "dev": true,
"engines": { "engines": {
"node": ">=6.9.0" "node": ">=6.9.0"
} }
}, },
"node_modules/@babel/helper-validator-identifier": { "node_modules/@babel/helper-validator-identifier": {
"version": "7.22.20", "version": "7.24.7",
"resolved": "https://registry.npmjs.org/@babel/helper-validator-identifier/-/helper-validator-identifier-7.22.20.tgz", "resolved": "https://registry.npmjs.org/@babel/helper-validator-identifier/-/helper-validator-identifier-7.24.7.tgz",
"integrity": "sha512-Y4OZ+ytlatR8AI+8KZfKuL5urKp7qey08ha31L8b3BwewJAoJamTzyvxPR/5D+KkdJCGPq/+8TukHBlY10FX9A==", "integrity": "sha512-rR+PBcQ1SMQDDyF6X0wxtG8QyLCgUB0eRAGguqRLfkCA87l7yAP7ehq8SNj96OOGTO8OBV70KhuFYcIkHXOg0w==",
"dev": true, "dev": true,
"engines": { "engines": {
"node": ">=6.9.0" "node": ">=6.9.0"
} }
}, },
"node_modules/@babel/helper-validator-option": { "node_modules/@babel/helper-validator-option": {
"version": "7.23.5", "version": "7.24.8",
"resolved": "https://registry.npmjs.org/@babel/helper-validator-option/-/helper-validator-option-7.23.5.tgz", "resolved": "https://registry.npmjs.org/@babel/helper-validator-option/-/helper-validator-option-7.24.8.tgz",
"integrity": "sha512-85ttAOMLsr53VgXkTbkx8oA6YTfT4q7/HzXSLEYmjcSTJPMPQtvq1BD79Byep5xMUYbGRzEpDsjUf3dyp54IKw==", "integrity": "sha512-xb8t9tD1MHLungh/AIoWYN+gVHaB9kwlu8gffXGSt3FFEIT7RjS+xWbc2vUD1UTZdIpKj/ab3rdqJ7ufngyi2Q==",
"dev": true, "dev": true,
"engines": { "engines": {
"node": ">=6.9.0" "node": ">=6.9.0"
@ -264,12 +269,12 @@
} }
}, },
"node_modules/@babel/highlight": { "node_modules/@babel/highlight": {
"version": "7.24.2", "version": "7.24.7",
"resolved": "https://registry.npmjs.org/@babel/highlight/-/highlight-7.24.2.tgz", "resolved": "https://registry.npmjs.org/@babel/highlight/-/highlight-7.24.7.tgz",
"integrity": "sha512-Yac1ao4flkTxTteCDZLEvdxg2fZfz1v8M4QpaGypq/WPDqg3ijHYbDfs+LG5hvzSoqaSZ9/Z9lKSP3CjZjv+pA==", "integrity": "sha512-EStJpq4OuY8xYfhGVXngigBJRWxftKX9ksiGDnmlY3o7B/V7KIAc9X4oiK87uPJSc/vs5L869bem5fhZa8caZw==",
"dev": true, "dev": true,
"dependencies": { "dependencies": {
"@babel/helper-validator-identifier": "^7.22.20", "@babel/helper-validator-identifier": "^7.24.7",
"chalk": "^2.4.2", "chalk": "^2.4.2",
"js-tokens": "^4.0.0", "js-tokens": "^4.0.0",
"picocolors": "^1.0.0" "picocolors": "^1.0.0"
@ -350,9 +355,9 @@
} }
}, },
"node_modules/@babel/parser": { "node_modules/@babel/parser": {
"version": "7.24.4", "version": "7.24.8",
"resolved": "https://registry.npmjs.org/@babel/parser/-/parser-7.24.4.tgz", "resolved": "https://registry.npmjs.org/@babel/parser/-/parser-7.24.8.tgz",
"integrity": "sha512-zTvEBcghmeBma9QIGunWevvBAp4/Qu9Bdq+2k0Ot4fVMD6v3dsC9WOcRSKk7tRRyBM/53yKMJko9xOatGQAwSg==", "integrity": "sha512-WzfbgXOkGzZiXXCqk43kKwZjzwx4oulxZi3nq2TYL9mOjQv6kYwul9mz6ID36njuL7Xkp6nJEfok848Zj10j/w==",
"dev": true, "dev": true,
"bin": { "bin": {
"parser": "bin/babel-parser.js" "parser": "bin/babel-parser.js"
@ -539,33 +544,33 @@
} }
}, },
"node_modules/@babel/template": { "node_modules/@babel/template": {
"version": "7.24.0", "version": "7.24.7",
"resolved": "https://registry.npmjs.org/@babel/template/-/template-7.24.0.tgz", "resolved": "https://registry.npmjs.org/@babel/template/-/template-7.24.7.tgz",
"integrity": "sha512-Bkf2q8lMB0AFpX0NFEqSbx1OkTHf0f+0j82mkw+ZpzBnkk7e9Ql0891vlfgi+kHwOk8tQjiQHpqh4LaSa0fKEA==", "integrity": "sha512-jYqfPrU9JTF0PmPy1tLYHW4Mp4KlgxJD9l2nP9fD6yT/ICi554DmrWBAEYpIelzjHf1msDP3PxJIRt/nFNfBig==",
"dev": true, "dev": true,
"dependencies": { "dependencies": {
"@babel/code-frame": "^7.23.5", "@babel/code-frame": "^7.24.7",
"@babel/parser": "^7.24.0", "@babel/parser": "^7.24.7",
"@babel/types": "^7.24.0" "@babel/types": "^7.24.7"
}, },
"engines": { "engines": {
"node": ">=6.9.0" "node": ">=6.9.0"
} }
}, },
"node_modules/@babel/traverse": { "node_modules/@babel/traverse": {
"version": "7.24.1", "version": "7.24.8",
"resolved": "https://registry.npmjs.org/@babel/traverse/-/traverse-7.24.1.tgz", "resolved": "https://registry.npmjs.org/@babel/traverse/-/traverse-7.24.8.tgz",
"integrity": "sha512-xuU6o9m68KeqZbQuDt2TcKSxUw/mrsvavlEqQ1leZ/B+C9tk6E4sRWy97WaXgvq5E+nU3cXMxv3WKOCanVMCmQ==", "integrity": "sha512-t0P1xxAPzEDcEPmjprAQq19NWum4K0EQPjMwZQZbHt+GiZqvjCHjj755Weq1YRPVzBI+3zSfvScfpnuIecVFJQ==",
"dev": true, "dev": true,
"dependencies": { "dependencies": {
"@babel/code-frame": "^7.24.1", "@babel/code-frame": "^7.24.7",
"@babel/generator": "^7.24.1", "@babel/generator": "^7.24.8",
"@babel/helper-environment-visitor": "^7.22.20", "@babel/helper-environment-visitor": "^7.24.7",
"@babel/helper-function-name": "^7.23.0", "@babel/helper-function-name": "^7.24.7",
"@babel/helper-hoist-variables": "^7.22.5", "@babel/helper-hoist-variables": "^7.24.7",
"@babel/helper-split-export-declaration": "^7.22.6", "@babel/helper-split-export-declaration": "^7.24.7",
"@babel/parser": "^7.24.1", "@babel/parser": "^7.24.8",
"@babel/types": "^7.24.0", "@babel/types": "^7.24.8",
"debug": "^4.3.1", "debug": "^4.3.1",
"globals": "^11.1.0" "globals": "^11.1.0"
}, },
@ -574,13 +579,13 @@
} }
}, },
"node_modules/@babel/types": { "node_modules/@babel/types": {
"version": "7.24.0", "version": "7.24.9",
"resolved": "https://registry.npmjs.org/@babel/types/-/types-7.24.0.tgz", "resolved": "https://registry.npmjs.org/@babel/types/-/types-7.24.9.tgz",
"integrity": "sha512-+j7a5c253RfKh8iABBhywc8NSfP5LURe7Uh4qpsh6jc+aLJguvmIUBdjSdEMQv2bENrCR5MfRdjGo7vzS/ob7w==", "integrity": "sha512-xm8XrMKz0IlUdocVbYJe0Z9xEgidU7msskG8BbhnTPK/HZ2z/7FP7ykqPgrUH+C+r414mNfNWam1f2vqOjqjYQ==",
"dev": true, "dev": true,
"dependencies": { "dependencies": {
"@babel/helper-string-parser": "^7.23.4", "@babel/helper-string-parser": "^7.24.8",
"@babel/helper-validator-identifier": "^7.22.20", "@babel/helper-validator-identifier": "^7.24.7",
"to-fast-properties": "^2.0.0" "to-fast-properties": "^2.0.0"
}, },
"engines": { "engines": {
@ -1175,6 +1180,12 @@
"sprintf-js": "~1.0.2" "sprintf-js": "~1.0.2"
} }
}, },
"node_modules/async": {
"version": "3.2.5",
"resolved": "https://registry.npmjs.org/async/-/async-3.2.5.tgz",
"integrity": "sha512-baNZyqaaLhyLVKm/DlvdW051MSgO6b8eVfIezl9E5PqWxFgzLm/wQntEW4zOytVburDEr0JlALEpdOFwvErLsg==",
"dev": true
},
"node_modules/asynckit": { "node_modules/asynckit": {
"version": "0.4.0", "version": "0.4.0",
"resolved": "https://registry.npmjs.org/asynckit/-/asynckit-0.4.0.tgz", "resolved": "https://registry.npmjs.org/asynckit/-/asynckit-0.4.0.tgz",
@ -1326,9 +1337,9 @@
} }
}, },
"node_modules/browserslist": { "node_modules/browserslist": {
"version": "4.23.0", "version": "4.23.2",
"resolved": "https://registry.npmjs.org/browserslist/-/browserslist-4.23.0.tgz", "resolved": "https://registry.npmjs.org/browserslist/-/browserslist-4.23.2.tgz",
"integrity": "sha512-QW8HiM1shhT2GuzkvklfjcKDiWFXHOeFCIA/huJPwHsslwcydgk7X+z2zXpEijP98UCY7HbubZt5J2Zgvf0CaQ==", "integrity": "sha512-qkqSyistMYdxAcw+CzbZwlBy8AGmS/eEWs+sEV5TnLRGDOL+C5M2EnH6tlZyg0YoAxGJAFKh61En9BR941GnHA==",
"dev": true, "dev": true,
"funding": [ "funding": [
{ {
@ -1345,10 +1356,10 @@
} }
], ],
"dependencies": { "dependencies": {
"caniuse-lite": "^1.0.30001587", "caniuse-lite": "^1.0.30001640",
"electron-to-chromium": "^1.4.668", "electron-to-chromium": "^1.4.820",
"node-releases": "^2.0.14", "node-releases": "^2.0.14",
"update-browserslist-db": "^1.0.13" "update-browserslist-db": "^1.1.0"
}, },
"bin": { "bin": {
"browserslist": "cli.js" "browserslist": "cli.js"
@ -1403,9 +1414,9 @@
} }
}, },
"node_modules/caniuse-lite": { "node_modules/caniuse-lite": {
"version": "1.0.30001612", "version": "1.0.30001642",
"resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001612.tgz", "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001642.tgz",
"integrity": "sha512-lFgnZ07UhaCcsSZgWW0K5j4e69dK1u/ltrL9lTUiFOwNHs12S3UMIEYgBV0Z6C6hRDev7iRnMzzYmKabYdXF9g==", "integrity": "sha512-3XQ0DoRgLijXJErLSl+bLnJ+Et4KqV1PY6JJBGAFlsNsz31zeAIncyeZfLCabHK/jtSh+671RM9YMldxjUPZtA==",
"dev": true, "dev": true,
"funding": [ "funding": [
{ {
@ -1651,10 +1662,25 @@
"url": "https://dotenvx.com" "url": "https://dotenvx.com"
} }
}, },
"node_modules/ejs": {
"version": "3.1.10",
"resolved": "https://registry.npmjs.org/ejs/-/ejs-3.1.10.tgz",
"integrity": "sha512-UeJmFfOrAQS8OJWPZ4qtgHyWExa088/MtK5UEyoJGFH67cDEXkZSviOiKRCZ4Xij0zxI3JECgYs3oKx+AizQBA==",
"dev": true,
"dependencies": {
"jake": "^10.8.5"
},
"bin": {
"ejs": "bin/cli.js"
},
"engines": {
"node": ">=0.10.0"
}
},
"node_modules/electron-to-chromium": { "node_modules/electron-to-chromium": {
"version": "1.4.748", "version": "1.4.829",
"resolved": "https://registry.npmjs.org/electron-to-chromium/-/electron-to-chromium-1.4.748.tgz", "resolved": "https://registry.npmjs.org/electron-to-chromium/-/electron-to-chromium-1.4.829.tgz",
"integrity": "sha512-VWqjOlPZn70UZ8FTKUOkUvBLeTQ0xpty66qV0yJcAGY2/CthI4xyW9aEozRVtuwv3Kpf5xTesmJUcPwuJmgP4A==", "integrity": "sha512-5qp1N2POAfW0u1qGAxXEtz6P7bO1m6gpZr5hdf5ve6lxpLM7MpiM4jIPz7xcrNlClQMafbyUDDWjlIQZ1Mw0Rw==",
"dev": true "dev": true
}, },
"node_modules/emittery": { "node_modules/emittery": {
@ -1778,6 +1804,36 @@
"bser": "2.1.1" "bser": "2.1.1"
} }
}, },
"node_modules/filelist": {
"version": "1.0.4",
"resolved": "https://registry.npmjs.org/filelist/-/filelist-1.0.4.tgz",
"integrity": "sha512-w1cEuf3S+DrLCQL7ET6kz+gmlJdbq9J7yXCSjK/OZCPA+qEN1WyF4ZAf0YYJa4/shHJra2t/d/r8SV4Ji+x+8Q==",
"dev": true,
"dependencies": {
"minimatch": "^5.0.1"
}
},
"node_modules/filelist/node_modules/brace-expansion": {
"version": "2.0.1",
"resolved": "https://registry.npmjs.org/brace-expansion/-/brace-expansion-2.0.1.tgz",
"integrity": "sha512-XnAIvQ8eM+kC6aULx6wuQiwVsnzsi9d3WxzV3FpWTGA19F621kwdbsAcFKXgKUHZWsy+mY6iL1sHTxWEFCytDA==",
"dev": true,
"dependencies": {
"balanced-match": "^1.0.0"
}
},
"node_modules/filelist/node_modules/minimatch": {
"version": "5.1.6",
"resolved": "https://registry.npmjs.org/minimatch/-/minimatch-5.1.6.tgz",
"integrity": "sha512-lKwV/1brpG6mBUFHtb7NUmtABCb2WZZmm2wNiOA5hAb8VdCS4B3dtMWyvcoViccwAW/COERjXLt0zP1zXUN26g==",
"dev": true,
"dependencies": {
"brace-expansion": "^2.0.1"
},
"engines": {
"node": ">=10"
}
},
"node_modules/fill-range": { "node_modules/fill-range": {
"version": "7.0.1", "version": "7.0.1",
"resolved": "https://registry.npmjs.org/fill-range/-/fill-range-7.0.1.tgz", "resolved": "https://registry.npmjs.org/fill-range/-/fill-range-7.0.1.tgz",
@ -2180,6 +2236,24 @@
"node": ">=8" "node": ">=8"
} }
}, },
"node_modules/jake": {
"version": "10.9.1",
"resolved": "https://registry.npmjs.org/jake/-/jake-10.9.1.tgz",
"integrity": "sha512-61btcOHNnLnsOdtLgA5efqQWjnSi/vow5HbI7HMdKKWqvrKR1bLK3BPlJn9gcSaP2ewuamUSMB5XEy76KUIS2w==",
"dev": true,
"dependencies": {
"async": "^3.2.3",
"chalk": "^4.0.2",
"filelist": "^1.0.4",
"minimatch": "^3.1.2"
},
"bin": {
"jake": "bin/cli.js"
},
"engines": {
"node": ">=10"
}
},
"node_modules/jest": { "node_modules/jest": {
"version": "29.7.0", "version": "29.7.0",
"resolved": "https://registry.npmjs.org/jest/-/jest-29.7.0.tgz", "resolved": "https://registry.npmjs.org/jest/-/jest-29.7.0.tgz",
@ -3009,9 +3083,9 @@
"dev": true "dev": true
}, },
"node_modules/node-releases": { "node_modules/node-releases": {
"version": "2.0.14", "version": "2.0.17",
"resolved": "https://registry.npmjs.org/node-releases/-/node-releases-2.0.14.tgz", "resolved": "https://registry.npmjs.org/node-releases/-/node-releases-2.0.17.tgz",
"integrity": "sha512-y10wOWt8yZpqXmOgRo77WaHEmhYQYGNA6y421PKsKYWEK8aW+cqAphborZDhqfyKrbZEN92CN1X2KbafY2s7Yw==", "integrity": "sha512-Ww6ZlOiEQfPfXM45v17oabk77Z7mg5bOt7AjDyzy7RjK9OrLrLC8dyZQoAPEOtFX9SaNf1Tdvr5gRJWdTJj7GA==",
"dev": true "dev": true
}, },
"node_modules/normalize-path": { "node_modules/normalize-path": {
@ -3162,9 +3236,9 @@
"dev": true "dev": true
}, },
"node_modules/picocolors": { "node_modules/picocolors": {
"version": "1.0.0", "version": "1.0.1",
"resolved": "https://registry.npmjs.org/picocolors/-/picocolors-1.0.0.tgz", "resolved": "https://registry.npmjs.org/picocolors/-/picocolors-1.0.1.tgz",
"integrity": "sha512-1fygroTLlHu66zi26VoTDv8yRgm0Fccecssto+MhsZ0D/DGW2sm8E8AjW7NU5VVTRt5GxbeZ5qBuJr+HyLYkjQ==", "integrity": "sha512-anP1Z8qwhkbmu7MFP5iTt+wQKXgwzf7zTyGlcdzabySa9vd0Xt392U0rVmz9poOaBj0uHJKyyo9/upk0HrEQew==",
"dev": true "dev": true
}, },
"node_modules/picomatch": { "node_modules/picomatch": {
@ -3545,12 +3619,13 @@
} }
}, },
"node_modules/ts-jest": { "node_modules/ts-jest": {
"version": "29.1.2", "version": "29.2.2",
"resolved": "https://registry.npmjs.org/ts-jest/-/ts-jest-29.1.2.tgz", "resolved": "https://registry.npmjs.org/ts-jest/-/ts-jest-29.2.2.tgz",
"integrity": "sha512-br6GJoH/WUX4pu7FbZXuWGKGNDuU7b8Uj77g/Sp7puZV6EXzuByl6JrECvm0MzVzSTkSHWTihsXt+5XYER5b+g==", "integrity": "sha512-sSW7OooaKT34AAngP6k1VS669a0HdLxkQZnlC7T76sckGCokXFnvJ3yRlQZGRTAoV5K19HfSgCiSwWOSIfcYlg==",
"dev": true, "dev": true,
"dependencies": { "dependencies": {
"bs-logger": "0.x", "bs-logger": "0.x",
"ejs": "^3.0.0",
"fast-json-stable-stringify": "2.x", "fast-json-stable-stringify": "2.x",
"jest-util": "^29.0.0", "jest-util": "^29.0.0",
"json5": "^2.2.3", "json5": "^2.2.3",
@ -3563,10 +3638,11 @@
"ts-jest": "cli.js" "ts-jest": "cli.js"
}, },
"engines": { "engines": {
"node": "^16.10.0 || ^18.0.0 || >=20.0.0" "node": "^14.15.0 || ^16.10.0 || ^18.0.0 || >=20.0.0"
}, },
"peerDependencies": { "peerDependencies": {
"@babel/core": ">=7.0.0-beta.0 <8", "@babel/core": ">=7.0.0-beta.0 <8",
"@jest/transform": "^29.0.0",
"@jest/types": "^29.0.0", "@jest/types": "^29.0.0",
"babel-jest": "^29.0.0", "babel-jest": "^29.0.0",
"jest": "^29.0.0", "jest": "^29.0.0",
@ -3576,6 +3652,9 @@
"@babel/core": { "@babel/core": {
"optional": true "optional": true
}, },
"@jest/transform": {
"optional": true
},
"@jest/types": { "@jest/types": {
"optional": true "optional": true
}, },
@ -3661,9 +3740,9 @@
"dev": true "dev": true
}, },
"node_modules/update-browserslist-db": { "node_modules/update-browserslist-db": {
"version": "1.0.13", "version": "1.1.0",
"resolved": "https://registry.npmjs.org/update-browserslist-db/-/update-browserslist-db-1.0.13.tgz", "resolved": "https://registry.npmjs.org/update-browserslist-db/-/update-browserslist-db-1.1.0.tgz",
"integrity": "sha512-xebP81SNcPuNpPP3uzeW1NYXxI3rxyJzF3pD6sH4jE7o/IX+WtSpwnVU+qIsDPyk0d3hmFQ7mjqc6AtV604hbg==", "integrity": "sha512-EdRAaAyk2cUE1wOf2DkEhzxqOQvFOoRJFNS6NeyJ01Gp2beMRpBAINjM2iDXE3KCuKhwnvHIQCJm6ThL2Z+HzQ==",
"dev": true, "dev": true,
"funding": [ "funding": [
{ {
@ -3680,8 +3759,8 @@
} }
], ],
"dependencies": { "dependencies": {
"escalade": "^3.1.1", "escalade": "^3.1.2",
"picocolors": "^1.0.0" "picocolors": "^1.0.1"
}, },
"bin": { "bin": {
"update-browserslist-db": "cli.js" "update-browserslist-db": "cli.js"

View File

@ -1,19 +1,15 @@
{ {
"name": "@mendable/firecrawl-js", "name": "@mendable/firecrawl-js",
"version": "0.0.29", "version": "0.0.34",
"description": "JavaScript SDK for Firecrawl API", "description": "JavaScript SDK for Firecrawl API",
"main": "build/cjs/index.js", "main": "build/index.js",
"types": "types/index.d.ts", "types": "types/index.d.ts",
"type": "module", "type": "module",
"exports": {
"require": "./build/cjs/index.js",
"import": "./build/esm/index.js"
},
"scripts": { "scripts": {
"build": "tsc --module commonjs --moduleResolution node10 --outDir build/cjs/ && echo '{\"type\": \"commonjs\"}' > build/cjs/package.json && npx tsc --module NodeNext --moduleResolution NodeNext --outDir build/esm/ && echo '{\"type\": \"module\"}' > build/esm/package.json", "build": "tsc",
"build-and-publish": "npm run build && npm publish --access public", "build-and-publish": "npm run build && npm publish --access public",
"publish-beta": "npm run build && npm publish --access public --tag beta", "publish-beta": "npm run build && npm publish --access public --tag beta",
"test": "jest src/__tests__/**/*.test.ts" "test": "NODE_OPTIONS=--experimental-vm-modules jest --verbose src/__tests__/**/*.test.ts"
}, },
"repository": { "repository": {
"type": "git", "type": "git",
@ -41,7 +37,7 @@
"@types/node": "^20.12.12", "@types/node": "^20.12.12",
"@types/uuid": "^9.0.8", "@types/uuid": "^9.0.8",
"jest": "^29.7.0", "jest": "^29.7.0",
"ts-jest": "^29.1.2", "ts-jest": "^29.2.2",
"typescript": "^5.4.5" "typescript": "^5.4.5"
}, },
"keywords": [ "keywords": [

View File

@ -13,7 +13,6 @@
"axios": "^1.6.8", "axios": "^1.6.8",
"ts-node": "^10.9.2", "ts-node": "^10.9.2",
"typescript": "^5.4.5", "typescript": "^5.4.5",
"uuid": "^10.0.0",
"zod": "^3.23.8" "zod": "^3.23.8"
}, },
"devDependencies": { "devDependencies": {
@ -451,15 +450,6 @@
"resolved": "https://registry.npmjs.org/@tsconfig/node16/-/node16-1.0.4.tgz", "resolved": "https://registry.npmjs.org/@tsconfig/node16/-/node16-1.0.4.tgz",
"integrity": "sha512-vxhUy4J8lyeyinH7Azl1pdd43GJhZH/tP2weN8TntQblOY+A0XbT8DJk1/oCPuOOyg/Ja757rG0CgHcWC8OfMA==" "integrity": "sha512-vxhUy4J8lyeyinH7Azl1pdd43GJhZH/tP2weN8TntQblOY+A0XbT8DJk1/oCPuOOyg/Ja757rG0CgHcWC8OfMA=="
}, },
"node_modules/@types/node": {
"version": "20.14.11",
"resolved": "https://registry.npmjs.org/@types/node/-/node-20.14.11.tgz",
"integrity": "sha512-kprQpL8MMeszbz6ojB5/tU8PLN4kesnN8Gjzw349rDlNgsSzg90lAVj3llK99Dh7JON+t9AuscPPFW6mPbTnSA==",
"peer": true,
"dependencies": {
"undici-types": "~5.26.4"
}
},
"node_modules/acorn": { "node_modules/acorn": {
"version": "8.11.3", "version": "8.11.3",
"resolved": "https://registry.npmjs.org/acorn/-/acorn-8.11.3.tgz", "resolved": "https://registry.npmjs.org/acorn/-/acorn-8.11.3.tgz",
@ -738,24 +728,6 @@
"node": ">=14.17" "node": ">=14.17"
} }
}, },
"node_modules/undici-types": {
"version": "5.26.5",
"resolved": "https://registry.npmjs.org/undici-types/-/undici-types-5.26.5.tgz",
"integrity": "sha512-JlCMO+ehdEIKqlFxk6IfVoAUVmgz7cU7zD/h9XZ0qzeosSHmUJVOzSQvvYSYWXkFXC+IfLKSIffhv0sVZup6pA==",
"peer": true
},
"node_modules/uuid": {
"version": "10.0.0",
"resolved": "https://registry.npmjs.org/uuid/-/uuid-10.0.0.tgz",
"integrity": "sha512-8XkAphELsDnEGrDxUOHB3RGvXz6TeuYSGEZBOjtTtPm2lwhGBjLgOzLHB63IUWfBpNucQjND6d3AOudO+H3RWQ==",
"funding": [
"https://github.com/sponsors/broofa",
"https://github.com/sponsors/ctavan"
],
"bin": {
"uuid": "dist/bin/uuid"
}
},
"node_modules/v8-compile-cache-lib": { "node_modules/v8-compile-cache-lib": {
"version": "3.0.1", "version": "3.0.1",
"resolved": "https://registry.npmjs.org/v8-compile-cache-lib/-/v8-compile-cache-lib-3.0.1.tgz", "resolved": "https://registry.npmjs.org/v8-compile-cache-lib/-/v8-compile-cache-lib-3.0.1.tgz",
@ -778,9 +750,9 @@
} }
}, },
"node_modules/zod-to-json-schema": { "node_modules/zod-to-json-schema": {
"version": "3.23.1", "version": "3.23.0",
"resolved": "https://registry.npmjs.org/zod-to-json-schema/-/zod-to-json-schema-3.23.1.tgz", "resolved": "https://registry.npmjs.org/zod-to-json-schema/-/zod-to-json-schema-3.23.0.tgz",
"integrity": "sha512-oT9INvydob1XV0v1d2IadrR74rLtDInLvDFfAa1CG0Pmg/vxATk7I2gSelfj271mbzeM4Da0uuDQE/Nkj3DWNw==", "integrity": "sha512-az0uJ243PxsRIa2x1WmNE/pnuA05gUq/JB8Lwe1EDCCL/Fz9MgjYQ0fPlyc2Tcv6aF2ZA7WM5TWaRZVEFaAIag==",
"peerDependencies": { "peerDependencies": {
"zod": "^3.23.3" "zod": "^3.23.3"
} }

View File

@ -15,7 +15,6 @@
"axios": "^1.6.8", "axios": "^1.6.8",
"ts-node": "^10.9.2", "ts-node": "^10.9.2",
"typescript": "^5.4.5", "typescript": "^5.4.5",
"uuid": "^10.0.0",
"zod": "^3.23.8" "zod": "^3.23.8"
}, },
"devDependencies": { "devDependencies": {

View File

@ -5,7 +5,7 @@ the HTML content of a specified URL. It supports optional proxy settings and med
from os import environ from os import environ
from fastapi import FastAPI from fastapi import FastAPI, Response
from fastapi.responses import JSONResponse from fastapi.responses import JSONResponse
from playwright.async_api import Browser, async_playwright from playwright.async_api import Browser, async_playwright
from pydantic import BaseModel from pydantic import BaseModel
@ -39,14 +39,28 @@ async def shutdown_event():
"""Event handler for application shutdown to close the browser.""" """Event handler for application shutdown to close the browser."""
await browser.close() await browser.close()
@app.get("/health/liveness")
def liveness_probe():
"""Endpoint for liveness probe."""
return JSONResponse(content={"status": "ok"}, status_code=200)
@app.get("/health/readiness")
async def readiness_probe():
"""Endpoint for readiness probe. Checks if the browser instance is ready."""
if browser:
return JSONResponse(content={"status": "ok"}, status_code=200)
return JSONResponse(content={"status": "Service Unavailable"}, status_code=503)
@app.post("/html") @app.post("/html")
async def root(body: UrlModel): async def root(body: UrlModel):
""" """
Endpoint to fetch and return HTML content of a given URL. Endpoint to fetch and return HTML content of a given URL.
Args: Args:
body (UrlModel): The URL model containing the target URL, wait time, and timeout. body (UrlModel): The URL model containing the target URL, wait time, and timeout.
Returns: Returns:
JSONResponse: The HTML content of the page. JSONResponse: The HTML content of the page.
""" """

View File

@ -1,5 +1,6 @@
import { createClient, SupabaseClient } from "@supabase/supabase-js"; import { createClient, SupabaseClient } from "@supabase/supabase-js";
import "dotenv/config"; import "dotenv/config";
// SupabaseService class initializes the Supabase client conditionally based on environment variables. // SupabaseService class initializes the Supabase client conditionally based on environment variables.
class SupabaseService { class SupabaseService {
private client: SupabaseClient | null = null; private client: SupabaseClient | null = null;
@ -11,12 +12,12 @@ class SupabaseService {
if (process.env.USE_DB_AUTHENTICATION === "false") { if (process.env.USE_DB_AUTHENTICATION === "false") {
// Warn the user that Authentication is disabled by setting the client to null // Warn the user that Authentication is disabled by setting the client to null
console.warn( console.warn(
"\x1b[33mAuthentication is disabled. Supabase client will not be initialized.\x1b[0m" "Authentication is disabled. Supabase client will not be initialized."
); );
this.client = null; this.client = null;
} else if (!supabaseUrl || !supabaseServiceToken) { } else if (!supabaseUrl || !supabaseServiceToken) {
console.error( console.error(
"\x1b[31mSupabase environment variables aren't configured correctly. Supabase client will not be initialized. Fix ENV configuration or disable DB authentication with USE_DB_AUTHENTICATION env variable\x1b[0m" "Supabase environment variables aren't configured correctly. Supabase client will not be initialized. Fix ENV configuration or disable DB authentication with USE_DB_AUTHENTICATION env variable"
); );
} else { } else {
this.client = createClient(supabaseUrl, supabaseServiceToken); this.client = createClient(supabaseUrl, supabaseServiceToken);
@ -35,6 +36,11 @@ export const supabase_service: SupabaseClient = new Proxy(
new SupabaseService(), new SupabaseService(),
{ {
get: function (target, prop, receiver) { get: function (target, prop, receiver) {
if (process.env.USE_DB_AUTHENTICATION === "false") {
console.debug(
"Attempted to access Supabase client when it's not configured."
);
}
const client = target.getClient(); const client = target.getClient();
// If the Supabase client is not initialized, intercept property access to provide meaningful error feedback. // If the Supabase client is not initialized, intercept property access to provide meaningful error feedback.
if (client === null) { if (client === null) {

View File

@ -0,0 +1,18 @@
module.exports = {
root: true,
env: { browser: true, es2020: true },
extends: [
'eslint:recommended',
'plugin:@typescript-eslint/recommended',
'plugin:react-hooks/recommended',
],
ignorePatterns: ['dist', '.eslintrc.cjs'],
parser: '@typescript-eslint/parser',
plugins: ['react-refresh'],
rules: {
'react-refresh/only-export-components': [
'warn',
{ allowConstantExport: true },
],
},
}

24
apps/ui/ingestion-ui/.gitignore vendored Normal file
View File

@ -0,0 +1,24 @@
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
pnpm-debug.log*
lerna-debug.log*
node_modules
dist
dist-ssr
*.local
# Editor directories and files
.vscode/*
!.vscode/extensions.json
.idea
.DS_Store
*.suo
*.ntvs*
*.njsproj
*.sln
*.sw?

View File

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2024 Sideguide Technologies Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@ -0,0 +1,65 @@
# Firecrawl UI Template
This template provides an easy way to spin up a UI for Firecrawl using React. It includes a pre-built component that interacts with the Firecrawl API, allowing you to quickly set up a web crawling and scraping interface.
## ⚠️ Important Security Notice
**This template exposes Firecrawl API keys in the client-side code. For production use, it is strongly recommended to move API interactions to a server-side implementation to protect your API keys.**
## Prerequisites
- Node.js (v14 or later recommended)
- npm
## Getting Started
1. Install dependencies:
```
npm install
```
2. Set up your Firecrawl API key:
Open `src/components/FirecrawlComponent.tsx` and replace the placeholder API key:
```typescript
const FIRECRAWL_API_KEY = "your-api-key-here";
```
3. Start the development server:
```
npm run dev
```
4. Open your browser and navigate to the port specified in your terminal
## Customization
The main Firecrawl component is located in `src/components/FirecrawlComponent.tsx`. You can modify this file to customize the UI or add additional features.
## Security Considerations
For production use, consider the following security measures:
1. Move API interactions to a server-side implementation to protect your Firecrawl API key.
2. Implement proper authentication and authorization for your application.
3. Set up CORS policies to restrict access to your API endpoints.
## Learn More
For more information about Firecrawl and its API, visit the [Firecrawl documentation](https://docs.firecrawl.dev/).
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
The Firecrawl Ingestion UI Template is licensed under the MIT License. This means you are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the SDK, subject to the following conditions:
- The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Please note that while this SDK is MIT licensed, it is part of a larger project which may be under different licensing terms. Always refer to the license information in the root directory of the main project for overall licensing details.

View File

@ -0,0 +1,17 @@
{
"$schema": "https://ui.shadcn.com/schema.json",
"style": "default",
"rsc": false,
"tsx": true,
"tailwind": {
"config": "tailwind.config.js",
"css": "src/index.css",
"baseColor": "slate",
"cssVariables": true,
"prefix": ""
},
"aliases": {
"components": "@/components",
"utils": "@/lib/utils"
}
}

Some files were not shown because too many files have changed in this diff Show More