mirror of
https://github.com/mendableai/firecrawl.git
synced 2024-11-15 19:22:19 +08:00
Merge branch 'main' into bugfix/lowercase-all-urls
This commit is contained in:
commit
68cb9e60c0
58
.github/workflows/ci.yml
vendored
Normal file
58
.github/workflows/ci.yml
vendored
Normal file
|
@ -0,0 +1,58 @@
|
|||
name: CI/CD
|
||||
on:
|
||||
pull_request:
|
||||
branches:
|
||||
- main
|
||||
# schedule:
|
||||
# - cron: '0 */4 * * *'
|
||||
|
||||
env:
|
||||
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
||||
BULL_AUTH_KEY: ${{ secrets.BULL_AUTH_KEY }}
|
||||
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
|
||||
HOST: ${{ secrets.HOST }}
|
||||
LLAMAPARSE_API_KEY: ${{ secrets.LLAMAPARSE_API_KEY }}
|
||||
LOGTAIL_KEY: ${{ secrets.LOGTAIL_KEY }}
|
||||
NUM_WORKERS_PER_QUEUE: ${{ secrets.NUM_WORKERS_PER_QUEUE }}
|
||||
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
||||
PLAYWRIGHT_MICROSERVICE_URL: ${{ secrets.PLAYWRIGHT_MICROSERVICE_URL }}
|
||||
PORT: ${{ secrets.PORT }}
|
||||
REDIS_URL: ${{ secrets.REDIS_URL }}
|
||||
SCRAPING_BEE_API_KEY: ${{ secrets.SCRAPING_BEE_API_KEY }}
|
||||
SUPABASE_ANON_TOKEN: ${{ secrets.SUPABASE_ANON_TOKEN }}
|
||||
SUPABASE_SERVICE_TOKEN: ${{ secrets.SUPABASE_SERVICE_TOKEN }}
|
||||
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
|
||||
TEST_API_KEY: ${{ secrets.TEST_API_KEY }}
|
||||
|
||||
jobs:
|
||||
pre-deploy:
|
||||
name: Pre-deploy checks
|
||||
runs-on: ubuntu-latest
|
||||
services:
|
||||
redis:
|
||||
image: redis
|
||||
ports:
|
||||
- 6379:6379
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
- name: Set up Node.js
|
||||
uses: actions/setup-node@v3
|
||||
with:
|
||||
node-version: '20'
|
||||
- name: Install pnpm
|
||||
run: npm install -g pnpm
|
||||
- name: Install dependencies
|
||||
run: pnpm install
|
||||
working-directory: ./apps/api
|
||||
- name: Start the application
|
||||
run: npm start &
|
||||
working-directory: ./apps/api
|
||||
id: start_app
|
||||
- name: Start workers
|
||||
run: npm run workers &
|
||||
working-directory: ./apps/api
|
||||
id: start_workers
|
||||
- name: Run E2E tests
|
||||
run: |
|
||||
npm run test:prod
|
||||
working-directory: ./apps/api
|
50
.github/workflows/fly.yml
vendored
50
.github/workflows/fly.yml
vendored
|
@ -6,10 +6,60 @@ on:
|
|||
# schedule:
|
||||
# - cron: '0 */4 * * *'
|
||||
|
||||
env:
|
||||
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
||||
BULL_AUTH_KEY: ${{ secrets.BULL_AUTH_KEY }}
|
||||
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
|
||||
HOST: ${{ secrets.HOST }}
|
||||
LLAMAPARSE_API_KEY: ${{ secrets.LLAMAPARSE_API_KEY }}
|
||||
LOGTAIL_KEY: ${{ secrets.LOGTAIL_KEY }}
|
||||
NUM_WORKERS_PER_QUEUE: ${{ secrets.NUM_WORKERS_PER_QUEUE }}
|
||||
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
||||
PLAYWRIGHT_MICROSERVICE_URL: ${{ secrets.PLAYWRIGHT_MICROSERVICE_URL }}
|
||||
PORT: ${{ secrets.PORT }}
|
||||
REDIS_URL: ${{ secrets.REDIS_URL }}
|
||||
SCRAPING_BEE_API_KEY: ${{ secrets.SCRAPING_BEE_API_KEY }}
|
||||
SUPABASE_ANON_TOKEN: ${{ secrets.SUPABASE_ANON_TOKEN }}
|
||||
SUPABASE_SERVICE_TOKEN: ${{ secrets.SUPABASE_SERVICE_TOKEN }}
|
||||
SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
|
||||
TEST_API_KEY: ${{ secrets.TEST_API_KEY }}
|
||||
|
||||
jobs:
|
||||
pre-deploy:
|
||||
name: Pre-deploy checks
|
||||
runs-on: ubuntu-latest
|
||||
services:
|
||||
redis:
|
||||
image: redis
|
||||
ports:
|
||||
- 6379:6379
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
- name: Set up Node.js
|
||||
uses: actions/setup-node@v3
|
||||
with:
|
||||
node-version: '20'
|
||||
- name: Install pnpm
|
||||
run: npm install -g pnpm
|
||||
- name: Install dependencies
|
||||
run: pnpm install
|
||||
working-directory: ./apps/api
|
||||
- name: Start the application
|
||||
run: npm start &
|
||||
working-directory: ./apps/api
|
||||
id: start_app
|
||||
- name: Start workers
|
||||
run: npm run workers &
|
||||
working-directory: ./apps/api
|
||||
id: start_workers
|
||||
- name: Run E2E tests
|
||||
run: |
|
||||
npm run test:prod
|
||||
working-directory: ./apps/api
|
||||
deploy:
|
||||
name: Deploy app
|
||||
runs-on: ubuntu-latest
|
||||
needs: pre-deploy
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
- name: Change directory
|
||||
|
|
5
.gitignore
vendored
5
.gitignore
vendored
|
@ -1,7 +1,10 @@
|
|||
.DS_Store
|
||||
/node_modules/
|
||||
/dist/
|
||||
.env
|
||||
*.csv
|
||||
dump.rdb
|
||||
/mongo-data
|
||||
apps/js-sdk/node_modules/
|
||||
apps/js-sdk/node_modules/
|
||||
|
||||
apps/api/.env.local
|
||||
|
|
114
CONTRIBUTING.md
114
CONTRIBUTING.md
|
@ -1,4 +1,114 @@
|
|||
# Contributing
|
||||
# Contributors guide:
|
||||
|
||||
Welcome to [Firecrawl](https://firecrawl.dev) 🔥! Here are some instructions on how to get the project locally, so you can run it on your own (and contribute)
|
||||
|
||||
If you're contributing, note that the process is similar to other open source repos i.e. (fork firecrawl, make changes, run tests, PR). If you have any questions, and would like help gettin on board, reach out to hello@mendable.ai for more or submit an issue!
|
||||
|
||||
|
||||
## Running the project locally
|
||||
|
||||
First, start by installing dependencies
|
||||
1. node.js [instructions](https://nodejs.org/en/learn/getting-started/how-to-install-nodejs)
|
||||
2. pnpm [instructions](https://pnpm.io/installation)
|
||||
3. redis [instructions](https://redis.io/docs/latest/operate/oss_and_stack/install/install-redis/)
|
||||
|
||||
|
||||
Set environment variables in a .env in the /apps/api/ directoryyou can copy over the template in .env.example.
|
||||
|
||||
To start, we wont set up authentication, or any optional sub services (pdf parsing, JS blocking support, AI features )
|
||||
|
||||
.env:
|
||||
```
|
||||
# ===== Required ENVS ======
|
||||
NUM_WORKERS_PER_QUEUE=8
|
||||
PORT=3002
|
||||
HOST=0.0.0.0
|
||||
REDIS_URL=redis://localhost:6379
|
||||
|
||||
## To turn on DB authentication, you need to set up supabase.
|
||||
USE_DB_AUTHENTICATION=false
|
||||
|
||||
# ===== Optional ENVS ======
|
||||
|
||||
# Supabase Setup (used to support DB authentication, advanced logging, etc.)
|
||||
SUPABASE_ANON_TOKEN=
|
||||
SUPABASE_URL=
|
||||
SUPABASE_SERVICE_TOKEN=
|
||||
|
||||
# Other Optionals
|
||||
TEST_API_KEY= # use if you've set up authentication and want to test with a real API key
|
||||
SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking
|
||||
OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.)
|
||||
BULL_AUTH_KEY= #
|
||||
LOGTAIL_KEY= # Use if you're configuring basic logging with logtail
|
||||
PLAYWRIGHT_MICROSERVICE_URL= # set if you'd like to run a playwright fallback
|
||||
LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs
|
||||
|
||||
```
|
||||
|
||||
### Installing dependencies
|
||||
|
||||
First, install the dependencies using pnpm.
|
||||
|
||||
```bash
|
||||
pnpm install
|
||||
```
|
||||
|
||||
### Running the project
|
||||
|
||||
You're going to need to open 3 terminals.
|
||||
|
||||
### Terminal 1 - setting up redis
|
||||
|
||||
Run the command anywhere within your project
|
||||
|
||||
```bash
|
||||
redis-server
|
||||
```
|
||||
|
||||
### Terminal 2 - setting up workers
|
||||
|
||||
Now, navigate to the apps/api/ directory and run:
|
||||
```bash
|
||||
pnpm run workers
|
||||
```
|
||||
|
||||
This will start the workers who are responsible for processing crawl jobs.
|
||||
|
||||
### Terminal 3 - setting up the main server
|
||||
|
||||
|
||||
To do this, navigate to the apps/api/ directory and run if you don’t have this already, install pnpm here: https://pnpm.io/installation
|
||||
Next, run your server with:
|
||||
|
||||
```bash
|
||||
pnpm run start
|
||||
```
|
||||
|
||||
### Terminal 3 - sending our first request.
|
||||
|
||||
Alright: now let’s send our first request.
|
||||
|
||||
```curl
|
||||
curl -X GET http://localhost:3002/test
|
||||
```
|
||||
This should return the response Hello, world!
|
||||
|
||||
|
||||
If you’d like to test the crawl endpoint, you can run this
|
||||
|
||||
```curl
|
||||
curl -X POST http://localhost:3002/v0/crawl \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"url": "https://mendable.ai"
|
||||
}'
|
||||
```
|
||||
|
||||
## Tests:
|
||||
|
||||
The best way to do this is run the test with `npm run test:local-no-auth` if you'd like to run the tests without authentication.
|
||||
|
||||
If you'd like to run the tests with authentication, run `npm run test:prod`
|
||||
|
||||
We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.
|
||||
|
||||
|
|
31
README.md
31
README.md
|
@ -2,26 +2,30 @@
|
|||
|
||||
Crawl and convert any website into LLM-ready markdown. Build by [Mendable.ai](https://mendable.ai?ref=gfirecrawl)
|
||||
|
||||
|
||||
*This repository is currently in its early stages of development. We are in the process of merging custom modules into this mono repository. The primary objective is to enhance the accuracy of LLM responses by utilizing clean data. It is not ready for full self-host yet - we're working on it*
|
||||
|
||||
## What is Firecrawl?
|
||||
|
||||
[Firecrawl](https://firecrawl.dev?ref=github) is an API service that takes a URL, crawls it, and converts it into clean markdown. We crawl all accessible subpages and give you clean markdown for each. No sitemap required.
|
||||
|
||||
_Pst. hey, you, join our stargazers :)_
|
||||
|
||||
<img src="https://github.com/mendableai/firecrawl/assets/44934913/53c4483a-0f0e-40c6-bd84-153a07f94d29" width="200">
|
||||
|
||||
|
||||
## How to use it?
|
||||
|
||||
We provide an easy to use API with our hosted version. You can find the playground and documentation [here](https://firecrawl.dev/playground). You can also self host the backend if you'd like.
|
||||
|
||||
- [x] [API](https://firecrawl.dev/playground)
|
||||
- [x] [Python SDK](https://github.com/mendableai/firecrawl/tree/main/apps/python-sdk)
|
||||
- [X] [Node SDK](https://github.com/mendableai/firecrawl/tree/main/apps/js-sdk)
|
||||
- [x] [Langchain Integration 🦜🔗](https://python.langchain.com/docs/integrations/document_loaders/firecrawl/)
|
||||
- [x] [Llama Index Integration 🦙](https://docs.llamaindex.ai/en/stable/)
|
||||
- [X] [JS SDK](https://github.com/mendableai/firecrawl/tree/main/apps/js-sdk)
|
||||
- [x] [Llama Index Integration 🦙](https://docs.llamaindex.ai/en/latest/examples/data_connectors/WebPageDemo/#using-firecrawl-reader)
|
||||
- [ ] LangchainJS - Coming Soon
|
||||
|
||||
|
||||
Self-host. To self-host refer to guide [here](https://github.com/mendableai/firecrawl/blob/main/SELF_HOST.md).
|
||||
To run locally, refer to guide [here](https://github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md).
|
||||
|
||||
### API Key
|
||||
|
||||
|
@ -63,15 +67,16 @@ curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
|
|||
"total": 22,
|
||||
"data": [
|
||||
{
|
||||
"content": "Raw Content ",
|
||||
"markdown": "# Markdown Content",
|
||||
"provider": "web-scraper",
|
||||
"metadata": {
|
||||
"title": "Mendable | AI for CX and Sales",
|
||||
"description": "AI for CX and Sales",
|
||||
"language": null,
|
||||
"sourceURL": "https://www.mendable.ai/",
|
||||
}
|
||||
"content": "Raw Content ",
|
||||
"markdown": "# Markdown Content",
|
||||
"provider": "web-scraper",
|
||||
"metadata": {
|
||||
"title": "Mendable | AI for CX and Sales",
|
||||
"description": "AI for CX and Sales",
|
||||
"language": null,
|
||||
"sourceURL": "https://www.mendable.ai/",
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# Self-hosting Firecrawl
|
||||
|
||||
Guide coming soon.
|
||||
Refer to [CONTRIBUTING.md](https://github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md) for instructions on how to run it locally.
|
||||
|
||||
*This repository is currently in its early stages of development. We are in the process of merging custom modules into this mono repository. The primary objective is to enhance the accuracy of LLM responses by utilizing clean data. It is not ready for full self-host yet - we're working on it*
|
||||
|
||||
|
|
BIN
apps/.DS_Store
vendored
BIN
apps/.DS_Store
vendored
Binary file not shown.
26
apps/api/.env.example
Normal file
26
apps/api/.env.example
Normal file
|
@ -0,0 +1,26 @@
|
|||
# ===== Required ENVS ======
|
||||
NUM_WORKERS_PER_QUEUE=8
|
||||
PORT=3002
|
||||
HOST=0.0.0.0
|
||||
REDIS_URL=redis://localhost:6379
|
||||
|
||||
## To turn on DB authentication, you need to set up supabase.
|
||||
USE_DB_AUTHENTICATION=true
|
||||
|
||||
# ===== Optional ENVS ======
|
||||
|
||||
# Supabase Setup (used to support DB authentication, advanced logging, etc.)
|
||||
SUPABASE_ANON_TOKEN=
|
||||
SUPABASE_URL=
|
||||
SUPABASE_SERVICE_TOKEN=
|
||||
|
||||
# Other Optionals
|
||||
TEST_API_KEY= # use if you've set up authentication and want to test with a real API key
|
||||
SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking
|
||||
OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.)
|
||||
BULL_AUTH_KEY= #
|
||||
LOGTAIL_KEY= # Use if you're configuring basic logging with logtail
|
||||
PLAYWRIGHT_MICROSERVICE_URL= # set if you'd like to run a playwright fallback
|
||||
LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs
|
||||
SERPER_API_KEY= #Set if you have a serper key you'd like to use as a search api
|
||||
SLACK_WEBHOOK_URL= # set if you'd like to send slack server health status messages
|
|
@ -7,6 +7,7 @@ SUPABASE_SERVICE_TOKEN=
|
|||
REDIS_URL=
|
||||
SCRAPING_BEE_API_KEY=
|
||||
OPENAI_API_KEY=
|
||||
ANTHROPIC_API_KEY=
|
||||
BULL_AUTH_KEY=
|
||||
LOGTAIL_KEY=
|
||||
PLAYWRIGHT_MICROSERVICE_URL=
|
||||
|
|
|
@ -2,4 +2,7 @@ module.exports = {
|
|||
preset: "ts-jest",
|
||||
testEnvironment: "node",
|
||||
setupFiles: ["./jest.setup.js"],
|
||||
// ignore dist folder root dir
|
||||
modulePathIgnorePatterns: ["<rootDir>/dist/"],
|
||||
|
||||
};
|
||||
|
|
309
apps/api/openapi.json
Normal file
309
apps/api/openapi.json
Normal file
|
@ -0,0 +1,309 @@
|
|||
{
|
||||
"openapi": "3.0.0",
|
||||
"info": {
|
||||
"title": "Firecrawl API",
|
||||
"version": "1.0.0",
|
||||
"description": "API for interacting with Firecrawl services to perform web scraping and crawling tasks.",
|
||||
"contact": {
|
||||
"name": "Firecrawl Support",
|
||||
"url": "https://firecrawl.dev/support",
|
||||
"email": "support@firecrawl.dev"
|
||||
}
|
||||
},
|
||||
"servers": [
|
||||
{
|
||||
"url": "https://api.firecrawl.dev/v0"
|
||||
}
|
||||
],
|
||||
"paths": {
|
||||
"/scrape": {
|
||||
"post": {
|
||||
"summary": "Scrape a single URL",
|
||||
"operationId": "scrapeSingleUrl",
|
||||
"tags": ["Scraping"],
|
||||
"security": [
|
||||
{
|
||||
"bearerAuth": []
|
||||
}
|
||||
],
|
||||
"requestBody": {
|
||||
"required": true,
|
||||
"content": {
|
||||
"application/json": {
|
||||
"schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"url": {
|
||||
"type": "string",
|
||||
"format": "uri",
|
||||
"description": "The URL to scrape"
|
||||
},
|
||||
"pageOptions": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"onlyMainContent": {
|
||||
"type": "boolean",
|
||||
"description": "Only return the main content of the page excluding headers, navs, footers, etc.",
|
||||
"default": false
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["url"]
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Successful response",
|
||||
"content": {
|
||||
"application/json": {
|
||||
"schema": {
|
||||
"$ref": "#/components/schemas/ScrapeResponse"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"402": {
|
||||
"description": "Payment required"
|
||||
},
|
||||
"429": {
|
||||
"description": "Too many requests"
|
||||
},
|
||||
"500": {
|
||||
"description": "Server error"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/crawl": {
|
||||
"post": {
|
||||
"summary": "Crawl multiple URLs based on options",
|
||||
"operationId": "crawlUrls",
|
||||
"tags": ["Crawling"],
|
||||
"security": [
|
||||
{
|
||||
"bearerAuth": []
|
||||
}
|
||||
],
|
||||
"requestBody": {
|
||||
"required": true,
|
||||
"content": {
|
||||
"application/json": {
|
||||
"schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"url": {
|
||||
"type": "string",
|
||||
"format": "uri",
|
||||
"description": "The base URL to start crawling from"
|
||||
},
|
||||
"crawlerOptions": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"includes": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string"
|
||||
},
|
||||
"description": "URL patterns to include"
|
||||
},
|
||||
"excludes": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string"
|
||||
},
|
||||
"description": "URL patterns to exclude"
|
||||
},
|
||||
"generateImgAltText": {
|
||||
"type": "boolean",
|
||||
"description": "Generate alt text for images using LLMs (must have a paid plan)",
|
||||
"default": false
|
||||
},
|
||||
"returnOnlyUrls": {
|
||||
"type": "boolean",
|
||||
"description": "If true, returns only the URLs as a list on the crawl status. Attention: the return response will be a list of URLs inside the data, not a list of documents.",
|
||||
"default": false
|
||||
},
|
||||
"limit": {
|
||||
"type": "integer",
|
||||
"description": "Maximum number of pages to crawl"
|
||||
}
|
||||
}
|
||||
},
|
||||
"pageOptions": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"onlyMainContent": {
|
||||
"type": "boolean",
|
||||
"description": "Only return the main content of the page excluding headers, navs, footers, etc.",
|
||||
"default": false
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["url"]
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Successful response",
|
||||
"content": {
|
||||
"application/json": {
|
||||
"schema": {
|
||||
"$ref": "#/components/schemas/CrawlResponse"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"402": {
|
||||
"description": "Payment required"
|
||||
},
|
||||
"429": {
|
||||
"description": "Too many requests"
|
||||
},
|
||||
"500": {
|
||||
"description": "Server error"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/crawl/status/{jobId}": {
|
||||
"get": {
|
||||
"tags": ["Crawl"],
|
||||
"summary": "Get the status of a crawl job",
|
||||
"operationId": "getCrawlStatus",
|
||||
"security": [
|
||||
{
|
||||
"bearerAuth": []
|
||||
}
|
||||
],
|
||||
"parameters": [
|
||||
{
|
||||
"name": "jobId",
|
||||
"in": "path",
|
||||
"description": "ID of the crawl job",
|
||||
"required": true,
|
||||
"schema": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "Successful response",
|
||||
"content": {
|
||||
"application/json": {
|
||||
"schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"status": {
|
||||
"type": "string",
|
||||
"description": "Status of the job (completed, active, failed, paused)"
|
||||
},
|
||||
"current": {
|
||||
"type": "integer",
|
||||
"description": "Current page number"
|
||||
},
|
||||
"current_url": {
|
||||
"type": "string",
|
||||
"description": "Current URL being scraped"
|
||||
},
|
||||
"current_step": {
|
||||
"type": "string",
|
||||
"description": "Current step in the process"
|
||||
},
|
||||
"total": {
|
||||
"type": "integer",
|
||||
"description": "Total number of pages"
|
||||
},
|
||||
"data": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"$ref": "#/components/schemas/ScrapeResponse"
|
||||
},
|
||||
"description": "Data returned from the job (null when it is in progress)"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"402": {
|
||||
"description": "Payment required"
|
||||
},
|
||||
"429": {
|
||||
"description": "Too many requests"
|
||||
},
|
||||
"500": {
|
||||
"description": "Server error"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"components": {
|
||||
"securitySchemes": {
|
||||
"bearerAuth": {
|
||||
"type": "http",
|
||||
"scheme": "bearer"
|
||||
}
|
||||
},
|
||||
"schemas": {
|
||||
"ScrapeResponse": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"success": {
|
||||
"type": "boolean"
|
||||
},
|
||||
"data": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"content": {
|
||||
"type": "string"
|
||||
},
|
||||
"markdown": {
|
||||
"type": "string"
|
||||
},
|
||||
"metadata": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"title": {
|
||||
"type": "string"
|
||||
},
|
||||
"description": {
|
||||
"type": "string"
|
||||
},
|
||||
"language": {
|
||||
"type": "string",
|
||||
"nullable": true
|
||||
},
|
||||
"sourceURL": {
|
||||
"type": "string",
|
||||
"format": "uri"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"CrawlResponse": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"jobId": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"security": [
|
||||
{
|
||||
"bearerAuth": []
|
||||
}
|
||||
]
|
||||
}
|
|
@ -10,7 +10,9 @@
|
|||
"flyio": "node dist/src/index.js",
|
||||
"start:dev": "nodemon --exec ts-node src/index.ts",
|
||||
"build": "tsc",
|
||||
"test": "jest --verbose",
|
||||
"test": "npx jest --detectOpenHandles --forceExit --openHandlesTimeout=120000 --watchAll=false --testPathIgnorePatterns='src/__tests__/e2e_noAuth/*'",
|
||||
"test:local-no-auth": "npx jest --detectOpenHandles --forceExit --openHandlesTimeout=120000 --watchAll=false --testPathIgnorePatterns='src/__tests__/e2e_withAuth/*'",
|
||||
"test:prod": "npx jest --detectOpenHandles --forceExit --openHandlesTimeout=120000 --watchAll=false --testPathIgnorePatterns='src/__tests__/e2e_noAuth/*'",
|
||||
"workers": "nodemon --exec ts-node src/services/queue-worker.ts",
|
||||
"worker:production": "node dist/src/services/queue-worker.js",
|
||||
"mongo-docker": "docker run -d -p 2717:27017 -v ./mongo-data:/data/db --name mongodb mongo:latest",
|
||||
|
@ -26,7 +28,7 @@
|
|||
"@types/bull": "^4.10.0",
|
||||
"@types/cors": "^2.8.13",
|
||||
"@types/express": "^4.17.17",
|
||||
"@types/jest": "^29.5.6",
|
||||
"@types/jest": "^29.5.12",
|
||||
"body-parser": "^1.20.1",
|
||||
"express": "^4.18.2",
|
||||
"jest": "^29.6.3",
|
||||
|
@ -39,6 +41,7 @@
|
|||
"typescript": "^5.4.2"
|
||||
},
|
||||
"dependencies": {
|
||||
"@anthropic-ai/sdk": "^0.20.5",
|
||||
"@brillout/import": "^0.2.2",
|
||||
"@bull-board/api": "^5.14.2",
|
||||
"@bull-board/express": "^5.8.0",
|
||||
|
@ -60,9 +63,11 @@
|
|||
"date-fns": "^2.29.3",
|
||||
"dotenv": "^16.3.1",
|
||||
"express-rate-limit": "^6.7.0",
|
||||
"form-data": "^4.0.0",
|
||||
"glob": "^10.3.12",
|
||||
"gpt3-tokenizer": "^1.1.5",
|
||||
"ioredis": "^5.3.2",
|
||||
"joplin-turndown-plugin-gfm": "^1.0.12",
|
||||
"keyword-extractor": "^0.0.25",
|
||||
"langchain": "^0.1.25",
|
||||
"languagedetect": "^2.0.0",
|
||||
|
@ -73,6 +78,7 @@
|
|||
"mongoose": "^8.0.3",
|
||||
"natural": "^6.3.0",
|
||||
"openai": "^4.28.4",
|
||||
"pdf-parse": "^1.1.1",
|
||||
"pos": "^0.4.2",
|
||||
"promptable": "^0.0.9",
|
||||
"puppeteer": "^22.6.3",
|
||||
|
@ -82,6 +88,7 @@
|
|||
"scrapingbee": "^1.7.4",
|
||||
"stripe": "^12.2.0",
|
||||
"turndown": "^7.1.3",
|
||||
"turndown-plugin-gfm": "^1.0.2",
|
||||
"typesense": "^1.5.4",
|
||||
"unstructured-client": "^0.9.4",
|
||||
"uuid": "^9.0.1",
|
||||
|
|
|
@ -5,6 +5,9 @@ settings:
|
|||
excludeLinksFromLockfile: false
|
||||
|
||||
dependencies:
|
||||
'@anthropic-ai/sdk':
|
||||
specifier: ^0.20.5
|
||||
version: 0.20.5
|
||||
'@brillout/import':
|
||||
specifier: ^0.2.2
|
||||
version: 0.2.3
|
||||
|
@ -68,6 +71,9 @@ dependencies:
|
|||
express-rate-limit:
|
||||
specifier: ^6.7.0
|
||||
version: 6.11.2(express@4.18.3)
|
||||
form-data:
|
||||
specifier: ^4.0.0
|
||||
version: 4.0.0
|
||||
glob:
|
||||
specifier: ^10.3.12
|
||||
version: 10.3.12
|
||||
|
@ -77,12 +83,15 @@ dependencies:
|
|||
ioredis:
|
||||
specifier: ^5.3.2
|
||||
version: 5.3.2
|
||||
joplin-turndown-plugin-gfm:
|
||||
specifier: ^1.0.12
|
||||
version: 1.0.12
|
||||
keyword-extractor:
|
||||
specifier: ^0.0.25
|
||||
version: 0.0.25
|
||||
langchain:
|
||||
specifier: ^0.1.25
|
||||
version: 0.1.25(@supabase/supabase-js@2.39.7)(axios@1.6.7)(cheerio@1.0.0-rc.12)(ioredis@5.3.2)(puppeteer@22.6.3)(redis@4.6.13)(typesense@1.7.2)
|
||||
version: 0.1.25(@supabase/supabase-js@2.39.7)(axios@1.6.7)(cheerio@1.0.0-rc.12)(ioredis@5.3.2)(pdf-parse@1.1.1)(puppeteer@22.6.3)(redis@4.6.13)(typesense@1.7.2)
|
||||
languagedetect:
|
||||
specifier: ^2.0.0
|
||||
version: 2.0.0
|
||||
|
@ -107,6 +116,9 @@ dependencies:
|
|||
openai:
|
||||
specifier: ^4.28.4
|
||||
version: 4.28.4
|
||||
pdf-parse:
|
||||
specifier: ^1.1.1
|
||||
version: 1.1.1
|
||||
pos:
|
||||
specifier: ^0.4.2
|
||||
version: 0.4.2
|
||||
|
@ -134,6 +146,9 @@ dependencies:
|
|||
turndown:
|
||||
specifier: ^7.1.3
|
||||
version: 7.1.3
|
||||
turndown-plugin-gfm:
|
||||
specifier: ^1.0.2
|
||||
version: 1.0.2
|
||||
typesense:
|
||||
specifier: ^1.5.4
|
||||
version: 1.7.2(@babel/runtime@7.24.0)
|
||||
|
@ -170,7 +185,7 @@ devDependencies:
|
|||
specifier: ^4.17.17
|
||||
version: 4.17.21
|
||||
'@types/jest':
|
||||
specifier: ^29.5.6
|
||||
specifier: ^29.5.12
|
||||
version: 29.5.12
|
||||
body-parser:
|
||||
specifier: ^1.20.1
|
||||
|
@ -213,6 +228,21 @@ packages:
|
|||
'@jridgewell/trace-mapping': 0.3.25
|
||||
dev: true
|
||||
|
||||
/@anthropic-ai/sdk@0.20.5:
|
||||
resolution: {integrity: sha512-d0ch+zp6/gHR4+2wqWV7JU1EJ7PpHc3r3F6hebovJTouY+pkaId1FuYYaVsG3l/gyqhOZUwKCMSMqcFNf+ZmWg==}
|
||||
dependencies:
|
||||
'@types/node': 18.19.22
|
||||
'@types/node-fetch': 2.6.11
|
||||
abort-controller: 3.0.0
|
||||
agentkeepalive: 4.5.0
|
||||
form-data-encoder: 1.7.2
|
||||
formdata-node: 4.4.1
|
||||
node-fetch: 2.7.0
|
||||
web-streams-polyfill: 3.3.3
|
||||
transitivePeerDependencies:
|
||||
- encoding
|
||||
dev: false
|
||||
|
||||
/@anthropic-ai/sdk@0.9.1:
|
||||
resolution: {integrity: sha512-wa1meQ2WSfoY8Uor3EdrJq0jTiZJoKoSii2ZVWRY1oN4Tlr5s59pADg9T79FTbPe1/se5c3pBeZgJL63wmuoBA==}
|
||||
dependencies:
|
||||
|
@ -2495,7 +2525,6 @@ packages:
|
|||
dependencies:
|
||||
ms: 2.1.3
|
||||
supports-color: 5.5.0
|
||||
dev: true
|
||||
|
||||
/debug@4.3.4:
|
||||
resolution: {integrity: sha512-PRWFHuSU3eDtQJPvnNY7Jcket1j0t5OuOsFzPPzsekD52Zl8qUfFIPEiswXqIvHWGVHOgX+7G/vCNNhehwxfkQ==}
|
||||
|
@ -3915,6 +3944,10 @@ packages:
|
|||
- ts-node
|
||||
dev: true
|
||||
|
||||
/joplin-turndown-plugin-gfm@1.0.12:
|
||||
resolution: {integrity: sha512-qL4+1iycQjZ1fs8zk3jSRk7cg3ROBUHk7GKtiLAQLFzLPKErnILUvz5DLszSQvz3s1sTjPbywLDISVUtBY6HaA==}
|
||||
dev: false
|
||||
|
||||
/js-tiktoken@1.0.10:
|
||||
resolution: {integrity: sha512-ZoSxbGjvGyMT13x6ACo9ebhDha/0FHdKA+OsQcMOWcm1Zs7r90Rhk5lhERLzji+3rA7EKpXCgwXcM5fF3DMpdA==}
|
||||
dependencies:
|
||||
|
@ -3994,7 +4027,7 @@ packages:
|
|||
engines: {node: '>=6'}
|
||||
dev: true
|
||||
|
||||
/langchain@0.1.25(@supabase/supabase-js@2.39.7)(axios@1.6.7)(cheerio@1.0.0-rc.12)(ioredis@5.3.2)(puppeteer@22.6.3)(redis@4.6.13)(typesense@1.7.2):
|
||||
/langchain@0.1.25(@supabase/supabase-js@2.39.7)(axios@1.6.7)(cheerio@1.0.0-rc.12)(ioredis@5.3.2)(pdf-parse@1.1.1)(puppeteer@22.6.3)(redis@4.6.13)(typesense@1.7.2):
|
||||
resolution: {integrity: sha512-sfEChvr4H2CklHdSByNBbytwBrFhgtA5kPOnwcBrxuXGg1iOaTzhVxQA0QcNcQucI3hZrsNbZjxGp+Can1ooZQ==}
|
||||
engines: {node: '>=18'}
|
||||
peerDependencies:
|
||||
|
@ -4171,6 +4204,7 @@ packages:
|
|||
ml-distance: 4.0.1
|
||||
openapi-types: 12.1.3
|
||||
p-retry: 4.6.2
|
||||
pdf-parse: 1.1.1
|
||||
puppeteer: 22.6.3(typescript@5.4.2)
|
||||
redis: 4.6.13
|
||||
uuid: 9.0.1
|
||||
|
@ -4650,6 +4684,10 @@ packages:
|
|||
resolution: {integrity: sha512-/jKZoMpw0F8GRwl4/eLROPA3cfcXtLApP0QzLmUT/HuPCZWyB7IY9ZrMeKw2O/nFIqPQB3PVM9aYm0F312AXDQ==}
|
||||
engines: {node: '>=10.5.0'}
|
||||
|
||||
/node-ensure@0.0.0:
|
||||
resolution: {integrity: sha512-DRI60hzo2oKN1ma0ckc6nQWlHU69RH6xN0sjQTjMpChPfTYvKZdcQFfdYK2RWbJcKyUizSIy/l8OTGxMAM1QDw==}
|
||||
dev: false
|
||||
|
||||
/node-fetch@2.7.0:
|
||||
resolution: {integrity: sha512-c4FRfUm/dbcWZ7U+1Wq0AwCyFL+3nt2bEw05wfxSz+DWpWsitgmSgYmy2dQdWyKC1694ELPqMs/YzUSNozLt8A==}
|
||||
engines: {node: 4.x || >=6.0.0}
|
||||
|
@ -4948,6 +4986,16 @@ packages:
|
|||
/path-to-regexp@0.1.7:
|
||||
resolution: {integrity: sha512-5DFkuoqlv1uYQKxy8omFBeJPQcdoE07Kv2sferDCrAq1ohOU+MSDswDIbnx3YAM60qIOnYa53wBhXW0EbMonrQ==}
|
||||
|
||||
/pdf-parse@1.1.1:
|
||||
resolution: {integrity: sha512-v6ZJ/efsBpGrGGknjtq9J/oC8tZWq0KWL5vQrk2GlzLEQPUDB1ex+13Rmidl1neNN358Jn9EHZw5y07FFtaC7A==}
|
||||
engines: {node: '>=6.8.1'}
|
||||
dependencies:
|
||||
debug: 3.2.7(supports-color@5.5.0)
|
||||
node-ensure: 0.0.0
|
||||
transitivePeerDependencies:
|
||||
- supports-color
|
||||
dev: false
|
||||
|
||||
/pend@1.2.0:
|
||||
resolution: {integrity: sha512-F3asv42UuXchdzt+xXqfW1OGlVBe+mxa2mqI0pg5yAHZPvFmY3Y6drSf/GQ1A86WgWEN9Kzh/WrgKa6iGcHXLg==}
|
||||
dev: false
|
||||
|
@ -5783,6 +5831,10 @@ packages:
|
|||
resolution: {integrity: sha512-AEYxH93jGFPn/a2iVAwW87VuUIkR1FVUKB77NwMF7nBTDkDrrT/Hpt/IrCJ0QXhW27jTBDcf5ZY7w6RiqTMw2Q==}
|
||||
dev: false
|
||||
|
||||
/turndown-plugin-gfm@1.0.2:
|
||||
resolution: {integrity: sha512-vwz9tfvF7XN/jE0dGoBei3FXWuvll78ohzCZQuOb+ZjWrs3a0XhQVomJEb2Qh4VHTPNRO4GPZh0V7VRbiWwkRg==}
|
||||
dev: false
|
||||
|
||||
/turndown@7.1.3:
|
||||
resolution: {integrity: sha512-Z3/iJ6IWh8VBiACWQJaA5ulPQE5E1QwvBHj00uGzdQxdRnd8fh1DPqNOJqzQDu6DkOstORrtXzf/9adB+vMtEA==}
|
||||
dependencies:
|
||||
|
|
|
@ -49,4 +49,13 @@ content-type: application/json
|
|||
|
||||
### Check Job Status
|
||||
GET https://api.firecrawl.dev/v0/crawl/status/cfcb71ac-23a3-4da5-bd85-d4e58b871d66
|
||||
Authorization: Bearer
|
||||
Authorization: Bearer
|
||||
|
||||
### Get Active Jobs Count
|
||||
GET http://localhost:3002/serverHealthCheck
|
||||
content-type: application/json
|
||||
|
||||
### Notify Server Health Check
|
||||
GET http://localhost:3002/serverHealthCheck/notify
|
||||
content-type: application/json
|
||||
|
||||
|
|
BIN
apps/api/src/.DS_Store
vendored
BIN
apps/api/src/.DS_Store
vendored
Binary file not shown.
213
apps/api/src/__tests__/e2e_noAuth/index.test.ts
Normal file
213
apps/api/src/__tests__/e2e_noAuth/index.test.ts
Normal file
|
@ -0,0 +1,213 @@
|
|||
import request from "supertest";
|
||||
import { app } from "../../index";
|
||||
import dotenv from "dotenv";
|
||||
const fs = require("fs");
|
||||
const path = require("path");
|
||||
|
||||
dotenv.config();
|
||||
|
||||
const TEST_URL = "http://127.0.0.1:3002";
|
||||
|
||||
describe("E2E Tests for API Routes with No Authentication", () => {
|
||||
let originalEnv: NodeJS.ProcessEnv;
|
||||
|
||||
// save original process.env
|
||||
beforeAll(() => {
|
||||
originalEnv = { ...process.env };
|
||||
process.env.USE_DB_AUTHENTICATION = "false";
|
||||
process.env.SUPABASE_ANON_TOKEN = "";
|
||||
process.env.SUPABASE_URL = "";
|
||||
process.env.SUPABASE_SERVICE_TOKEN = "";
|
||||
process.env.SCRAPING_BEE_API_KEY = "";
|
||||
process.env.OPENAI_API_KEY = "";
|
||||
process.env.BULL_AUTH_KEY = "";
|
||||
process.env.LOGTAIL_KEY = "";
|
||||
process.env.PLAYWRIGHT_MICROSERVICE_URL = "";
|
||||
process.env.LLAMAPARSE_API_KEY = "";
|
||||
process.env.TEST_API_KEY = "";
|
||||
});
|
||||
|
||||
// restore original process.env
|
||||
afterAll(() => {
|
||||
process.env = originalEnv;
|
||||
});
|
||||
|
||||
|
||||
describe("GET /", () => {
|
||||
it("should return Hello, world! message", async () => {
|
||||
const response = await request(TEST_URL).get("/");
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.text).toContain("SCRAPERS-JS: Hello, world! Fly.io");
|
||||
});
|
||||
});
|
||||
|
||||
describe("GET /test", () => {
|
||||
it("should return Hello, world! message", async () => {
|
||||
const response = await request(TEST_URL).get("/test");
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.text).toContain("Hello, world!");
|
||||
});
|
||||
});
|
||||
|
||||
describe("POST /v0/scrape", () => {
|
||||
it("should not require authorization", async () => {
|
||||
const response = await request(TEST_URL).post("/v0/scrape");
|
||||
expect(response.statusCode).not.toBe(401);
|
||||
});
|
||||
|
||||
it("should return an error for a blocklisted URL without requiring authorization", async () => {
|
||||
const blocklistedUrl = "https://facebook.com/fake-test";
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/scrape")
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: blocklistedUrl });
|
||||
expect(response.statusCode).toBe(403);
|
||||
expect(response.body.error).toContain("Firecrawl currently does not support social media scraping due to policy restrictions. We're actively working on building support for it.");
|
||||
});
|
||||
|
||||
it("should return a successful response", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/scrape")
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: "https://firecrawl.dev" });
|
||||
expect(response.statusCode).toBe(200);
|
||||
}, 10000); // 10 seconds timeout
|
||||
});
|
||||
|
||||
describe("POST /v0/crawl", () => {
|
||||
it("should not require authorization", async () => {
|
||||
const response = await request(TEST_URL).post("/v0/crawl");
|
||||
expect(response.statusCode).not.toBe(401);
|
||||
});
|
||||
|
||||
it("should return an error for a blocklisted URL", async () => {
|
||||
const blocklistedUrl = "https://twitter.com/fake-test";
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/crawl")
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: blocklistedUrl });
|
||||
expect(response.statusCode).toBe(403);
|
||||
expect(response.body.error).toContain("Firecrawl currently does not support social media scraping due to policy restrictions. We're actively working on building support for it.");
|
||||
});
|
||||
|
||||
it("should return a successful response", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/crawl")
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: "https://firecrawl.dev" });
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.body).toHaveProperty("jobId");
|
||||
expect(response.body.jobId).toMatch(
|
||||
/^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$/
|
||||
);
|
||||
});
|
||||
});
|
||||
|
||||
describe("POST /v0/crawlWebsitePreview", () => {
|
||||
it("should not require authorization", async () => {
|
||||
const response = await request(TEST_URL).post("/v0/crawlWebsitePreview");
|
||||
expect(response.statusCode).not.toBe(401);
|
||||
});
|
||||
|
||||
it("should return an error for a blocklisted URL", async () => {
|
||||
const blocklistedUrl = "https://instagram.com/fake-test";
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/crawlWebsitePreview")
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: blocklistedUrl });
|
||||
expect(response.statusCode).toBe(403);
|
||||
expect(response.body.error).toContain("Firecrawl currently does not support social media scraping due to policy restrictions. We're actively working on building support for it.");
|
||||
});
|
||||
|
||||
it("should return a successful response", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/crawlWebsitePreview")
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: "https://firecrawl.dev" });
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.body).toHaveProperty("jobId");
|
||||
expect(response.body.jobId).toMatch(
|
||||
/^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$/
|
||||
);
|
||||
});
|
||||
});
|
||||
|
||||
describe("POST /v0/search", () => {
|
||||
it("should require not authorization", async () => {
|
||||
const response = await request(TEST_URL).post("/v0/search");
|
||||
expect(response.statusCode).not.toBe(401);
|
||||
});
|
||||
|
||||
it("should return no error response with an invalid API key", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/search")
|
||||
.set("Authorization", `Bearer invalid-api-key`)
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ query: "test" });
|
||||
expect(response.statusCode).not.toBe(401);
|
||||
});
|
||||
|
||||
it("should return a successful response without a valid API key", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/search")
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ query: "test" });
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.body).toHaveProperty("success");
|
||||
expect(response.body.success).toBe(true);
|
||||
expect(response.body).toHaveProperty("data");
|
||||
}, 20000);
|
||||
});
|
||||
|
||||
describe("GET /v0/crawl/status/:jobId", () => {
|
||||
it("should not require authorization", async () => {
|
||||
const response = await request(TEST_URL).get("/v0/crawl/status/123");
|
||||
expect(response.statusCode).not.toBe(401);
|
||||
});
|
||||
|
||||
it("should return Job not found for invalid job ID", async () => {
|
||||
const response = await request(TEST_URL).get(
|
||||
"/v0/crawl/status/invalidJobId"
|
||||
);
|
||||
expect(response.statusCode).toBe(404);
|
||||
});
|
||||
|
||||
it("should return a successful response for a valid crawl job", async () => {
|
||||
const crawlResponse = await request(TEST_URL)
|
||||
.post("/v0/crawl")
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: "https://firecrawl.dev" });
|
||||
expect(crawlResponse.statusCode).toBe(200);
|
||||
|
||||
const response = await request(TEST_URL).get(
|
||||
`/v0/crawl/status/${crawlResponse.body.jobId}`
|
||||
);
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.body).toHaveProperty("status");
|
||||
expect(response.body.status).toBe("active");
|
||||
|
||||
// wait for 30 seconds
|
||||
await new Promise((r) => setTimeout(r, 30000));
|
||||
|
||||
const completedResponse = await request(TEST_URL).get(
|
||||
`/v0/crawl/status/${crawlResponse.body.jobId}`
|
||||
);
|
||||
expect(completedResponse.statusCode).toBe(200);
|
||||
expect(completedResponse.body).toHaveProperty("status");
|
||||
expect(completedResponse.body.status).toBe("completed");
|
||||
expect(completedResponse.body).toHaveProperty("data");
|
||||
expect(completedResponse.body.data[0]).toHaveProperty("content");
|
||||
expect(completedResponse.body.data[0]).toHaveProperty("markdown");
|
||||
expect(completedResponse.body.data[0]).toHaveProperty("metadata");
|
||||
expect(completedResponse.body.data[0].content).toContain("🔥 FireCrawl");
|
||||
}, 60000); // 60 seconds
|
||||
});
|
||||
|
||||
describe("GET /is-production", () => {
|
||||
it("should return the production status", async () => {
|
||||
const response = await request(TEST_URL).get("/is-production");
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.body).toHaveProperty("isProduction");
|
||||
});
|
||||
});
|
||||
});
|
260
apps/api/src/__tests__/e2e_withAuth/index.test.ts
Normal file
260
apps/api/src/__tests__/e2e_withAuth/index.test.ts
Normal file
|
@ -0,0 +1,260 @@
|
|||
import request from "supertest";
|
||||
import { app } from "../../index";
|
||||
import dotenv from "dotenv";
|
||||
|
||||
dotenv.config();
|
||||
|
||||
// const TEST_URL = 'http://localhost:3002'
|
||||
const TEST_URL = "http://127.0.0.1:3002";
|
||||
|
||||
|
||||
describe("E2E Tests for API Routes", () => {
|
||||
beforeAll(() => {
|
||||
process.env.USE_DB_AUTHENTICATION = "true";
|
||||
});
|
||||
|
||||
afterAll(() => {
|
||||
delete process.env.USE_DB_AUTHENTICATION;
|
||||
});
|
||||
describe("GET /", () => {
|
||||
it("should return Hello, world! message", async () => {
|
||||
const response = await request(TEST_URL).get("/");
|
||||
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.text).toContain("SCRAPERS-JS: Hello, world! Fly.io");
|
||||
});
|
||||
});
|
||||
|
||||
describe("GET /test", () => {
|
||||
it("should return Hello, world! message", async () => {
|
||||
const response = await request(TEST_URL).get("/test");
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.text).toContain("Hello, world!");
|
||||
});
|
||||
});
|
||||
|
||||
describe("POST /v0/scrape", () => {
|
||||
it("should require authorization", async () => {
|
||||
const response = await request(app).post("/v0/scrape");
|
||||
expect(response.statusCode).toBe(401);
|
||||
});
|
||||
|
||||
it("should return an error response with an invalid API key", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/scrape")
|
||||
.set("Authorization", `Bearer invalid-api-key`)
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: "https://firecrawl.dev" });
|
||||
expect(response.statusCode).toBe(401);
|
||||
});
|
||||
|
||||
it("should return an error for a blocklisted URL", async () => {
|
||||
const blocklistedUrl = "https://facebook.com/fake-test";
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/scrape")
|
||||
.set("Authorization", `Bearer ${process.env.TEST_API_KEY}`)
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: blocklistedUrl });
|
||||
expect(response.statusCode).toBe(403);
|
||||
expect(response.body.error).toContain("Firecrawl currently does not support social media scraping due to policy restrictions. We're actively working on building support for it.");
|
||||
});
|
||||
|
||||
it("should return a successful response with a valid preview token", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/scrape")
|
||||
.set("Authorization", `Bearer this_is_just_a_preview_token`)
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: "https://firecrawl.dev" });
|
||||
expect(response.statusCode).toBe(200);
|
||||
}, 10000); // 10 seconds timeout
|
||||
|
||||
it("should return a successful response with a valid API key", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/scrape")
|
||||
.set("Authorization", `Bearer ${process.env.TEST_API_KEY}`)
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: "https://firecrawl.dev" });
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.body).toHaveProperty("data");
|
||||
expect(response.body.data).toHaveProperty("content");
|
||||
expect(response.body.data).toHaveProperty("markdown");
|
||||
expect(response.body.data).toHaveProperty("metadata");
|
||||
expect(response.body.data.content).toContain("🔥 FireCrawl");
|
||||
}, 30000); // 30 seconds timeout
|
||||
});
|
||||
|
||||
describe("POST /v0/crawl", () => {
|
||||
it("should require authorization", async () => {
|
||||
const response = await request(TEST_URL).post("/v0/crawl");
|
||||
expect(response.statusCode).toBe(401);
|
||||
});
|
||||
|
||||
it("should return an error response with an invalid API key", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/crawl")
|
||||
.set("Authorization", `Bearer invalid-api-key`)
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: "https://firecrawl.dev" });
|
||||
expect(response.statusCode).toBe(401);
|
||||
});
|
||||
|
||||
it("should return an error for a blocklisted URL", async () => {
|
||||
const blocklistedUrl = "https://twitter.com/fake-test";
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/crawl")
|
||||
.set("Authorization", `Bearer ${process.env.TEST_API_KEY}`)
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: blocklistedUrl });
|
||||
expect(response.statusCode).toBe(403);
|
||||
expect(response.body.error).toContain("Firecrawl currently does not support social media scraping due to policy restrictions. We're actively working on building support for it.");
|
||||
});
|
||||
|
||||
it("should return a successful response with a valid API key", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/crawl")
|
||||
.set("Authorization", `Bearer ${process.env.TEST_API_KEY}`)
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: "https://firecrawl.dev" });
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.body).toHaveProperty("jobId");
|
||||
expect(response.body.jobId).toMatch(
|
||||
/^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$/
|
||||
);
|
||||
});
|
||||
|
||||
|
||||
// Additional tests for insufficient credits?
|
||||
});
|
||||
|
||||
describe("POST /v0/crawlWebsitePreview", () => {
|
||||
it("should require authorization", async () => {
|
||||
const response = await request(TEST_URL).post(
|
||||
"/v0/crawlWebsitePreview"
|
||||
);
|
||||
expect(response.statusCode).toBe(401);
|
||||
});
|
||||
|
||||
it("should return an error response with an invalid API key", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/crawlWebsitePreview")
|
||||
.set("Authorization", `Bearer invalid-api-key`)
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: "https://firecrawl.dev" });
|
||||
expect(response.statusCode).toBe(401);
|
||||
});
|
||||
|
||||
it("should return an error for a blocklisted URL", async () => {
|
||||
const blocklistedUrl = "https://instagram.com/fake-test";
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/crawlWebsitePreview")
|
||||
.set("Authorization", `Bearer ${process.env.TEST_API_KEY}`)
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: blocklistedUrl });
|
||||
expect(response.statusCode).toBe(403);
|
||||
expect(response.body.error).toContain("Firecrawl currently does not support social media scraping due to policy restrictions. We're actively working on building support for it.");
|
||||
});
|
||||
|
||||
it("should return a successful response with a valid API key", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/crawlWebsitePreview")
|
||||
.set("Authorization", `Bearer this_is_just_a_preview_token`)
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: "https://firecrawl.dev" });
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.body).toHaveProperty("jobId");
|
||||
expect(response.body.jobId).toMatch(
|
||||
/^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}$/
|
||||
);
|
||||
});
|
||||
});
|
||||
|
||||
describe("POST /v0/search", () => {
|
||||
it("should require authorization", async () => {
|
||||
const response = await request(TEST_URL).post("/v0/search");
|
||||
expect(response.statusCode).toBe(401);
|
||||
});
|
||||
|
||||
it("should return an error response with an invalid API key", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/search")
|
||||
.set("Authorization", `Bearer invalid-api-key`)
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ query: "test" });
|
||||
expect(response.statusCode).toBe(401);
|
||||
});
|
||||
|
||||
it("should return a successful response with a valid API key", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.post("/v0/search")
|
||||
.set("Authorization", `Bearer ${process.env.TEST_API_KEY}`)
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ query: "test" });
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.body).toHaveProperty("success");
|
||||
expect(response.body.success).toBe(true);
|
||||
expect(response.body).toHaveProperty("data");
|
||||
}, 30000); // 30 seconds timeout
|
||||
});
|
||||
|
||||
describe("GET /v0/crawl/status/:jobId", () => {
|
||||
it("should require authorization", async () => {
|
||||
const response = await request(TEST_URL).get("/v0/crawl/status/123");
|
||||
expect(response.statusCode).toBe(401);
|
||||
});
|
||||
|
||||
it("should return an error response with an invalid API key", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.get("/v0/crawl/status/123")
|
||||
.set("Authorization", `Bearer invalid-api-key`);
|
||||
expect(response.statusCode).toBe(401);
|
||||
});
|
||||
|
||||
it("should return Job not found for invalid job ID", async () => {
|
||||
const response = await request(TEST_URL)
|
||||
.get("/v0/crawl/status/invalidJobId")
|
||||
.set("Authorization", `Bearer ${process.env.TEST_API_KEY}`);
|
||||
expect(response.statusCode).toBe(404);
|
||||
});
|
||||
|
||||
it("should return a successful response for a valid crawl job", async () => {
|
||||
const crawlResponse = await request(TEST_URL)
|
||||
.post("/v0/crawl")
|
||||
.set("Authorization", `Bearer ${process.env.TEST_API_KEY}`)
|
||||
.set("Content-Type", "application/json")
|
||||
.send({ url: "https://firecrawl.dev" });
|
||||
expect(crawlResponse.statusCode).toBe(200);
|
||||
|
||||
const response = await request(TEST_URL)
|
||||
.get(`/v0/crawl/status/${crawlResponse.body.jobId}`)
|
||||
.set("Authorization", `Bearer ${process.env.TEST_API_KEY}`);
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.body).toHaveProperty("status");
|
||||
expect(response.body.status).toBe("active");
|
||||
|
||||
// wait for 30 seconds
|
||||
await new Promise((r) => setTimeout(r, 30000));
|
||||
|
||||
const completedResponse = await request(TEST_URL)
|
||||
.get(`/v0/crawl/status/${crawlResponse.body.jobId}`)
|
||||
.set("Authorization", `Bearer ${process.env.TEST_API_KEY}`);
|
||||
expect(completedResponse.statusCode).toBe(200);
|
||||
expect(completedResponse.body).toHaveProperty("status");
|
||||
expect(completedResponse.body.status).toBe("completed");
|
||||
expect(completedResponse.body).toHaveProperty("data");
|
||||
expect(completedResponse.body.data[0]).toHaveProperty("content");
|
||||
expect(completedResponse.body.data[0]).toHaveProperty("markdown");
|
||||
expect(completedResponse.body.data[0]).toHaveProperty("metadata");
|
||||
expect(completedResponse.body.data[0].content).toContain(
|
||||
"🔥 FireCrawl"
|
||||
);
|
||||
}, 60000); // 60 seconds
|
||||
});
|
||||
|
||||
describe("GET /is-production", () => {
|
||||
it("should return the production status", async () => {
|
||||
const response = await request(TEST_URL).get("/is-production");
|
||||
expect(response.statusCode).toBe(200);
|
||||
expect(response.body).toHaveProperty("isProduction");
|
||||
});
|
||||
});
|
||||
});
|
74
apps/api/src/controllers/auth.ts
Normal file
74
apps/api/src/controllers/auth.ts
Normal file
|
@ -0,0 +1,74 @@
|
|||
import { parseApi } from "../../src/lib/parseApi";
|
||||
import { getRateLimiter } from "../../src/services/rate-limiter";
|
||||
import { AuthResponse, RateLimiterMode } from "../../src/types";
|
||||
import { supabase_service } from "../../src/services/supabase";
|
||||
import { withAuth } from "../../src/lib/withAuth";
|
||||
|
||||
|
||||
export async function authenticateUser(req, res, mode?: RateLimiterMode) : Promise<AuthResponse> {
|
||||
return withAuth(supaAuthenticateUser)(req, res, mode);
|
||||
}
|
||||
|
||||
export async function supaAuthenticateUser(
|
||||
req,
|
||||
res,
|
||||
mode?: RateLimiterMode
|
||||
): Promise<{
|
||||
success: boolean;
|
||||
team_id?: string;
|
||||
error?: string;
|
||||
status?: number;
|
||||
}> {
|
||||
|
||||
const authHeader = req.headers.authorization;
|
||||
if (!authHeader) {
|
||||
return { success: false, error: "Unauthorized", status: 401 };
|
||||
}
|
||||
const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
|
||||
if (!token) {
|
||||
return {
|
||||
success: false,
|
||||
error: "Unauthorized: Token missing",
|
||||
status: 401,
|
||||
};
|
||||
}
|
||||
|
||||
try {
|
||||
const incomingIP = (req.headers["x-forwarded-for"] ||
|
||||
req.socket.remoteAddress) as string;
|
||||
const iptoken = incomingIP + token;
|
||||
await getRateLimiter(
|
||||
token === "this_is_just_a_preview_token" ? RateLimiterMode.Preview : mode
|
||||
).consume(iptoken);
|
||||
} catch (rateLimiterRes) {
|
||||
console.error(rateLimiterRes);
|
||||
return {
|
||||
success: false,
|
||||
error: "Rate limit exceeded. Too many requests, try again in 1 minute.",
|
||||
status: 429,
|
||||
};
|
||||
}
|
||||
|
||||
if (
|
||||
token === "this_is_just_a_preview_token" &&
|
||||
(mode === RateLimiterMode.Scrape || mode === RateLimiterMode.Preview)
|
||||
) {
|
||||
return { success: true, team_id: "preview" };
|
||||
}
|
||||
|
||||
const normalizedApi = parseApi(token);
|
||||
// make sure api key is valid, based on the api_keys table in supabase
|
||||
const { data, error } = await supabase_service
|
||||
.from("api_keys")
|
||||
.select("*")
|
||||
.eq("key", normalizedApi);
|
||||
if (error || !data || data.length === 0) {
|
||||
return {
|
||||
success: false,
|
||||
error: "Unauthorized: Invalid token",
|
||||
status: 401,
|
||||
};
|
||||
}
|
||||
|
||||
return { success: true, team_id: data[0].team_id };
|
||||
}
|
36
apps/api/src/controllers/crawl-status.ts
Normal file
36
apps/api/src/controllers/crawl-status.ts
Normal file
|
@ -0,0 +1,36 @@
|
|||
import { Request, Response } from "express";
|
||||
import { authenticateUser } from "./auth";
|
||||
import { RateLimiterMode } from "../../src/types";
|
||||
import { addWebScraperJob } from "../../src/services/queue-jobs";
|
||||
import { getWebScraperQueue } from "../../src/services/queue-service";
|
||||
|
||||
export async function crawlStatusController(req: Request, res: Response) {
|
||||
try {
|
||||
const { success, team_id, error, status } = await authenticateUser(
|
||||
req,
|
||||
res,
|
||||
RateLimiterMode.CrawlStatus
|
||||
);
|
||||
if (!success) {
|
||||
return res.status(status).json({ error });
|
||||
}
|
||||
const job = await getWebScraperQueue().getJob(req.params.jobId);
|
||||
if (!job) {
|
||||
return res.status(404).json({ error: "Job not found" });
|
||||
}
|
||||
|
||||
const { current, current_url, total, current_step } = await job.progress();
|
||||
res.json({
|
||||
status: await job.getState(),
|
||||
// progress: job.progress(),
|
||||
current: current,
|
||||
current_url: current_url,
|
||||
current_step: current_step,
|
||||
total: total,
|
||||
data: job.returnvalue,
|
||||
});
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
}
|
83
apps/api/src/controllers/crawl.ts
Normal file
83
apps/api/src/controllers/crawl.ts
Normal file
|
@ -0,0 +1,83 @@
|
|||
import { Request, Response } from "express";
|
||||
import { WebScraperDataProvider } from "../../src/scraper/WebScraper";
|
||||
import { billTeam } from "../../src/services/billing/credit_billing";
|
||||
import { checkTeamCredits } from "../../src/services/billing/credit_billing";
|
||||
import { authenticateUser } from "./auth";
|
||||
import { RateLimiterMode } from "../../src/types";
|
||||
import { addWebScraperJob } from "../../src/services/queue-jobs";
|
||||
import { isUrlBlocked } from "../../src/scraper/WebScraper/utils/blocklist";
|
||||
|
||||
export async function crawlController(req: Request, res: Response) {
|
||||
try {
|
||||
const { success, team_id, error, status } = await authenticateUser(
|
||||
req,
|
||||
res,
|
||||
RateLimiterMode.Crawl
|
||||
);
|
||||
if (!success) {
|
||||
return res.status(status).json({ error });
|
||||
}
|
||||
|
||||
const { success: creditsCheckSuccess, message: creditsCheckMessage } =
|
||||
await checkTeamCredits(team_id, 1);
|
||||
if (!creditsCheckSuccess) {
|
||||
return res.status(402).json({ error: "Insufficient credits" });
|
||||
}
|
||||
|
||||
const url = req.body.url;
|
||||
if (!url) {
|
||||
return res.status(400).json({ error: "Url is required" });
|
||||
}
|
||||
|
||||
if (isUrlBlocked(url)) {
|
||||
return res.status(403).json({ error: "Firecrawl currently does not support social media scraping due to policy restrictions. We're actively working on building support for it." });
|
||||
}
|
||||
|
||||
const mode = req.body.mode ?? "crawl";
|
||||
const crawlerOptions = req.body.crawlerOptions ?? {};
|
||||
const pageOptions = req.body.pageOptions ?? { onlyMainContent: false };
|
||||
|
||||
if (mode === "single_urls" && !url.includes(",")) {
|
||||
try {
|
||||
const a = new WebScraperDataProvider();
|
||||
await a.setOptions({
|
||||
mode: "single_urls",
|
||||
urls: [url],
|
||||
crawlerOptions: {
|
||||
returnOnlyUrls: true,
|
||||
},
|
||||
pageOptions: pageOptions,
|
||||
});
|
||||
|
||||
const docs = await a.getDocuments(false, (progress) => {
|
||||
job.progress({
|
||||
current: progress.current,
|
||||
total: progress.total,
|
||||
current_step: "SCRAPING",
|
||||
current_url: progress.currentDocumentUrl,
|
||||
});
|
||||
});
|
||||
return res.json({
|
||||
success: true,
|
||||
documents: docs,
|
||||
});
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
}
|
||||
const job = await addWebScraperJob({
|
||||
url: url,
|
||||
mode: mode ?? "crawl", // fix for single urls not working
|
||||
crawlerOptions: { ...crawlerOptions },
|
||||
team_id: team_id,
|
||||
pageOptions: pageOptions,
|
||||
origin: req.body.origin ?? "api",
|
||||
});
|
||||
|
||||
res.json({ jobId: job.id });
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
}
|
45
apps/api/src/controllers/crawlPreview.ts
Normal file
45
apps/api/src/controllers/crawlPreview.ts
Normal file
|
@ -0,0 +1,45 @@
|
|||
import { Request, Response } from "express";
|
||||
import { authenticateUser } from "./auth";
|
||||
import { RateLimiterMode } from "../../src/types";
|
||||
import { addWebScraperJob } from "../../src/services/queue-jobs";
|
||||
import { isUrlBlocked } from "../../src/scraper/WebScraper/utils/blocklist";
|
||||
|
||||
export async function crawlPreviewController(req: Request, res: Response) {
|
||||
try {
|
||||
const { success, team_id, error, status } = await authenticateUser(
|
||||
req,
|
||||
res,
|
||||
RateLimiterMode.Preview
|
||||
);
|
||||
if (!success) {
|
||||
return res.status(status).json({ error });
|
||||
}
|
||||
// authenticate on supabase
|
||||
const url = req.body.url;
|
||||
if (!url) {
|
||||
return res.status(400).json({ error: "Url is required" });
|
||||
}
|
||||
|
||||
if (isUrlBlocked(url)) {
|
||||
return res.status(403).json({ error: "Firecrawl currently does not support social media scraping due to policy restrictions. We're actively working on building support for it." });
|
||||
}
|
||||
|
||||
const mode = req.body.mode ?? "crawl";
|
||||
const crawlerOptions = req.body.crawlerOptions ?? {};
|
||||
const pageOptions = req.body.pageOptions ?? { onlyMainContent: false };
|
||||
|
||||
const job = await addWebScraperJob({
|
||||
url: url,
|
||||
mode: mode ?? "crawl", // fix for single urls not working
|
||||
crawlerOptions: { ...crawlerOptions, limit: 5, maxCrawledLinks: 5 },
|
||||
team_id: "preview",
|
||||
pageOptions: pageOptions,
|
||||
origin: "website-preview",
|
||||
});
|
||||
|
||||
res.json({ jobId: job.id });
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
}
|
121
apps/api/src/controllers/scrape.ts
Normal file
121
apps/api/src/controllers/scrape.ts
Normal file
|
@ -0,0 +1,121 @@
|
|||
import { Request, Response } from "express";
|
||||
import { WebScraperDataProvider } from "../scraper/WebScraper";
|
||||
import { billTeam, checkTeamCredits } from "../services/billing/credit_billing";
|
||||
import { authenticateUser } from "./auth";
|
||||
import { RateLimiterMode } from "../types";
|
||||
import { logJob } from "../services/logging/log_job";
|
||||
import { Document } from "../lib/entities";
|
||||
import { isUrlBlocked } from "../scraper/WebScraper/utils/blocklist"; // Import the isUrlBlocked function
|
||||
|
||||
export async function scrapeHelper(
|
||||
req: Request,
|
||||
team_id: string,
|
||||
crawlerOptions: any,
|
||||
pageOptions: any
|
||||
): Promise<{
|
||||
success: boolean;
|
||||
error?: string;
|
||||
data?: Document;
|
||||
returnCode: number;
|
||||
}> {
|
||||
const url = req.body.url;
|
||||
if (!url) {
|
||||
return { success: false, error: "Url is required", returnCode: 400 };
|
||||
}
|
||||
|
||||
if (isUrlBlocked(url)) {
|
||||
return { success: false, error: "Firecrawl currently does not support social media scraping due to policy restrictions. We're actively working on building support for it.", returnCode: 403 };
|
||||
}
|
||||
|
||||
const a = new WebScraperDataProvider();
|
||||
await a.setOptions({
|
||||
mode: "single_urls",
|
||||
urls: [url],
|
||||
crawlerOptions: {
|
||||
...crawlerOptions,
|
||||
},
|
||||
pageOptions: pageOptions,
|
||||
});
|
||||
|
||||
const docs = await a.getDocuments(false);
|
||||
// make sure doc.content is not empty
|
||||
const filteredDocs = docs.filter(
|
||||
(doc: { content?: string }) => doc.content && doc.content.trim().length > 0
|
||||
);
|
||||
if (filteredDocs.length === 0) {
|
||||
return { success: true, error: "No page found", returnCode: 200 };
|
||||
}
|
||||
|
||||
const { success, credit_usage } = await billTeam(
|
||||
team_id,
|
||||
filteredDocs.length
|
||||
);
|
||||
if (!success) {
|
||||
return {
|
||||
success: false,
|
||||
error:
|
||||
"Failed to bill team. Insufficient credits or subscription not found.",
|
||||
returnCode: 402,
|
||||
};
|
||||
}
|
||||
|
||||
return {
|
||||
success: true,
|
||||
data: filteredDocs[0],
|
||||
returnCode: 200,
|
||||
};
|
||||
}
|
||||
|
||||
export async function scrapeController(req: Request, res: Response) {
|
||||
try {
|
||||
// make sure to authenticate user first, Bearer <token>
|
||||
const { success, team_id, error, status } = await authenticateUser(
|
||||
req,
|
||||
res,
|
||||
RateLimiterMode.Scrape
|
||||
);
|
||||
if (!success) {
|
||||
return res.status(status).json({ error });
|
||||
}
|
||||
const crawlerOptions = req.body.crawlerOptions ?? {};
|
||||
const pageOptions = req.body.pageOptions ?? { onlyMainContent: false };
|
||||
const origin = req.body.origin ?? "api";
|
||||
|
||||
try {
|
||||
const { success: creditsCheckSuccess, message: creditsCheckMessage } =
|
||||
await checkTeamCredits(team_id, 1);
|
||||
if (!creditsCheckSuccess) {
|
||||
return res.status(402).json({ error: "Insufficient credits" });
|
||||
}
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: "Internal server error" });
|
||||
}
|
||||
const startTime = new Date().getTime();
|
||||
const result = await scrapeHelper(
|
||||
req,
|
||||
team_id,
|
||||
crawlerOptions,
|
||||
pageOptions
|
||||
);
|
||||
const endTime = new Date().getTime();
|
||||
const timeTakenInSeconds = (endTime - startTime) / 1000;
|
||||
logJob({
|
||||
success: result.success,
|
||||
message: result.error,
|
||||
num_docs: 1,
|
||||
docs: [result.data],
|
||||
time_taken: timeTakenInSeconds,
|
||||
team_id: team_id,
|
||||
mode: "scrape",
|
||||
url: req.body.url,
|
||||
crawlerOptions: crawlerOptions,
|
||||
pageOptions: pageOptions,
|
||||
origin: origin,
|
||||
});
|
||||
return res.status(result.returnCode).json(result);
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
}
|
156
apps/api/src/controllers/search.ts
Normal file
156
apps/api/src/controllers/search.ts
Normal file
|
@ -0,0 +1,156 @@
|
|||
import { Request, Response } from "express";
|
||||
import { WebScraperDataProvider } from "../scraper/WebScraper";
|
||||
import { billTeam, checkTeamCredits } from "../services/billing/credit_billing";
|
||||
import { authenticateUser } from "./auth";
|
||||
import { RateLimiterMode } from "../types";
|
||||
import { logJob } from "../services/logging/log_job";
|
||||
import { PageOptions, SearchOptions } from "../lib/entities";
|
||||
import { search } from "../search";
|
||||
import { isUrlBlocked } from "../scraper/WebScraper/utils/blocklist";
|
||||
|
||||
export async function searchHelper(
|
||||
req: Request,
|
||||
team_id: string,
|
||||
crawlerOptions: any,
|
||||
pageOptions: PageOptions,
|
||||
searchOptions: SearchOptions
|
||||
): Promise<{
|
||||
success: boolean;
|
||||
error?: string;
|
||||
data?: any;
|
||||
returnCode: number;
|
||||
}> {
|
||||
const query = req.body.query;
|
||||
const advanced = false;
|
||||
if (!query) {
|
||||
return { success: false, error: "Query is required", returnCode: 400 };
|
||||
}
|
||||
|
||||
const tbs = searchOptions.tbs ?? null;
|
||||
const filter = searchOptions.filter ?? null;
|
||||
|
||||
let res = await search({query: query, advanced: advanced, num_results: searchOptions.limit ?? 7, tbs: tbs, filter: filter});
|
||||
|
||||
let justSearch = pageOptions.fetchPageContent === false;
|
||||
|
||||
if (justSearch) {
|
||||
return { success: true, data: res, returnCode: 200 };
|
||||
}
|
||||
|
||||
res = res.filter((r) => !isUrlBlocked(r));
|
||||
|
||||
if (res.length === 0) {
|
||||
return { success: true, error: "No search results found", returnCode: 200 };
|
||||
}
|
||||
|
||||
// filter out social media links
|
||||
|
||||
const a = new WebScraperDataProvider();
|
||||
await a.setOptions({
|
||||
mode: "single_urls",
|
||||
urls: res.map((r) => r),
|
||||
crawlerOptions: {
|
||||
...crawlerOptions,
|
||||
},
|
||||
pageOptions: {
|
||||
...pageOptions,
|
||||
onlyMainContent: pageOptions?.onlyMainContent ?? true,
|
||||
fetchPageContent: pageOptions?.fetchPageContent ?? true,
|
||||
fallback: false,
|
||||
},
|
||||
});
|
||||
|
||||
const docs = await a.getDocuments(true);
|
||||
if (docs.length === 0) {
|
||||
return { success: true, error: "No search results found", returnCode: 200 };
|
||||
}
|
||||
|
||||
// make sure doc.content is not empty
|
||||
const filteredDocs = docs.filter(
|
||||
(doc: { content?: string }) => doc.content && doc.content.trim().length > 0
|
||||
);
|
||||
|
||||
if (filteredDocs.length === 0) {
|
||||
return { success: true, error: "No page found", returnCode: 200 };
|
||||
}
|
||||
|
||||
const { success, credit_usage } = await billTeam(
|
||||
team_id,
|
||||
filteredDocs.length
|
||||
);
|
||||
if (!success) {
|
||||
return {
|
||||
success: false,
|
||||
error:
|
||||
"Failed to bill team. Insufficient credits or subscription not found.",
|
||||
returnCode: 402,
|
||||
};
|
||||
}
|
||||
|
||||
return {
|
||||
success: true,
|
||||
data: filteredDocs,
|
||||
returnCode: 200,
|
||||
};
|
||||
}
|
||||
|
||||
export async function searchController(req: Request, res: Response) {
|
||||
try {
|
||||
// make sure to authenticate user first, Bearer <token>
|
||||
const { success, team_id, error, status } = await authenticateUser(
|
||||
req,
|
||||
res,
|
||||
RateLimiterMode.Search
|
||||
);
|
||||
if (!success) {
|
||||
return res.status(status).json({ error });
|
||||
}
|
||||
const crawlerOptions = req.body.crawlerOptions ?? {};
|
||||
const pageOptions = req.body.pageOptions ?? {
|
||||
onlyMainContent: true,
|
||||
fetchPageContent: true,
|
||||
fallback: false,
|
||||
};
|
||||
const origin = req.body.origin ?? "api";
|
||||
|
||||
const searchOptions = req.body.searchOptions ?? { limit: 7 };
|
||||
|
||||
try {
|
||||
const { success: creditsCheckSuccess, message: creditsCheckMessage } =
|
||||
await checkTeamCredits(team_id, 1);
|
||||
if (!creditsCheckSuccess) {
|
||||
return res.status(402).json({ error: "Insufficient credits" });
|
||||
}
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: "Internal server error" });
|
||||
}
|
||||
const startTime = new Date().getTime();
|
||||
const result = await searchHelper(
|
||||
req,
|
||||
team_id,
|
||||
crawlerOptions,
|
||||
pageOptions,
|
||||
searchOptions
|
||||
);
|
||||
const endTime = new Date().getTime();
|
||||
const timeTakenInSeconds = (endTime - startTime) / 1000;
|
||||
logJob({
|
||||
success: result.success,
|
||||
message: result.error,
|
||||
num_docs: 1,
|
||||
docs: [result.data],
|
||||
time_taken: timeTakenInSeconds,
|
||||
team_id: team_id,
|
||||
mode: "search",
|
||||
url: req.body.url,
|
||||
crawlerOptions: crawlerOptions,
|
||||
pageOptions: pageOptions,
|
||||
origin: origin,
|
||||
});
|
||||
return res.status(result.returnCode).json(result);
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
}
|
25
apps/api/src/controllers/status.ts
Normal file
25
apps/api/src/controllers/status.ts
Normal file
|
@ -0,0 +1,25 @@
|
|||
import { Request, Response } from "express";
|
||||
import { getWebScraperQueue } from "../../src/services/queue-service";
|
||||
|
||||
export async function crawlJobStatusPreviewController(req: Request, res: Response) {
|
||||
try {
|
||||
const job = await getWebScraperQueue().getJob(req.params.jobId);
|
||||
if (!job) {
|
||||
return res.status(404).json({ error: "Job not found" });
|
||||
}
|
||||
|
||||
const { current, current_url, total, current_step } = await job.progress();
|
||||
res.json({
|
||||
status: await job.getState(),
|
||||
// progress: job.progress(),
|
||||
current: current,
|
||||
current_url: current_url,
|
||||
current_step: current_step,
|
||||
total: total,
|
||||
data: job.returnvalue,
|
||||
});
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
}
|
|
@ -3,13 +3,8 @@ import bodyParser from "body-parser";
|
|||
import cors from "cors";
|
||||
import "dotenv/config";
|
||||
import { getWebScraperQueue } from "./services/queue-service";
|
||||
import { addWebScraperJob } from "./services/queue-jobs";
|
||||
import { supabase_service } from "./services/supabase";
|
||||
import { WebScraperDataProvider } from "./scraper/WebScraper";
|
||||
import { billTeam, checkTeamCredits } from "./services/billing/credit_billing";
|
||||
import { getRateLimiter, redisClient } from "./services/rate-limiter";
|
||||
import { parseApi } from "./lib/parseApi";
|
||||
|
||||
import { redisClient } from "./services/rate-limiter";
|
||||
import { v0Router } from "./routes/v0";
|
||||
const { createBullBoard } = require("@bull-board/api");
|
||||
const { BullAdapter } = require("@bull-board/api/bullAdapter");
|
||||
const { ExpressAdapter } = require("@bull-board/express");
|
||||
|
@ -45,281 +40,20 @@ app.get("/test", async (req, res) => {
|
|||
res.send("Hello, world!");
|
||||
});
|
||||
|
||||
async function authenticateUser(req, res, mode?: string): Promise<string> {
|
||||
const authHeader = req.headers.authorization;
|
||||
if (!authHeader) {
|
||||
return res.status(401).json({ error: "Unauthorized" });
|
||||
}
|
||||
const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
|
||||
if (!token) {
|
||||
return res.status(401).json({ error: "Unauthorized: Token missing" });
|
||||
}
|
||||
|
||||
try {
|
||||
const incomingIP = (req.headers["x-forwarded-for"] ||
|
||||
req.socket.remoteAddress) as string;
|
||||
const iptoken = incomingIP + token;
|
||||
await getRateLimiter(
|
||||
token === "this_is_just_a_preview_token" ? true : false
|
||||
).consume(iptoken);
|
||||
} catch (rateLimiterRes) {
|
||||
console.error(rateLimiterRes);
|
||||
return res.status(429).json({
|
||||
error: "Rate limit exceeded. Too many requests, try again in 1 minute.",
|
||||
});
|
||||
}
|
||||
|
||||
if (token === "this_is_just_a_preview_token" && mode === "scrape") {
|
||||
return "preview";
|
||||
}
|
||||
|
||||
const normalizedApi = parseApi(token);
|
||||
// make sure api key is valid, based on the api_keys table in supabase
|
||||
const { data, error } = await supabase_service
|
||||
.from("api_keys")
|
||||
.select("*")
|
||||
.eq("key", normalizedApi);
|
||||
if (error || !data || data.length === 0) {
|
||||
return res.status(401).json({ error: "Unauthorized: Invalid token" });
|
||||
}
|
||||
|
||||
return data[0].team_id;
|
||||
}
|
||||
|
||||
app.post("/v0/scrape", async (req, res) => {
|
||||
try {
|
||||
// make sure to authenticate user first, Bearer <token>
|
||||
const team_id = await authenticateUser(req, res, "scrape");
|
||||
|
||||
try {
|
||||
const { success: creditsCheckSuccess, message: creditsCheckMessage } =
|
||||
await checkTeamCredits(team_id, 1);
|
||||
if (!creditsCheckSuccess) {
|
||||
return res.status(402).json({ error: "Insufficient credits" });
|
||||
}
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: "Internal server error" });
|
||||
}
|
||||
|
||||
// authenticate on supabase
|
||||
let url = req.body.url;
|
||||
if (!url) {
|
||||
return res.status(400).json({ error: "Url is required" });
|
||||
}
|
||||
url = url.trim().toLowerCase();
|
||||
|
||||
try {
|
||||
const a = new WebScraperDataProvider();
|
||||
await a.setOptions({
|
||||
mode: "single_urls",
|
||||
urls: [url],
|
||||
});
|
||||
|
||||
const docs = await a.getDocuments(false);
|
||||
// make sure doc.content is not empty
|
||||
const filteredDocs = docs.filter(
|
||||
(doc: { content?: string }) =>
|
||||
doc.content && doc.content.trim().length > 0
|
||||
);
|
||||
if (filteredDocs.length === 0) {
|
||||
return res.status(200).json({ success: true, data: [] });
|
||||
}
|
||||
const { success, credit_usage } = await billTeam(
|
||||
team_id,
|
||||
filteredDocs.length
|
||||
);
|
||||
if (!success) {
|
||||
// throw new Error("Failed to bill team, no subscribtion was found");
|
||||
// return {
|
||||
// success: false,
|
||||
// message: "Failed to bill team, no subscribtion was found",
|
||||
// docs: [],
|
||||
// };
|
||||
return res
|
||||
.status(402)
|
||||
.json({ error: "Failed to bill, no subscribtion was found" });
|
||||
}
|
||||
return res.json({
|
||||
success: true,
|
||||
data: filteredDocs[0],
|
||||
});
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
});
|
||||
|
||||
app.post("/v0/crawl", async (req, res) => {
|
||||
try {
|
||||
const team_id = await authenticateUser(req, res);
|
||||
|
||||
const { success: creditsCheckSuccess, message: creditsCheckMessage } =
|
||||
await checkTeamCredits(team_id, 1);
|
||||
if (!creditsCheckSuccess) {
|
||||
return res.status(402).json({ error: "Insufficient credits" });
|
||||
}
|
||||
|
||||
// authenticate on supabase
|
||||
let url = req.body.url;
|
||||
if (!url) {
|
||||
return res.status(400).json({ error: "Url is required" });
|
||||
}
|
||||
|
||||
url = url.trim().toLowerCase();
|
||||
const mode = req.body.mode ?? "crawl";
|
||||
const crawlerOptions = req.body.crawlerOptions ?? {};
|
||||
|
||||
if (mode === "single_urls" && !url.includes(",")) {
|
||||
try {
|
||||
const a = new WebScraperDataProvider();
|
||||
await a.setOptions({
|
||||
mode: "single_urls",
|
||||
urls: [url],
|
||||
crawlerOptions: {
|
||||
returnOnlyUrls: true,
|
||||
},
|
||||
});
|
||||
|
||||
const docs = await a.getDocuments(false, (progress) => {
|
||||
job.progress({
|
||||
current: progress.current,
|
||||
total: progress.total,
|
||||
current_step: "SCRAPING",
|
||||
current_url: progress.currentDocumentUrl,
|
||||
});
|
||||
});
|
||||
return res.json({
|
||||
success: true,
|
||||
documents: docs,
|
||||
});
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
}
|
||||
const job = await addWebScraperJob({
|
||||
url: url,
|
||||
mode: mode ?? "crawl", // fix for single urls not working
|
||||
crawlerOptions: { ...crawlerOptions },
|
||||
team_id: team_id,
|
||||
});
|
||||
|
||||
res.json({ jobId: job.id });
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
});
|
||||
app.post("/v0/crawlWebsitePreview", async (req, res) => {
|
||||
try {
|
||||
// make sure to authenticate user first, Bearer <token>
|
||||
const authHeader = req.headers.authorization;
|
||||
if (!authHeader) {
|
||||
return res.status(401).json({ error: "Unauthorized" });
|
||||
}
|
||||
const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
|
||||
if (!token) {
|
||||
return res.status(401).json({ error: "Unauthorized: Token missing" });
|
||||
}
|
||||
|
||||
// authenticate on supabase
|
||||
let url = req.body.url;
|
||||
if (!url) {
|
||||
return res.status(400).json({ error: "Url is required" });
|
||||
}
|
||||
url = url.trim().toLowerCase();
|
||||
const mode = req.body.mode ?? "crawl";
|
||||
const crawlerOptions = req.body.crawlerOptions ?? {};
|
||||
const job = await addWebScraperJob({
|
||||
url: url,
|
||||
mode: mode ?? "crawl", // fix for single urls not working
|
||||
crawlerOptions: { ...crawlerOptions, limit: 5, maxCrawledLinks: 5 },
|
||||
team_id: "preview",
|
||||
});
|
||||
|
||||
res.json({ jobId: job.id });
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
});
|
||||
|
||||
app.get("/v0/crawl/status/:jobId", async (req, res) => {
|
||||
try {
|
||||
const authHeader = req.headers.authorization;
|
||||
if (!authHeader) {
|
||||
return res.status(401).json({ error: "Unauthorized" });
|
||||
}
|
||||
const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
|
||||
if (!token) {
|
||||
return res.status(401).json({ error: "Unauthorized: Token missing" });
|
||||
}
|
||||
|
||||
// make sure api key is valid, based on the api_keys table in supabase
|
||||
const { data, error } = await supabase_service
|
||||
.from("api_keys")
|
||||
.select("*")
|
||||
.eq("key", token);
|
||||
if (error || !data || data.length === 0) {
|
||||
return res.status(401).json({ error: "Unauthorized: Invalid token" });
|
||||
}
|
||||
const job = await getWebScraperQueue().getJob(req.params.jobId);
|
||||
if (!job) {
|
||||
return res.status(404).json({ error: "Job not found" });
|
||||
}
|
||||
|
||||
const { current, current_url, total, current_step } = await job.progress();
|
||||
res.json({
|
||||
status: await job.getState(),
|
||||
// progress: job.progress(),
|
||||
current: current,
|
||||
current_url: current_url,
|
||||
current_step: current_step,
|
||||
total: total,
|
||||
data: job.returnvalue,
|
||||
});
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
});
|
||||
|
||||
app.get("/v0/checkJobStatus/:jobId", async (req, res) => {
|
||||
try {
|
||||
const job = await getWebScraperQueue().getJob(req.params.jobId);
|
||||
if (!job) {
|
||||
return res.status(404).json({ error: "Job not found" });
|
||||
}
|
||||
|
||||
const { current, current_url, total, current_step } = await job.progress();
|
||||
res.json({
|
||||
status: await job.getState(),
|
||||
// progress: job.progress(),
|
||||
current: current,
|
||||
current_url: current_url,
|
||||
current_step: current_step,
|
||||
total: total,
|
||||
data: job.returnvalue,
|
||||
});
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
});
|
||||
// register router
|
||||
app.use(v0Router);
|
||||
|
||||
const DEFAULT_PORT = process.env.PORT ?? 3002;
|
||||
const HOST = process.env.HOST ?? "localhost";
|
||||
redisClient.connect();
|
||||
|
||||
|
||||
export function startServer(port = DEFAULT_PORT) {
|
||||
const server = app.listen(Number(port), HOST, () => {
|
||||
console.log(`Server listening on port ${port}`);
|
||||
console.log(`For the UI, open http://${HOST}:${port}/admin/${process.env.BULL_AUTH_KEY}/queues`);
|
||||
console.log(
|
||||
`For the UI, open http://${HOST}:${port}/admin/${process.env.BULL_AUTH_KEY}/queues`
|
||||
);
|
||||
console.log("");
|
||||
console.log("1. Make sure Redis is running on port 6379 by default");
|
||||
console.log(
|
||||
|
@ -353,7 +87,77 @@ app.get(`/admin/${process.env.BULL_AUTH_KEY}/queues`, async (req, res) => {
|
|||
}
|
||||
});
|
||||
|
||||
app.get(`/serverHealthCheck`, async (req, res) => {
|
||||
try {
|
||||
const webScraperQueue = getWebScraperQueue();
|
||||
const [waitingJobs] = await Promise.all([
|
||||
webScraperQueue.getWaitingCount(),
|
||||
]);
|
||||
|
||||
const noWaitingJobs = waitingJobs === 0;
|
||||
// 200 if no active jobs, 503 if there are active jobs
|
||||
return res.status(noWaitingJobs ? 200 : 500).json({
|
||||
waitingJobs,
|
||||
});
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
});
|
||||
|
||||
app.get('/serverHealthCheck/notify', async (req, res) => {
|
||||
if (process.env.SLACK_WEBHOOK_URL) {
|
||||
const treshold = 1; // The treshold value for the active jobs
|
||||
const timeout = 60000; // 1 minute // The timeout value for the check in milliseconds
|
||||
|
||||
const getWaitingJobsCount = async () => {
|
||||
const webScraperQueue = getWebScraperQueue();
|
||||
const [waitingJobsCount] = await Promise.all([
|
||||
webScraperQueue.getWaitingCount(),
|
||||
]);
|
||||
|
||||
return waitingJobsCount;
|
||||
};
|
||||
|
||||
res.status(200).json({ message: "Check initiated" });
|
||||
|
||||
const checkWaitingJobs = async () => {
|
||||
try {
|
||||
let waitingJobsCount = await getWaitingJobsCount();
|
||||
if (waitingJobsCount >= treshold) {
|
||||
setTimeout(async () => {
|
||||
// Re-check the waiting jobs count after the timeout
|
||||
waitingJobsCount = await getWaitingJobsCount();
|
||||
if (waitingJobsCount >= treshold) {
|
||||
const slackWebhookUrl = process.env.SLACK_WEBHOOK_URL;
|
||||
const message = {
|
||||
text: `⚠️ Warning: The number of active jobs (${waitingJobsCount}) has exceeded the threshold (${treshold}) for more than ${timeout/60000} minute(s).`,
|
||||
};
|
||||
|
||||
const response = await fetch(slackWebhookUrl, {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
},
|
||||
body: JSON.stringify(message),
|
||||
})
|
||||
|
||||
if (!response.ok) {
|
||||
console.error('Failed to send Slack notification')
|
||||
}
|
||||
}
|
||||
}, timeout);
|
||||
}
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
}
|
||||
};
|
||||
|
||||
checkWaitingJobs();
|
||||
}
|
||||
});
|
||||
|
||||
|
||||
app.get("/is-production", (req, res) => {
|
||||
res.send({ isProduction: global.isProduction });
|
||||
});
|
||||
|
||||
|
|
|
@ -9,8 +9,38 @@ export interface Progress {
|
|||
currentDocumentUrl?: string;
|
||||
}
|
||||
|
||||
export type PageOptions = {
|
||||
onlyMainContent?: boolean;
|
||||
fallback?: boolean;
|
||||
fetchPageContent?: boolean;
|
||||
|
||||
};
|
||||
|
||||
export type SearchOptions = {
|
||||
limit?: number;
|
||||
tbs?: string;
|
||||
filter?: string;
|
||||
};
|
||||
|
||||
export type WebScraperOptions = {
|
||||
urls: string[];
|
||||
mode: "single_urls" | "sitemap" | "crawl";
|
||||
crawlerOptions?: {
|
||||
returnOnlyUrls?: boolean;
|
||||
includes?: string[];
|
||||
excludes?: string[];
|
||||
maxCrawledLinks?: number;
|
||||
limit?: number;
|
||||
generateImgAltText?: boolean;
|
||||
replaceAllPathsWithAbsolutePaths?: boolean;
|
||||
};
|
||||
pageOptions?: PageOptions;
|
||||
concurrentRequests?: number;
|
||||
};
|
||||
|
||||
export class Document {
|
||||
id?: string;
|
||||
url?: string; // Used only in /search for now
|
||||
content: string;
|
||||
markdown?: string;
|
||||
createdAt?: Date;
|
||||
|
@ -21,6 +51,7 @@ export class Document {
|
|||
[key: string]: any;
|
||||
};
|
||||
childrenLinks?: string[];
|
||||
provider?: string;
|
||||
|
||||
constructor(data: Partial<Document>) {
|
||||
if (!data.content) {
|
||||
|
@ -33,5 +64,6 @@ export class Document {
|
|||
this.metadata = data.metadata || { sourceURL: "" };
|
||||
this.markdown = data.markdown || "";
|
||||
this.childrenLinks = data.childrenLinks || undefined;
|
||||
this.provider = data.provider || undefined;
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,5 +1,8 @@
|
|||
|
||||
export function parseMarkdown(html: string) {
|
||||
var TurndownService = require("turndown");
|
||||
var turndownPluginGfm = require('joplin-turndown-plugin-gfm')
|
||||
|
||||
|
||||
const turndownService = new TurndownService();
|
||||
turndownService.addRule("inlineLink", {
|
||||
|
@ -16,7 +19,8 @@ export function parseMarkdown(html: string) {
|
|||
return "[" + content.trim() + "](" + href + title + ")\n";
|
||||
},
|
||||
});
|
||||
|
||||
var gfm = turndownPluginGfm.gfm;
|
||||
turndownService.use(gfm);
|
||||
let markdownContent = turndownService.turndown(html);
|
||||
|
||||
// multiple line links
|
||||
|
|
24
apps/api/src/lib/withAuth.ts
Normal file
24
apps/api/src/lib/withAuth.ts
Normal file
|
@ -0,0 +1,24 @@
|
|||
import { AuthResponse } from "../../src/types";
|
||||
|
||||
let warningCount = 0;
|
||||
|
||||
export function withAuth<T extends AuthResponse, U extends any[]>(
|
||||
originalFunction: (...args: U) => Promise<T>
|
||||
) {
|
||||
return async function (...args: U): Promise<T> {
|
||||
if (process.env.USE_DB_AUTHENTICATION === "false") {
|
||||
if (warningCount < 5) {
|
||||
console.warn("WARNING - You're bypassing authentication");
|
||||
warningCount++;
|
||||
}
|
||||
return { success: true } as T;
|
||||
} else {
|
||||
try {
|
||||
return await originalFunction(...args);
|
||||
} catch (error) {
|
||||
console.error("Error in withAuth function: ", error);
|
||||
return { success: false, error: error.message } as T;
|
||||
}
|
||||
}
|
||||
};
|
||||
}
|
|
@ -3,7 +3,7 @@ import { CrawlResult, WebScraperOptions } from "../types";
|
|||
import { WebScraperDataProvider } from "../scraper/WebScraper";
|
||||
import { Progress } from "../lib/entities";
|
||||
import { billTeam } from "../services/billing/credit_billing";
|
||||
|
||||
import { Document } from "../lib/entities";
|
||||
export async function startWebScraperPipeline({
|
||||
job,
|
||||
}: {
|
||||
|
@ -13,6 +13,7 @@ export async function startWebScraperPipeline({
|
|||
url: job.data.url,
|
||||
mode: job.data.mode,
|
||||
crawlerOptions: job.data.crawlerOptions,
|
||||
pageOptions: job.data.pageOptions,
|
||||
inProgress: (progress) => {
|
||||
job.progress(progress);
|
||||
},
|
||||
|
@ -23,12 +24,13 @@ export async function startWebScraperPipeline({
|
|||
job.moveToFailed(error);
|
||||
},
|
||||
team_id: job.data.team_id,
|
||||
})) as { success: boolean; message: string; docs: CrawlResult[] };
|
||||
})) as { success: boolean; message: string; docs: Document[] };
|
||||
}
|
||||
export async function runWebScraper({
|
||||
url,
|
||||
mode,
|
||||
crawlerOptions,
|
||||
pageOptions,
|
||||
inProgress,
|
||||
onSuccess,
|
||||
onError,
|
||||
|
@ -37,25 +39,31 @@ export async function runWebScraper({
|
|||
url: string;
|
||||
mode: "crawl" | "single_urls" | "sitemap";
|
||||
crawlerOptions: any;
|
||||
pageOptions?: any;
|
||||
inProgress: (progress: any) => void;
|
||||
onSuccess: (result: any) => void;
|
||||
onError: (error: any) => void;
|
||||
team_id: string;
|
||||
}): Promise<{ success: boolean; message: string; docs: CrawlResult[] }> {
|
||||
}): Promise<{
|
||||
success: boolean;
|
||||
message: string;
|
||||
docs: CrawlResult[];
|
||||
}> {
|
||||
try {
|
||||
const provider = new WebScraperDataProvider();
|
||||
|
||||
if (mode === "crawl") {
|
||||
await provider.setOptions({
|
||||
mode: mode,
|
||||
urls: [url],
|
||||
crawlerOptions: crawlerOptions,
|
||||
pageOptions: pageOptions,
|
||||
});
|
||||
} else {
|
||||
await provider.setOptions({
|
||||
mode: mode,
|
||||
urls: url.split(","),
|
||||
crawlerOptions: crawlerOptions,
|
||||
pageOptions: pageOptions,
|
||||
});
|
||||
}
|
||||
const docs = (await provider.getDocuments(false, (progress: Progress) => {
|
||||
|
@ -66,28 +74,32 @@ export async function runWebScraper({
|
|||
return {
|
||||
success: true,
|
||||
message: "No pages found",
|
||||
docs: [],
|
||||
docs: []
|
||||
};
|
||||
}
|
||||
|
||||
// remove docs with empty content
|
||||
const filteredDocs = docs.filter((doc) => doc.content.trim().length > 0);
|
||||
onSuccess(filteredDocs);
|
||||
|
||||
const { success, credit_usage } = await billTeam(
|
||||
team_id,
|
||||
filteredDocs.length
|
||||
);
|
||||
|
||||
if (!success) {
|
||||
// throw new Error("Failed to bill team, no subscribtion was found");
|
||||
// throw new Error("Failed to bill team, no subscription was found");
|
||||
return {
|
||||
success: false,
|
||||
message: "Failed to bill team, no subscribtion was found",
|
||||
docs: [],
|
||||
message: "Failed to bill team, no subscription was found",
|
||||
docs: []
|
||||
};
|
||||
}
|
||||
|
||||
return { success: true, message: "", docs: filteredDocs as CrawlResult[] };
|
||||
// This is where the returnvalue from the job is set
|
||||
onSuccess(filteredDocs);
|
||||
|
||||
// this return doesn't matter too much for the job completion result
|
||||
return { success: true, message: "", docs: filteredDocs };
|
||||
} catch (error) {
|
||||
console.error("Error running web scraper", error);
|
||||
onError(error);
|
||||
|
|
19
apps/api/src/routes/v0.ts
Normal file
19
apps/api/src/routes/v0.ts
Normal file
|
@ -0,0 +1,19 @@
|
|||
import express from "express";
|
||||
import { crawlController } from "../../src/controllers/crawl";
|
||||
import { crawlStatusController } from "../../src/controllers/crawl-status";
|
||||
import { scrapeController } from "../../src/controllers/scrape";
|
||||
import { crawlPreviewController } from "../../src/controllers/crawlPreview";
|
||||
import { crawlJobStatusPreviewController } from "../../src/controllers/status";
|
||||
import { searchController } from "../../src/controllers/search";
|
||||
|
||||
export const v0Router = express.Router();
|
||||
|
||||
v0Router.post("/v0/scrape", scrapeController);
|
||||
v0Router.post("/v0/crawl", crawlController);
|
||||
v0Router.post("/v0/crawlWebsitePreview", crawlPreviewController);
|
||||
v0Router.get("/v0/crawl/status/:jobId", crawlStatusController);
|
||||
v0Router.get("/v0/checkJobStatus/:jobId", crawlJobStatusPreviewController);
|
||||
|
||||
// Search routes
|
||||
v0Router.post("/v0/search", searchController);
|
||||
|
|
@ -257,7 +257,7 @@ export class WebCrawler {
|
|||
".js",
|
||||
".ico",
|
||||
".svg",
|
||||
".pdf",
|
||||
// ".pdf",
|
||||
".zip",
|
||||
".exe",
|
||||
".dmg",
|
||||
|
|
|
@ -1,24 +1,13 @@
|
|||
import { Document } from "../../lib/entities";
|
||||
import { Document, PageOptions, WebScraperOptions } from "../../lib/entities";
|
||||
import { Progress } from "../../lib/entities";
|
||||
import { scrapSingleUrl } from "./single_url";
|
||||
import { SitemapEntry, fetchSitemapData, getLinksFromSitemap } from "./sitemap";
|
||||
import { WebCrawler } from "./crawler";
|
||||
import { getValue, setValue } from "../../services/redis";
|
||||
import { getImageDescription } from "./utils/gptVision";
|
||||
import { getImageDescription } from "./utils/imageDescription";
|
||||
import { fetchAndProcessPdf } from "./utils/pdfProcessor";
|
||||
import { replaceImgPathsWithAbsolutePaths, replacePathsWithAbsolutePaths } from "./utils/replacePaths";
|
||||
|
||||
export type WebScraperOptions = {
|
||||
urls: string[];
|
||||
mode: "single_urls" | "sitemap" | "crawl";
|
||||
crawlerOptions?: {
|
||||
returnOnlyUrls?: boolean;
|
||||
includes?: string[];
|
||||
excludes?: string[];
|
||||
maxCrawledLinks?: number;
|
||||
limit?: number;
|
||||
generateImgAltText?: boolean;
|
||||
};
|
||||
concurrentRequests?: number;
|
||||
};
|
||||
export class WebScraperDataProvider {
|
||||
private urls: string[] = [""];
|
||||
private mode: "single_urls" | "sitemap" | "crawl" = "single_urls";
|
||||
|
@ -29,6 +18,9 @@ export class WebScraperDataProvider {
|
|||
private limit: number = 10000;
|
||||
private concurrentRequests: number = 20;
|
||||
private generateImgAltText: boolean = false;
|
||||
private pageOptions?: PageOptions;
|
||||
private replaceAllPathsWithAbsolutePaths?: boolean = false;
|
||||
private generateImgAltTextModel: "gpt-4-turbo" | "claude-3-opus" = "gpt-4-turbo";
|
||||
|
||||
authorize(): void {
|
||||
throw new Error("Method not implemented.");
|
||||
|
@ -49,19 +41,21 @@ export class WebScraperDataProvider {
|
|||
const results: (Document | null)[] = new Array(urls.length).fill(null);
|
||||
for (let i = 0; i < urls.length; i += this.concurrentRequests) {
|
||||
const batchUrls = urls.slice(i, i + this.concurrentRequests);
|
||||
await Promise.all(batchUrls.map(async (url, index) => {
|
||||
const result = await scrapSingleUrl(url, true);
|
||||
processedUrls++;
|
||||
if (inProgress) {
|
||||
inProgress({
|
||||
current: processedUrls,
|
||||
total: totalUrls,
|
||||
status: "SCRAPING",
|
||||
currentDocumentUrl: url,
|
||||
});
|
||||
}
|
||||
results[i + index] = result;
|
||||
}));
|
||||
await Promise.all(
|
||||
batchUrls.map(async (url, index) => {
|
||||
const result = await scrapSingleUrl(url, true, this.pageOptions);
|
||||
processedUrls++;
|
||||
if (inProgress) {
|
||||
inProgress({
|
||||
current: processedUrls,
|
||||
total: totalUrls,
|
||||
status: "SCRAPING",
|
||||
currentDocumentUrl: url,
|
||||
});
|
||||
}
|
||||
results[i + index] = result;
|
||||
})
|
||||
);
|
||||
}
|
||||
return results.filter((result) => result !== null) as Document[];
|
||||
}
|
||||
|
@ -84,7 +78,7 @@ export class WebScraperDataProvider {
|
|||
limit: this.limit,
|
||||
generateImgAltText: this.generateImgAltText,
|
||||
});
|
||||
const links = await crawler.start(inProgress, 5, this.limit);
|
||||
let links = await crawler.start(inProgress, 5, this.limit);
|
||||
if (this.returnOnlyUrls) {
|
||||
return links.map((url) => ({
|
||||
content: "",
|
||||
|
@ -93,55 +87,142 @@ export class WebScraperDataProvider {
|
|||
type: "text",
|
||||
}));
|
||||
}
|
||||
|
||||
let pdfLinks = links.filter((link) => link.endsWith(".pdf"));
|
||||
let pdfDocuments: Document[] = [];
|
||||
for (let pdfLink of pdfLinks) {
|
||||
const pdfContent = await fetchAndProcessPdf(pdfLink);
|
||||
pdfDocuments.push({
|
||||
content: pdfContent,
|
||||
metadata: { sourceURL: pdfLink },
|
||||
provider: "web-scraper"
|
||||
});
|
||||
}
|
||||
links = links.filter((link) => !link.endsWith(".pdf"));
|
||||
|
||||
let documents = await this.convertUrlsToDocuments(links, inProgress);
|
||||
documents = await this.getSitemapData(this.urls[0], documents);
|
||||
console.log("documents", documents)
|
||||
|
||||
if (this.replaceAllPathsWithAbsolutePaths) {
|
||||
documents = replacePathsWithAbsolutePaths(documents);
|
||||
} else {
|
||||
documents = replaceImgPathsWithAbsolutePaths(documents);
|
||||
}
|
||||
|
||||
if (this.generateImgAltText) {
|
||||
documents = await this.generatesImgAltText(documents);
|
||||
}
|
||||
documents = documents.concat(pdfDocuments);
|
||||
|
||||
// CACHING DOCUMENTS
|
||||
// - parent document
|
||||
const cachedParentDocumentString = await getValue('web-scraper-cache:' + this.normalizeUrl(this.urls[0]));
|
||||
const cachedParentDocumentString = await getValue(
|
||||
"web-scraper-cache:" + this.normalizeUrl(this.urls[0])
|
||||
);
|
||||
if (cachedParentDocumentString != null) {
|
||||
let cachedParentDocument = JSON.parse(cachedParentDocumentString);
|
||||
if (!cachedParentDocument.childrenLinks || cachedParentDocument.childrenLinks.length < links.length - 1) {
|
||||
cachedParentDocument.childrenLinks = links.filter((link) => link !== this.urls[0]);
|
||||
await setValue('web-scraper-cache:' + this.normalizeUrl(this.urls[0]), JSON.stringify(cachedParentDocument), 60 * 60 * 24 * 10); // 10 days
|
||||
if (
|
||||
!cachedParentDocument.childrenLinks ||
|
||||
cachedParentDocument.childrenLinks.length < links.length - 1
|
||||
) {
|
||||
cachedParentDocument.childrenLinks = links.filter(
|
||||
(link) => link !== this.urls[0]
|
||||
);
|
||||
await setValue(
|
||||
"web-scraper-cache:" + this.normalizeUrl(this.urls[0]),
|
||||
JSON.stringify(cachedParentDocument),
|
||||
60 * 60 * 24 * 10
|
||||
); // 10 days
|
||||
}
|
||||
} else {
|
||||
let parentDocument = documents.filter((document) => this.normalizeUrl(document.metadata.sourceURL) === this.normalizeUrl(this.urls[0]))
|
||||
let parentDocument = documents.filter(
|
||||
(document) =>
|
||||
this.normalizeUrl(document.metadata.sourceURL) ===
|
||||
this.normalizeUrl(this.urls[0])
|
||||
);
|
||||
await this.setCachedDocuments(parentDocument, links);
|
||||
}
|
||||
|
||||
await this.setCachedDocuments(documents.filter((document) => this.normalizeUrl(document.metadata.sourceURL) !== this.normalizeUrl(this.urls[0])), []);
|
||||
await this.setCachedDocuments(
|
||||
documents.filter(
|
||||
(document) =>
|
||||
this.normalizeUrl(document.metadata.sourceURL) !==
|
||||
this.normalizeUrl(this.urls[0])
|
||||
),
|
||||
[]
|
||||
);
|
||||
documents = this.removeChildLinks(documents);
|
||||
documents = documents.splice(0, this.limit);
|
||||
return documents;
|
||||
}
|
||||
|
||||
if (this.mode === "single_urls") {
|
||||
let documents = await this.convertUrlsToDocuments(this.urls, inProgress);
|
||||
let pdfLinks = this.urls.filter((link) => link.endsWith(".pdf"));
|
||||
let pdfDocuments: Document[] = [];
|
||||
for (let pdfLink of pdfLinks) {
|
||||
const pdfContent = await fetchAndProcessPdf(pdfLink);
|
||||
pdfDocuments.push({
|
||||
content: pdfContent,
|
||||
metadata: { sourceURL: pdfLink },
|
||||
provider: "web-scraper"
|
||||
});
|
||||
}
|
||||
|
||||
let documents = await this.convertUrlsToDocuments(
|
||||
this.urls.filter((link) => !link.endsWith(".pdf")),
|
||||
inProgress
|
||||
);
|
||||
|
||||
if (this.replaceAllPathsWithAbsolutePaths) {
|
||||
documents = replacePathsWithAbsolutePaths(documents);
|
||||
} else {
|
||||
documents = replaceImgPathsWithAbsolutePaths(documents);
|
||||
}
|
||||
|
||||
if (this.generateImgAltText) {
|
||||
documents = await this.generatesImgAltText(documents);
|
||||
}
|
||||
const baseUrl = new URL(this.urls[0]).origin;
|
||||
documents = await this.getSitemapData(baseUrl, documents);
|
||||
|
||||
documents = documents.concat(pdfDocuments);
|
||||
|
||||
await this.setCachedDocuments(documents);
|
||||
documents = this.removeChildLinks(documents);
|
||||
documents = documents.splice(0, this.limit);
|
||||
return documents;
|
||||
}
|
||||
if (this.mode === "sitemap") {
|
||||
const links = await getLinksFromSitemap(this.urls[0]);
|
||||
let documents = await this.convertUrlsToDocuments(links.slice(0, this.limit), inProgress);
|
||||
let links = await getLinksFromSitemap(this.urls[0]);
|
||||
let pdfLinks = links.filter((link) => link.endsWith(".pdf"));
|
||||
let pdfDocuments: Document[] = [];
|
||||
for (let pdfLink of pdfLinks) {
|
||||
const pdfContent = await fetchAndProcessPdf(pdfLink);
|
||||
pdfDocuments.push({
|
||||
content: pdfContent,
|
||||
metadata: { sourceURL: pdfLink },
|
||||
provider: "web-scraper"
|
||||
});
|
||||
}
|
||||
links = links.filter((link) => !link.endsWith(".pdf"));
|
||||
|
||||
let documents = await this.convertUrlsToDocuments(
|
||||
links.slice(0, this.limit),
|
||||
inProgress
|
||||
);
|
||||
|
||||
documents = await this.getSitemapData(this.urls[0], documents);
|
||||
|
||||
if (this.replaceAllPathsWithAbsolutePaths) {
|
||||
documents = replacePathsWithAbsolutePaths(documents);
|
||||
} else {
|
||||
documents = replaceImgPathsWithAbsolutePaths(documents);
|
||||
}
|
||||
|
||||
if (this.generateImgAltText) {
|
||||
documents = await this.generatesImgAltText(documents);
|
||||
}
|
||||
|
||||
documents = documents.concat(pdfDocuments);
|
||||
|
||||
await this.setCachedDocuments(documents);
|
||||
documents = this.removeChildLinks(documents);
|
||||
documents = documents.splice(0, this.limit);
|
||||
|
@ -151,11 +232,22 @@ export class WebScraperDataProvider {
|
|||
return [];
|
||||
}
|
||||
|
||||
let documents = await this.getCachedDocuments(this.urls.slice(0, this.limit));
|
||||
let documents = await this.getCachedDocuments(
|
||||
this.urls.slice(0, this.limit)
|
||||
);
|
||||
if (documents.length < this.limit) {
|
||||
const newDocuments: Document[] = await this.getDocuments(false, inProgress);
|
||||
newDocuments.forEach(doc => {
|
||||
if (!documents.some(d => this.normalizeUrl(d.metadata.sourceURL) === this.normalizeUrl(doc.metadata?.sourceURL))) {
|
||||
const newDocuments: Document[] = await this.getDocuments(
|
||||
false,
|
||||
inProgress
|
||||
);
|
||||
newDocuments.forEach((doc) => {
|
||||
if (
|
||||
!documents.some(
|
||||
(d) =>
|
||||
this.normalizeUrl(d.metadata.sourceURL) ===
|
||||
this.normalizeUrl(doc.metadata?.sourceURL)
|
||||
)
|
||||
) {
|
||||
documents.push(doc);
|
||||
}
|
||||
});
|
||||
|
@ -171,17 +263,23 @@ export class WebScraperDataProvider {
|
|||
const url = new URL(document.metadata.sourceURL);
|
||||
const path = url.pathname;
|
||||
|
||||
if (this.excludes.length > 0 && this.excludes[0] !== '') {
|
||||
if (this.excludes.length > 0 && this.excludes[0] !== "") {
|
||||
// Check if the link should be excluded
|
||||
if (this.excludes.some(excludePattern => new RegExp(excludePattern).test(path))) {
|
||||
if (
|
||||
this.excludes.some((excludePattern) =>
|
||||
new RegExp(excludePattern).test(path)
|
||||
)
|
||||
) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
if (this.includes.length > 0 && this.includes[0] !== '') {
|
||||
|
||||
if (this.includes.length > 0 && this.includes[0] !== "") {
|
||||
// Check if the link matches the include patterns, if any are specified
|
||||
if (this.includes.length > 0) {
|
||||
return this.includes.some(includePattern => new RegExp(includePattern).test(path));
|
||||
return this.includes.some((includePattern) =>
|
||||
new RegExp(includePattern).test(path)
|
||||
);
|
||||
}
|
||||
}
|
||||
return true;
|
||||
|
@ -198,7 +296,7 @@ export class WebScraperDataProvider {
|
|||
private removeChildLinks(documents: Document[]): Document[] {
|
||||
for (let document of documents) {
|
||||
if (document?.childrenLinks) delete document.childrenLinks;
|
||||
};
|
||||
}
|
||||
return documents;
|
||||
}
|
||||
|
||||
|
@ -208,10 +306,14 @@ export class WebScraperDataProvider {
|
|||
continue;
|
||||
}
|
||||
const normalizedUrl = this.normalizeUrl(document.metadata.sourceURL);
|
||||
await setValue('web-scraper-cache:' + normalizedUrl, JSON.stringify({
|
||||
...document,
|
||||
childrenLinks: childrenLinks || []
|
||||
}), 60 * 60 * 24 * 10); // 10 days
|
||||
await setValue(
|
||||
"web-scraper-cache:" + normalizedUrl,
|
||||
JSON.stringify({
|
||||
...document,
|
||||
childrenLinks: childrenLinks || [],
|
||||
}),
|
||||
60 * 60 * 24 * 10
|
||||
); // 10 days
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -219,8 +321,12 @@ export class WebScraperDataProvider {
|
|||
let documents: Document[] = [];
|
||||
for (const url of urls) {
|
||||
const normalizedUrl = this.normalizeUrl(url);
|
||||
console.log("Getting cached document for web-scraper-cache:" + normalizedUrl)
|
||||
const cachedDocumentString = await getValue('web-scraper-cache:' + normalizedUrl);
|
||||
console.log(
|
||||
"Getting cached document for web-scraper-cache:" + normalizedUrl
|
||||
);
|
||||
const cachedDocumentString = await getValue(
|
||||
"web-scraper-cache:" + normalizedUrl
|
||||
);
|
||||
if (cachedDocumentString) {
|
||||
const cachedDocument = JSON.parse(cachedDocumentString);
|
||||
documents.push(cachedDocument);
|
||||
|
@ -228,10 +334,18 @@ export class WebScraperDataProvider {
|
|||
// get children documents
|
||||
for (const childUrl of cachedDocument.childrenLinks) {
|
||||
const normalizedChildUrl = this.normalizeUrl(childUrl);
|
||||
const childCachedDocumentString = await getValue('web-scraper-cache:' + normalizedChildUrl);
|
||||
const childCachedDocumentString = await getValue(
|
||||
"web-scraper-cache:" + normalizedChildUrl
|
||||
);
|
||||
if (childCachedDocumentString) {
|
||||
const childCachedDocument = JSON.parse(childCachedDocumentString);
|
||||
if (!documents.find((doc) => doc.metadata.sourceURL === childCachedDocument.metadata.sourceURL)) {
|
||||
if (
|
||||
!documents.find(
|
||||
(doc) =>
|
||||
doc.metadata.sourceURL ===
|
||||
childCachedDocument.metadata.sourceURL
|
||||
)
|
||||
) {
|
||||
documents.push(childCachedDocument);
|
||||
}
|
||||
}
|
||||
|
@ -246,7 +360,6 @@ export class WebScraperDataProvider {
|
|||
throw new Error("Urls are required");
|
||||
}
|
||||
|
||||
console.log("options", options.crawlerOptions?.excludes)
|
||||
this.urls = options.urls;
|
||||
this.mode = options.mode;
|
||||
this.concurrentRequests = options.concurrentRequests ?? 20;
|
||||
|
@ -255,13 +368,14 @@ export class WebScraperDataProvider {
|
|||
this.maxCrawledLinks = options.crawlerOptions?.maxCrawledLinks ?? 1000;
|
||||
this.returnOnlyUrls = options.crawlerOptions?.returnOnlyUrls ?? false;
|
||||
this.limit = options.crawlerOptions?.limit ?? 10000;
|
||||
this.generateImgAltText = options.crawlerOptions?.generateImgAltText ?? false;
|
||||
|
||||
this.generateImgAltText =
|
||||
options.crawlerOptions?.generateImgAltText ?? false;
|
||||
this.pageOptions = options.pageOptions ?? {onlyMainContent: false};
|
||||
this.replaceAllPathsWithAbsolutePaths = options.crawlerOptions?.replaceAllPathsWithAbsolutePaths ?? false;
|
||||
|
||||
//! @nicolas, for some reason this was being injected and breakign everything. Don't have time to find source of the issue so adding this check
|
||||
this.excludes = this.excludes.filter(item => item !== '');
|
||||
|
||||
|
||||
this.excludes = this.excludes.filter((item) => item !== "");
|
||||
|
||||
// make sure all urls start with https://
|
||||
this.urls = this.urls.map((url) => {
|
||||
if (!url.trim().startsWith("http")) {
|
||||
|
@ -272,10 +386,14 @@ export class WebScraperDataProvider {
|
|||
}
|
||||
|
||||
private async getSitemapData(baseUrl: string, documents: Document[]) {
|
||||
const sitemapData = await fetchSitemapData(baseUrl)
|
||||
const sitemapData = await fetchSitemapData(baseUrl);
|
||||
if (sitemapData) {
|
||||
for (let i = 0; i < documents.length; i++) {
|
||||
const docInSitemapData = sitemapData.find((data) => this.normalizeUrl(data.loc) === this.normalizeUrl(documents[i].metadata.sourceURL))
|
||||
const docInSitemapData = sitemapData.find(
|
||||
(data) =>
|
||||
this.normalizeUrl(data.loc) ===
|
||||
this.normalizeUrl(documents[i].metadata.sourceURL)
|
||||
);
|
||||
if (docInSitemapData) {
|
||||
let sitemapDocData: Partial<SitemapEntry> = {};
|
||||
if (docInSitemapData.changefreq) {
|
||||
|
@ -296,30 +414,47 @@ export class WebScraperDataProvider {
|
|||
return documents;
|
||||
}
|
||||
generatesImgAltText = async (documents: Document[]): Promise<Document[]> => {
|
||||
await Promise.all(documents.map(async (document) => {
|
||||
const baseUrl = new URL(document.metadata.sourceURL).origin;
|
||||
const images = document.content.match(/!\[.*?\]\(((?:[^()]+|\((?:[^()]+|\([^()]*\))*\))*)\)/g) || [];
|
||||
await Promise.all(
|
||||
documents.map(async (document) => {
|
||||
const images = document.content.match(/!\[.*?\]\((.*?)\)/g) || [];
|
||||
|
||||
await Promise.all(images.map(async (image) => {
|
||||
let imageUrl = image.match(/\(([^)]+)\)/)[1];
|
||||
let altText = image.match(/\[(.*?)\]/)[1];
|
||||
let newImageUrl = '';
|
||||
await Promise.all(
|
||||
images.map(async (image: string) => {
|
||||
let imageUrl = image.match(/\(([^)]+)\)/)[1];
|
||||
let altText = image.match(/\[(.*?)\]/)[1];
|
||||
|
||||
if (!altText && !imageUrl.startsWith("data:image") && /\.(png|jpeg|gif|webp)$/.test(imageUrl)) {
|
||||
newImageUrl = baseUrl + imageUrl;
|
||||
const imageIndex = document.content.indexOf(image);
|
||||
const contentLength = document.content.length;
|
||||
let backText = document.content.substring(imageIndex + image.length, Math.min(imageIndex + image.length + 1000, contentLength));
|
||||
let frontTextStartIndex = Math.max(imageIndex - 1000, 0);
|
||||
let frontText = document.content.substring(frontTextStartIndex, imageIndex);
|
||||
altText = await getImageDescription(newImageUrl, backText, frontText);
|
||||
}
|
||||
if (
|
||||
!altText &&
|
||||
!imageUrl.startsWith("data:image") &&
|
||||
/\.(png|jpeg|gif|webp)$/.test(imageUrl)
|
||||
) {
|
||||
const imageIndex = document.content.indexOf(image);
|
||||
const contentLength = document.content.length;
|
||||
let backText = document.content.substring(
|
||||
imageIndex + image.length,
|
||||
Math.min(imageIndex + image.length + 1000, contentLength)
|
||||
);
|
||||
let frontTextStartIndex = Math.max(imageIndex - 1000, 0);
|
||||
let frontText = document.content.substring(
|
||||
frontTextStartIndex,
|
||||
imageIndex
|
||||
);
|
||||
altText = await getImageDescription(
|
||||
imageUrl,
|
||||
backText,
|
||||
frontText
|
||||
, this.generateImgAltTextModel);
|
||||
}
|
||||
|
||||
document.content = document.content.replace(image, `![${altText}](${newImageUrl})`);
|
||||
}));
|
||||
}));
|
||||
document.content = document.content.replace(
|
||||
image,
|
||||
`![${altText}](${imageUrl})`
|
||||
);
|
||||
})
|
||||
);
|
||||
})
|
||||
);
|
||||
|
||||
return documents;
|
||||
}
|
||||
};
|
||||
}
|
||||
|
||||
|
|
|
@ -2,10 +2,9 @@ import * as cheerio from "cheerio";
|
|||
import { ScrapingBeeClient } from "scrapingbee";
|
||||
import { extractMetadata } from "./utils/metadata";
|
||||
import dotenv from "dotenv";
|
||||
import { Document } from "../../lib/entities";
|
||||
import { Document, PageOptions } from "../../lib/entities";
|
||||
import { parseMarkdown } from "../../lib/html-to-markdown";
|
||||
import { parseTablesToMarkdown } from "./utils/parseTable";
|
||||
// import puppeteer from "puppeteer";
|
||||
import { excludeNonMainTags } from "./utils/excludeTags";
|
||||
|
||||
dotenv.config();
|
||||
|
||||
|
@ -24,13 +23,14 @@ export async function scrapWithCustomFirecrawl(
|
|||
|
||||
export async function scrapWithScrapingBee(
|
||||
url: string,
|
||||
wait_browser: string = "domcontentloaded"
|
||||
wait_browser: string = "domcontentloaded",
|
||||
timeout: number = 15000
|
||||
): Promise<string> {
|
||||
try {
|
||||
const client = new ScrapingBeeClient(process.env.SCRAPING_BEE_API_KEY);
|
||||
const response = await client.get({
|
||||
url: url,
|
||||
params: { timeout: 15000, wait_browser: wait_browser },
|
||||
params: { timeout: timeout, wait_browser: wait_browser },
|
||||
headers: { "ScrapingService-Request": "TRUE" },
|
||||
});
|
||||
|
||||
|
@ -77,14 +77,21 @@ export async function scrapWithPlaywright(url: string): Promise<string> {
|
|||
|
||||
export async function scrapSingleUrl(
|
||||
urlToScrap: string,
|
||||
toMarkdown: boolean = true
|
||||
toMarkdown: boolean = true,
|
||||
pageOptions: PageOptions = { onlyMainContent: true }
|
||||
): Promise<Document> {
|
||||
console.log(`Scraping URL: ${urlToScrap}`);
|
||||
urlToScrap = urlToScrap.trim();
|
||||
|
||||
const removeUnwantedElements = (html: string) => {
|
||||
const removeUnwantedElements = (html: string, pageOptions: PageOptions) => {
|
||||
const soup = cheerio.load(html);
|
||||
soup("script, style, iframe, noscript, meta, head").remove();
|
||||
if (pageOptions.onlyMainContent) {
|
||||
// remove any other tags that are not in the main content
|
||||
excludeNonMainTags.forEach((tag) => {
|
||||
soup(tag).remove();
|
||||
});
|
||||
}
|
||||
return soup.html();
|
||||
};
|
||||
|
||||
|
@ -100,11 +107,11 @@ export async function scrapSingleUrl(
|
|||
let text = "";
|
||||
switch (method) {
|
||||
case "firecrawl-scraper":
|
||||
text = await scrapWithCustomFirecrawl(url);
|
||||
text = await scrapWithCustomFirecrawl(url,);
|
||||
break;
|
||||
case "scrapingBee":
|
||||
if (process.env.SCRAPING_BEE_API_KEY) {
|
||||
text = await scrapWithScrapingBee(url);
|
||||
text = await scrapWithScrapingBee(url,"domcontentloaded", pageOptions.fallback === false? 7000 : 15000);
|
||||
}
|
||||
break;
|
||||
case "playwright":
|
||||
|
@ -133,8 +140,8 @@ export async function scrapSingleUrl(
|
|||
}
|
||||
break;
|
||||
}
|
||||
let cleanedHtml = removeUnwantedElements(text);
|
||||
cleanedHtml = await parseTablesToMarkdown(cleanedHtml);
|
||||
let cleanedHtml = removeUnwantedElements(text, pageOptions);
|
||||
|
||||
return [await parseMarkdown(cleanedHtml), text];
|
||||
};
|
||||
|
||||
|
@ -147,6 +154,17 @@ export async function scrapSingleUrl(
|
|||
// }
|
||||
|
||||
let [text, html] = await attemptScraping(urlToScrap, "scrapingBee");
|
||||
// Basically means that it is using /search endpoint
|
||||
if(pageOptions.fallback === false){
|
||||
const soup = cheerio.load(html);
|
||||
const metadata = extractMetadata(soup, urlToScrap);
|
||||
return {
|
||||
url: urlToScrap,
|
||||
content: text,
|
||||
markdown: text,
|
||||
metadata: { ...metadata, sourceURL: urlToScrap },
|
||||
} as Document;
|
||||
}
|
||||
if (!text || text.length < 100) {
|
||||
console.log("Falling back to playwright");
|
||||
[text, html] = await attemptScraping(urlToScrap, "playwright");
|
||||
|
|
|
@ -0,0 +1,47 @@
|
|||
import * as pdfProcessor from '../pdfProcessor';
|
||||
|
||||
describe('PDF Processing Module - Integration Test', () => {
|
||||
it('should correctly process a simple PDF file without the LLAMAPARSE_API_KEY', async () => {
|
||||
delete process.env.LLAMAPARSE_API_KEY;
|
||||
const pdfContent = await pdfProcessor.fetchAndProcessPdf('https://s3.us-east-1.amazonaws.com/storage.mendable.ai/rafa-testing/test%20%281%29.pdf');
|
||||
expect(pdfContent.trim()).toEqual("Dummy PDF file");
|
||||
});
|
||||
|
||||
// We're hitting the LLAMAPARSE rate limit 🫠
|
||||
// it('should download and read a simple PDF file by URL', async () => {
|
||||
// const pdfContent = await pdfProcessor.fetchAndProcessPdf('https://s3.us-east-1.amazonaws.com/storage.mendable.ai/rafa-testing/test%20%281%29.pdf');
|
||||
// expect(pdfContent).toEqual("Dummy PDF file");
|
||||
// });
|
||||
|
||||
// it('should download and read a complex PDF file by URL', async () => {
|
||||
// const pdfContent = await pdfProcessor.fetchAndProcessPdf('https://arxiv.org/pdf/2307.06435.pdf');
|
||||
|
||||
// const expectedContent = 'A Comprehensive Overview of Large Language Models\n' +
|
||||
// ' a a,∗ b,∗ c,d,∗ e,f e,f g,i\n' +
|
||||
// ' Humza Naveed , Asad Ullah Khan , Shi Qiu , Muhammad Saqib , Saeed Anwar , Muhammad Usman , Naveed Akhtar ,\n' +
|
||||
// ' Nick Barnes h, Ajmal Mian i\n' +
|
||||
// ' aUniversity of Engineering and Technology (UET), Lahore, Pakistan\n' +
|
||||
// ' bThe Chinese University of Hong Kong (CUHK), HKSAR, China\n' +
|
||||
// ' cUniversity of Technology Sydney (UTS), Sydney, Australia\n' +
|
||||
// ' dCommonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia\n' +
|
||||
// ' eKing Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia\n' +
|
||||
// ' fSDAIA-KFUPM Joint Research Center for Artificial Intelligence (JRCAI), Dhahran, Saudi Arabia\n' +
|
||||
// ' gThe University of Melbourne (UoM), Melbourne, Australia\n' +
|
||||
// ' hAustralian National University (ANU), Canberra, Australia\n' +
|
||||
// ' iThe University of Western Australia (UWA), Perth, Australia\n' +
|
||||
// ' Abstract\n' +
|
||||
// ' Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and\n' +
|
||||
// ' beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse\n' +
|
||||
// ' topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs,\n' +
|
||||
// ' robotics, datasets, benchmarking, efficiency, and more. With the rapid development of techniques and regular breakthroughs in\n' +
|
||||
// ' LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering\n' +
|
||||
// ' the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise\n' +
|
||||
// ' yet comprehensive overview of the recent developments in this field. This article provides an overview of the existing literature\n' +
|
||||
// ' on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background\n' +
|
||||
// ' concepts along with covering the advanced topics at the frontier of research in LLMs. This review article is intended to not only\n' +
|
||||
// ' provide a systematic survey but also a quick comprehensive reference for the researchers and practitioners to draw insights from\n' +
|
||||
// ' extensive informative summaries of the existing works to advance the LLM research.\n'
|
||||
// expect(pdfContent).toContain(expectedContent);
|
||||
// }, 60000);
|
||||
|
||||
});
|
|
@ -0,0 +1,114 @@
|
|||
import { Document } from "../../../../lib/entities";
|
||||
import { replacePathsWithAbsolutePaths, replaceImgPathsWithAbsolutePaths } from "../replacePaths";
|
||||
|
||||
describe('replacePaths', () => {
|
||||
describe('replacePathsWithAbsolutePaths', () => {
|
||||
it('should replace relative paths with absolute paths', () => {
|
||||
const documents: Document[] = [{
|
||||
metadata: { sourceURL: 'https://example.com' },
|
||||
content: 'This is a [link](/path/to/resource) and an image ![alt text](/path/to/image.jpg).'
|
||||
}];
|
||||
|
||||
const expectedDocuments: Document[] = [{
|
||||
metadata: { sourceURL: 'https://example.com' },
|
||||
content: 'This is a [link](https://example.com/path/to/resource) and an image ![alt text](https://example.com/path/to/image.jpg).'
|
||||
}];
|
||||
|
||||
const result = replacePathsWithAbsolutePaths(documents);
|
||||
expect(result).toEqual(expectedDocuments);
|
||||
});
|
||||
|
||||
it('should not alter absolute URLs', () => {
|
||||
const documents: Document[] = [{
|
||||
metadata: { sourceURL: 'https://example.com' },
|
||||
content: 'This is an [external link](https://external.com/path) and an image ![alt text](https://example.com/path/to/image.jpg).'
|
||||
}];
|
||||
|
||||
const result = replacePathsWithAbsolutePaths(documents);
|
||||
expect(result).toEqual(documents); // Expect no change
|
||||
});
|
||||
|
||||
it('should not alter data URLs for images', () => {
|
||||
const documents: Document[] = [{
|
||||
metadata: { sourceURL: 'https://example.com' },
|
||||
content: 'This is an image: ![alt text]().'
|
||||
}];
|
||||
|
||||
const result = replacePathsWithAbsolutePaths(documents);
|
||||
expect(result).toEqual(documents); // Expect no change
|
||||
});
|
||||
|
||||
it('should handle multiple links and images correctly', () => {
|
||||
const documents: Document[] = [{
|
||||
metadata: { sourceURL: 'https://example.com' },
|
||||
content: 'Here are two links: [link1](/path1) and [link2](/path2), and two images: ![img1](/img1.jpg) ![img2](/img2.jpg).'
|
||||
}];
|
||||
|
||||
const expectedDocuments: Document[] = [{
|
||||
metadata: { sourceURL: 'https://example.com' },
|
||||
content: 'Here are two links: [link1](https://example.com/path1) and [link2](https://example.com/path2), and two images: ![img1](https://example.com/img1.jpg) ![img2](https://example.com/img2.jpg).'
|
||||
}];
|
||||
|
||||
const result = replacePathsWithAbsolutePaths(documents);
|
||||
expect(result).toEqual(expectedDocuments);
|
||||
});
|
||||
|
||||
it('should correctly handle a mix of absolute and relative paths', () => {
|
||||
const documents: Document[] = [{
|
||||
metadata: { sourceURL: 'https://example.com' },
|
||||
content: 'Mixed paths: [relative](/path), [absolute](https://example.com/path), and [data image]().'
|
||||
}];
|
||||
|
||||
const expectedDocuments: Document[] = [{
|
||||
metadata: { sourceURL: 'https://example.com' },
|
||||
content: 'Mixed paths: [relative](https://example.com/path), [absolute](https://example.com/path), and [data image]().'
|
||||
}];
|
||||
|
||||
const result = replacePathsWithAbsolutePaths(documents);
|
||||
expect(result).toEqual(expectedDocuments);
|
||||
});
|
||||
|
||||
});
|
||||
|
||||
describe('replaceImgPathsWithAbsolutePaths', () => {
|
||||
it('should replace relative image paths with absolute paths', () => {
|
||||
const documents: Document[] = [{
|
||||
metadata: { sourceURL: 'https://example.com' },
|
||||
content: 'Here is an image: ![alt text](/path/to/image.jpg).'
|
||||
}];
|
||||
|
||||
const expectedDocuments: Document[] = [{
|
||||
metadata: { sourceURL: 'https://example.com' },
|
||||
content: 'Here is an image: ![alt text](https://example.com/path/to/image.jpg).'
|
||||
}];
|
||||
|
||||
const result = replaceImgPathsWithAbsolutePaths(documents);
|
||||
expect(result).toEqual(expectedDocuments);
|
||||
});
|
||||
|
||||
it('should not alter data:image URLs', () => {
|
||||
const documents: Document[] = [{
|
||||
metadata: { sourceURL: 'https://example.com' },
|
||||
content: 'An image with a data URL: ![alt text]().'
|
||||
}];
|
||||
|
||||
const result = replaceImgPathsWithAbsolutePaths(documents);
|
||||
expect(result).toEqual(documents); // Expect no change
|
||||
});
|
||||
|
||||
it('should handle multiple images with a mix of data and relative URLs', () => {
|
||||
const documents: Document[] = [{
|
||||
metadata: { sourceURL: 'https://example.com' },
|
||||
content: 'Multiple images: ![img1](/img1.jpg) ![img2]() ![img3](/img3.jpg).'
|
||||
}];
|
||||
|
||||
const expectedDocuments: Document[] = [{
|
||||
metadata: { sourceURL: 'https://example.com' },
|
||||
content: 'Multiple images: ![img1](https://example.com/img1.jpg) ![img2]() ![img3](https://example.com/img3.jpg).'
|
||||
}];
|
||||
|
||||
const result = replaceImgPathsWithAbsolutePaths(documents);
|
||||
expect(result).toEqual(expectedDocuments);
|
||||
});
|
||||
});
|
||||
});
|
19
apps/api/src/scraper/WebScraper/utils/blocklist.ts
Normal file
19
apps/api/src/scraper/WebScraper/utils/blocklist.ts
Normal file
|
@ -0,0 +1,19 @@
|
|||
const socialMediaBlocklist = [
|
||||
'facebook.com',
|
||||
'twitter.com',
|
||||
'instagram.com',
|
||||
'linkedin.com',
|
||||
'pinterest.com',
|
||||
'snapchat.com',
|
||||
'tiktok.com',
|
||||
'reddit.com',
|
||||
'tumblr.com',
|
||||
'flickr.com',
|
||||
'whatsapp.com',
|
||||
'wechat.com',
|
||||
'telegram.org',
|
||||
];
|
||||
|
||||
export function isUrlBlocked(url: string): boolean {
|
||||
return socialMediaBlocklist.some(domain => url.includes(domain));
|
||||
}
|
60
apps/api/src/scraper/WebScraper/utils/excludeTags.ts
Normal file
60
apps/api/src/scraper/WebScraper/utils/excludeTags.ts
Normal file
|
@ -0,0 +1,60 @@
|
|||
export const excludeNonMainTags = [
|
||||
"header",
|
||||
"footer",
|
||||
"nav",
|
||||
"aside",
|
||||
".header",
|
||||
".top",
|
||||
".navbar",
|
||||
"#header",
|
||||
".footer",
|
||||
".bottom",
|
||||
"#footer",
|
||||
".sidebar",
|
||||
".side",
|
||||
".aside",
|
||||
"#sidebar",
|
||||
".modal",
|
||||
".popup",
|
||||
"#modal",
|
||||
".overlay",
|
||||
".ad",
|
||||
".ads",
|
||||
".advert",
|
||||
"#ad",
|
||||
".lang-selector",
|
||||
".language",
|
||||
"#language-selector",
|
||||
".social",
|
||||
".social-media",
|
||||
".social-links",
|
||||
"#social",
|
||||
".menu",
|
||||
".navigation",
|
||||
"#nav",
|
||||
".breadcrumbs",
|
||||
"#breadcrumbs",
|
||||
".form",
|
||||
"form",
|
||||
"#search-form",
|
||||
".search",
|
||||
"#search",
|
||||
".share",
|
||||
"#share",
|
||||
".pagination",
|
||||
"#pagination",
|
||||
".widget",
|
||||
"#widget",
|
||||
".related",
|
||||
"#related",
|
||||
".tag",
|
||||
"#tag",
|
||||
".category",
|
||||
"#category",
|
||||
".comment",
|
||||
"#comment",
|
||||
".reply",
|
||||
"#reply",
|
||||
".author",
|
||||
"#author",
|
||||
];
|
|
@ -1,41 +0,0 @@
|
|||
export async function getImageDescription(
|
||||
imageUrl: string,
|
||||
backText: string,
|
||||
frontText: string
|
||||
): Promise<string> {
|
||||
const { OpenAI } = require("openai");
|
||||
const openai = new OpenAI();
|
||||
|
||||
try {
|
||||
const response = await openai.chat.completions.create({
|
||||
model: "gpt-4-turbo",
|
||||
messages: [
|
||||
{
|
||||
role: "user",
|
||||
content: [
|
||||
{
|
||||
type: "text",
|
||||
text:
|
||||
"What's in the image? You need to answer with the content for the alt tag of the image. To help you with the context, the image is in the following text: " +
|
||||
backText +
|
||||
" and the following text: " +
|
||||
frontText +
|
||||
". Be super concise.",
|
||||
},
|
||||
{
|
||||
type: "image_url",
|
||||
image_url: {
|
||||
url: imageUrl,
|
||||
},
|
||||
},
|
||||
],
|
||||
},
|
||||
],
|
||||
});
|
||||
|
||||
return response.choices[0].message.content;
|
||||
} catch (error) {
|
||||
console.error("Error generating image alt text:", error?.message);
|
||||
return "";
|
||||
}
|
||||
}
|
88
apps/api/src/scraper/WebScraper/utils/imageDescription.ts
Normal file
88
apps/api/src/scraper/WebScraper/utils/imageDescription.ts
Normal file
|
@ -0,0 +1,88 @@
|
|||
import Anthropic from '@anthropic-ai/sdk';
|
||||
import axios from 'axios';
|
||||
|
||||
export async function getImageDescription(
|
||||
imageUrl: string,
|
||||
backText: string,
|
||||
frontText: string,
|
||||
model: string = "gpt-4-turbo"
|
||||
): Promise<string> {
|
||||
try {
|
||||
const prompt = "What's in the image? You need to answer with the content for the alt tag of the image. To help you with the context, the image is in the following text: " +
|
||||
backText +
|
||||
" and the following text: " +
|
||||
frontText +
|
||||
". Be super concise."
|
||||
|
||||
switch (model) {
|
||||
case 'claude-3-opus': {
|
||||
if (!process.env.ANTHROPIC_API_KEY) {
|
||||
throw new Error("No Anthropic API key provided");
|
||||
}
|
||||
const imageRequest = await axios.get(imageUrl, { responseType: 'arraybuffer' });
|
||||
const imageMediaType = 'image/png';
|
||||
const imageData = Buffer.from(imageRequest.data, 'binary').toString('base64');
|
||||
|
||||
const anthropic = new Anthropic();
|
||||
const response = await anthropic.messages.create({
|
||||
model: "claude-3-opus-20240229",
|
||||
max_tokens: 1024,
|
||||
messages: [
|
||||
{
|
||||
role: "user",
|
||||
content: [
|
||||
{
|
||||
type: "image",
|
||||
source: {
|
||||
type: "base64",
|
||||
media_type: imageMediaType,
|
||||
data: imageData,
|
||||
},
|
||||
},
|
||||
{
|
||||
type: "text",
|
||||
text: prompt
|
||||
}
|
||||
],
|
||||
}
|
||||
]
|
||||
});
|
||||
|
||||
return response.content[0].text;
|
||||
}
|
||||
default: {
|
||||
if (!process.env.OPENAI_API_KEY) {
|
||||
throw new Error("No OpenAI API key provided");
|
||||
}
|
||||
|
||||
const { OpenAI } = require("openai");
|
||||
const openai = new OpenAI();
|
||||
|
||||
const response = await openai.chat.completions.create({
|
||||
model: "gpt-4-turbo",
|
||||
messages: [
|
||||
{
|
||||
role: "user",
|
||||
content: [
|
||||
{
|
||||
type: "text",
|
||||
text: prompt,
|
||||
},
|
||||
{
|
||||
type: "image_url",
|
||||
image_url: {
|
||||
url: imageUrl,
|
||||
},
|
||||
},
|
||||
],
|
||||
},
|
||||
],
|
||||
});
|
||||
return response.choices[0].message.content;
|
||||
}
|
||||
}
|
||||
} catch (error) {
|
||||
console.error("Error generating image alt text:", error?.message);
|
||||
return "";
|
||||
}
|
||||
}
|
|
@ -1,4 +1,3 @@
|
|||
// import * as cheerio from 'cheerio';
|
||||
import { CheerioAPI } from "cheerio";
|
||||
interface Metadata {
|
||||
title?: string;
|
||||
|
@ -8,6 +7,14 @@ interface Metadata {
|
|||
robots?: string;
|
||||
ogTitle?: string;
|
||||
ogDescription?: string;
|
||||
ogUrl?: string;
|
||||
ogImage?: string;
|
||||
ogAudio?: string;
|
||||
ogDeterminer?: string;
|
||||
ogLocale?: string;
|
||||
ogLocaleAlternate?: string[];
|
||||
ogSiteName?: string;
|
||||
ogVideo?: string;
|
||||
dctermsCreated?: string;
|
||||
dcDateCreated?: string;
|
||||
dcDate?: string;
|
||||
|
@ -17,7 +24,6 @@ interface Metadata {
|
|||
dctermsSubject?: string;
|
||||
dcSubject?: string;
|
||||
dcDescription?: string;
|
||||
ogImage?: string;
|
||||
dctermsKeywords?: string;
|
||||
modifiedTime?: string;
|
||||
publishedTime?: string;
|
||||
|
@ -33,6 +39,14 @@ export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
|
|||
let robots: string | null = null;
|
||||
let ogTitle: string | null = null;
|
||||
let ogDescription: string | null = null;
|
||||
let ogUrl: string | null = null;
|
||||
let ogImage: string | null = null;
|
||||
let ogAudio: string | null = null;
|
||||
let ogDeterminer: string | null = null;
|
||||
let ogLocale: string | null = null;
|
||||
let ogLocaleAlternate: string[] | null = null;
|
||||
let ogSiteName: string | null = null;
|
||||
let ogVideo: string | null = null;
|
||||
let dctermsCreated: string | null = null;
|
||||
let dcDateCreated: string | null = null;
|
||||
let dcDate: string | null = null;
|
||||
|
@ -42,7 +56,6 @@ export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
|
|||
let dctermsSubject: string | null = null;
|
||||
let dcSubject: string | null = null;
|
||||
let dcDescription: string | null = null;
|
||||
let ogImage: string | null = null;
|
||||
let dctermsKeywords: string | null = null;
|
||||
let modifiedTime: string | null = null;
|
||||
let publishedTime: string | null = null;
|
||||
|
@ -62,11 +75,18 @@ export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
|
|||
robots = soup('meta[name="robots"]').attr("content") || null;
|
||||
ogTitle = soup('meta[property="og:title"]').attr("content") || null;
|
||||
ogDescription = soup('meta[property="og:description"]').attr("content") || null;
|
||||
ogUrl = soup('meta[property="og:url"]').attr("content") || null;
|
||||
ogImage = soup('meta[property="og:image"]').attr("content") || null;
|
||||
ogAudio = soup('meta[property="og:audio"]').attr("content") || null;
|
||||
ogDeterminer = soup('meta[property="og:determiner"]').attr("content") || null;
|
||||
ogLocale = soup('meta[property="og:locale"]').attr("content") || null;
|
||||
ogLocaleAlternate = soup('meta[property="og:locale:alternate"]').map((i, el) => soup(el).attr("content")).get() || null;
|
||||
ogSiteName = soup('meta[property="og:site_name"]').attr("content") || null;
|
||||
ogVideo = soup('meta[property="og:video"]').attr("content") || null;
|
||||
articleSection = soup('meta[name="article:section"]').attr("content") || null;
|
||||
articleTag = soup('meta[name="article:tag"]').attr("content") || null;
|
||||
publishedTime = soup('meta[property="article:published_time"]').attr("content") || null;
|
||||
modifiedTime = soup('meta[property="article:modified_time"]').attr("content") || null;
|
||||
ogImage = soup('meta[property="og:image"]').attr("content") || null;
|
||||
dctermsKeywords = soup('meta[name="dcterms.keywords"]').attr("content") || null;
|
||||
dcDescription = soup('meta[name="dc.description"]').attr("content") || null;
|
||||
dcSubject = soup('meta[name="dc.subject"]').attr("content") || null;
|
||||
|
@ -90,6 +110,14 @@ export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
|
|||
...(robots ? { robots } : {}),
|
||||
...(ogTitle ? { ogTitle } : {}),
|
||||
...(ogDescription ? { ogDescription } : {}),
|
||||
...(ogUrl ? { ogUrl } : {}),
|
||||
...(ogImage ? { ogImage } : {}),
|
||||
...(ogAudio ? { ogAudio } : {}),
|
||||
...(ogDeterminer ? { ogDeterminer } : {}),
|
||||
...(ogLocale ? { ogLocale } : {}),
|
||||
...(ogLocaleAlternate ? { ogLocaleAlternate } : {}),
|
||||
...(ogSiteName ? { ogSiteName } : {}),
|
||||
...(ogVideo ? { ogVideo } : {}),
|
||||
...(dctermsCreated ? { dctermsCreated } : {}),
|
||||
...(dcDateCreated ? { dcDateCreated } : {}),
|
||||
...(dcDate ? { dcDate } : {}),
|
||||
|
@ -99,7 +127,6 @@ export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
|
|||
...(dctermsSubject ? { dctermsSubject } : {}),
|
||||
...(dcSubject ? { dcSubject } : {}),
|
||||
...(dcDescription ? { dcDescription } : {}),
|
||||
...(ogImage ? { ogImage } : {}),
|
||||
...(dctermsKeywords ? { dctermsKeywords } : {}),
|
||||
...(modifiedTime ? { modifiedTime } : {}),
|
||||
...(publishedTime ? { publishedTime } : {}),
|
||||
|
|
|
@ -24,7 +24,6 @@ export const parseTablesToMarkdown = async (html: string): Promise<string> => {
|
|||
if (isTableEmpty) {
|
||||
markdownTable = '';
|
||||
}
|
||||
console.log({markdownTable})
|
||||
replacements.push({ start, end, markdownTable });
|
||||
});
|
||||
}
|
||||
|
|
108
apps/api/src/scraper/WebScraper/utils/pdfProcessor.ts
Normal file
108
apps/api/src/scraper/WebScraper/utils/pdfProcessor.ts
Normal file
|
@ -0,0 +1,108 @@
|
|||
import axios, { AxiosResponse } from "axios";
|
||||
import fs from "fs";
|
||||
import { createReadStream, createWriteStream } from "node:fs";
|
||||
import FormData from "form-data";
|
||||
import dotenv from "dotenv";
|
||||
import pdf from "pdf-parse";
|
||||
import path from "path";
|
||||
import os from "os";
|
||||
|
||||
dotenv.config();
|
||||
|
||||
export async function fetchAndProcessPdf(url: string): Promise<string> {
|
||||
const tempFilePath = await downloadPdf(url);
|
||||
const content = await processPdfToText(tempFilePath);
|
||||
fs.unlinkSync(tempFilePath); // Clean up the temporary file
|
||||
return content;
|
||||
}
|
||||
|
||||
async function downloadPdf(url: string): Promise<string> {
|
||||
const response = await axios({
|
||||
url,
|
||||
method: 'GET',
|
||||
responseType: 'stream',
|
||||
});
|
||||
|
||||
const tempFilePath = path.join(os.tmpdir(), `tempPdf-${Date.now()}.pdf`);
|
||||
const writer = createWriteStream(tempFilePath);
|
||||
|
||||
response.data.pipe(writer);
|
||||
|
||||
return new Promise((resolve, reject) => {
|
||||
writer.on('finish', () => resolve(tempFilePath));
|
||||
writer.on('error', reject);
|
||||
});
|
||||
}
|
||||
|
||||
export async function processPdfToText(filePath: string): Promise<string> {
|
||||
let content = "";
|
||||
|
||||
if (process.env.LLAMAPARSE_API_KEY) {
|
||||
const apiKey = process.env.LLAMAPARSE_API_KEY;
|
||||
const headers = {
|
||||
Authorization: `Bearer ${apiKey}`,
|
||||
};
|
||||
const base_url = "https://api.cloud.llamaindex.ai/api/parsing";
|
||||
const fileType2 = "application/pdf";
|
||||
|
||||
try {
|
||||
const formData = new FormData();
|
||||
formData.append("file", createReadStream(filePath), {
|
||||
filename: filePath,
|
||||
contentType: fileType2,
|
||||
});
|
||||
|
||||
const uploadUrl = `${base_url}/upload`;
|
||||
const uploadResponse = await axios.post(uploadUrl, formData, {
|
||||
headers: {
|
||||
...headers,
|
||||
...formData.getHeaders(),
|
||||
},
|
||||
});
|
||||
|
||||
const jobId = uploadResponse.data.id;
|
||||
const resultType = "text";
|
||||
const resultUrl = `${base_url}/job/${jobId}/result/${resultType}`;
|
||||
|
||||
let resultResponse: AxiosResponse;
|
||||
let attempt = 0;
|
||||
const maxAttempts = 10; // Maximum number of attempts
|
||||
let resultAvailable = false;
|
||||
|
||||
while (attempt < maxAttempts && !resultAvailable) {
|
||||
try {
|
||||
resultResponse = await axios.get(resultUrl, { headers });
|
||||
if (resultResponse.status === 200) {
|
||||
resultAvailable = true; // Exit condition met
|
||||
} else {
|
||||
// If the status code is not 200, increment the attempt counter and wait
|
||||
attempt++;
|
||||
await new Promise((resolve) => setTimeout(resolve, 250)); // Wait for 2 seconds
|
||||
}
|
||||
} catch (error) {
|
||||
console.error("Error fetching result:", error);
|
||||
attempt++;
|
||||
await new Promise((resolve) => setTimeout(resolve, 250)); // Wait for 2 seconds before retrying
|
||||
// You may want to handle specific errors differently
|
||||
}
|
||||
}
|
||||
|
||||
if (!resultAvailable) {
|
||||
content = await processPdf(filePath);
|
||||
}
|
||||
content = resultResponse.data[resultType];
|
||||
} catch (error) {
|
||||
console.error("Error processing document:", filePath, error);
|
||||
content = await processPdf(filePath);
|
||||
}
|
||||
} else {
|
||||
content = await processPdf(filePath);
|
||||
}
|
||||
return content;
|
||||
}
|
||||
|
||||
async function processPdf(file: string){
|
||||
const fileContent = fs.readFileSync(file);
|
||||
const data = await pdf(fileContent);
|
||||
return data.text;
|
||||
}
|
80
apps/api/src/scraper/WebScraper/utils/replacePaths.ts
Normal file
80
apps/api/src/scraper/WebScraper/utils/replacePaths.ts
Normal file
|
@ -0,0 +1,80 @@
|
|||
import { Document } from "../../../lib/entities";
|
||||
|
||||
export const replacePathsWithAbsolutePaths = (documents: Document[]): Document[] => {
|
||||
try {
|
||||
documents.forEach((document) => {
|
||||
const baseUrl = new URL(document.metadata.sourceURL).origin;
|
||||
const paths =
|
||||
document.content.match(
|
||||
/(!?\[.*?\])\(((?:[^()]+|\((?:[^()]+|\([^()]*\))*\))*)\)|href="([^"]+)"/g
|
||||
) || [];
|
||||
|
||||
paths.forEach((path: string) => {
|
||||
const isImage = path.startsWith("!");
|
||||
let matchedUrl = path.match(/\(([^)]+)\)/) || path.match(/href="([^"]+)"/);
|
||||
let url = matchedUrl[1];
|
||||
|
||||
if (!url.startsWith("data:") && !url.startsWith("http")) {
|
||||
if (url.startsWith("/")) {
|
||||
url = url.substring(1);
|
||||
}
|
||||
url = new URL(url, baseUrl).toString();
|
||||
}
|
||||
|
||||
const markdownLinkOrImageText = path.match(/(!?\[.*?\])/)[0];
|
||||
if (isImage) {
|
||||
document.content = document.content.replace(
|
||||
path,
|
||||
`${markdownLinkOrImageText}(${url})`
|
||||
);
|
||||
} else {
|
||||
document.content = document.content.replace(
|
||||
path,
|
||||
`${markdownLinkOrImageText}(${url})`
|
||||
);
|
||||
}
|
||||
});
|
||||
});
|
||||
|
||||
return documents;
|
||||
} catch (error) {
|
||||
console.error("Error replacing paths with absolute paths", error);
|
||||
return documents;
|
||||
}
|
||||
};
|
||||
|
||||
export const replaceImgPathsWithAbsolutePaths = (documents: Document[]): Document[] => {
|
||||
try {
|
||||
documents.forEach((document) => {
|
||||
const baseUrl = new URL(document.metadata.sourceURL).origin;
|
||||
const images =
|
||||
document.content.match(
|
||||
/!\[.*?\]\(((?:[^()]+|\((?:[^()]+|\([^()]*\))*\))*)\)/g
|
||||
) || [];
|
||||
|
||||
images.forEach((image: string) => {
|
||||
let imageUrl = image.match(/\(([^)]+)\)/)[1];
|
||||
let altText = image.match(/\[(.*?)\]/)[1];
|
||||
|
||||
if (!imageUrl.startsWith("data:image")) {
|
||||
if (!imageUrl.startsWith("http")) {
|
||||
if (imageUrl.startsWith("/")) {
|
||||
imageUrl = imageUrl.substring(1);
|
||||
}
|
||||
imageUrl = new URL(imageUrl, baseUrl).toString();
|
||||
}
|
||||
}
|
||||
|
||||
document.content = document.content.replace(
|
||||
image,
|
||||
`![${altText}](${imageUrl})`
|
||||
);
|
||||
});
|
||||
});
|
||||
|
||||
return documents;
|
||||
} catch (error) {
|
||||
console.error("Error replacing img paths with absolute paths", error);
|
||||
return documents;
|
||||
}
|
||||
};
|
131
apps/api/src/search/googlesearch.ts
Normal file
131
apps/api/src/search/googlesearch.ts
Normal file
|
@ -0,0 +1,131 @@
|
|||
import axios from 'axios';
|
||||
import * as cheerio from 'cheerio';
|
||||
import * as querystring from 'querystring';
|
||||
|
||||
const _useragent_list = [
|
||||
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0',
|
||||
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
|
||||
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
|
||||
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
|
||||
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
|
||||
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36 Edg/111.0.1661.62',
|
||||
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/111.0'
|
||||
];
|
||||
|
||||
function get_useragent(): string {
|
||||
return _useragent_list[Math.floor(Math.random() * _useragent_list.length)];
|
||||
}
|
||||
|
||||
async function _req(term: string, results: number, lang: string, start: number, proxies: any, timeout: number, tbs: string = null, filter: string = null) {
|
||||
const params = {
|
||||
"q": term,
|
||||
"num": results, // Number of results to return
|
||||
"hl": lang,
|
||||
"start": start,
|
||||
};
|
||||
if (tbs) {
|
||||
params["tbs"] = tbs;
|
||||
}
|
||||
if (filter) {
|
||||
params["filter"] = filter;
|
||||
}
|
||||
try {
|
||||
const resp = await axios.get("https://www.google.com/search", {
|
||||
headers: {
|
||||
"User-Agent": get_useragent()
|
||||
},
|
||||
params: params,
|
||||
proxy: proxies,
|
||||
timeout: timeout,
|
||||
});
|
||||
return resp;
|
||||
} catch (error) {
|
||||
if (error.response && error.response.status === 429) {
|
||||
throw new Error('Google Search: Too many requests, try again later.');
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
class SearchResult {
|
||||
url: string;
|
||||
title: string;
|
||||
description: string;
|
||||
|
||||
constructor(url: string, title: string, description: string) {
|
||||
this.url = url;
|
||||
this.title = title;
|
||||
this.description = description;
|
||||
}
|
||||
|
||||
toString(): string {
|
||||
return `SearchResult(url=${this.url}, title=${this.title}, description=${this.description})`;
|
||||
}
|
||||
}
|
||||
|
||||
export async function google_search(term: string, advanced = false, num_results = 7, tbs = null, filter = null, lang = "en", proxy = null, sleep_interval = 0, timeout = 5000, ) :Promise<string[]> {
|
||||
const escaped_term = querystring.escape(term);
|
||||
|
||||
let proxies = null;
|
||||
if (proxy) {
|
||||
if (proxy.startsWith("https")) {
|
||||
proxies = {"https": proxy};
|
||||
} else {
|
||||
proxies = {"http": proxy};
|
||||
}
|
||||
}
|
||||
|
||||
// TODO: knowledge graph, answer box, etc.
|
||||
|
||||
let start = 0;
|
||||
let results : string[] = [];
|
||||
let attempts = 0;
|
||||
const maxAttempts = 20; // Define a maximum number of attempts to prevent infinite loop
|
||||
while (start < num_results && attempts < maxAttempts) {
|
||||
try {
|
||||
const resp = await _req(escaped_term, num_results - start, lang, start, proxies, timeout, tbs, filter);
|
||||
const $ = cheerio.load(resp.data);
|
||||
const result_block = $("div.g");
|
||||
if (result_block.length === 0) {
|
||||
start += 1;
|
||||
attempts += 1;
|
||||
} else {
|
||||
attempts = 0; // Reset attempts if we have results
|
||||
}
|
||||
result_block.each((index, element) => {
|
||||
const linkElement = $(element).find("a");
|
||||
const link = linkElement && linkElement.attr("href") ? linkElement.attr("href") : null;
|
||||
const title = $(element).find("h3");
|
||||
const ogImage = $(element).find("img").eq(1).attr("src");
|
||||
const description_box = $(element).find("div[style='-webkit-line-clamp:2']");
|
||||
const answerBox = $(element).find(".mod").text();
|
||||
if (description_box) {
|
||||
const description = description_box.text();
|
||||
if (link && title && description) {
|
||||
start += 1;
|
||||
if (advanced) {
|
||||
// results.push(new SearchResult(link, title.text(), description));
|
||||
} else {
|
||||
results.push(link);
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
await new Promise(resolve => setTimeout(resolve, sleep_interval * 1000));
|
||||
} catch (error) {
|
||||
if (error.message === 'Too many requests') {
|
||||
console.warn('Too many requests, breaking the loop');
|
||||
break;
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
|
||||
if (start === 0) {
|
||||
return results;
|
||||
}
|
||||
}
|
||||
if (attempts >= maxAttempts) {
|
||||
console.warn('Max attempts reached, breaking the loop');
|
||||
}
|
||||
return results
|
||||
}
|
45
apps/api/src/search/index.ts
Normal file
45
apps/api/src/search/index.ts
Normal file
|
@ -0,0 +1,45 @@
|
|||
import { google_search } from "./googlesearch";
|
||||
import { serper_search } from "./serper";
|
||||
|
||||
export async function search({
|
||||
query,
|
||||
advanced = false,
|
||||
num_results = 7,
|
||||
tbs = null,
|
||||
filter = null,
|
||||
lang = "en",
|
||||
proxy = null,
|
||||
sleep_interval = 0,
|
||||
timeout = 5000,
|
||||
}: {
|
||||
query: string;
|
||||
advanced?: boolean;
|
||||
num_results?: number;
|
||||
tbs?: string;
|
||||
filter?: string;
|
||||
lang?: string;
|
||||
proxy?: string;
|
||||
sleep_interval?: number;
|
||||
timeout?: number;
|
||||
}) {
|
||||
try {
|
||||
if (process.env.SERPER_API_KEY && !tbs) {
|
||||
return await serper_search(query, num_results);
|
||||
}
|
||||
return await google_search(
|
||||
query,
|
||||
advanced,
|
||||
num_results,
|
||||
tbs,
|
||||
filter,
|
||||
lang,
|
||||
proxy,
|
||||
sleep_interval,
|
||||
timeout
|
||||
);
|
||||
} catch (error) {
|
||||
console.error("Error in search function: ", error);
|
||||
return []
|
||||
}
|
||||
// if process.env.SERPER_API_KEY is set, use serper
|
||||
}
|
28
apps/api/src/search/serper.ts
Normal file
28
apps/api/src/search/serper.ts
Normal file
|
@ -0,0 +1,28 @@
|
|||
import axios from "axios";
|
||||
import dotenv from "dotenv";
|
||||
|
||||
dotenv.config();
|
||||
|
||||
export async function serper_search(q, num_results) : Promise<string[]> {
|
||||
let data = JSON.stringify({
|
||||
q: q,
|
||||
"num": num_results,
|
||||
|
||||
});
|
||||
|
||||
let config = {
|
||||
method: "POST",
|
||||
url: "https://google.serper.dev/search",
|
||||
headers: {
|
||||
"X-API-KEY": process.env.SERPER_API_KEY,
|
||||
"Content-Type": "application/json",
|
||||
},
|
||||
data: data,
|
||||
};
|
||||
const response = await axios(config);
|
||||
if (response && response.data && Array.isArray(response.data.organic)) {
|
||||
return response.data.organic.map((a) => a.link);
|
||||
} else {
|
||||
return [];
|
||||
}
|
||||
}
|
|
@ -1,7 +1,12 @@
|
|||
import { withAuth } from "../../lib/withAuth";
|
||||
import { supabase_service } from "../supabase";
|
||||
|
||||
const FREE_CREDITS = 100;
|
||||
|
||||
export async function billTeam(team_id: string, credits: number) {
|
||||
return withAuth(supaBillTeam)(team_id, credits);
|
||||
}
|
||||
export async function supaBillTeam(team_id: string, credits: number) {
|
||||
if (team_id === "preview") {
|
||||
return { success: true, message: "Preview team, no credits used" };
|
||||
}
|
||||
|
@ -52,8 +57,11 @@ export async function billTeam(team_id: string, credits: number) {
|
|||
return { success: true, credit_usage };
|
||||
}
|
||||
|
||||
// if team has enough credits for the operation, return true, else return false
|
||||
export async function checkTeamCredits(team_id: string, credits: number) {
|
||||
return withAuth(supaCheckTeamCredits)(team_id, credits);
|
||||
}
|
||||
// if team has enough credits for the operation, return true, else return false
|
||||
export async function supaCheckTeamCredits(team_id: string, credits: number) {
|
||||
if (team_id === "preview") {
|
||||
return { success: true, message: "Preview team, no credits used" };
|
||||
}
|
||||
|
|
34
apps/api/src/services/logging/log_job.ts
Normal file
34
apps/api/src/services/logging/log_job.ts
Normal file
|
@ -0,0 +1,34 @@
|
|||
import { supabase_service } from "../supabase";
|
||||
import { FirecrawlJob } from "../../types";
|
||||
import "dotenv/config";
|
||||
|
||||
export async function logJob(job: FirecrawlJob) {
|
||||
try {
|
||||
// Only log jobs in production
|
||||
if (process.env.ENV !== "production") {
|
||||
return;
|
||||
}
|
||||
const { data, error } = await supabase_service
|
||||
.from("firecrawl_jobs")
|
||||
.insert([
|
||||
{
|
||||
success: job.success,
|
||||
message: job.message,
|
||||
num_docs: job.num_docs,
|
||||
docs: job.docs,
|
||||
time_taken: job.time_taken,
|
||||
team_id: job.team_id === "preview" ? null : job.team_id,
|
||||
mode: job.mode,
|
||||
url: job.url,
|
||||
crawler_options: job.crawlerOptions,
|
||||
page_options: job.pageOptions,
|
||||
origin: job.origin,
|
||||
},
|
||||
]);
|
||||
if (error) {
|
||||
console.error("Error logging job:\n", error);
|
||||
}
|
||||
} catch (error) {
|
||||
console.error("Error logging job:\n", error);
|
||||
}
|
||||
}
|
|
@ -1,4 +1,19 @@
|
|||
const { Logtail } = require("@logtail/node");
|
||||
//dot env
|
||||
require("dotenv").config();
|
||||
export const logtail = new Logtail(process.env.LOGTAIL_KEY);
|
||||
import { Logtail } from "@logtail/node";
|
||||
import "dotenv/config";
|
||||
|
||||
// A mock Logtail class to handle cases where LOGTAIL_KEY is not provided
|
||||
class MockLogtail {
|
||||
info(message: string, context?: Record<string, any>): void {
|
||||
console.log(message, context);
|
||||
}
|
||||
error(message: string, context: Record<string, any> = {}): void {
|
||||
console.error(message, context);
|
||||
}
|
||||
}
|
||||
|
||||
// Using the actual Logtail class if LOGTAIL_KEY exists, otherwise using the mock class
|
||||
// Additionally, print a warning to the terminal if LOGTAIL_KEY is not provided
|
||||
export const logtail = process.env.LOGTAIL_KEY ? new Logtail(process.env.LOGTAIL_KEY) : (() => {
|
||||
console.warn("LOGTAIL_KEY is not provided - your events will not be logged. Using MockLogtail as a fallback. see logtail.ts for more.");
|
||||
return new MockLogtail();
|
||||
})();
|
||||
|
|
|
@ -3,8 +3,8 @@ import { getWebScraperQueue } from "./queue-service";
|
|||
import "dotenv/config";
|
||||
import { logtail } from "./logtail";
|
||||
import { startWebScraperPipeline } from "../main/runWebScraper";
|
||||
import { WebScraperDataProvider } from "../scraper/WebScraper";
|
||||
import { callWebhook } from "./webhook";
|
||||
import { logJob } from "./logging/log_job";
|
||||
|
||||
getWebScraperQueue().process(
|
||||
Math.floor(Number(process.env.NUM_WORKERS_PER_QUEUE ?? 8)),
|
||||
|
@ -16,7 +16,11 @@ getWebScraperQueue().process(
|
|||
current_step: "SCRAPING",
|
||||
current_url: "",
|
||||
});
|
||||
const start = Date.now();
|
||||
|
||||
const { success, message, docs } = await startWebScraperPipeline({ job });
|
||||
const end = Date.now();
|
||||
const timeTakenInSeconds = (end - start) / 1000;
|
||||
|
||||
const data = {
|
||||
success: success,
|
||||
|
@ -30,6 +34,20 @@ getWebScraperQueue().process(
|
|||
};
|
||||
|
||||
await callWebhook(job.data.team_id, data);
|
||||
|
||||
await logJob({
|
||||
success: success,
|
||||
message: message,
|
||||
num_docs: docs.length,
|
||||
docs: docs,
|
||||
time_taken: timeTakenInSeconds,
|
||||
team_id: job.data.team_id,
|
||||
mode: "crawl",
|
||||
url: job.data.url,
|
||||
crawlerOptions: job.data.crawlerOptions,
|
||||
pageOptions: job.data.pageOptions,
|
||||
origin: job.data.origin,
|
||||
});
|
||||
done(null, data);
|
||||
} catch (error) {
|
||||
if (error instanceof CustomError) {
|
||||
|
@ -56,6 +74,19 @@ getWebScraperQueue().process(
|
|||
"Something went wrong... Contact help@mendable.ai or try again." /* etc... */,
|
||||
};
|
||||
await callWebhook(job.data.team_id, data);
|
||||
await logJob({
|
||||
success: false,
|
||||
message: typeof error === 'string' ? error : (error.message ?? "Something went wrong... Contact help@mendable.ai"),
|
||||
num_docs: 0,
|
||||
docs: [],
|
||||
time_taken: 0,
|
||||
team_id: job.data.team_id,
|
||||
mode: "crawl",
|
||||
url: job.data.url,
|
||||
crawlerOptions: job.data.crawlerOptions,
|
||||
pageOptions: job.data.pageOptions,
|
||||
origin: job.data.origin,
|
||||
});
|
||||
done(null, data);
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,12 +1,16 @@
|
|||
import { RateLimiterRedis } from "rate-limiter-flexible";
|
||||
import * as redis from "redis";
|
||||
import { RateLimiterMode } from "../../src/types";
|
||||
|
||||
const MAX_REQUESTS_PER_MINUTE_PREVIEW = 5;
|
||||
const MAX_CRAWLS_PER_MINUTE_STARTER = 2;
|
||||
const MAX_CRAWLS_PER_MINUTE_STANDAR = 4;
|
||||
const MAX_CRAWLS_PER_MINUTE_STANDARD = 4;
|
||||
const MAX_CRAWLS_PER_MINUTE_SCALE = 20;
|
||||
|
||||
const MAX_REQUESTS_PER_MINUTE_ACCOUNT = 40;
|
||||
const MAX_REQUESTS_PER_MINUTE_ACCOUNT = 20;
|
||||
|
||||
const MAX_REQUESTS_PER_MINUTE_CRAWL_STATUS = 120;
|
||||
|
||||
|
||||
|
||||
|
||||
|
@ -29,13 +33,20 @@ export const serverRateLimiter = new RateLimiterRedis({
|
|||
duration: 60, // Duration in seconds
|
||||
});
|
||||
|
||||
export const crawlStatusRateLimiter = new RateLimiterRedis({
|
||||
storeClient: redisClient,
|
||||
keyPrefix: "middleware",
|
||||
points: MAX_REQUESTS_PER_MINUTE_CRAWL_STATUS,
|
||||
duration: 60, // Duration in seconds
|
||||
});
|
||||
|
||||
|
||||
export function crawlRateLimit(plan: string){
|
||||
if(plan === "standard"){
|
||||
return new RateLimiterRedis({
|
||||
storeClient: redisClient,
|
||||
keyPrefix: "middleware",
|
||||
points: MAX_CRAWLS_PER_MINUTE_STANDAR,
|
||||
points: MAX_CRAWLS_PER_MINUTE_STANDARD,
|
||||
duration: 60, // Duration in seconds
|
||||
});
|
||||
}else if(plan === "scale"){
|
||||
|
@ -56,10 +67,15 @@ export function crawlRateLimit(plan: string){
|
|||
}
|
||||
|
||||
|
||||
export function getRateLimiter(preview: boolean){
|
||||
if(preview){
|
||||
return previewRateLimiter;
|
||||
}else{
|
||||
return serverRateLimiter;
|
||||
|
||||
|
||||
export function getRateLimiter(mode: RateLimiterMode){
|
||||
switch(mode) {
|
||||
case RateLimiterMode.Preview:
|
||||
return previewRateLimiter;
|
||||
case RateLimiterMode.CrawlStatus:
|
||||
return crawlStatusRateLimiter;
|
||||
default:
|
||||
return serverRateLimiter;
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,6 +1,56 @@
|
|||
import { createClient } from "@supabase/supabase-js";
|
||||
import { createClient, SupabaseClient } from "@supabase/supabase-js";
|
||||
|
||||
export const supabase_service = createClient<any>(
|
||||
process.env.SUPABASE_URL,
|
||||
process.env.SUPABASE_SERVICE_TOKEN,
|
||||
);
|
||||
// SupabaseService class initializes the Supabase client conditionally based on environment variables.
|
||||
class SupabaseService {
|
||||
private client: SupabaseClient | null = null;
|
||||
|
||||
constructor() {
|
||||
const supabaseUrl = process.env.SUPABASE_URL;
|
||||
const supabaseServiceToken = process.env.SUPABASE_SERVICE_TOKEN;
|
||||
// Only initialize the Supabase client if both URL and Service Token are provided.
|
||||
if (process.env.USE_DB_AUTHENTICATION === "false") {
|
||||
// Warn the user that Authentication is disabled by setting the client to null
|
||||
console.warn(
|
||||
"\x1b[33mAuthentication is disabled. Supabase client will not be initialized.\x1b[0m"
|
||||
);
|
||||
this.client = null;
|
||||
} else if (!supabaseUrl || !supabaseServiceToken) {
|
||||
console.error(
|
||||
"\x1b[31mSupabase environment variables aren't configured correctly. Supabase client will not be initialized. Fix ENV configuration or disable DB authentication with USE_DB_AUTHENTICATION env variable\x1b[0m"
|
||||
);
|
||||
} else {
|
||||
this.client = createClient(supabaseUrl, supabaseServiceToken);
|
||||
}
|
||||
}
|
||||
|
||||
// Provides access to the initialized Supabase client, if available.
|
||||
getClient(): SupabaseClient | null {
|
||||
return this.client;
|
||||
}
|
||||
}
|
||||
|
||||
// Using a Proxy to handle dynamic access to the Supabase client or service methods.
|
||||
// This approach ensures that if Supabase is not configured, any attempt to use it will result in a clear error.
|
||||
export const supabase_service: SupabaseClient = new Proxy(
|
||||
new SupabaseService(),
|
||||
{
|
||||
get: function (target, prop, receiver) {
|
||||
const client = target.getClient();
|
||||
// If the Supabase client is not initialized, intercept property access to provide meaningful error feedback.
|
||||
if (client === null) {
|
||||
console.error(
|
||||
"Attempted to access Supabase client when it's not configured."
|
||||
);
|
||||
return () => {
|
||||
throw new Error("Supabase client is not configured.");
|
||||
};
|
||||
}
|
||||
// Direct access to SupabaseService properties takes precedence.
|
||||
if (prop in target) {
|
||||
return Reflect.get(target, prop, receiver);
|
||||
}
|
||||
// Otherwise, delegate access to the Supabase client.
|
||||
return Reflect.get(client, prop, receiver);
|
||||
},
|
||||
}
|
||||
) as unknown as SupabaseClient;
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
import { supabase_service } from "./supabase";
|
||||
|
||||
export const callWebhook = async (teamId: string, data: any) => {
|
||||
try {
|
||||
const { data: webhooksData, error } = await supabase_service
|
||||
.from('webhooks')
|
||||
.select('url')
|
||||
|
@ -37,5 +38,9 @@ export const callWebhook = async (teamId: string, data: any) => {
|
|||
data: dataToSend,
|
||||
error: data.error || undefined,
|
||||
}),
|
||||
});
|
||||
}
|
||||
});
|
||||
} catch (error) {
|
||||
console.error(`Error sending webhook for team ID: ${teamId}`, error.message);
|
||||
}
|
||||
};
|
||||
|
||||
|
|
|
@ -20,7 +20,39 @@ export interface WebScraperOptions {
|
|||
url: string;
|
||||
mode: "crawl" | "single_urls" | "sitemap";
|
||||
crawlerOptions: any;
|
||||
pageOptions: any;
|
||||
team_id: string;
|
||||
origin?: string;
|
||||
}
|
||||
|
||||
export interface FirecrawlJob {
|
||||
success: boolean;
|
||||
message: string;
|
||||
num_docs: number;
|
||||
docs: any[];
|
||||
time_taken: number;
|
||||
team_id: string;
|
||||
mode: string;
|
||||
url: string;
|
||||
crawlerOptions?: any;
|
||||
pageOptions?: any;
|
||||
origin: string;
|
||||
}
|
||||
|
||||
export enum RateLimiterMode {
|
||||
Crawl = "crawl",
|
||||
CrawlStatus = "crawl-status",
|
||||
Scrape = "scrape",
|
||||
Preview = "preview",
|
||||
Search = "search",
|
||||
|
||||
}
|
||||
|
||||
export interface AuthResponse {
|
||||
success: boolean;
|
||||
team_id?: string;
|
||||
error?: string;
|
||||
status?: number;
|
||||
}
|
||||
|
||||
|
||||
|
|
|
@ -33,15 +33,18 @@ Here's an example of how to use the SDK with error handling:
|
|||
|
||||
// Crawl a website
|
||||
const crawlUrl = 'https://mendable.ai';
|
||||
const crawlParams = {
|
||||
const params = {
|
||||
crawlerOptions: {
|
||||
excludes: ['blog/'],
|
||||
includes: [], // leave empty for all pages
|
||||
limit: 1000,
|
||||
},
|
||||
pageOptions: {
|
||||
onlyMainContent: true
|
||||
}
|
||||
};
|
||||
|
||||
const crawlResult = await app.crawlUrl(crawlUrl, crawlParams);
|
||||
const crawlResult = await app.crawlUrl(crawlUrl, params);
|
||||
console.log(crawlResult);
|
||||
|
||||
} catch (error) {
|
||||
|
@ -83,18 +86,21 @@ To crawl a website with error handling, use the `crawlUrl` method. It takes the
|
|||
async function crawlExample() {
|
||||
try {
|
||||
const crawlUrl = 'https://example.com';
|
||||
const crawlParams = {
|
||||
const params = {
|
||||
crawlerOptions: {
|
||||
excludes: ['blog/'],
|
||||
includes: [], // leave empty for all pages
|
||||
limit: 1000,
|
||||
},
|
||||
pageOptions: {
|
||||
onlyMainContent: true
|
||||
}
|
||||
};
|
||||
const waitUntilDone = true;
|
||||
const timeout = 5;
|
||||
const crawlResult = await app.crawlUrl(
|
||||
crawlUrl,
|
||||
crawlParams,
|
||||
params,
|
||||
waitUntilDone,
|
||||
timeout
|
||||
);
|
||||
|
|
|
@ -10,13 +10,26 @@ var __awaiter = (this && this.__awaiter) || function (thisArg, _arguments, P, ge
|
|||
import axios from 'axios';
|
||||
import dotenv from 'dotenv';
|
||||
dotenv.config();
|
||||
/**
|
||||
* Main class for interacting with the Firecrawl API.
|
||||
*/
|
||||
export default class FirecrawlApp {
|
||||
/**
|
||||
* Initializes a new instance of the FirecrawlApp class.
|
||||
* @param {FirecrawlAppConfig} config - Configuration options for the FirecrawlApp instance.
|
||||
*/
|
||||
constructor({ apiKey = null }) {
|
||||
this.apiKey = apiKey || process.env.FIRECRAWL_API_KEY || '';
|
||||
if (!this.apiKey) {
|
||||
throw new Error('No API key provided');
|
||||
}
|
||||
}
|
||||
/**
|
||||
* Scrapes a URL using the Firecrawl API.
|
||||
* @param {string} url - The URL to scrape.
|
||||
* @param {Params | null} params - Additional parameters for the scrape request.
|
||||
* @returns {Promise<ScrapeResponse>} The response from the scrape operation.
|
||||
*/
|
||||
scrapeUrl(url_1) {
|
||||
return __awaiter(this, arguments, void 0, function* (url, params = null) {
|
||||
const headers = {
|
||||
|
@ -32,7 +45,7 @@ export default class FirecrawlApp {
|
|||
if (response.status === 200) {
|
||||
const responseData = response.data;
|
||||
if (responseData.success) {
|
||||
return responseData.data;
|
||||
return responseData;
|
||||
}
|
||||
else {
|
||||
throw new Error(`Failed to scrape URL. Error: ${responseData.error}`);
|
||||
|
@ -45,8 +58,17 @@ export default class FirecrawlApp {
|
|||
catch (error) {
|
||||
throw new Error(error.message);
|
||||
}
|
||||
return { success: false, error: 'Internal server error.' };
|
||||
});
|
||||
}
|
||||
/**
|
||||
* Initiates a crawl job for a URL using the Firecrawl API.
|
||||
* @param {string} url - The URL to crawl.
|
||||
* @param {Params | null} params - Additional parameters for the crawl request.
|
||||
* @param {boolean} waitUntilDone - Whether to wait for the crawl job to complete.
|
||||
* @param {number} timeout - Timeout in seconds for job status checks.
|
||||
* @returns {Promise<CrawlResponse | any>} The response from the crawl operation.
|
||||
*/
|
||||
crawlUrl(url_1) {
|
||||
return __awaiter(this, arguments, void 0, function* (url, params = null, waitUntilDone = true, timeout = 2) {
|
||||
const headers = this.prepareHeaders();
|
||||
|
@ -62,7 +84,7 @@ export default class FirecrawlApp {
|
|||
return this.monitorJobStatus(jobId, headers, timeout);
|
||||
}
|
||||
else {
|
||||
return { jobId };
|
||||
return { success: true, jobId };
|
||||
}
|
||||
}
|
||||
else {
|
||||
|
@ -73,8 +95,14 @@ export default class FirecrawlApp {
|
|||
console.log(error);
|
||||
throw new Error(error.message);
|
||||
}
|
||||
return { success: false, error: 'Internal server error.' };
|
||||
});
|
||||
}
|
||||
/**
|
||||
* Checks the status of a crawl job using the Firecrawl API.
|
||||
* @param {string} jobId - The job ID of the crawl operation.
|
||||
* @returns {Promise<JobStatusResponse>} The response containing the job status.
|
||||
*/
|
||||
checkCrawlStatus(jobId) {
|
||||
return __awaiter(this, void 0, void 0, function* () {
|
||||
const headers = this.prepareHeaders();
|
||||
|
@ -90,20 +118,45 @@ export default class FirecrawlApp {
|
|||
catch (error) {
|
||||
throw new Error(error.message);
|
||||
}
|
||||
return { success: false, status: 'unknown', error: 'Internal server error.' };
|
||||
});
|
||||
}
|
||||
/**
|
||||
* Prepares the headers for an API request.
|
||||
* @returns {AxiosRequestHeaders} The prepared headers.
|
||||
*/
|
||||
prepareHeaders() {
|
||||
return {
|
||||
'Content-Type': 'application/json',
|
||||
'Authorization': `Bearer ${this.apiKey}`,
|
||||
};
|
||||
}
|
||||
/**
|
||||
* Sends a POST request to the specified URL.
|
||||
* @param {string} url - The URL to send the request to.
|
||||
* @param {Params} data - The data to send in the request.
|
||||
* @param {AxiosRequestHeaders} headers - The headers for the request.
|
||||
* @returns {Promise<AxiosResponse>} The response from the POST request.
|
||||
*/
|
||||
postRequest(url, data, headers) {
|
||||
return axios.post(url, data, { headers });
|
||||
}
|
||||
/**
|
||||
* Sends a GET request to the specified URL.
|
||||
* @param {string} url - The URL to send the request to.
|
||||
* @param {AxiosRequestHeaders} headers - The headers for the request.
|
||||
* @returns {Promise<AxiosResponse>} The response from the GET request.
|
||||
*/
|
||||
getRequest(url, headers) {
|
||||
return axios.get(url, { headers });
|
||||
}
|
||||
/**
|
||||
* Monitors the status of a crawl job until completion or failure.
|
||||
* @param {string} jobId - The job ID of the crawl operation.
|
||||
* @param {AxiosRequestHeaders} headers - The headers for the request.
|
||||
* @param {number} timeout - Timeout in seconds for job status checks.
|
||||
* @returns {Promise<any>} The final job status or data.
|
||||
*/
|
||||
monitorJobStatus(jobId, headers, timeout) {
|
||||
return __awaiter(this, void 0, void 0, function* () {
|
||||
while (true) {
|
||||
|
@ -134,6 +187,11 @@ export default class FirecrawlApp {
|
|||
}
|
||||
});
|
||||
}
|
||||
/**
|
||||
* Handles errors from API responses.
|
||||
* @param {AxiosResponse} response - The response from the API.
|
||||
* @param {string} action - The action being performed when the error occurred.
|
||||
*/
|
||||
handleError(response, action) {
|
||||
if ([402, 409, 500].includes(response.status)) {
|
||||
const errorMessage = response.data.error || 'Unknown error occurred';
|
||||
|
|
4
apps/js-sdk/firecrawl/package-lock.json
generated
4
apps/js-sdk/firecrawl/package-lock.json
generated
|
@ -1,12 +1,12 @@
|
|||
{
|
||||
"name": "@mendable/firecrawl-js",
|
||||
"version": "0.0.7",
|
||||
"version": "0.0.9",
|
||||
"lockfileVersion": 3,
|
||||
"requires": true,
|
||||
"packages": {
|
||||
"": {
|
||||
"name": "@mendable/firecrawl-js",
|
||||
"version": "0.0.7",
|
||||
"version": "0.0.9",
|
||||
"license": "MIT",
|
||||
"dependencies": {
|
||||
"axios": "^1.6.8",
|
||||
|
|
|
@ -1,10 +1,13 @@
|
|||
{
|
||||
"name": "@mendable/firecrawl-js",
|
||||
"version": "0.0.9",
|
||||
"version": "0.0.13",
|
||||
"description": "JavaScript SDK for Firecrawl API",
|
||||
"main": "build/index.js",
|
||||
"types": "types/index.d.ts",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"build": "tsc",
|
||||
"publish":"npm run build && npm publish --access public",
|
||||
"test": "echo \"Error: no test specified\" && exit 1"
|
||||
},
|
||||
"repository": {
|
||||
|
|
|
@ -2,17 +2,60 @@ import axios, { AxiosResponse, AxiosRequestHeaders } from 'axios';
|
|||
import dotenv from 'dotenv';
|
||||
dotenv.config();
|
||||
|
||||
interface FirecrawlAppConfig {
|
||||
/**
|
||||
* Configuration interface for FirecrawlApp.
|
||||
*/
|
||||
export interface FirecrawlAppConfig {
|
||||
apiKey?: string | null;
|
||||
}
|
||||
|
||||
interface Params {
|
||||
/**
|
||||
* Generic parameter interface.
|
||||
*/
|
||||
export interface Params {
|
||||
[key: string]: any;
|
||||
}
|
||||
|
||||
/**
|
||||
* Response interface for scraping operations.
|
||||
*/
|
||||
export interface ScrapeResponse {
|
||||
success: boolean;
|
||||
data?: any;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
/**
|
||||
* Response interface for crawling operations.
|
||||
*/
|
||||
export interface CrawlResponse {
|
||||
success: boolean;
|
||||
jobId?: string;
|
||||
data?: any;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
/**
|
||||
* Response interface for job status checks.
|
||||
*/
|
||||
export interface JobStatusResponse {
|
||||
success: boolean;
|
||||
status: string;
|
||||
jobId?: string;
|
||||
data?: any;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
/**
|
||||
* Main class for interacting with the Firecrawl API.
|
||||
*/
|
||||
export default class FirecrawlApp {
|
||||
private apiKey: string;
|
||||
|
||||
/**
|
||||
* Initializes a new instance of the FirecrawlApp class.
|
||||
* @param {FirecrawlAppConfig} config - Configuration options for the FirecrawlApp instance.
|
||||
*/
|
||||
constructor({ apiKey = null }: FirecrawlAppConfig) {
|
||||
this.apiKey = apiKey || process.env.FIRECRAWL_API_KEY || '';
|
||||
if (!this.apiKey) {
|
||||
|
@ -20,7 +63,13 @@ export default class FirecrawlApp {
|
|||
}
|
||||
}
|
||||
|
||||
async scrapeUrl(url: string, params: Params | null = null): Promise<any> {
|
||||
/**
|
||||
* Scrapes a URL using the Firecrawl API.
|
||||
* @param {string} url - The URL to scrape.
|
||||
* @param {Params | null} params - Additional parameters for the scrape request.
|
||||
* @returns {Promise<ScrapeResponse>} The response from the scrape operation.
|
||||
*/
|
||||
async scrapeUrl(url: string, params: Params | null = null): Promise<ScrapeResponse> {
|
||||
const headers: AxiosRequestHeaders = {
|
||||
'Content-Type': 'application/json',
|
||||
'Authorization': `Bearer ${this.apiKey}`,
|
||||
|
@ -34,7 +83,7 @@ export default class FirecrawlApp {
|
|||
if (response.status === 200) {
|
||||
const responseData = response.data;
|
||||
if (responseData.success) {
|
||||
return responseData.data;
|
||||
return responseData;
|
||||
} else {
|
||||
throw new Error(`Failed to scrape URL. Error: ${responseData.error}`);
|
||||
}
|
||||
|
@ -44,9 +93,18 @@ export default class FirecrawlApp {
|
|||
} catch (error: any) {
|
||||
throw new Error(error.message);
|
||||
}
|
||||
return { success: false, error: 'Internal server error.' };
|
||||
}
|
||||
|
||||
async crawlUrl(url: string, params: Params | null = null, waitUntilDone: boolean = true, timeout: number = 2): Promise<any> {
|
||||
/**
|
||||
* Initiates a crawl job for a URL using the Firecrawl API.
|
||||
* @param {string} url - The URL to crawl.
|
||||
* @param {Params | null} params - Additional parameters for the crawl request.
|
||||
* @param {boolean} waitUntilDone - Whether to wait for the crawl job to complete.
|
||||
* @param {number} timeout - Timeout in seconds for job status checks.
|
||||
* @returns {Promise<CrawlResponse | any>} The response from the crawl operation.
|
||||
*/
|
||||
async crawlUrl(url: string, params: Params | null = null, waitUntilDone: boolean = true, timeout: number = 2): Promise<CrawlResponse | any> {
|
||||
const headers = this.prepareHeaders();
|
||||
let jsonData: Params = { url };
|
||||
if (params) {
|
||||
|
@ -59,7 +117,7 @@ export default class FirecrawlApp {
|
|||
if (waitUntilDone) {
|
||||
return this.monitorJobStatus(jobId, headers, timeout);
|
||||
} else {
|
||||
return { jobId };
|
||||
return { success: true, jobId };
|
||||
}
|
||||
} else {
|
||||
this.handleError(response, 'start crawl job');
|
||||
|
@ -68,9 +126,15 @@ export default class FirecrawlApp {
|
|||
console.log(error)
|
||||
throw new Error(error.message);
|
||||
}
|
||||
return { success: false, error: 'Internal server error.' };
|
||||
}
|
||||
|
||||
async checkCrawlStatus(jobId: string): Promise<any> {
|
||||
/**
|
||||
* Checks the status of a crawl job using the Firecrawl API.
|
||||
* @param {string} jobId - The job ID of the crawl operation.
|
||||
* @returns {Promise<JobStatusResponse>} The response containing the job status.
|
||||
*/
|
||||
async checkCrawlStatus(jobId: string): Promise<JobStatusResponse> {
|
||||
const headers: AxiosRequestHeaders = this.prepareHeaders();
|
||||
try {
|
||||
const response: AxiosResponse = await this.getRequest(`https://api.firecrawl.dev/v0/crawl/status/${jobId}`, headers);
|
||||
|
@ -82,8 +146,13 @@ export default class FirecrawlApp {
|
|||
} catch (error: any) {
|
||||
throw new Error(error.message);
|
||||
}
|
||||
return { success: false, status: 'unknown', error: 'Internal server error.' };
|
||||
}
|
||||
|
||||
/**
|
||||
* Prepares the headers for an API request.
|
||||
* @returns {AxiosRequestHeaders} The prepared headers.
|
||||
*/
|
||||
prepareHeaders(): AxiosRequestHeaders {
|
||||
return {
|
||||
'Content-Type': 'application/json',
|
||||
|
@ -91,14 +160,34 @@ export default class FirecrawlApp {
|
|||
} as AxiosRequestHeaders;
|
||||
}
|
||||
|
||||
/**
|
||||
* Sends a POST request to the specified URL.
|
||||
* @param {string} url - The URL to send the request to.
|
||||
* @param {Params} data - The data to send in the request.
|
||||
* @param {AxiosRequestHeaders} headers - The headers for the request.
|
||||
* @returns {Promise<AxiosResponse>} The response from the POST request.
|
||||
*/
|
||||
postRequest(url: string, data: Params, headers: AxiosRequestHeaders): Promise<AxiosResponse> {
|
||||
return axios.post(url, data, { headers });
|
||||
}
|
||||
|
||||
/**
|
||||
* Sends a GET request to the specified URL.
|
||||
* @param {string} url - The URL to send the request to.
|
||||
* @param {AxiosRequestHeaders} headers - The headers for the request.
|
||||
* @returns {Promise<AxiosResponse>} The response from the GET request.
|
||||
*/
|
||||
getRequest(url: string, headers: AxiosRequestHeaders): Promise<AxiosResponse> {
|
||||
return axios.get(url, { headers });
|
||||
}
|
||||
|
||||
/**
|
||||
* Monitors the status of a crawl job until completion or failure.
|
||||
* @param {string} jobId - The job ID of the crawl operation.
|
||||
* @param {AxiosRequestHeaders} headers - The headers for the request.
|
||||
* @param {number} timeout - Timeout in seconds for job status checks.
|
||||
* @returns {Promise<any>} The final job status or data.
|
||||
*/
|
||||
async monitorJobStatus(jobId: string, headers: AxiosRequestHeaders, timeout: number): Promise<any> {
|
||||
while (true) {
|
||||
const statusResponse: AxiosResponse = await this.getRequest(`https://api.firecrawl.dev/v0/crawl/status/${jobId}`, headers);
|
||||
|
@ -124,6 +213,11 @@ export default class FirecrawlApp {
|
|||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Handles errors from API responses.
|
||||
* @param {AxiosResponse} response - The response from the API.
|
||||
* @param {string} action - The action being performed when the error occurred.
|
||||
*/
|
||||
handleError(response: AxiosResponse, action: string): void {
|
||||
if ([402, 409, 500].includes(response.status)) {
|
||||
const errorMessage: string = response.data.error || 'Unknown error occurred';
|
||||
|
|
|
@ -49,7 +49,7 @@
|
|||
// "maxNodeModuleJsDepth": 1, /* Specify the maximum folder depth used for checking JavaScript files from 'node_modules'. Only applicable with 'allowJs'. */
|
||||
|
||||
/* Emit */
|
||||
// "declaration": true, /* Generate .d.ts files from TypeScript and JavaScript files in your project. */
|
||||
"declaration": true, /* Generate .d.ts files from TypeScript and JavaScript files in your project. */
|
||||
// "declarationMap": true, /* Create sourcemaps for d.ts files. */
|
||||
// "emitDeclarationOnly": true, /* Only output d.ts files and not JavaScript files. */
|
||||
// "sourceMap": true, /* Create source map files for emitted JavaScript files. */
|
||||
|
@ -70,7 +70,7 @@
|
|||
// "noEmitHelpers": true, /* Disable generating custom helper functions like '__extends' in compiled output. */
|
||||
// "noEmitOnError": true, /* Disable emitting files if any type checking errors are reported. */
|
||||
// "preserveConstEnums": true, /* Disable erasing 'const enum' declarations in generated code. */
|
||||
// "declarationDir": "./", /* Specify the output directory for generated declaration files. */
|
||||
"declarationDir": "./types", /* Specify the output directory for generated declaration files. */
|
||||
// "preserveValueImports": true, /* Preserve unused imported values in the JavaScript output that would otherwise be removed. */
|
||||
|
||||
/* Interop Constraints */
|
||||
|
@ -105,5 +105,7 @@
|
|||
/* Completeness */
|
||||
// "skipDefaultLibCheck": true, /* Skip type checking .d.ts files that are included with TypeScript. */
|
||||
"skipLibCheck": true /* Skip type checking all .d.ts files. */
|
||||
}
|
||||
},
|
||||
"include": ["src/**/*"],
|
||||
"exclude": ["node_modules", "dist", "**/__tests__/*"]
|
||||
}
|
||||
|
|
107
apps/js-sdk/firecrawl/types/index.d.ts
vendored
Normal file
107
apps/js-sdk/firecrawl/types/index.d.ts
vendored
Normal file
|
@ -0,0 +1,107 @@
|
|||
import { AxiosResponse, AxiosRequestHeaders } from 'axios';
|
||||
/**
|
||||
* Configuration interface for FirecrawlApp.
|
||||
*/
|
||||
export interface FirecrawlAppConfig {
|
||||
apiKey?: string | null;
|
||||
}
|
||||
/**
|
||||
* Generic parameter interface.
|
||||
*/
|
||||
export interface Params {
|
||||
[key: string]: any;
|
||||
}
|
||||
/**
|
||||
* Response interface for scraping operations.
|
||||
*/
|
||||
export interface ScrapeResponse {
|
||||
success: boolean;
|
||||
data?: any;
|
||||
error?: string;
|
||||
}
|
||||
/**
|
||||
* Response interface for crawling operations.
|
||||
*/
|
||||
export interface CrawlResponse {
|
||||
success: boolean;
|
||||
jobId?: string;
|
||||
data?: any;
|
||||
error?: string;
|
||||
}
|
||||
/**
|
||||
* Response interface for job status checks.
|
||||
*/
|
||||
export interface JobStatusResponse {
|
||||
success: boolean;
|
||||
status: string;
|
||||
jobId?: string;
|
||||
data?: any;
|
||||
error?: string;
|
||||
}
|
||||
/**
|
||||
* Main class for interacting with the Firecrawl API.
|
||||
*/
|
||||
export default class FirecrawlApp {
|
||||
private apiKey;
|
||||
/**
|
||||
* Initializes a new instance of the FirecrawlApp class.
|
||||
* @param {FirecrawlAppConfig} config - Configuration options for the FirecrawlApp instance.
|
||||
*/
|
||||
constructor({ apiKey }: FirecrawlAppConfig);
|
||||
/**
|
||||
* Scrapes a URL using the Firecrawl API.
|
||||
* @param {string} url - The URL to scrape.
|
||||
* @param {Params | null} params - Additional parameters for the scrape request.
|
||||
* @returns {Promise<ScrapeResponse>} The response from the scrape operation.
|
||||
*/
|
||||
scrapeUrl(url: string, params?: Params | null): Promise<ScrapeResponse>;
|
||||
/**
|
||||
* Initiates a crawl job for a URL using the Firecrawl API.
|
||||
* @param {string} url - The URL to crawl.
|
||||
* @param {Params | null} params - Additional parameters for the crawl request.
|
||||
* @param {boolean} waitUntilDone - Whether to wait for the crawl job to complete.
|
||||
* @param {number} timeout - Timeout in seconds for job status checks.
|
||||
* @returns {Promise<CrawlResponse | any>} The response from the crawl operation.
|
||||
*/
|
||||
crawlUrl(url: string, params?: Params | null, waitUntilDone?: boolean, timeout?: number): Promise<CrawlResponse | any>;
|
||||
/**
|
||||
* Checks the status of a crawl job using the Firecrawl API.
|
||||
* @param {string} jobId - The job ID of the crawl operation.
|
||||
* @returns {Promise<JobStatusResponse>} The response containing the job status.
|
||||
*/
|
||||
checkCrawlStatus(jobId: string): Promise<JobStatusResponse>;
|
||||
/**
|
||||
* Prepares the headers for an API request.
|
||||
* @returns {AxiosRequestHeaders} The prepared headers.
|
||||
*/
|
||||
prepareHeaders(): AxiosRequestHeaders;
|
||||
/**
|
||||
* Sends a POST request to the specified URL.
|
||||
* @param {string} url - The URL to send the request to.
|
||||
* @param {Params} data - The data to send in the request.
|
||||
* @param {AxiosRequestHeaders} headers - The headers for the request.
|
||||
* @returns {Promise<AxiosResponse>} The response from the POST request.
|
||||
*/
|
||||
postRequest(url: string, data: Params, headers: AxiosRequestHeaders): Promise<AxiosResponse>;
|
||||
/**
|
||||
* Sends a GET request to the specified URL.
|
||||
* @param {string} url - The URL to send the request to.
|
||||
* @param {AxiosRequestHeaders} headers - The headers for the request.
|
||||
* @returns {Promise<AxiosResponse>} The response from the GET request.
|
||||
*/
|
||||
getRequest(url: string, headers: AxiosRequestHeaders): Promise<AxiosResponse>;
|
||||
/**
|
||||
* Monitors the status of a crawl job until completion or failure.
|
||||
* @param {string} jobId - The job ID of the crawl operation.
|
||||
* @param {AxiosRequestHeaders} headers - The headers for the request.
|
||||
* @param {number} timeout - Timeout in seconds for job status checks.
|
||||
* @returns {Promise<any>} The final job status or data.
|
||||
*/
|
||||
monitorJobStatus(jobId: string, headers: AxiosRequestHeaders, timeout: number): Promise<any>;
|
||||
/**
|
||||
* Handles errors from API responses.
|
||||
* @param {AxiosResponse} response - The response from the API.
|
||||
* @param {string} action - The action being performed when the error occurred.
|
||||
*/
|
||||
handleError(response: AxiosResponse, action: string): void;
|
||||
}
|
BIN
apps/playwright-service/.DS_Store
vendored
BIN
apps/playwright-service/.DS_Store
vendored
Binary file not shown.
2
apps/playwright-service/.gitignore
vendored
2
apps/playwright-service/.gitignore
vendored
|
@ -145,7 +145,7 @@ dmypy.json
|
|||
cython_debug/
|
||||
|
||||
# PyCharm
|
||||
# JetBrains specific template is maintainted in a separate JetBrains.gitignore that can
|
||||
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
|
||||
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
|
||||
# and can be added to the global gitignore or merged into this file. For a more nuclear
|
||||
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
|
||||
|
|
|
@ -21,6 +21,7 @@ async def root(body: UrlModel): # Using Pydantic model for request body
|
|||
await page.goto(body.url) # Adjusted to use the url from the request body model
|
||||
page_content = await page.content() # Get the HTML content of the page
|
||||
|
||||
await context.close()
|
||||
await browser.close()
|
||||
|
||||
json_compatible_item_data = {"content": page_content}
|
||||
|
|
|
@ -30,14 +30,12 @@ scraped_data = app.scrape_url(url)
|
|||
|
||||
# Crawl a website
|
||||
crawl_url = 'https://mendable.ai'
|
||||
crawl_params = {
|
||||
'crawlerOptions': {
|
||||
'excludes': ['blog/*'],
|
||||
'includes': [], # leave empty for all pages
|
||||
'limit': 1000,
|
||||
params = {
|
||||
'pageOptions': {
|
||||
'onlyMainContent': True
|
||||
}
|
||||
}
|
||||
crawl_result = app.crawl_url(crawl_url, params=crawl_params)
|
||||
crawl_result = app.crawl_url(crawl_url, params=params)
|
||||
```
|
||||
|
||||
### Scraping a URL
|
||||
|
@ -57,14 +55,17 @@ The `wait_until_done` parameter determines whether the method should wait for th
|
|||
|
||||
```python
|
||||
crawl_url = 'https://example.com'
|
||||
crawl_params = {
|
||||
params = {
|
||||
'crawlerOptions': {
|
||||
'excludes': ['blog/*'],
|
||||
'includes': [], # leave empty for all pages
|
||||
'limit': 1000,
|
||||
},
|
||||
'pageOptions': {
|
||||
'onlyMainContent': True
|
||||
}
|
||||
}
|
||||
crawl_result = app.crawl_url(crawl_url, params=crawl_params, wait_until_done=True, timeout=5)
|
||||
crawl_result = app.crawl_url(crawl_url, params=params, wait_until_done=True, timeout=5)
|
||||
```
|
||||
|
||||
If `wait_until_done` is set to `True`, the `crawl_url` method will return the crawl result once the job is completed. If the job fails or is stopped, an exception will be raised.
|
||||
|
|
92
tutorials/data-extraction-using-llms.mdx
Normal file
92
tutorials/data-extraction-using-llms.mdx
Normal file
|
@ -0,0 +1,92 @@
|
|||
# Extract website data using LLMs
|
||||
|
||||
Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code. With Groq fast inference speeds and firecrawl parellization, you can extract data from web pages *super* fast.
|
||||
|
||||
## Setup
|
||||
|
||||
Install our python dependencies, including groq and firecrawl-py.
|
||||
|
||||
```bash
|
||||
pip install groq firecrawl-py
|
||||
```
|
||||
|
||||
## Getting your Groq and Firecrawl API Keys
|
||||
|
||||
To use Groq and Firecrawl, you will need to get your API keys. You can get your Groq API key from [here](https://groq.com) and your Firecrawl API key from [here](https://firecrawl.dev).
|
||||
|
||||
## Load website with Firecrawl
|
||||
|
||||
To be able to get all the data from a website page and make sure it is in the cleanest format, we will use [FireCrawl](https://firecrawl.dev). It handles by-passing JS-blocked websites, extracting the main content, and outputting in a LLM-readable format for increased accuracy.
|
||||
|
||||
Here is how we will scrape a website url using Firecrawl. We will also set a `pageOptions` for only extracting the main content (`onlyMainContent: True`) of the website page - excluding the navs, footers, etc.
|
||||
|
||||
```python
|
||||
from firecrawl import FirecrawlApp # Importing the FireCrawlLoader
|
||||
|
||||
url = "https://about.fb.com/news/2024/04/introducing-our-open-mixed-reality-ecosystem/"
|
||||
|
||||
firecrawl = FirecrawlApp(
|
||||
api_key="fc-YOUR_FIRECRAWL_API_KEY",
|
||||
)
|
||||
page_content = firecrawl.scrape_url(url=url, # Target URL to crawl
|
||||
params={
|
||||
"pageOptions":{
|
||||
"onlyMainContent": True # Ignore navs, footers, etc.
|
||||
}
|
||||
})
|
||||
print(page_content)
|
||||
```
|
||||
|
||||
Perfect, now we have clean data from the website - ready to be fed to the LLM for data extraction.
|
||||
|
||||
## Extraction and Generation
|
||||
|
||||
Now that we have the website data, let's use Groq to pull out the information we need. We'll use Groq Llama 3 model in JSON mode and pick out certain fields from the page content.
|
||||
|
||||
We are using LLama 3 8b model for this example. Feel free to use bigger models for improved results.
|
||||
|
||||
```python
|
||||
import json
|
||||
from groq import Groq
|
||||
|
||||
client = Groq(
|
||||
api_key="gsk_YOUR_GROQ_API_KEY", # Note: Replace 'API_KEY' with your actual Groq API key
|
||||
)
|
||||
|
||||
# Here we define the fields we want to extract from the page content
|
||||
extract = ["summary","date","companies_building_with_quest","title_of_the_article","people_testimonials"]
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="llama3-8b-8192",
|
||||
messages=[
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a legal advisor who extracts information from documents in JSON."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
# Here we pass the page content and the fields we want to extract
|
||||
"content": f"Extract the following information from the provided documentation:\Page content:\n\n{page_content}\n\nInformation to extract: {extract}"
|
||||
}
|
||||
],
|
||||
temperature=0,
|
||||
max_tokens=1024,
|
||||
top_p=1,
|
||||
stream=False,
|
||||
stop=None,
|
||||
# We set the response format to JSON object
|
||||
response_format={"type": "json_object"}
|
||||
)
|
||||
|
||||
|
||||
# Pretty print the JSON response
|
||||
dataExtracted = json.dumps(str(completion.choices[0].message.content), indent=4)
|
||||
|
||||
print(dataExtracted)
|
||||
```
|
||||
|
||||
## And Voila!
|
||||
|
||||
You have now built a data extraction bot using Groq and Firecrawl. You can now use this bot to extract structured data from any website.
|
||||
|
||||
If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev).
|
91
tutorials/rag-llama3.mdx
Normal file
91
tutorials/rag-llama3.mdx
Normal file
|
@ -0,0 +1,91 @@
|
|||
---
|
||||
title: "Build a 'Chat with website' using Groq Llama 3"
|
||||
description: "Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot."
|
||||
---
|
||||
|
||||
## Setup
|
||||
|
||||
Install our python dependencies, including langchain, groq, faiss, ollama, and firecrawl-py.
|
||||
|
||||
```bash
|
||||
pip install --upgrade --quiet langchain langchain-community groq faiss-cpu ollama firecrawl-py
|
||||
```
|
||||
|
||||
We will be using Ollama for the embeddings, you can download Ollama [here](https://ollama.com/). But feel free to use any other embeddings you prefer.
|
||||
|
||||
## Load website with Firecrawl
|
||||
|
||||
To be able to get all the data from a website and make sure it is in the cleanest format, we will use FireCrawl. Firecrawl integrates very easily with Langchain as a document loader.
|
||||
|
||||
Here is how you can load a website with FireCrawl:
|
||||
|
||||
```python
|
||||
from langchain_community.document_loaders import FireCrawlLoader # Importing the FireCrawlLoader
|
||||
|
||||
url = "https://firecrawl.dev"
|
||||
loader = FireCrawlLoader(
|
||||
api_key="fc-YOUR_API_KEY", # Note: Replace 'YOUR_API_KEY' with your actual FireCrawl API key
|
||||
url=url, # Target URL to crawl
|
||||
mode="crawl" # Mode set to 'crawl' to crawl all accessible subpages
|
||||
)
|
||||
docs = loader.load()
|
||||
```
|
||||
|
||||
## Setup the Vectorstore
|
||||
|
||||
Next, we will setup the vectorstore. The vectorstore is a data structure that allows us to store and query embeddings. We will use the Ollama embeddings and the FAISS vectorstore.
|
||||
We split the documents into chunks of 1000 characters each, with a 200 character overlap. This is to ensure that the chunks are not too small and not too big - and that it can fit into the LLM model when we query it.
|
||||
|
||||
```python
|
||||
from langchain_community.embeddings import OllamaEmbeddings
|
||||
from langchain_text_splitters import RecursiveCharacterTextSplitter
|
||||
from langchain_community.vectorstores import FAISS
|
||||
|
||||
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
|
||||
splits = text_splitter.split_documents(docs)
|
||||
vectorstore = FAISS.from_documents(documents=splits, embedding=OllamaEmbeddings())
|
||||
```
|
||||
|
||||
## Retrieval and Generation
|
||||
|
||||
Now that our documents are loaded and the vectorstore is setup, we can, based on user's question, do a similarity search to retrieve the most relevant documents. That way we can use these documents to be fed to the LLM model.
|
||||
|
||||
|
||||
```python
|
||||
question = "What is firecrawl?"
|
||||
docs = vectorstore.similarity_search(query=question)
|
||||
```
|
||||
|
||||
## Generation
|
||||
Last but not least, you can use the Groq to generate a response to a question based on the documents we have loaded.
|
||||
|
||||
```python
|
||||
from groq import Groq
|
||||
|
||||
client = Groq(
|
||||
api_key="YOUR_GROQ_API_KEY",
|
||||
)
|
||||
|
||||
completion = client.chat.completions.create(
|
||||
model="llama3-8b-8192",
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": f"You are a friendly assistant. Your job is to answer the users question based on the documentation provided below:\nDocs:\n\n{docs}\n\nQuestion: {question}"
|
||||
}
|
||||
],
|
||||
temperature=1,
|
||||
max_tokens=1024,
|
||||
top_p=1,
|
||||
stream=False,
|
||||
stop=None,
|
||||
)
|
||||
|
||||
print(completion.choices[0].message)
|
||||
```
|
||||
|
||||
## And Voila!
|
||||
|
||||
You have now built a 'Chat with your website' bot using Llama 3, Groq Llama 3, Langchain, and Firecrawl. You can now use this bot to answer questions based on the documentation of your website.
|
||||
|
||||
If you have any questions or need help, feel free to reach out to us at [Firecrawl](https://firecrawl.dev).
|
Loading…
Reference in New Issue
Block a user