mirror of
https://github.com/mendableai/firecrawl.git
synced 2024-11-15 19:22:19 +08:00
.. | ||
helpers | ||
api.ts | ||
Dockerfile | ||
package.json | ||
README.md | ||
tsconfig.json |
Playwright Scrape API
This is a simple web scraping service built with Express and Playwright.
Features
- Scrapes HTML content from specified URLs.
- Blocks requests to known ad-serving domains.
- Blocks media files to reduce bandwidth usage.
- Uses random user-agent strings to avoid detection.
- Strategy to ensure the page is fully rendered.
Install
npm install
npx playwright install
RUN
npm run build
npm start
OR
npm run dev
USE
curl -X POST http://localhost:3000/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"wait_after_load": 1000,
"timeout": 15000,
"headers": {
"Custom-Header": "value"
},
"check_selector": "#content"
}'
USING WITH FIRECRAWL
Add PLAYWRIGHT_MICROSERVICE_URL=http://localhost:3003/scrape
to /apps/api/.env
to configure the API to use this Playwright microservice for scraping operations.