mirror of
https://github.com/mendableai/firecrawl.git
synced 2024-11-16 11:42:24 +08:00
48 lines
945 B
Markdown
48 lines
945 B
Markdown
# Playwright Scrape API
|
|
|
|
This is a simple web scraping service built with Express and Playwright.
|
|
|
|
## Features
|
|
|
|
- Scrapes HTML content from specified URLs.
|
|
- Blocks requests to known ad-serving domains.
|
|
- Blocks media files to reduce bandwidth usage.
|
|
- Uses random user-agent strings to avoid detection.
|
|
- Strategy to ensure the page is fully rendered.
|
|
|
|
## Install
|
|
```bash
|
|
npm install
|
|
npx playwright install
|
|
```
|
|
|
|
## RUN
|
|
```bash
|
|
npm run build
|
|
npm start
|
|
```
|
|
OR
|
|
```bash
|
|
npm run dev
|
|
```
|
|
|
|
## USE
|
|
|
|
```bash
|
|
curl -X POST http://localhost:3000/scrape \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"url": "https://example.com",
|
|
"wait_after_load": 1000,
|
|
"timeout": 15000,
|
|
"headers": {
|
|
"Custom-Header": "value"
|
|
},
|
|
"check_selector": "#content"
|
|
}'
|
|
```
|
|
|
|
## USING WITH FIRECRAWL
|
|
|
|
Add `PLAYWRIGHT_MICROSERVICE_URL=http://localhost:3003/scrape` to `/apps/api/.env` to configure the API to use this Playwright microservice for scraping operations.
|