Commit Graph

10 Commits

Author SHA1 Message Date
Gergő Móricz
8d467c8ca7
WebScraper refactor into scrapeURL (#714)
* feat: use strictNullChecking

* feat: switch logger to Winston

* feat(scrapeURL): first batch

* fix(scrapeURL): error swallow

* fix(scrapeURL): add timeout to EngineResultsTracker

* fix(scrapeURL): report unexpected error to sentry

* chore: remove unused modules

* feat(transfomers/coerce): warn when a format's response is missing

* feat(scrapeURL): feature flag priorities, engine quality sorting, PDF and DOCX support

* (add note)

* feat(scrapeURL): wip readme

* feat(scrapeURL): LLM extract

* feat(scrapeURL): better warnings

* fix(scrapeURL/engines/fire-engine;playwright): fix screenshot

* feat(scrapeURL): add forceEngine internal option

* feat(scrapeURL/engines): scrapingbee

* feat(scrapeURL/transformars): uploadScreenshot

* feat(scrapeURL): more intense tests

* bunch of stuff

* get rid of WebScraper (mostly)

* adapt batch scrape

* add staging deploy workflow

* fix yaml

* fix logger issues

* fix v1 test schema

* feat(scrapeURL/fire-engine/chrome-cdp): remove wait inserts on actions

* scrapeURL: v0 backwards compat

* logger fixes

* feat(scrapeurl): v0 returnOnlyUrls support

* fix(scrapeURL/v0): URL leniency

* fix(batch-scrape): ts non-nullable

* fix(scrapeURL/fire-engine/chromecdp): fix wait action

* fix(logger): remove error debug key

* feat(requests.http): use dotenv expression

* fix(scrapeURL/extractMetadata): extract custom metadata

* fix crawl option conversion

* feat(scrapeURL): Add retry logic to robustFetch

* fix(scrapeURL): crawl stuff

* fix(scrapeURL): LLM extract

* fix(scrapeURL/v0): search fix

* fix(tests/v0): grant larger response size to v0 crawl status

* feat(scrapeURL): basic fetch engine

* feat(scrapeURL): playwright engine

* feat(scrapeURL): add url-specific parameters

* Update readme and examples

* added e2e tests for most parameters. Still a few actions, location and iframes to be done.

* fixed type

* Nick:

* Update scrape.ts

* Update index.ts

* added actions and base64 check

* Nick: skipTls feature flag?

* 403

* todo

* todo

* fixes

* yeet headers from url specific params

* add warning when final engine has feature deficit

* expose engine results tracker for ScrapeEvents implementation

* ingest scrape events

* fixed some tests

* comment

* Update index.test.ts

* fixed rawHtml

* Update index.test.ts

* update comments

* move geolocation to global f-e option, fix removeBase64Images

* Nick:

* trim url-specific params

* Update index.ts

---------

Co-authored-by: Eric Ciarla <ericciarla@yahoo.com>
Co-authored-by: rafaelmmiller <8574157+rafaelmmiller@users.noreply.github.com>
Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2024-11-07 20:57:33 +01:00
y5n
4278fae51e
Update README.md 2024-09-09 10:55:31 +08:00
y5n
1ea9131e63 feat: Update redis deployment to run redis with password if REDIS_PASSWORD is configured 2024-09-07 16:00:32 +08:00
rafaelsideguide
7a61325500 map + search + scrape markdown bug 2024-08-16 17:57:11 -03:00
Jakob Stadlhuber
2dc7be3869 Remove liveness and readiness probes from worker.yaml
This commit removes the liveness and readiness probes configuration from the Kubernetes worker manifest. Additionally, a Service definition for the worker application has been removed. These changes might be necessary to update the deployment strategy or simplify the configuration.
2024-07-24 19:38:54 +02:00
Jakob Stadlhuber
d68f349109 Update Kubernetes YAMLs and add worker service
Refactored container configurations in worker, api, and playwright-service YAMLs to streamline syntax and add missing fields. Added a service definition for the worker component and included a new environment variable in the configmap for rate-limiting. These changes enhance configuration clarity and ensure proper resource definitions.
2024-07-24 19:31:37 +02:00
Jakob Stadlhuber
f26bda2477 Update Docker build paths in Kubernetes setup README
Corrected relative paths for Docker build commands to ensure the appropriate directories are targeted. This fix is crucial for successful image builds and deployment consistency in the Kubernetes cluster setup.
2024-07-24 19:06:19 +02:00
Jakob Stadlhuber
895e80caa4 Add liveness and readiness probes to Kubernetes configs
Introduced liveness and readiness probes for the Playwright service, API, and worker components. This ensures that Kubernetes can better manage the health and availability of these services by periodically checking their endpoints. This enhancement will improve the robustness and reliability of the deployed applications.
2024-07-24 19:00:23 +02:00
Jakob Stadlhuber
497aa5d25e Update Kubernetes configs for playwright-service, api, and worker
Added new ConfigMap for playwright-service and adjusted existing references.
Applied imagePullPolicy: Always to ensure all images are updated promptly.
Updated README to include --no-cache for Docker build instructions.
2024-07-24 17:55:45 +02:00
Eric Ciarla
8e39083d8c Update examples section 2024-06-21 15:40:46 -04:00