yanlong.wang
|
579f259cb9
|
fix: detect when readability does not work
|
2024-06-20 18:20:13 +08:00 |
|
yanlong.wang
|
eaa06781e3
|
fix: normalize-url pollution
|
2024-06-20 14:53:25 +08:00 |
|
yanlong.wang
|
6f37e5d3b4
|
feat: x-remove-selector
|
2024-06-18 18:07:38 +08:00 |
|
yanlong.wang
|
ee008ebe10
|
fix: improved code rules
|
2024-06-13 16:27:30 +08:00 |
|
yanlong.wang
|
fd9a86bc00
|
chore: fix abuse timing
|
2024-06-11 13:57:19 +08:00 |
|
Yanlong Wang
|
70d80bbcfe
|
fix: abuse condition
|
2024-06-10 17:41:38 +08:00 |
|
Yanlong Wang
|
5789ae1407
|
chore: dont abuse our service
|
2024-06-10 17:23:50 +08:00 |
|
yanlong.wang
|
1e3bae6aad
|
fix: timeout parsing
|
2024-06-05 19:50:48 +08:00 |
|
yanlong.wang
|
a9936d322e
|
fix: search descriptions
|
2024-06-05 19:47:04 +08:00 |
|
yanlong.wang
|
165cce6c91
|
refactor: options dto
|
2024-06-05 18:55:40 +08:00 |
|
Yanlong Wang
|
f0668a96b4
|
fix: potential circular crawling
|
2024-06-02 23:23:39 +08:00 |
|
Yanlong Wang
|
be91371b93
|
fix: ignore blockade for authenticated users
|
2024-06-02 09:09:21 +08:00 |
|
Yanlong Wang
|
154d8ede45
|
fix: truncate svg
|
2024-06-02 08:57:39 +08:00 |
|
Yanlong Wang
|
7a7e49bc00
|
fix: blockade query
|
2024-06-01 08:06:46 +08:00 |
|
Yanlong Wang
|
d2bebec60f
|
fix: abuse blocker
|
2024-06-01 02:01:12 +08:00 |
|
Yanlong Wang
|
249408df6b
|
fix
|
2024-06-01 01:07:50 +08:00 |
|
Yanlong Wang
|
43dee08dcc
|
security: detect abuse
|
2024-06-01 00:57:51 +08:00 |
|
Yanlong Wang
|
908157b61e
|
fix: pdf cache
|
2024-05-31 19:05:17 +08:00 |
|
Yanlong Wang
|
9c60b4b93d
|
fix: setup expire for pdf caches
|
2024-05-31 18:36:23 +08:00 |
|
Yanlong Wang
|
1ba21da0c5
|
fix: pdf cache
|
2024-05-31 18:26:05 +08:00 |
|
Yanlong Wang
|
fd0b77285f
|
fix: firebase fail to save large docs
|
2024-05-31 18:16:37 +08:00 |
|
Yanlong Wang
|
964b66b6ab
|
fix: data crunching import
|
2024-05-31 17:32:16 +08:00 |
|
Yanlong Wang
|
9ac40606d5
|
fix: bulk fix multiple issues
|
2024-05-31 17:30:57 +08:00 |
|
Yanlong Wang
|
0c15946874
|
fix: trimstart url
|
2024-05-30 20:29:31 +08:00 |
|
Yanlong Wang
|
33e14e5404
|
feat: extract text from pdf (#70)
* feat: pdf
* fix
* fix
|
2024-05-30 20:21:33 +08:00 |
|
yanlong.wang
|
7c5712363c
|
feat: allow custom rate limit per uid
|
2024-05-23 15:36:09 +08:00 |
|
yanlong.wang
|
8eee95119d
|
feat: index brief in JSON format
|
2024-05-23 12:06:07 +08:00 |
|
yanlong.wang
|
4f37de24f6
|
fix: docs
|
2024-05-21 17:35:16 +08:00 |
|
Yanlong Wang
|
a8e0628460
|
feat: links and images summary (#63)
* wip: dedicated link and image summary
* fix
* fix
* fix
* fix: docs
* fix
* fix
* fix
|
2024-05-21 17:34:19 +08:00 |
|
Yanlong Wang
|
df71c9a534
|
fix: stop using pool
|
2024-05-20 01:12:22 +08:00 |
|
Yanlong Wang
|
4077fa7040
|
fix: geoip encoding
|
2024-05-17 09:31:22 +08:00 |
|
Yanlong Wang
|
2941be6096
|
fix: potential unencoded query
|
2024-05-17 09:15:37 +08:00 |
|
Yanlong Wang
|
ed9e9f43cf
|
fix: block rough requests
|
2024-05-16 20:22:26 +08:00 |
|
yanlong.wang
|
8ec8c1e718
|
fix: logging for search error
|
2024-05-16 19:01:30 +08:00 |
|
yanlong.wang
|
e0e37ad4d7
|
fix: potential chargeAmount mismatch
|
2024-05-16 18:43:41 +08:00 |
|
yanlong.wang
|
8b0916f858
|
fix: race condition while logging chargeAmount
|
2024-05-16 18:26:18 +08:00 |
|
yanlong.wang
|
6f4819bc49
|
chore: tweak deployment
|
2024-05-16 17:46:53 +08:00 |
|
yanlong.wang
|
322cb86f21
|
fix: on no results
|
2024-05-16 17:30:47 +08:00 |
|
yanlong.wang
|
e2698b48bd
|
fix: rate limit tag for search
|
2024-05-16 16:58:10 +08:00 |
|
yanlong.wang
|
72e1c46a6c
|
fix: improve search responsiveness
|
2024-05-16 15:47:49 +08:00 |
|
Yanlong Wang
|
0583645613
|
fix: noCache in search
|
2024-05-16 00:42:30 +08:00 |
|
Yanlong Wang
|
4556954d17
|
fix: image url
|
2024-05-16 00:39:24 +08:00 |
|
Yanlong Wang
|
6f65083f8d
|
feat: control cache tolerance and select target using headers
|
2024-05-16 00:10:20 +08:00 |
|
yanlong.wang
|
77fc500f41
|
fix: allow x-return-format header alias
|
2024-05-15 12:24:46 +08:00 |
|
Yanlong Wang
|
445624c405
|
fix: early return for search
|
2024-05-15 08:47:16 +08:00 |
|
Yanlong Wang
|
1cf8e83857
|
fix: add cache tolerance
|
2024-05-15 08:06:35 +08:00 |
|
Yanlong Wang
|
d100c3fc5f
|
fix: search result cache save
|
2024-05-14 19:57:49 +08:00 |
|
Yanlong Wang
|
ec4ce4fef3
|
chore: update rate limits
|
2024-05-14 19:44:35 +08:00 |
|
Yanlong Wang
|
2e3c217479
|
feat: web search (#57)
|
2024-05-14 19:39:43 +08:00 |
|
Yanlong Wang
|
f171e54ac9
|
fix: log charge amount
|
2024-05-14 17:25:59 +08:00 |
|
yanlong.wang
|
ffc6899acd
|
chore: reduce resource
|
2024-05-13 18:35:11 +08:00 |
|
yanlong.wang
|
e417cd8a53
|
fix: tidyMarkdown feature in turndown rues
|
2024-05-09 15:15:15 +08:00 |
|
Yanlong Wang
|
36bf5d96b5
|
fix: remove tidyMarkdown at all
|
2024-05-09 11:33:56 +08:00 |
|
Yanlong Wang
|
59f807cb7c
|
fix: tidyMarkdown
|
2024-05-09 11:32:26 +08:00 |
|
Yanlong Wang
|
6b6774f43b
|
fix: tidyMarkdown
|
2024-05-09 11:25:51 +08:00 |
|
Yanlong Wang
|
4bee36ed4a
|
fix: patch tidyMarkdown
|
2024-05-09 11:06:20 +08:00 |
|
Yanlong Wang
|
de22127d2f
|
fix: leak of crippled listeners
|
2024-05-08 19:51:55 +08:00 |
|
Yanlong Wang
|
62dc75f78e
|
fix: consider image data-src and make generated alt text optional (#50)
* fix: image src and alt
* fix
* docs: doc about x-with-generated-alt
* fix: deps
|
2024-05-08 18:29:11 +08:00 |
|
Yanlong Wang
|
8cfd0d67dc
|
feat: jina paywall (#49)
* feat: integrate with jina embeddings paywall
|
2024-05-08 18:25:26 +08:00 |
|
Yanlong Wang
|
2e025d10cf
|
fix: the complex regexp caused node.js process to hang
Co-authored-by: Claude 3 opus
|
2024-05-05 16:29:39 +08:00 |
|
Yanlong Wang
|
fef1d0faf1
|
bump: deps
|
2024-05-05 10:54:11 +08:00 |
|
Yanlong Wang
|
a0d1a7234b
|
chore: tweak health check
|
2024-05-02 08:39:54 +08:00 |
|
Yanlong Wang
|
9e02080103
|
fix: error on browser crashes
|
2024-05-02 03:23:57 +08:00 |
|
Yanlong Wang
|
55b954ffeb
|
fix: tweak health check
|
2024-04-30 18:56:46 +08:00 |
|
Yanlong Wang
|
528b3e5fed
|
fix: add health check to detect puppeteer stall
|
2024-04-30 18:30:31 +08:00 |
|
Yanlong Wang
|
ae29055142
|
chore: tweaks
|
2024-04-29 20:12:11 +08:00 |
|
yanlong.wang
|
867636d037
|
fix: apply rate limit to 100qpm per IP
|
2024-04-29 18:54:51 +08:00 |
|
yanlong.wang
|
15606f38d7
|
fix: on null element
|
2024-04-29 17:28:07 +08:00 |
|
yanlong.wang
|
53a4361c23
|
fix: block firebase runtime intrusion
|
2024-04-29 17:21:34 +08:00 |
|
yanlong.wang
|
059c8aa61e
|
fix: remove exposed function before cleanup
|
2024-04-29 15:51:23 +08:00 |
|
yanlong.wang
|
bfc6d678d8
|
fix: split report handler from other page preps
|
2024-04-29 15:19:05 +08:00 |
|
Yanlong Wang
|
036f6dc776
|
chore: tweak runtime config
|
2024-04-29 09:49:29 +08:00 |
|
Yanlong Wang
|
6ac2863e89
|
bump: deps
|
2024-04-28 22:28:24 +08:00 |
|
yanlong.wang
|
a6a5b7c530
|
fix: respond with markdown
|
2024-04-25 18:58:42 +08:00 |
|
yanlong.wang
|
69231ad59e
|
feat: full markdown mode
|
2024-04-25 18:21:04 +08:00 |
|
yanlong.wang
|
adc05fe20a
|
fix
|
2024-04-25 16:09:23 +08:00 |
|
yanlong.wang
|
39a446f5e7
|
fix: root content
|
2024-04-25 15:43:17 +08:00 |
|
yanlong.wang
|
f1016649ac
|
fix: firebase limit on document size causing cache failures
|
2024-04-25 12:24:05 +08:00 |
|
yanlong.wang
|
94a72052f4
|
fix: reduce frequency of screenshot if possible
|
2024-04-24 19:43:24 +08:00 |
|
Yanlong Wang
|
7ee2c327a3
|
refactor: reorganize features (#37)
* wip
* fix
* wip
* cleanup
* fix
* fix
* cache: may rescue using stale cache
* fix: target 384mb ram per page
* fix: log about pool size
* fix
* clean
* fix: cache and snapshot reporting
|
2024-04-24 19:21:12 +08:00 |
|
dependabot[bot]
|
e36d3b0f24
|
chore(deps): bump protobufjs and firebase-admin in /backend/functions (#35)
Bumps [protobufjs](https://github.com/protobufjs/protobuf.js) to 7.2.6 and updates ancestor dependency [firebase-admin](https://github.com/firebase/firebase-admin-node). These dependencies need to be updated together.
Updates `protobufjs` from 7.2.4 to 7.2.6
- [Release notes](https://github.com/protobufjs/protobuf.js/releases)
- [Changelog](https://github.com/protobufjs/protobuf.js/blob/master/CHANGELOG.md)
- [Commits](https://github.com/protobufjs/protobuf.js/compare/protobufjs-v7.2.4...protobufjs-v7.2.6)
Updates `firebase-admin` from 11.11.1 to 12.1.0
- [Release notes](https://github.com/firebase/firebase-admin-node/releases)
- [Commits](https://github.com/firebase/firebase-admin-node/compare/v11.11.1...v12.1.0)
---
updated-dependencies:
- dependency-name: protobufjs
dependency-type: indirect
- dependency-name: firebase-admin
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
|
2024-04-24 16:37:38 +08:00 |
|
Yanlong Wang
|
4b208f44b5
|
fix: process not quitting on errors
|
2024-04-21 10:17:05 +08:00 |
|
Charuka Samarakoon
|
d47310a6f7
|
fix: allocating incorrect max value due to missing parentheses (#26)
|
2024-04-19 09:01:23 +08:00 |
|
yanlong.wang
|
d4ca381c38
|
fix: explicitly reject non http protocols
|
2024-04-18 15:35:06 +08:00 |
|
yanlong.wang
|
abc817e960
|
feat: block media resources to improve speed
|
2024-04-18 15:06:28 +08:00 |
|
yanlong.wang
|
cbc13ecbbd
|
fix: catch turndown errors
|
2024-04-18 13:51:54 +08:00 |
|
yanlong.wang
|
0975b35ca2
|
chore: turn up concurrency a bit base on analysis
|
2024-04-18 11:53:55 +08:00 |
|
yanlong.wang
|
a211366501
|
fix: expose publishedTime if possible
|
2024-04-17 12:36:36 +08:00 |
|
Yanlong Wang
|
6e36f0a447
|
fix: url wrong normalization
|
2024-04-17 09:55:41 +08:00 |
|
Yanlong Wang
|
781b835466
|
fix: keep url details
|
2024-04-17 09:48:26 +08:00 |
|
Yanlong Wang
|
11a5a90611
|
fix: favor nominal url over real url
|
2024-04-17 09:30:49 +08:00 |
|
Yanlong Wang
|
bda7e76e50
|
chore: increase max instances to target 10k concurrent requests
|
2024-04-17 09:22:26 +08:00 |
|
Yanlong Wang
|
50ed9cc248
|
feat: fallback to google archive (#16)
* feat: fallback to google archive
* fix
|
2024-04-16 09:17:45 -07:00 |
|
yanlong.wang
|
8a2b095bd7
|
fix: give expireAt for image cache
|
2024-04-16 15:46:05 +08:00 |
|
Han Xiao
|
b3fb4c5c57
|
feat: add image captioning (#6)
* Fix contentText assignment in CrawlerHost class
* fix: recover vscode configurations
* feat: add image captioning
* feat: add image captioning
* clean: vscode config
* chore: fix some ts warnings
* feat: auto alt text
* fix
* chore: improve prompt
* clean: unused config
* fix: failure condition
* fix: remove redundant code
* fix: catch parse error
* fix: catch parse error
---------
Co-authored-by: Yanlong Wang <yanlong.wang@naiver.org>
|
2024-04-15 20:51:31 -07:00 |
|
Han Xiao
|
18373626b2
|
fix: catch parse error
|
2024-04-15 19:27:40 -07:00 |
|
Han Xiao
|
9b190127aa
|
fix: clean broken markdown
|
2024-04-13 21:40:51 -07:00 |
|
Han Xiao
|
ef23d810f8
|
feat: clean broken markdown
|
2024-04-13 19:21:35 -07:00 |
|
Han Xiao
|
8378cb06ee
|
chore: rename url2text to reader
|
2024-04-13 12:25:42 -07:00 |
|
Han Xiao
|
e050a5bffa
|
Merge remote-tracking branch 'origin/main'
|
2024-04-13 11:42:21 -07:00 |
|
Han Xiao
|
8e241c7f5a
|
chore: rename url2text to reader
|
2024-04-13 11:42:15 -07:00 |
|
Yanlong Wang
|
dbeb69582a
|
puppeteer stealth
|
2024-04-13 22:27:50 +08:00 |
|
Yanlong Wang
|
33d7cfc41c
|
fix
|
2024-04-13 08:25:52 +08:00 |
|
Yanlong Wang
|
95799988da
|
fix: use gpt bot UA
|
2024-04-13 08:13:50 +08:00 |
|
Yanlong Wang
|
950338261a
|
fix
|
2024-04-13 08:07:55 +08:00 |
|
Yanlong Wang
|
5199b00eeb
|
fix
|
2024-04-13 08:04:07 +08:00 |
|
Yanlong Wang
|
5ed3f90b9c
|
fix
|
2024-04-13 07:53:58 +08:00 |
|
Yanlong Wang
|
be7eeec11b
|
fix
|
2024-04-12 14:17:30 +08:00 |
|
Yanlong Wang
|
2da1b7f3a5
|
fix
|
2024-04-12 14:17:04 +08:00 |
|
Yanlong Wang
|
fdd8a8aa8d
|
fix
|
2024-04-12 12:27:42 +08:00 |
|
Yanlong Wang
|
78c8444096
|
fix
|
2024-04-12 10:59:37 +08:00 |
|
Yanlong Wang
|
629ab270be
|
fix
|
2024-04-12 10:24:56 +08:00 |
|
Yanlong Wang
|
664d4b1c9f
|
fix
|
2024-04-12 09:25:19 +08:00 |
|
Han Xiao
|
2dc0850c8c
|
chore: rename url2text to reader
|
2024-04-11 15:44:12 -07:00 |
|
Han Xiao
|
c1743db305
|
chore: clean code
|
2024-04-11 15:29:57 -07:00 |
|
yanlong.wang
|
b29a569d39
|
fix
|
2024-04-11 19:20:17 +08:00 |
|
yanlong.wang
|
7e366aca68
|
fix
|
2024-04-11 19:12:07 +08:00 |
|
yanlong.wang
|
a9426341f6
|
fix
|
2024-04-11 19:06:45 +08:00 |
|
yanlong.wang
|
5cfb78b275
|
gfm
|
2024-04-11 19:06:06 +08:00 |
|
yanlong.wang
|
9d0d54e511
|
fix
|
2024-04-11 19:00:27 +08:00 |
|
yanlong.wang
|
e17ef6dba0
|
fix
|
2024-04-11 18:28:51 +08:00 |
|
yanlong.wang
|
77174f1511
|
fix
|
2024-04-11 17:24:42 +08:00 |
|
yanlong.wang
|
94e65381bd
|
fix
|
2024-04-11 17:14:41 +08:00 |
|
yanlong.wang
|
b2f8b11cdc
|
wip
|
2024-04-10 19:57:00 +08:00 |
|
yanlong.wang
|
b46e859a30
|
wip
|
2024-04-10 19:43:53 +08:00 |
|
yanlong.wang
|
89d6d49f06
|
wip
|
2024-04-10 19:32:07 +08:00 |
|