reader/README.md

107 lines
6.2 KiB
Markdown
Raw Normal View History

2024-04-14 02:42:15 +08:00
# Reader
2024-04-10 19:32:07 +08:00
2024-04-14 03:41:38 +08:00
Your LLMs deserve better input.
2024-04-14 03:39:00 +08:00
2024-04-14 10:27:10 +08:00
Reader converts any URL to an **LLM-friendly** input with a simple prefix `https://r.jina.ai/`. Get improved output for your agent and RAG systems at no cost.
2024-04-14 03:41:38 +08:00
2024-04-14 03:42:40 +08:00
- Live demo: https://jina.ai/reader
2024-04-14 03:50:45 +08:00
- Or just visit these URLs https://r.jina.ai/https://github.com/jina-ai/reader, https://r.jina.ai/https://x.com/elonmusk and see yourself.
2024-04-10 19:32:07 +08:00
2024-04-25 16:06:24 +08:00
> Feel free to use Reader API in production. It is free, stable and scalable. We are maintaining it actively as one of the core products of Jina AI.
2024-04-16 08:23:16 +08:00
2024-04-21 05:27:42 +08:00
<img width="973" alt="image" src="https://github.com/jina-ai/reader/assets/2041322/2067c7a2-c12e-4465-b107-9a16ca178d41">
2024-04-14 03:39:00 +08:00
2024-04-16 12:50:34 +08:00
## Updates
2024-04-24 23:30:08 +08:00
- **2024-04-24**: You now have more fine-grained control over Reader API [using headers](#using-request-headers), e.g. forwarding cookies, using HTTP proxy.
2024-04-16 12:50:34 +08:00
- **2024-04-15**: Reader now supports image reading! It captions all images at the specified URL and adds `Image [idx]: [caption]` as an alt tag (if they initially lack one). This enables downstream LLMs to interact with the images in reasoning, summarizing etc. [See example here](https://x.com/JinaAI_/status/1780094402071023926).
2024-04-14 02:42:15 +08:00
## Usage
2024-04-10 19:32:07 +08:00
2024-04-14 04:13:24 +08:00
Simply prepend `https://r.jina.ai/` to any URL. For example, to convert the URL `https://en.wikipedia.org/wiki/Artificial_intelligence` to an LLM-friendly input, use the following URL:
2024-04-10 19:32:07 +08:00
2024-04-25 16:06:24 +08:00
[https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence](https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence)
2024-04-10 19:32:07 +08:00
2024-04-25 16:10:32 +08:00
All images in that page that lack `alt` tag are auto-captioned by a VLM (vision langauge model) and formatted as `!(Image [idx]: [VLM_caption])[img_URL]`. This should give your downstream text-only LLM *just enough* hints to include those images into reasoning, selecting, and summarization.
2024-04-25 16:06:24 +08:00
### Streaming mode
2024-04-10 19:32:07 +08:00
2024-04-25 16:06:24 +08:00
Streaming mode is useful when you find that the standard mode provides an incomplete result. This is because the Reader will wait a bit longer until the page is *stablely* rendered. Use the accept-header to toggle the streaming mode:
2024-04-10 19:32:07 +08:00
```bash
2024-04-14 02:42:15 +08:00
curl -H "Accept: text/event-stream" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
2024-04-10 19:32:07 +08:00
```
2024-04-25 16:06:24 +08:00
The data comes in a stream; each subsequent chunk contains more complete information. **The last chunk should provide the most complete and final result.** If you come from LLMs, please note that it is a different behavior than the LLMs' text-generation streaming.
2024-04-14 03:33:51 +08:00
2024-04-18 12:53:20 +08:00
For example, compare these two curl commands below. You can see streaming one gives you complete information at last, whereas standard mode does not. This is because the content loading on this particular site is triggered by some js *after* the page is fully loaded, and standard mode returns the page "too soon".
2024-04-18 12:48:42 +08:00
```bash
curl -H 'x-no-cache: true' https://access.redhat.com/security/cve/CVE-2023-45853
curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853
```
2024-04-14 03:33:51 +08:00
2024-04-18 12:48:42 +08:00
> Note: `-H 'x-no-cache: true'` is used only for demonstration purposes to bypass the cache.
Streaming mode is also useful if your downstream LLM/agent system requires immediate content delivery or needs to process data in chunks to interleave I/O and LLM processing times. This allows for quicker access and more efficient data handling:
```text
2024-04-14 03:33:51 +08:00
Reader API: streamContent1 ----> streamContent2 ----> streamContent3 ---> ...
| | |
v | |
Your LLM: LLM(streamContent1) | |
v |
LLM(streamContent2) |
v
LLM(streamContent3)
```
2024-04-18 12:53:20 +08:00
Note that in terms of completeness: `... > streamContent3 > streamContent2 > streamContent1`, each subsequent chunk contains more complete information.
2024-04-14 10:25:51 +08:00
2024-04-24 23:28:55 +08:00
### Using request headers
As you have already seen above, one can control the behavior of the Reader API using request headers. Here is a complete list of supported headers.
- You can ask the Reader API to forward cookies settings via the `x-set-cookie` header.
- Note that requests with cookies will not be cached.
- You can bypass `readability` filtering via the `x-respond-with` header, specifically:
2024-04-26 02:35:28 +08:00
- `x-respond-with: markdown` returns markdown *without* going through `reability`
2024-04-24 23:28:55 +08:00
- `x-respond-with: html` returns `documentElement.outerHTML`
- `x-respond-with: text` returns `document.body.innerText`
2024-04-26 02:35:28 +08:00
- `x-respond-with: screenshot` returns the URL of the webpage's screenshot
2024-04-24 23:28:55 +08:00
- You can specify a proxy server via the `x-proxy-url` header.
- You can bypass the cached page (lifetime 300s) via the `x-no-cache` header.
2024-04-25 16:06:24 +08:00
### JSON mode (super early beta)
This is still very early and the result is not really a "useful" JSON. It contains three fields `url`, `title` and `content` only. Nonetheless, you can use accept-header to control the output format:
```bash
curl -H "Accept: application/json" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
```
2024-04-24 23:28:55 +08:00
2024-04-14 02:42:15 +08:00
## Install
2024-04-10 19:32:07 +08:00
2024-04-14 02:42:15 +08:00
You will need the following tools to run the project:
- Node v18 (The build fails for Node version >18)
- Firebase CLI (`npm install -g firebase-tools`)
2024-04-10 19:32:07 +08:00
2024-04-14 02:42:15 +08:00
For backend, go to the `backend/functions` directory and install the npm dependencies.
2024-04-10 19:32:07 +08:00
```bash
2024-04-14 02:42:15 +08:00
git clone git@github.com:jina-ai/reader.git
cd backend/functions
npm install
2024-04-10 19:32:07 +08:00
```
2024-04-14 03:22:36 +08:00
2024-04-14 03:51:36 +08:00
## What is `thinapps-shared` submodule?
2024-04-14 03:22:36 +08:00
2024-04-14 03:47:55 +08:00
You might notice a reference to `thinapps-shared` submodule, an internal package we use to share code across our products. While its not open-sourced and isn't integral to the Reader's functions, it mainly helps with decorators, logging, secrets management, etc. Feel free to ignore it for now.
2024-04-14 03:22:36 +08:00
2024-04-14 03:47:55 +08:00
That said, this is *the single codebase* behind `https://r.jina.ai`, so everytime we commit here, we will deploy the new version to the `https://r.jina.ai`.
2024-04-14 03:25:42 +08:00
2024-04-14 03:33:51 +08:00
## Having trouble on some websites?
Please raise an issue with the URL you are having trouble with. We will look into it and try to fix it.
2024-04-14 03:22:36 +08:00
## License
2024-04-18 12:48:42 +08:00
Reader is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE).