mirror of https://github.com/intergalacticalvariable/reader.git synced 2024-11-15 19:22:20 +08:00

📚 This is an adapted version of Jina AI's Reader for local deployment using Docker. Convert any URL to an LLM-friendly input with a simple prefix http://127.0.0.1:3000/https://website-to-scrape.com/

docker llm proxy rag scraper self-hosted webscraper webscraping webscraping-data website-screenshot website-screenshot-capturer

Go to file

Han Xiao b3fb4c5c57 feat: add image captioning (#6 ) * Fix contentText assignment in CrawlerHost class * fix: recover vscode configurations * feat: add image captioning * feat: add image captioning * clean: vscode config * chore: fix some ts warnings * feat: auto alt text * fix * chore: improve prompt * clean: unused config * fix: failure condition * fix: remove redundant code * fix: catch parse error * fix: catch parse error --------- Co-authored-by: Yanlong Wang <yanlong.wang@naiver.org>		2024-04-15 20:51:31 -07:00
.github/workflows	wip	2024-04-10 19:32:07 +08:00
.vscode	feat: add image captioning (#6 )	2024-04-15 20:51:31 -07:00
backend	feat: add image captioning (#6 )	2024-04-15 20:51:31 -07:00
thinapps-shared@bea967a371	feat: add image captioning (#6 )	2024-04-15 20:51:31 -07:00
.gitignore	feat: add image captioning (#6 )	2024-04-15 20:51:31 -07:00
.gitmodules	wip	2024-04-10 19:32:07 +08:00
LICENSE	chore: rename url2text to reader	2024-04-13 11:42:15 -07:00
package-lock.json	fix	2024-04-12 12:27:42 +08:00
package.json	chore: rename url2text to reader	2024-04-11 15:44:12 -07:00
README.md	chore: update readme	2024-04-15 17:23:16 -07:00

README.md

Reader

Your LLMs deserve better input.

Reader converts any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/. Get improved output for your agent and RAG systems at no cost.

Live demo: https://jina.ai/reader
Or just visit these URLs https://r.jina.ai/https://github.com/jina-ai/reader, https://r.jina.ai/https://x.com/elonmusk and see yourself.

Feel free to use https://r.jina.ai/* in production. It is free, stable and scalable. We are maintaining it actively as one of the core products of Jina AI.

Usage

Standard mode

Simply prepend https://r.jina.ai/ to any URL. For example, to convert the URL https://en.wikipedia.org/wiki/Artificial_intelligence to an LLM-friendly input, use the following URL:

https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence

Streaming mode

Use accept-header to control the streaming behavior:

Note, if you run this example below and not see streaming output but a single response, it means someone else has just run this within 5 min you and the result is cached already. Hence, the server simply returns the result instantly. Try with a different URL and you will see the streaming output.

curl -H "Accept: text/event-stream" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page

If your downstream LLM/agent system requires immediate content delivery or needs to process data in chunks to interleave the IO and LLM time, use Streaming Mode. This allows for quicker access and efficient handling of data:


Reader API:  streamContent1 ----> streamContent2 ----> streamContent3 ---> ... 
                          |                    |                     |
                          v                    |                     |
Your LLM:                 LLM(streamContent1)  |                     |
                                               v                     |
                                               LLM(streamContent2)   |
                                                                     v
                                                                     LLM(streamContent3)

Stream mode is also useful when the target page is large to render. If you find standard mode gives you incomplete content, try streaming mode.

JSON mode

This is still very early and the result is not really a "useful" JSON. It contains three fields url, title and content only. Nonetheless, you can use accept-header to control the output format:

curl -H "Accept: application/json" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page

Install

You will need the following tools to run the project:

Node v18 (The build fails for Node version >18)
Firebase CLI (npm install -g firebase-tools)

For backend, go to the backend/functions directory and install the npm dependencies.

git clone git@github.com:jina-ai/reader.git
cd backend/functions
npm install

What is `thinapps-shared` submodule?

You might notice a reference to thinapps-shared submodule, an internal package we use to share code across our products. While it’s not open-sourced and isn't integral to the Reader's functions, it mainly helps with decorators, logging, secrets management, etc. Feel free to ignore it for now.

That said, this is the single codebase behind https://r.jina.ai, so everytime we commit here, we will deploy the new version to the https://r.jina.ai.

Having trouble on some websites?

Please raise an issue with the URL you are having trouble with. We will look into it and try to fix it.

License

Reader is backed by Jina AI and licensed under Apache-2.0.

README.md Unescape Escape