SimpleChatTC:SimpleProxy:Pdf2Text update /cleanup readme

This commit is contained in:
hanishkvc 2025-11-02 22:11:07 +05:30
parent 494d063657
commit e1cf2bae7e
1 changed files with 26 additions and 19 deletions

View File

@ -94,13 +94,17 @@ remember to
* cd tools/server/public_simplechat/local.tools; python3 ./simpleproxy.py --config simpleproxy.json * cd tools/server/public_simplechat/local.tools; python3 ./simpleproxy.py --config simpleproxy.json
* remember that this is a relatively minimal dumb proxy logic along with optional stripping of non textual * remember that this is a relatively minimal dumb proxy logic which can fetch html or pdf content and
content like head, scripts, styles, headers, footers, ... Be careful when accessing web through this and inturn optionally provide plain text version of the content by stripping off non textual/core contents.
use it only with known safe sites. Be careful when accessing web through this and use it only with known safe sites.
* look into local.tools/simpleproxy.json for specifying * look into local.tools/simpleproxy.json for specifying
* the white list of allowed.schemes
* you may want to use this to disable local file access and or disable http access,
and inturn retaining only https based urls or so.
* the white list of allowed.domains * the white list of allowed.domains
* review and update this to match your needs.
* the shared bearer token between server and client ui * the shared bearer token between server and client ui
* other builtin tool / function calls like calculator, javascript runner, DataStore dont require the * other builtin tool / function calls like calculator, javascript runner, DataStore dont require the
@ -389,15 +393,15 @@ like
sessions by getting it to also create and execute mathematical expressions or code to verify sessions by getting it to also create and execute mathematical expressions or code to verify
such stuff and so. such stuff and so.
* access content from internet and augment the ai model's context with additional data as * access content (including html, pdf, text based...) from local file system or the internet
needed to help generate better responses. this can also be used for and augment the ai model's context with additional data as needed to help generate better
responses. This can also be used for
* generating the latest news summary by fetching from news aggregator sites and collating * generating the latest news summary by fetching from news aggregator sites and collating
organising and summarising the same organising and summarising the same
* searching for specific topics and summarising the results * searching for specific topics and summarising the search results and or fetching and
analysing found data to generate summary or to explore / answer queries around that data ...
* or so * or so
* one could also augment additional data / info by accessing text content from pdf files
* save collated data or generated analysis or more to the provided data store and retrieve * save collated data or generated analysis or more to the provided data store and retrieve
them later to augment the analysis / generation then. Also could be used to summarise chat them later to augment the analysis / generation then. Also could be used to summarise chat
session till a given point and inturn save the summary into data store and later retrieve session till a given point and inturn save the summary into data store and later retrieve
@ -444,16 +448,18 @@ Either way always remember to cross check the tool requests and generated respon
* search_web_text - search for the specified words using the configured search engine and return the * search_web_text - search for the specified words using the configured search engine and return the
plain textual content from the search result page. plain textual content from the search result page.
* pdf2text - fetch/read specified pdf file and extract its textual content
* this depends on the pypdf python based open source library
the above set of web related tool calls work by handshaking with a bundled simple local web proxy the above set of web related tool calls work by handshaking with a bundled simple local web proxy
(/caching in future) server logic, this helps bypass the CORS restrictions applied if trying to (/caching in future) server logic, this helps bypass the CORS restrictions applied if trying to
directly fetch from the browser js runtime environment. directly fetch from the browser js runtime environment.
* pdf2text - fetch/read specified pdf file and extract its textual content Local file access is also enabled for web fetch and pdf tool calls, if one uses the file:/// scheme
in the url, so be careful as to where and under which user id the simple proxy will be run.
* local file access is enabled for this feature, so be careful as to where and under which user id * one can always disable local file access by removing 'file' from the list of allowed.schemes in
the simple proxy will be run. simpleproxy.json config file.
* this depends on the pypdf python based open source library
Implementing some of the tool calls through the simpleproxy.py server and not directly in the browser Implementing some of the tool calls through the simpleproxy.py server and not directly in the browser
js env, allows one to isolate the core of these logic within a discardable VM or so, by running the js env, allows one to isolate the core of these logic within a discardable VM or so, by running the
@ -463,7 +469,7 @@ Depending on the path specified wrt the proxy server, it executes the correspond
urltext path is used (and not urlraw), the logic in addition to fetching content from given url, it urltext path is used (and not urlraw), the logic in addition to fetching content from given url, it
tries to convert html content into equivalent plain text content to some extent in a simple minded tries to convert html content into equivalent plain text content to some extent in a simple minded
manner by dropping head block as well as all scripts/styles/footers/headers/nav blocks and inturn manner by dropping head block as well as all scripts/styles/footers/headers/nav blocks and inturn
dropping the html tags. also dropping the html tags. Similarly for pdf2text.
The client ui logic does a simple check to see if the bundled simpleproxy is running at specified The client ui logic does a simple check to see if the bundled simpleproxy is running at specified
proxyUrl before enabling these web and related tool calls. proxyUrl before enabling these web and related tool calls.
@ -475,7 +481,8 @@ The bundled simple proxy
* it provides for a basic white list of allowed domains to access, to be specified by the end user. * it provides for a basic white list of allowed domains to access, to be specified by the end user.
This should help limit web access to a safe set of sites determined by the end user. There is also This should help limit web access to a safe set of sites determined by the end user. There is also
a provision for shared bearer token to be specified by the end user. a provision for shared bearer token to be specified by the end user. One could even control what
schemes are supported wrt the urls.
* it tries to mimic the client/browser making the request to it by propogating header entries like * it tries to mimic the client/browser making the request to it by propogating header entries like
user-agent, accept and accept-language from the got request to the generated request during proxying user-agent, accept and accept-language from the got request to the generated request during proxying
@ -572,13 +579,15 @@ users) own data or data of ai model.
Trap http response errors and inform user the specific error returned by ai server. Trap http response errors and inform user the specific error returned by ai server.
Initial go at a pdf2text tool call. For now it allows local pdf files to be read and their text content Initial go at a pdf2text tool call. It allows web / local pdf files to be read and their text content
extracted and passed to ai model for further processing, as decided by ai and end user. extracted and passed to ai model for further processing, as decided by ai and end user. One could
either work with the full pdf or a subset of adjacent pages.
SimpleProxy SimpleProxy
* Convert from a single monolithic file into a collection of modules. * Convert from a single monolithic file into a collection of modules.
* UrlValidator to cross check scheme and domain of requested urls, * UrlValidator to cross check scheme and domain of requested urls,
the whitelist inturn picked from config json the whitelist inturn picked from config json
* Helpers to fetch file from local file system or the web, transparently
#### ToDo #### ToDo
@ -594,8 +603,6 @@ same when saved chat is loaded.
MAYBE make the settings in general chat session specific, rather than the current global config flow. MAYBE make the settings in general chat session specific, rather than the current global config flow.
Provide tool to allow for specified pdf files to be converted to equivalent plain text form, so that ai
can be used to work with the content in those PDFs.
### Debuging the handshake and beyond ### Debuging the handshake and beyond