SimpleChatTC:SimpleProxy:Pdf2Text update /cleanup readme

2025-11-02 22:11:07 +05:30 · 2025-11-02 22:11:07 +05:30 · e1cf2bae7e
parent 494d063657
commit e1cf2bae7e
1 changed files with 26 additions and 19 deletions
--- a/tools/server/public_simplechat/readme.md
+++ b/tools/server/public_simplechat/readme.md
@ -94,13 +94,17 @@ remember to
  * cd tools/server/public_simplechat/local.tools; python3 ./simpleproxy.py --config simpleproxy.json
-  * remember that this is a relatively minimal dumb proxy logic along with optional stripping of non textual
+  * remember that this is a relatively minimal dumb proxy logic which can fetch html or pdf content and
-  content like head, scripts, styles, headers, footers, ... Be careful when accessing web through this and
+  inturn optionally provide plain text version of the content by stripping off non textual/core contents.
-  use it only with known safe sites.
+  Be careful when accessing web through this and use it only with known safe sites.
  * look into local.tools/simpleproxy.json for specifying
    * the white list of allowed.schemes
      * you may want to use this to disable local file access and or disable http access,
        and inturn retaining only https based urls or so.
    * the white list of allowed.domains
      * review and update this to match your needs.
    * the shared bearer token between server and client ui
 * other builtin tool / function calls like calculator, javascript runner, DataStore dont require the
@ -389,15 +393,15 @@ like
 sessions by getting it to also create and execute mathematical expressions or code to verify
 such stuff and so.
-* access content from internet and augment the ai model's context with additional data as
+* access content (including html, pdf, text based...) from local file system or the internet
-needed to help generate better responses. this can also be used for
+and augment the ai model's context with additional data as needed to help generate better
 responses. This can also be used for
  * generating the latest news summary by fetching from news aggregator sites and collating
  organising and summarising the same
-  * searching for specific topics and summarising the results
+  * searching for specific topics and summarising the search results and or fetching and
  analysing found data to generate summary or to explore / answer queries around that data ...
  * or so
 * one could also augment additional data / info by accessing text content from pdf files
 * save collated data or generated analysis or more to the provided data store and retrieve
 them later to augment the analysis / generation then. Also could be used to summarise chat
 session till a given point and inturn save the summary into data store and later retrieve
@ -444,16 +448,18 @@ Either way always remember to cross check the tool requests and generated respon
 * search_web_text - search for the specified words using the configured search engine and return the
 plain textual content from the search result page.
 * pdf2text - fetch/read specified pdf file and extract its textual content
  * this depends on the pypdf python based open source library
 the above set of web related tool calls work by handshaking with a bundled simple local web proxy
 (/caching in future) server logic, this helps bypass the CORS restrictions applied if trying to
 directly fetch from the browser js runtime environment.
-* pdf2text - fetch/read specified pdf file and extract its textual content
+Local file access is also enabled for web fetch and pdf tool calls, if one uses the file:/// scheme
 in the url, so be careful as to where and under which user id the simple proxy will be run.
-  * local file access is enabled for this feature, so be careful as to where and under which user id
+* one can always disable local file access by removing 'file' from the list of allowed.schemes in
-  the simple proxy will be run.
+simpleproxy.json config file.
  * this depends on the pypdf python based open source library
 Implementing some of the tool calls through the simpleproxy.py server and not directly in the browser
 js env, allows one to isolate the core of these logic within a discardable VM or so, by running the
@ -463,7 +469,7 @@ Depending on the path specified wrt the proxy server, it executes the correspond
 urltext path is used (and not urlraw), the logic in addition to fetching content from given url, it
 tries to convert html content into equivalent plain text content to some extent in a simple minded
 manner by dropping head block as well as all scripts/styles/footers/headers/nav blocks and inturn
-dropping the html tags.
+also dropping the html tags. Similarly for pdf2text.
 The client ui logic does a simple check to see if the bundled simpleproxy is running at specified
 proxyUrl before enabling these web and related tool calls.
@ -475,7 +481,8 @@ The bundled simple proxy
 * it provides for a basic white list of allowed domains to access, to be specified by the end user.
  This should help limit web access to a safe set of sites determined by the end user. There is also
-  a provision for shared bearer token to be specified by the end user.
+  a provision for shared bearer token to be specified by the end user. One could even control what
  schemes are supported wrt the urls.
 * it tries to mimic the client/browser making the request to it by propogating header entries like
  user-agent, accept and accept-language from the got request to the generated request during proxying
@ -572,13 +579,15 @@ users) own data or data of ai model.
 Trap http response errors and inform user the specific error returned by ai server.
-Initial go at a pdf2text tool call. For now it allows local pdf files to be read and their text content
+Initial go at a pdf2text tool call. It allows web / local pdf files to be read and their text content
-extracted and passed to ai model for further processing, as decided by ai and end user.
+extracted and passed to ai model for further processing, as decided by ai and end user. One could
 either work with the full pdf or a subset of adjacent pages.
 SimpleProxy
 * Convert from a single monolithic file into a collection of modules.
 * UrlValidator to cross check scheme and domain of requested urls,
  the whitelist inturn picked from config json
 * Helpers to fetch file from local file system or the web, transparently
 #### ToDo
@ -594,8 +603,6 @@ same when saved chat is loaded.
 MAYBE make the settings in general chat session specific, rather than the current global config flow.
 Provide tool to allow for specified pdf files to be converted to equivalent plain text form, so that ai
 can be used to work with the content in those PDFs.
 ### Debuging the handshake and beyond