SimpleChatTC:SimpleProxy:Pdf2Text update /cleanup readme
This commit is contained in:
parent
494d063657
commit
e1cf2bae7e
|
|
@ -94,13 +94,17 @@ remember to
|
||||||
|
|
||||||
* cd tools/server/public_simplechat/local.tools; python3 ./simpleproxy.py --config simpleproxy.json
|
* cd tools/server/public_simplechat/local.tools; python3 ./simpleproxy.py --config simpleproxy.json
|
||||||
|
|
||||||
* remember that this is a relatively minimal dumb proxy logic along with optional stripping of non textual
|
* remember that this is a relatively minimal dumb proxy logic which can fetch html or pdf content and
|
||||||
content like head, scripts, styles, headers, footers, ... Be careful when accessing web through this and
|
inturn optionally provide plain text version of the content by stripping off non textual/core contents.
|
||||||
use it only with known safe sites.
|
Be careful when accessing web through this and use it only with known safe sites.
|
||||||
|
|
||||||
* look into local.tools/simpleproxy.json for specifying
|
* look into local.tools/simpleproxy.json for specifying
|
||||||
|
|
||||||
|
* the white list of allowed.schemes
|
||||||
|
* you may want to use this to disable local file access and or disable http access,
|
||||||
|
and inturn retaining only https based urls or so.
|
||||||
* the white list of allowed.domains
|
* the white list of allowed.domains
|
||||||
|
* review and update this to match your needs.
|
||||||
* the shared bearer token between server and client ui
|
* the shared bearer token between server and client ui
|
||||||
|
|
||||||
* other builtin tool / function calls like calculator, javascript runner, DataStore dont require the
|
* other builtin tool / function calls like calculator, javascript runner, DataStore dont require the
|
||||||
|
|
@ -389,15 +393,15 @@ like
|
||||||
sessions by getting it to also create and execute mathematical expressions or code to verify
|
sessions by getting it to also create and execute mathematical expressions or code to verify
|
||||||
such stuff and so.
|
such stuff and so.
|
||||||
|
|
||||||
* access content from internet and augment the ai model's context with additional data as
|
* access content (including html, pdf, text based...) from local file system or the internet
|
||||||
needed to help generate better responses. this can also be used for
|
and augment the ai model's context with additional data as needed to help generate better
|
||||||
|
responses. This can also be used for
|
||||||
* generating the latest news summary by fetching from news aggregator sites and collating
|
* generating the latest news summary by fetching from news aggregator sites and collating
|
||||||
organising and summarising the same
|
organising and summarising the same
|
||||||
* searching for specific topics and summarising the results
|
* searching for specific topics and summarising the search results and or fetching and
|
||||||
|
analysing found data to generate summary or to explore / answer queries around that data ...
|
||||||
* or so
|
* or so
|
||||||
|
|
||||||
* one could also augment additional data / info by accessing text content from pdf files
|
|
||||||
|
|
||||||
* save collated data or generated analysis or more to the provided data store and retrieve
|
* save collated data or generated analysis or more to the provided data store and retrieve
|
||||||
them later to augment the analysis / generation then. Also could be used to summarise chat
|
them later to augment the analysis / generation then. Also could be used to summarise chat
|
||||||
session till a given point and inturn save the summary into data store and later retrieve
|
session till a given point and inturn save the summary into data store and later retrieve
|
||||||
|
|
@ -444,16 +448,18 @@ Either way always remember to cross check the tool requests and generated respon
|
||||||
* search_web_text - search for the specified words using the configured search engine and return the
|
* search_web_text - search for the specified words using the configured search engine and return the
|
||||||
plain textual content from the search result page.
|
plain textual content from the search result page.
|
||||||
|
|
||||||
|
* pdf2text - fetch/read specified pdf file and extract its textual content
|
||||||
|
* this depends on the pypdf python based open source library
|
||||||
|
|
||||||
the above set of web related tool calls work by handshaking with a bundled simple local web proxy
|
the above set of web related tool calls work by handshaking with a bundled simple local web proxy
|
||||||
(/caching in future) server logic, this helps bypass the CORS restrictions applied if trying to
|
(/caching in future) server logic, this helps bypass the CORS restrictions applied if trying to
|
||||||
directly fetch from the browser js runtime environment.
|
directly fetch from the browser js runtime environment.
|
||||||
|
|
||||||
* pdf2text - fetch/read specified pdf file and extract its textual content
|
Local file access is also enabled for web fetch and pdf tool calls, if one uses the file:/// scheme
|
||||||
|
in the url, so be careful as to where and under which user id the simple proxy will be run.
|
||||||
|
|
||||||
* local file access is enabled for this feature, so be careful as to where and under which user id
|
* one can always disable local file access by removing 'file' from the list of allowed.schemes in
|
||||||
the simple proxy will be run.
|
simpleproxy.json config file.
|
||||||
|
|
||||||
* this depends on the pypdf python based open source library
|
|
||||||
|
|
||||||
Implementing some of the tool calls through the simpleproxy.py server and not directly in the browser
|
Implementing some of the tool calls through the simpleproxy.py server and not directly in the browser
|
||||||
js env, allows one to isolate the core of these logic within a discardable VM or so, by running the
|
js env, allows one to isolate the core of these logic within a discardable VM or so, by running the
|
||||||
|
|
@ -463,7 +469,7 @@ Depending on the path specified wrt the proxy server, it executes the correspond
|
||||||
urltext path is used (and not urlraw), the logic in addition to fetching content from given url, it
|
urltext path is used (and not urlraw), the logic in addition to fetching content from given url, it
|
||||||
tries to convert html content into equivalent plain text content to some extent in a simple minded
|
tries to convert html content into equivalent plain text content to some extent in a simple minded
|
||||||
manner by dropping head block as well as all scripts/styles/footers/headers/nav blocks and inturn
|
manner by dropping head block as well as all scripts/styles/footers/headers/nav blocks and inturn
|
||||||
dropping the html tags.
|
also dropping the html tags. Similarly for pdf2text.
|
||||||
|
|
||||||
The client ui logic does a simple check to see if the bundled simpleproxy is running at specified
|
The client ui logic does a simple check to see if the bundled simpleproxy is running at specified
|
||||||
proxyUrl before enabling these web and related tool calls.
|
proxyUrl before enabling these web and related tool calls.
|
||||||
|
|
@ -475,7 +481,8 @@ The bundled simple proxy
|
||||||
|
|
||||||
* it provides for a basic white list of allowed domains to access, to be specified by the end user.
|
* it provides for a basic white list of allowed domains to access, to be specified by the end user.
|
||||||
This should help limit web access to a safe set of sites determined by the end user. There is also
|
This should help limit web access to a safe set of sites determined by the end user. There is also
|
||||||
a provision for shared bearer token to be specified by the end user.
|
a provision for shared bearer token to be specified by the end user. One could even control what
|
||||||
|
schemes are supported wrt the urls.
|
||||||
|
|
||||||
* it tries to mimic the client/browser making the request to it by propogating header entries like
|
* it tries to mimic the client/browser making the request to it by propogating header entries like
|
||||||
user-agent, accept and accept-language from the got request to the generated request during proxying
|
user-agent, accept and accept-language from the got request to the generated request during proxying
|
||||||
|
|
@ -572,13 +579,15 @@ users) own data or data of ai model.
|
||||||
|
|
||||||
Trap http response errors and inform user the specific error returned by ai server.
|
Trap http response errors and inform user the specific error returned by ai server.
|
||||||
|
|
||||||
Initial go at a pdf2text tool call. For now it allows local pdf files to be read and their text content
|
Initial go at a pdf2text tool call. It allows web / local pdf files to be read and their text content
|
||||||
extracted and passed to ai model for further processing, as decided by ai and end user.
|
extracted and passed to ai model for further processing, as decided by ai and end user. One could
|
||||||
|
either work with the full pdf or a subset of adjacent pages.
|
||||||
|
|
||||||
SimpleProxy
|
SimpleProxy
|
||||||
* Convert from a single monolithic file into a collection of modules.
|
* Convert from a single monolithic file into a collection of modules.
|
||||||
* UrlValidator to cross check scheme and domain of requested urls,
|
* UrlValidator to cross check scheme and domain of requested urls,
|
||||||
the whitelist inturn picked from config json
|
the whitelist inturn picked from config json
|
||||||
|
* Helpers to fetch file from local file system or the web, transparently
|
||||||
|
|
||||||
#### ToDo
|
#### ToDo
|
||||||
|
|
||||||
|
|
@ -594,8 +603,6 @@ same when saved chat is loaded.
|
||||||
|
|
||||||
MAYBE make the settings in general chat session specific, rather than the current global config flow.
|
MAYBE make the settings in general chat session specific, rather than the current global config flow.
|
||||||
|
|
||||||
Provide tool to allow for specified pdf files to be converted to equivalent plain text form, so that ai
|
|
||||||
can be used to work with the content in those PDFs.
|
|
||||||
|
|
||||||
### Debuging the handshake and beyond
|
### Debuging the handshake and beyond
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue