Rename search-drops to urltext-tag-drops, to indicate its more
generic semantic. Rather search drops specified in UI by user
will be mapped to urltext-tag-drops header entry of a urltext
web fetch request.
Implement a crude urltext-tag-drops logic in TextHtmlParser.
If there is any mismatch with opening and closing tags in the
html being parsed and inturn wrt the type of tag being targetted
for dropping, things can mess up.
Allow the web tools handshake helper to pass additional header
entries provided by its caller.
Make use of this to send a list of tag and id pairs wrt web search
tool. Which will be used to drop div's matching the specified id.
Rename path and tags/identifiers from Pdf2Text to PdfText
Rename the function call to pdf_to_text, this should also help
indicate semantic more unambiguously, just in case, especially
for smaller models.
This makes the logic more generic, as well as prepares for additional
parameters to be passed to the simpleproxy.py helper handshakes.
Ex: Restrict extracted contents of a pdf to specified start and end
page numbers or so.
Needed to tweak the description further for the ai model to be
able to understand that its ok to pass file:// scheme based urls
Had forgotten how big the web site pages have become as also the
need for more ResultDataLength wrt one shot PDF read to get
atleast some good enough amount of content in it with large pdfs
Allow user to limit the max amount of result data returned to ai
after a tool call.
Inturn it is set by default to 2K.
Update the pdf2text tool description to try make the local file
path support more explicit
Make the description bit more explicit with it supporting local
file paths as part of the url scheme, as the tested ai model was
cribbing about not supporting file url scheme. Need to check if
this new description will make things better.
Convert the text to bytes for writing to the http pipe.
Ensure CORS is kept happy by passing AccessControlAllowOrigin in
header.
instead of using the shared bearer token as is, hash it with
current year and use the hash.
keep /aum path out of auth check.
in future bearer token could be transformed more often, as well as
with additional nounce/dynamic token from server got during initial
/aum handshake as also running counter and so ...
NOTE: All these circus not good enough, given that currently the
simpleproxy.py handshakes work over http. However these skeletons
put in place, for future, if needed.
TODO: There is a once in a bluemoon race when the year transitions
between client generating the request and server handling the req.
But other wise year transitions dont matter bcas client always
creates fresh token, and server checks for year change to genrate
fresh token if required.
Avoid code duplication, by creating helpers for setup and toolcall.
Also send indication of the path that will be used, when checking
for simpleproxy.py server to be running at runtime setup.
Initial go at implementing a web search tool call, which uses the
existing UrlText support of the bundled simpleproxy.py.
It allows user to control the search engine to use, by allowing
them to set the search engine url template.
The logic comes with search engine url template strings for
duckduckgo, brave, bing and google. With duckduckgo set by default.