Define a typealias for HttpHeaders and use it where ever needed.
Inturn map this to email.message.Message and dict for now.
If and when python evolves Http Headers type into better one,
need to replace in only one place.
Add a ToolManager class which
* maintains the list of tool calls and inturn allows any given
tool call to be executed and response returned along with needed
meta data
* generate the overall tool calls meta data
* add ToolCallResponseEx which maintains full TCOutResponse for
use by tc_handle callers
Avoid duplicating handling of some of the basic needed http header
entries.
Move checking for any dependencies before enabling a tool call into
respective tc??? module.
* for now this also demotes the logic from the previous fine grained
per tool call based dependency check to a more global dep check at
the respective module level
Will be looking at changing the handshake between AnveshikaSallap
web tech based client logic and this tool calls server to follow
the emerging interoperable MCP standard
Also remember to picks the tagDropREs from passed args object and
not from got http header.
Even TCHtmlText updated to get the tags to drop from passed args
object and not got http header. And inturn allow ai to pass this
optional arg, as it sees fit in co-ordination with user.
Instead of manually setting up rfile and wfile after switching to
ssl mode wrt a client request, now use the builtin setup provided
by the RequestHandler logic, so that these and any other needed
things will be setup as needed after the ssl hs based new socket,
just in case new things are needed in future.
Minimal skeleton to allow dict [] style access to dataclass based
class's attributes/fields. Also get member function similar to dict.
This simplifies the flow and avoids duplicating data between
attribute and dict data related name and data spaces.
Add a helper base class to try map data class's attributes into
underlying dict.
TODO: this potentially duplicates data in both normal attribute
space as well as dict items space. And will require additional
standard helper logics to be overridden to ensure sync between
both space etal. Rather given distance from python internals for
a long time now, on pausing and thinking a bit, better to move
into a simpler arch where attributes are directly worked on for
dict [] style access.
Instead of maintaining the config and some of the runtime states
identified as gMe as a generic literal dictionary which grows at
runtime with fields as required, try create it as a class of classes.
Inturn use dataclass annotation to let biolerplate code get auto
generated.
A config module created with above, however remaining part of the
code not yet updated to work with this new structure.
process_args and load_config moved into the new Config class.
otherwise aum path was not handled immidiately wrt exceptions.
this also ensures any future changes wrt get request handling
also get handled immidiately wrt exceptions, that may be missed
by any targetted exception handling.
Given that default HTTPServer handles only one connection and inturn
request at any given time, so if a client opens connection and then
doesnt do anything with it, it will block other clients by putting their
requests into network queue for long.
So to overcome the above issue switch to ThreadingHTTPServer, which
starts a new thread for each request.
Given that previously ssl wrapping was done wrt the main server socket,
even with switching to ThreadingHTTPServer, the handshake for ssl/tls
still occurs in the main thread before a child thread is started for
parallel request handling, thus the ssl handshake phase blocking other
client requests.
So now avoid wrapping ssl wrt the main server socket, instead wait for
ThreadingHttpServer to start the new thread for a client request ie
after a connection is accepted for the client, before trying to wrap
the connection in ssl. This ensures that the ssl handshake occurs in
this child (ie client request related) thread. So some rogue entity
opening a http connection and not doing ssl handshake wont block.
Inturn in this case the rfile and wfile instances within the proxy
handler need to be remapped to the new ssl wrapped socket.
Pass a list to keep track of the numbering at different depths
as well as to delay incrementing the numbering to the last min
Dont let recursion go beyond a predefined limit
This simple scheme doesnt work. Rather the pdf outline seems
to follow below logic
If a child list is found when processing the current list, dont
increment the numbering.
To make it easier for the ai model to understand that this works
mainly for html pages and not say xml or pdf or so. For those
one needs to use other explict tool calls provided like fetchpdftext
or fetchxmltext or so
The server service path renamed from urltext to htmltext.
SearchWebText also updated to use htmltext now
At simpleproxy end
* Add the tag names hierarchy before contents of a tag
* Remember to convert the tagDrops to small case as HTMLParser base
class seems to do that by default.
At the client ui end
* if undefined remember to pass a empty list wrt tagDrops.
* cleanup the func description and also mention possible tagDrops
for RSS feeds in the tool meta
Update the initial skeleton wrt the tag drops logic
* had forgotten to convert object to json string at the client end
* had confused between js and python and tried accessing the dict
elements using . notation rather than [] notation in python.
* if the id filtered tag to be dropped is found, from then on
track all other tags of the same type (independent of id),
so that start and end tags can be matched. bcas end tag call
wont have attribute, so all other tags of same type need to
be tracked, for proper winding and unwinding to try find
matching end tag
* remember to reset the tracked drop tag type to None once matching
end tag at same depth is found. should avoid some unnecessary
unwinding.
* set/fix the type wrt tagDrops explicitly to needed depth and
ensure the dummy one and any explicitly got one is of right type.
Tested with duckduckgo search engine and now the div based unneeded
header is avoided in returned search result.
Rename search-drops to urltext-tag-drops, to indicate its more
generic semantic. Rather search drops specified in UI by user
will be mapped to urltext-tag-drops header entry of a urltext
web fetch request.
Implement a crude urltext-tag-drops logic in TextHtmlParser.
If there is any mismatch with opening and closing tags in the
html being parsed and inturn wrt the type of tag being targetted
for dropping, things can mess up.
Rename path and tags/identifiers from Pdf2Text to PdfText
Rename the function call to pdf_to_text, this should also help
indicate semantic more unambiguously, just in case, especially
for smaller models.
Added logic to help get a file from either the local file system
or from the web, based on the url specified.
Update pdfmagic module to use the same, so that it can support
both local as well as web based pdf.
Bring in the debug module, which I had forgotten to commit, after
moving debug helper code from simpleproxy.py to the debug module
also move debug dump helper to its own module
also remember to specify the Class name in quotes, similar to
refering to a class within a member of th class wrt python type
checking.
Add --allowed.schemes config entry as a needed config.
Setup the url validator.
Use this wrt urltext, urlraw and pdf2text
This allows user to control whether local file access is enabled
or not. By default in the sample simpleproxy.json config file
local file access is allowed.
Make the description bit more explicit with it supporting local
file paths as part of the url scheme, as the tested ai model was
cribbing about not supporting file url scheme. Need to check if
this new description will make things better.
Convert the text to bytes for writing to the http pipe.
Ensure CORS is kept happy by passing AccessControlAllowOrigin in
header.