Pass a list to keep track of the numbering at different depths
as well as to delay incrementing the numbering to the last min
Dont let recursion go beyond a predefined limit
This simple scheme doesnt work. Rather the pdf outline seems
to follow below logic
If a child list is found when processing the current list, dont
increment the numbering.
To make it easier for the ai model to understand that this works
mainly for html pages and not say xml or pdf or so. For those
one needs to use other explict tool calls provided like fetchpdftext
or fetchxmltext or so
The server service path renamed from urltext to htmltext.
SearchWebText also updated to use htmltext now
At simpleproxy end
* Add the tag names hierarchy before contents of a tag
* Remember to convert the tagDrops to small case as HTMLParser base
class seems to do that by default.
At the client ui end
* if undefined remember to pass a empty list wrt tagDrops.
* cleanup the func description and also mention possible tagDrops
for RSS feeds in the tool meta
Update the initial skeleton wrt the tag drops logic
* had forgotten to convert object to json string at the client end
* had confused between js and python and tried accessing the dict
elements using . notation rather than [] notation in python.
* if the id filtered tag to be dropped is found, from then on
track all other tags of the same type (independent of id),
so that start and end tags can be matched. bcas end tag call
wont have attribute, so all other tags of same type need to
be tracked, for proper winding and unwinding to try find
matching end tag
* remember to reset the tracked drop tag type to None once matching
end tag at same depth is found. should avoid some unnecessary
unwinding.
* set/fix the type wrt tagDrops explicitly to needed depth and
ensure the dummy one and any explicitly got one is of right type.
Tested with duckduckgo search engine and now the div based unneeded
header is avoided in returned search result.
Rename search-drops to urltext-tag-drops, to indicate its more
generic semantic. Rather search drops specified in UI by user
will be mapped to urltext-tag-drops header entry of a urltext
web fetch request.
Implement a crude urltext-tag-drops logic in TextHtmlParser.
If there is any mismatch with opening and closing tags in the
html being parsed and inturn wrt the type of tag being targetted
for dropping, things can mess up.
Rename path and tags/identifiers from Pdf2Text to PdfText
Rename the function call to pdf_to_text, this should also help
indicate semantic more unambiguously, just in case, especially
for smaller models.
Added logic to help get a file from either the local file system
or from the web, based on the url specified.
Update pdfmagic module to use the same, so that it can support
both local as well as web based pdf.
Bring in the debug module, which I had forgotten to commit, after
moving debug helper code from simpleproxy.py to the debug module
also move debug dump helper to its own module
also remember to specify the Class name in quotes, similar to
refering to a class within a member of th class wrt python type
checking.
Add --allowed.schemes config entry as a needed config.
Setup the url validator.
Use this wrt urltext, urlraw and pdf2text
This allows user to control whether local file access is enabled
or not. By default in the sample simpleproxy.json config file
local file access is allowed.
Make the description bit more explicit with it supporting local
file paths as part of the url scheme, as the tested ai model was
cribbing about not supporting file url scheme. Need to check if
this new description will make things better.
Convert the text to bytes for writing to the http pipe.
Ensure CORS is kept happy by passing AccessControlAllowOrigin in
header.
Add a newline between name and content in the xml representation
of the tool response, so that it is more easy to distinguish things
Add github, linkedin and apnews domains to allowed.domains for
simpleproxy.py
instead of using the shared bearer token as is, hash it with
current year and use the hash.
keep /aum path out of auth check.
in future bearer token could be transformed more often, as well as
with additional nounce/dynamic token from server got during initial
/aum handshake as also running counter and so ...
NOTE: All these circus not good enough, given that currently the
simpleproxy.py handshakes work over http. However these skeletons
put in place, for future, if needed.
TODO: There is a once in a bluemoon race when the year transitions
between client generating the request and server handling the req.
But other wise year transitions dont matter bcas client always
creates fresh token, and server checks for year change to genrate
fresh token if required.
Add a config entry called bearer.insecure which will contain a
token used for bearer auth of http requests
Make bearer.insecure and allowed.domains as needed configs, and
exit program if they arent got through cmdline or config file.
Ensure load_config gets called on encountering --config in cmdline,
so that the user has control over whether cmdline or config file
will decide the final value of any given parameter.
Ensure that str type values in cmdline are picked up directly, without
running them through ast.literal_eval, bcas otherwise one will have to
ensure throught the cmdline arg mechanism that string quote is retained
for literal_eval
Have the """ function note/description below def line immidiately
so that it is interpreted as a function description.
Now both follow a similar mechanism and do the following
* exit on finding any issue, so that things are in a known
state from usage perspective, without any confusion/overlook
* check if the cmdlineArgCmd/configCmd being processed is a known
one or not.
* check value of the cmd is of the expected type
* have a generic flow which can accomodate more cmds in future
in a simple way
bing raised a challenge for chrome triggered search requests after
few requests, which were spread few minutes apart, while still
seemingly allowing wget based search to continue (again spread
few minutes apart).
Added a simple helper to trace this, use --debug True to enable
same.
mimicing got req in generated req helps with duckduckgo also and
not just yahoo.
also update allowed.domains to allow a url generated by ai when
trying to access the bing's news aggregation url
The tagging of messages wrt ValidateUrl and UrlReq
Also dump req
Move check for --allowed.domains to ValidateUrl
NOTE: Also with mimicing of user agent etal from got request to
the generated request, yahoo search/news is returning results now,
instead of the bland error before.