Commit Graph

73 Commits

Author SHA1 Message Date
hanishkvc efcd81aa2f SimpleChatTCRV:SimpleProxy:DumpHeaders 2025-12-04 19:41:39 +05:30
hanishkvc 9484bea71a SimpleChatTC:PdfText:Basic Outline and its Numbering done
Pass a list to keep track of the numbering at different depths
as well as to delay incrementing the numbering to the last min

Dont let recursion go beyond a predefined limit
2025-12-04 19:41:39 +05:30
hanishkvc 15e99843db SimpleChatTC:PdfText:Numbering T2 - Need diff scheme
This increaments before itself, but we need to increment after
2025-12-04 19:41:39 +05:30
hanishkvc bd60437cc6 SimpleChatTC:PdfText: Numbering T1 - Diff Scheme needed
This simple scheme doesnt work. Rather the pdf outline seems
to follow below logic

If a child list is found when processing the current list, dont
increment the numbering.
2025-12-04 19:41:39 +05:30
hanishkvc 51707b5169 SimpleChatTC:PdfText:Add initial skeleton for outline 2025-12-04 19:41:39 +05:30
hanishkvc 0628226ea1 SimpleChatTC:XmlFiltered: Avoid showing skipped tags as no content
Dont even insert skipped tags as tag blocks with empty content.

This should make the resultant xml cleaner and make it use less
space.
2025-12-04 19:41:39 +05:30
hanishkvc 143f9c0b1a SimpleChatTC:Rename fetch_web_url_text to fetch_html_text
To make it easier for the ai model to understand that this works
mainly for html pages and not say xml or pdf or so. For those
one needs to use other explict tool calls provided like fetchpdftext
or fetchxmltext or so

The server service path renamed from urltext to htmltext.

SearchWebText also updated to use htmltext now
2025-12-04 19:41:39 +05:30
hanishkvc 9f5c3d7776 SimpleChatTC:XmlFiltered: Use re with heirarchy of tags to filter
Rename xmltext to xmlfiltered.

This simplifies the filtering related logic as well as gives more
fine grained flexibility wrt filtering bcas of re.
2025-12-04 19:41:39 +05:30
hanishkvc 9ed1cf9886 SimpleChatTC:XMLFiltered: Retain xml tags with selective dropping
instead of the prefixing of tag heirarchy retain the xml structure
while parallely allowing unwanted tags and their contents to be
dropped.
2025-12-04 19:41:39 +05:30
hanishkvc b8bb258dd5 SimpleChatTC:XmlText: Cleanup initial go
At simpleproxy end

* Add the tag names hierarchy before contents of a tag

* Remember to convert the tagDrops to small case as HTMLParser base
  class seems to do that by default.

At the client ui end

* if undefined remember to pass a empty list wrt tagDrops.

* cleanup the func description and also mention possible tagDrops
  for RSS feeds in the tool meta
2025-12-04 19:41:39 +05:30
hanishkvc 92b0dd7d36 SimpleChatTC:SimpleProxy:XMLText: initial go
Take the existing urltext logic including its html parser and
strip it out to be simpler.
2025-12-04 19:41:39 +05:30
hanishkvc 2394d38d58 SimpleChatTC:Cleanup: General T2
Pretty print SimpleProxy gMe config

Dont ignore the got http response status text.

Update readme wrt why autoSecs
2025-12-04 19:41:39 +05:30
hanishkvc c316f5a2bd SimpleChatTC:WebTools:UrlText:HtmlParser: tag drops - refine
Update the initial skeleton wrt the tag drops logic

* had forgotten to convert object to json string at the client end
* had confused between js and python and tried accessing the dict
  elements using . notation rather than [] notation in python.
* if the id filtered tag to be dropped is found, from then on
  track all other tags of the same type (independent of id),
  so that start and end tags can be matched. bcas end tag call
  wont have attribute, so all other tags of same type need to
  be tracked, for proper winding and unwinding to try find
  matching end tag
* remember to reset the tracked drop tag type to None once matching
  end tag at same depth is found. should avoid some unnecessary
  unwinding.
* set/fix the type wrt tagDrops explicitly to needed depth and
  ensure the dummy one and any explicitly got one is of right type.

Tested with duckduckgo search engine and now the div based unneeded
header is avoided in returned search result.
2025-12-04 19:41:39 +05:30
hanishkvc 06fd41a88e SimpleChatTC:WebTools: urltext-tag-drops python side - skel
Rename search-drops to urltext-tag-drops, to indicate its more
generic semantic. Rather search drops specified in UI by user
will be mapped to urltext-tag-drops header entry of a urltext
web fetch request.

Implement a crude urltext-tag-drops logic in TextHtmlParser.
If there is any mismatch with opening and closing tags in the
html being parsed and inturn wrt the type of tag being targetted
for dropping, things can mess up.
2025-12-04 19:41:39 +05:30
hanishkvc 2cdf3f574c SimpleChatTC:SimpleProxy: Validate deps wrt enabled service paths
helps ensure only service paths that can be serviced are enabled

Use same to check for pypdf wrt pdftext
2025-12-04 19:41:39 +05:30
hanishkvc 1d1894ad14 SimpleChatTC:PdfText:Cleanup rename to follow a common convention
Rename path and tags/identifiers from Pdf2Text to PdfText

Rename the function call to pdf_to_text, this should also help
indicate semantic more unambiguously, just in case, especially
for smaller models.
2025-12-04 19:41:39 +05:30
hanishkvc 9efab62702 SimpleChatTC:SimpleProxy:Add generic arxiv.org entry to allowed 2025-12-04 19:41:39 +05:30
hanishkvc 3b929f934f SimpleChatTC:SimpleProxy:Switch web flow to use file helpers
This also indirectly adds support for local file system access
through the web / fetch (ie urlraw and urltext) service request paths.
2025-12-04 19:41:39 +05:30
hanishkvc 494d063657 SimpleChatTC:SimpleProxy: getting local / web file module ++
Added logic to help get a file from either the local file system
or from the web, based on the url specified.

Update pdfmagic module to use the same, so that it can support
both local as well as web based pdf.

Bring in the debug module, which I had forgotten to commit, after
moving debug helper code from simpleproxy.py to the debug module
2025-12-04 19:41:39 +05:30
hanishkvc a3beacf16a SimpleChatTC:SimpleProxy:Pdf2Text cleanup page number handling
Its not necessary to request a page number range always.

Take care of page number starting from 1 and underlying data having
0 as the starting index
2025-12-04 19:41:39 +05:30
hanishkvc d012d127bf SimpleChatTC:SimpleProxy: Avoid circular deps wrt Type Checking
also move debug dump helper to its own module

also remember to specify the Class name in quotes, similar to
refering to a class within a member of th class wrt python type
checking.
2025-12-04 19:41:39 +05:30
hanishkvc 350d7d77e0 SimpleChatTC:SimpleProxy: Move web requests to its own module 2025-12-04 19:41:39 +05:30
hanishkvc a7de002fd0 SimpleChatTC:SimpleProxy:Move pdf logic into its own module 2025-12-04 19:41:39 +05:30
hanishkvc b18aed4449 SimpleChatTC:SimpleProxy: AuthAndRun hlpr for paths that check auth
Also trap any exceptions while handling and send exception info
to the client requesting service
2025-12-04 19:41:39 +05:30
hanishkvc c597572e10 SimpleChatTC:SimpleProxy: Use urlvalidator
Add --allowed.schemes config entry as a needed config.

Setup the url validator.

Use this wrt urltext, urlraw and pdf2text

This allows user to control whether local file access is enabled
or not. By default in the sample simpleproxy.json config file
local file access is allowed.
2025-12-04 19:41:39 +05:30
hanishkvc 6cab95657f SimpleChatTC:SimpleProxy:UrlValidator initial go
Check if the specified scheme is allowed or not.

If allowed then call corresponding validator to check remaining
part of the url is fine or not
2025-12-04 19:41:39 +05:30
hanishkvc c8407a1240 SimpleChatTC:SimpleProxy:UrlValidator module initial skeleton
Copy validate_url and build initial skeleton
2025-12-04 19:41:39 +05:30
hanishkvc dd0a7ec500 SimpleChatTC:Pdf2Text: Make it work with a subset of pages
Initial go, need to review the code flow as well as test it out
2025-12-04 19:41:39 +05:30
hanishkvc dfeb94d3f6 SimpleChatTC:Pdf2Text: cleanup initial go
Make the description bit more explicit with it supporting local
file paths as part of the url scheme, as the tested ai model was
cribbing about not supporting file url scheme. Need to check if
this new description will make things better.

Convert the text to bytes for writing to the http pipe.

Ensure CORS is kept happy by passing AccessControlAllowOrigin in
header.
2025-12-04 19:41:39 +05:30
hanishkvc 6054ddfb65 SimpleChatTC:SimpleProxy:Pdf2Text: Initial go 2025-12-04 19:41:39 +05:30
hanishkvc 5ec29087ea SimpleChatTC:SimpleProxy:Pdf2Text: Move handling url to its own 2025-12-04 19:41:39 +05:30
hanishkvc ecfdb66c94 SimpleChatTC:SimpleProxy:Pdf2Text:Initial plumbing
Get the pdf2text request for processing.
2025-12-04 19:41:39 +05:30
hanishkvc da98a961ab SimpleChatTC:SimpleProxy: Enable allowing or not requested feature 2025-12-04 19:41:39 +05:30
hanishkvc 59effa6ea8 SimpleChatTC:Cleanup: tool resp xml, some allowed domains
Add a newline between name and content in the xml representation
of the tool response, so that it is more easy to distinguish things

Add github, linkedin and apnews domains to allowed.domains for
simpleproxy.py
2025-12-04 19:41:39 +05:30
hanishkvc cf06c8682b SimpleChatTC:Reasoning+: Update readme wrt reasoning, flow cleanup
Also cleanup the minimal based showing of chat messages a bit

And add github.com to allowed list
2025-12-04 19:41:39 +05:30
hanishkvc aa17edfa78 SimpleChatTC:SimpleProxy: Include some news sites in allowed domains 2025-12-04 19:41:39 +05:30
hanishkvc 84403973cd SimpleChatTC:SimpleProxy: once in a bluemoon transformed bearer
instead of using the shared bearer token as is, hash it with
current year and use the hash.

keep /aum path out of auth check.

in future bearer token could be transformed more often, as well as
with additional nounce/dynamic token from server got during initial
/aum handshake as also running counter and so ...

NOTE: All these circus not good enough, given that currently the
simpleproxy.py handshakes work over http. However these skeletons
put in place, for future, if needed.

TODO: There is a once in a bluemoon race when the year transitions
between client generating the request and server handling the req.
But other wise year transitions dont matter bcas client always
creates fresh token, and server checks for year change to genrate
fresh token if required.
2025-12-04 19:41:39 +05:30
hanishkvc 6d08cda9c8 SimpleChatTC:SimpleProxy: Check for bearer authorization
As noted in the comments in code, this is a very insecure flow
for now.
2025-12-04 19:41:39 +05:30
hanishkvc 3f1fd289eb SimpleChatTC:SimpleProxy:BearerInsecure a needed config
Add a config entry called bearer.insecure which will contain a
token used for bearer auth of http requests

Make bearer.insecure and allowed.domains as needed configs, and
exit program if they arent got through cmdline or config file.
2025-12-04 19:41:39 +05:30
hanishkvc 0caa2e8101 SimpleChatTC:SimpleProxy: Prg Parameters handling cleanup - next
Ensure load_config gets called on encountering --config in cmdline,
so that the user has control over whether cmdline or config file
will decide the final value of any given parameter.

Ensure that str type values in cmdline are picked up directly, without
running them through ast.literal_eval, bcas otherwise one will have to
ensure throught the cmdline arg mechanism that string quote is retained
for literal_eval

Have the """ function note/description below def line immidiately
so that it is interpreted as a function description.
2025-12-04 19:41:39 +05:30
hanishkvc f221a2c356 SimpleChatTC:SimpleProxy:LoadConfig ProcessArgs cleanup - initial
Now both follow a similar mechanism and do the following

* exit on finding any issue, so that things are in a known
  state from usage perspective, without any confusion/overlook

* check if the cmdlineArgCmd/configCmd being processed is a known
  one or not.

* check value of the cmd is of the expected type

* have a generic flow which can accomodate more cmds in future
  in a simple way
2025-12-04 19:41:39 +05:30
hanishkvc 252fb91e95 SimpleChatTC:WebSearchPlus: Update readme, Wikipedia in allowed
If using wikipedia or so, remember to have sufficient context window
in general wrt the ai engine as well as wrt the handshake / chat
end point.
2025-12-04 19:41:39 +05:30
hanishkvc 2192ae6dd3 SimpleChatTC:Cleanup whitespace - github editorconfig checker
Add missing newline to ending bracket line of json config file
2025-12-04 19:41:39 +05:30
hanishkvc f74ce327e5 SimpleChatTC: Cleanup whitespaces
identified by llama.cpp editorconfig check

* convert tab to spaces in json config file
* remove extra space at end of line
2025-12-04 19:41:39 +05:30
hanishkvc 9e97880dde SimpleChatTC:SimpleProxy:Cleanup
avoid logically duplicate debug log
2025-12-04 19:41:39 +05:30
hanishkvc 4c1c363504 SimpleChatTC:SimpleProxy: debug dumps to identify funny bing
bing raised a challenge for chrome triggered search requests after
few requests, which were spread few minutes apart, while still
seemingly allowing wget based search to continue (again spread
few minutes apart).

Added a simple helper to trace this, use --debug True to enable
same.
2025-12-04 19:41:39 +05:30
hanishkvc c109da870f SimpleChatTC:SimpleProxy: mimicing got req helps wrt duckduckgo
mimicing got req in generated req helps with duckduckgo also and
not just yahoo.

also update allowed.domains to allow a url generated by ai when
trying to access the bing's news aggregation url
2025-12-04 19:41:39 +05:30
hanishkvc bebf846157 SimpleChatTC:SimpleProxy:Cleanup a bit
The tagging of messages wrt ValidateUrl and UrlReq

Also dump req

Move check for --allowed.domains to ValidateUrl

NOTE: Also with mimicing of user agent etal from got request to
the generated request, yahoo search/news is returning results now,
instead of the bland error before.
2025-12-04 19:41:39 +05:30
hanishkvc d0b9103176 SimpleChatTC:SimpleProxy:Try mimic real client using got req info
ie include User-Agent, Accept-Language and Accept in the generated
request using equivalent values got in the request being proxied.
2025-12-04 19:41:39 +05:30
hanishkvc e6e0adbe90 SimpleChatTC:SimpleProxy: Some debug prints which give info 2025-12-04 19:41:39 +05:30