Commit Graph

39 Commits

Author SHA1 Message Date
hanishkvc dd0a7ec500 SimpleChatTC:Pdf2Text: Make it work with a subset of pages
Initial go, need to review the code flow as well as test it out
2025-12-04 19:41:39 +05:30
hanishkvc dfeb94d3f6 SimpleChatTC:Pdf2Text: cleanup initial go
Make the description bit more explicit with it supporting local
file paths as part of the url scheme, as the tested ai model was
cribbing about not supporting file url scheme. Need to check if
this new description will make things better.

Convert the text to bytes for writing to the http pipe.

Ensure CORS is kept happy by passing AccessControlAllowOrigin in
header.
2025-12-04 19:41:39 +05:30
hanishkvc 6054ddfb65 SimpleChatTC:SimpleProxy:Pdf2Text: Initial go 2025-12-04 19:41:39 +05:30
hanishkvc 5ec29087ea SimpleChatTC:SimpleProxy:Pdf2Text: Move handling url to its own 2025-12-04 19:41:39 +05:30
hanishkvc ecfdb66c94 SimpleChatTC:SimpleProxy:Pdf2Text:Initial plumbing
Get the pdf2text request for processing.
2025-12-04 19:41:39 +05:30
hanishkvc da98a961ab SimpleChatTC:SimpleProxy: Enable allowing or not requested feature 2025-12-04 19:41:39 +05:30
hanishkvc 84403973cd SimpleChatTC:SimpleProxy: once in a bluemoon transformed bearer
instead of using the shared bearer token as is, hash it with
current year and use the hash.

keep /aum path out of auth check.

in future bearer token could be transformed more often, as well as
with additional nounce/dynamic token from server got during initial
/aum handshake as also running counter and so ...

NOTE: All these circus not good enough, given that currently the
simpleproxy.py handshakes work over http. However these skeletons
put in place, for future, if needed.

TODO: There is a once in a bluemoon race when the year transitions
between client generating the request and server handling the req.
But other wise year transitions dont matter bcas client always
creates fresh token, and server checks for year change to genrate
fresh token if required.
2025-12-04 19:41:39 +05:30
hanishkvc 6d08cda9c8 SimpleChatTC:SimpleProxy: Check for bearer authorization
As noted in the comments in code, this is a very insecure flow
for now.
2025-12-04 19:41:39 +05:30
hanishkvc 3f1fd289eb SimpleChatTC:SimpleProxy:BearerInsecure a needed config
Add a config entry called bearer.insecure which will contain a
token used for bearer auth of http requests

Make bearer.insecure and allowed.domains as needed configs, and
exit program if they arent got through cmdline or config file.
2025-12-04 19:41:39 +05:30
hanishkvc 0caa2e8101 SimpleChatTC:SimpleProxy: Prg Parameters handling cleanup - next
Ensure load_config gets called on encountering --config in cmdline,
so that the user has control over whether cmdline or config file
will decide the final value of any given parameter.

Ensure that str type values in cmdline are picked up directly, without
running them through ast.literal_eval, bcas otherwise one will have to
ensure throught the cmdline arg mechanism that string quote is retained
for literal_eval

Have the """ function note/description below def line immidiately
so that it is interpreted as a function description.
2025-12-04 19:41:39 +05:30
hanishkvc f221a2c356 SimpleChatTC:SimpleProxy:LoadConfig ProcessArgs cleanup - initial
Now both follow a similar mechanism and do the following

* exit on finding any issue, so that things are in a known
  state from usage perspective, without any confusion/overlook

* check if the cmdlineArgCmd/configCmd being processed is a known
  one or not.

* check value of the cmd is of the expected type

* have a generic flow which can accomodate more cmds in future
  in a simple way
2025-12-04 19:41:39 +05:30
hanishkvc 9e97880dde SimpleChatTC:SimpleProxy:Cleanup
avoid logically duplicate debug log
2025-12-04 19:41:39 +05:30
hanishkvc 4c1c363504 SimpleChatTC:SimpleProxy: debug dumps to identify funny bing
bing raised a challenge for chrome triggered search requests after
few requests, which were spread few minutes apart, while still
seemingly allowing wget based search to continue (again spread
few minutes apart).

Added a simple helper to trace this, use --debug True to enable
same.
2025-12-04 19:41:39 +05:30
hanishkvc bebf846157 SimpleChatTC:SimpleProxy:Cleanup a bit
The tagging of messages wrt ValidateUrl and UrlReq

Also dump req

Move check for --allowed.domains to ValidateUrl

NOTE: Also with mimicing of user agent etal from got request to
the generated request, yahoo search/news is returning results now,
instead of the bland error before.
2025-12-04 19:41:39 +05:30
hanishkvc d0b9103176 SimpleChatTC:SimpleProxy:Try mimic real client using got req info
ie include User-Agent, Accept-Language and Accept in the generated
request using equivalent values got in the request being proxied.
2025-12-04 19:41:39 +05:30
hanishkvc e6e0adbe90 SimpleChatTC:SimpleProxy: Some debug prints which give info 2025-12-04 19:41:39 +05:30
hanishkvc 840cab0b1c SimpleChatTC:SimpleProxy: Include a sample config file
with allowed domains set to few sites in general to show its use

this includes some sites which allow search to be carried out
through them as well as provide news aggregation
2025-12-04 19:41:39 +05:30
hanishkvc 370326b1ec SimpleChatTC:SimpleProxy: Cleanup domain filtering and general
Had confused between js and python wrt accessing dictionary
contents and its consequence on non existent key. Fixed it.

Use different error ids to distinguish between failure in common
urlreq and the specific urltext and urlraw helpers.
2025-12-04 19:41:39 +05:30
hanishkvc 71ad609db6 SimpleChatTC:SimpleProxy: AllowedDomains based filtering
Allow fetching from only specified allowed.domains
2025-12-04 19:41:39 +05:30
hanishkvc 58954c8814 SimpleChatTC:SimpleProxy: Update doc following python convention 2025-12-04 19:41:39 +05:30
hanishkvc 62dcd506e3 SimpleChatTC:SimpleProxy:Allow for loading json based config file
The config entries should be named same as their equivalent cmdline
argument entries but without the -- prefix
2025-12-04 19:41:39 +05:30
hanishkvc 98d43fac7f SimpleChatTC:WebFetch: Try confirm simpleproxy before enabling 2025-12-04 19:41:39 +05:30
hanishkvc d04c8cd38d SimpleChatTC:SimpleProxy: Ensure CORS related headers sent always
Add a new send headers common helper and use the same wrt the
overridden send_error as well as do_OPTIONS

This ensures that if there is any error during proxy opertions,
the send_error propogates to the fetch from any browser properly
without browser intercepting it with a CORS error
2025-12-04 19:41:39 +05:30
hanishkvc 73a144c44d SimpleChatTC:SimpleProxy:HtmlParser more generic and flexible
also now track header, footer and nav so that they arent captured
2025-12-04 19:41:39 +05:30
hanishkvc 9ff2c596ee SimpleChatTC:SimpleProxy:Options just in case 2025-12-04 19:41:39 +05:30
hanishkvc bf63b8f45a SimpleChatTC:SimpleProxy:UrlText: Slightly better trimming
First identify lines which have only whitespace and replace them
with lines with only newline char in them.

Next strip out adjacent lines, if they have only newlines
2025-12-04 19:41:39 +05:30
hanishkvc 266e825c68 SimpleChatTC:SimpleProxy:UrlText: Try strip empty lines some what 2025-12-04 19:41:39 +05:30
hanishkvc b46bbc542a SimpleChatTC:SimpleProxy:UrlText: Avoid style blocks also 2025-12-04 19:41:39 +05:30
hanishkvc f493e1af59 SimpleChatTC:SimpleProxy:UrlText: Capture body except for scripts 2025-12-04 19:41:39 +05:30
hanishkvc 45b05df21b SimpleChatTC:SimpleProxy: Switch to html.parser
As html can be malformed, xml ElementTree XMLParser cant handle
the same properly, so switch to the HtmlParser helper class that is
provided by python and try extend it.

Currently a minimal skeleton to just start it out, which captures
only the body contents.
2025-12-04 19:41:39 +05:30
hanishkvc d5f4183f7c SimpleChatTC:SimpleProxy: ElementTree, No _UrlopenRet
As _UrlopenRet not exposed for use outside urllib, so decode and
encode the data.

Add skeleton to try get the html/xml tree top elements
2025-12-04 19:41:39 +05:30
hanishkvc 6537559360 SimpleChatTC:SimpleProxy:Common UrlReq helper for UrlRaw & UrlText
Declare the result of UrlReq as a DataClass, so that one doesnt
goof up wrt updating and accessing members.

Duplicate UrlRaw into UrlText, need to add Text extracting from
html next for UrlText
2025-12-04 19:41:39 +05:30
hanishkvc e600e62e86 SimpleChatTC:SimpleProxy: Cleanup few messages 2025-12-04 19:41:39 +05:30
hanishkvc 3bab4de0e8 SimpleChatTC:SimpleProxy:UrlRaw: Fixup basic oversight wrt 1st go 2025-12-04 19:41:39 +05:30
hanishkvc 73ef9f7d46 SimpleChatTC:SimpleProxy:implement handle_urlraw
A basic go at it
2025-12-04 19:41:39 +05:30
hanishkvc 73054a5832 SimpleChatTC:SimpleProxy: Extract and check path, route to handlers 2025-12-04 19:41:39 +05:30
hanishkvc c99788e290 SimpleChatTC:SimpleProxy: Cleanup for basic run 2025-12-04 19:41:39 +05:30
hanishkvc 80fd065993 SimpleChatTC:SimpleProxy: Start server, Show requested path 2025-12-04 19:41:39 +05:30
hanishkvc 05c0ade8be SimpleChatTC:SimpleProxy:Process args --port 2025-12-04 19:41:39 +05:30