SimpleChatTC:PdfText:Basic Outline and its Numbering done

Pass a list to keep track of the numbering at different depths as well as to delay incrementing the numbering to the last min Dont let recursion go beyond a predefined limit
2025-11-08 04:32:30 +05:30 · 2025-11-08 04:32:30 +05:30 · 9484bea71a
parent 15e99843db
commit 9484bea71a
2 changed files with 22 additions and 12 deletions
--- a/tools/server/public_simplechat/local.tools/pdfmagic.py
+++ b/tools/server/public_simplechat/local.tools/pdfmagic.py
@ -10,21 +10,24 @@ if TYPE_CHECKING:
    from simpleproxy import ProxyHandler


-def extract_pdfoutline(ol: Any, prefix: str):
+PDFOUTLINE_MAXDEPTH=4
+
+
+def extract_pdfoutline(ol: Any, prefix: list[int]):
    """
-    Extract the pdf outline recursively.
-    1st tuple entry returned indicates whether to increase outline entry numbering
-    2nd tuple entry returns the outline string that provides the extracted outline.
+    Helps extract the pdf outline recursively, along with its numbering.
    """
+    if (len(prefix) > PDFOUTLINE_MAXDEPTH):
+        return ""
    if type(ol).__name__ != type([]).__name__:
-        return (1, f"{prefix}:{ol['/Title']}\n")
+        prefix[-1] += 1
+        return f"{".".join(map(str,prefix))}:{ol['/Title']}\n"
    olText = ""
-    olNum = 1
+    prefix.append(0)
    for (i,iol) in enumerate(ol):
-        got = extract_pdfoutline(iol, f"{prefix}.{olNum}")
-        olNum += got[0]
-        olText += got[1]
-    return (0, olText)
+        olText += extract_pdfoutline(iol, prefix)
+    prefix.pop()
+    return olText


 def process_pdftext(url: str, startPN: int, endPN: int):
@ -53,8 +56,11 @@ def process_pdftext(url: str, startPN: int, endPN: int):
        startPN = 1
    if (endPN <= 0) or (endPN > len(oPdf.pages)):
        endPN = len(oPdf.pages)
-    outlineGot = extract_pdfoutline(oPdf.outline, "")
-    tPdf += outlineGot[1]
+    # Add the pdf outline, if available
+    outlineGot = extract_pdfoutline(oPdf.outline, [])
+    if outlineGot:
+        tPdf += f"\n\nOutline Start\n\n{outlineGot}\n\nOutline End\n\n"
+    # Add the pdf page contents
    for i in range(startPN, endPN+1):
        pd = oPdf.pages[i-1]
        tPdf = tPdf + pd.extract_text()
--- a/tools/server/public_simplechat/readme.md
+++ b/tools/server/public_simplechat/readme.md
@ -462,6 +462,7 @@ plain textual content from the search result page.

 * fetch_pdf_as_text - fetch/read specified pdf file and extract its textual content
  * this depends on the pypdf python based open source library
+  * create a outline of titles along with numbering if the pdf contains a outline/toc

 * fetch_xml_filtered - fetch/read specified xml file and optionally filter out any specified tags
  * allows one to specify a list of tags related REs,
@ -676,6 +677,9 @@ sliding window based drop off or even before they kick in, this can help in many

 * capture the body of ai server not ok responses, to help debug as well as to show same to user.

+* extract and include the outline of titles (along with calculated numbering) in the text output of pdftext
+  * ensure that one doesnt recurse beyond a predefined limit.
+

 #### ToDo