Tutorial 5 min read

How to Train Your Bot on Messy PDF Manuals

Author
Felix Lee
Oct 24, 2025

PDFs are notoriously hard for AI to parse. Tables get scrambled, headers disappear, and footnotes ruin the flow. Here is our battle-tested method for cleaning your data before you hit "Train".

The "Garbage In, Garbage Out" Problem

Most users just drag and drop a 500-page technical manual into their chatbot builder and expect magic. But LLMs (Large Language Models) read text linearly. When a PDF has a two-column layout or a table that spans pages, the text stream gets corrupted.

Common Mistake

Don't upload scanned PDFs (images of text). Unless the platform has built-in OCR (Botech does), the bot will see nothing but blank pages.

Step 1: Pre-processing the PDF

Before uploading, run your PDF through a parser to see what the text actually looks like. If you see headers appearing in the middle of sentences (e.g., "Chapter 1... [Page 4]... Continued"), you need to crop margins.

Code Snippet: Cleaning Text with Python

import pypdf

def clean_text(text):
    # Remove page numbers
    text = re.sub(r'\n\d+\n', '\n', text)
    # Fix broken hyphenations
    text = text.replace('-\n', '')
    return text

# Botech does this automatically, but good to know!

Step 2: Chunking Strategy

Botech uses a "sliding window" chunking strategy. We break your document into 500-token overlapping segments. This ensures that context isn't lost at the cut points.

  • Semantic Chunking: Breaking text at paragraph ends.
  • Fixed Size: Breaking strictly at token limits (risky).
  • Recursive: The Botech method. We look for headers first, then paragraphs.

Step 3: Testing the Response

Once trained, ask your bot a specific question that references a table in your PDF. If it hallucinates, your chunks are likely too small to contain the full row of data. Increase chunk size in the Botech "Advanced Settings" tab.

Conclusion

Clean data beats a better model every time. Spend 10 minutes formatting your PDF, and your bot will be 10x smarter.

Get smarter about AI.

Join 5,000+ engineers receiving our weekly breakdown of chatbot tactics.