A PDF that changes based on how its read
- gpvos - 5296 sekunder sedanI would suggest changing the title to the actual title of the article: Adaptive PDFs.
Assuming the program works, the PDF will not actually look different to me than to anyone else looking at it, so there is nothing that "changes based on who is reading". It is just that text extraction, a wholly different (and much fuzzier) process than viewing the PDF, and something that the same person can do, will now return structured (Markdown) text. (One might say the PDF changes based on how you are reading it.) A great idea, IMHO.
- gnunicorn - 4880 sekunder sedanJust because everything is a potential threat vector now: doesn't this also mean you could easily put AI specific malicious instructions into the PDF that the regular human would never notice?
Like the "white text between the lines that only appears when copy-pasted"-hack that some professors have been doing in their exercises to their students to include pink elephants in the output and stuff. But worse. Just thinking of a electricity bill pdf you provide as proof of address to some company that uses an LLM to extraxt that address and pre-process that doc. But instead we can command it to do something else that a regular human wouldn't even ever notice...
Just a thought
- Tomte - 2648 sekunder sedan> LaTeX, Chrome's print-to-PDF, most export tools don't produce tags
LaTeX is actually one of the best ways to create tagged PDF: https://latex3.github.io/tagging-project/tagging-status/ and https://www.overleaf.com/learn/latex/An_introduction_to_tagg...
- al_hag - 3191 sekunder sedanIn the US, publicly funded organizations are required to code their PDF with semantic structure to support machine access by screen readers and other assistive technologies [1], [2].
Given the low adherence to accessibility standards e.g. in academic publishing [3], LLM parsing needs creating a commercial incentive for comparable structured access would be marvelous.
[1] https://www.section508.gov/create/pdfs/common-tags-and-usage...
[2] https://pdfa.org/resource/tagged-pdf-best-practice-guide-syn...
- Xotic007 - 2868 sekunder sedanCool but it's relying on every extractor honoring that replacement-text property which you said yourself is hit or miss. So it's clean markdown until someone runs it through a tool that ignores it and quietly gets the messy version and has no idea that happened.
- jheimark - 5393 sekunder sedanThis looks really interesting. Optimizing for humans vs. agents feels like the new wave of Desktop vs. Mobile (where mobile won) - agents are going to win even faster.
Where is the repo? It's mentioned but I can't find it.
- jexp - 4662 sekunder sedanShouldn’t it be possible since forever to put machine readable source information into PDF metadata. It’s more a problem of the tools and programs generating the PDFs.
We spend millions turning structured information into PDFs and billions to extract the same data from a printer rendering language
- tombert - 1513 sekunder sedanI always export my Typst with PDF/A. It basically guarantees maximal compatibility and none of the annoying dynamic bullshit. I wish everyone would do this, at least for documents that don't need the fancy dynamic PDF features.
- iLoveOncall - 4327 sekunder sedanI'd be more interested in the contrary. A PDF that ensures it's only readable by humans.
I guess the exact same technique can actually be used.
- mschuster91 - 896 sekunder sedan> The advantage isn't fewer tokens. It's that the same tokens now carry structure.
> Headings, lists, structure. One file, no separate versions, no conversion step.
... and I guess that AI wasn't just used as a target to write the software against, but also to fluff up the PR piece?
- fsckboy - 2002 sekunder sedan>This didn't matter when humans were the only readers. But now most PDFs end up in an LLM.
but it did matter, a lot. the PDF format was originally proprietary and was designed to be proprietary and to disallow casual text extraction. I just didn't like the way you glossed over that, "it was OK that people for over 30 years were not given any way for the information they were given to be unshackled, but now it matters because our AI overlords were prefer that so we must change things!"
- froh - 4916 sekunder sedan[dead]
Nördnytt! 🤓