We've seen using #docling a lot at work lately to parse all kinds of documents in various formats. It's handy for converting them into a common JSON document.
We've seen using #docling a lot at work lately to parse all kinds of documents in various formats. It's handy for converting them into a common JSON document.
ICYMI: Taming Unstructured Data: From PDFs to JSON with Quarkus and Docling https://open.substack.com/pub/myfear/p/quarkus-docling-data-preparation-for-ai?r=17bggb&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false
#Java #quarkus #Docling #Data
Taming Unstructured Data: From PDFs to JSON with Quarkus and Docling
Build a fast, scalable converter to turn business documents into structured data
https://myfear.substack.com/p/quarkus-docling-data-preparation-for-ai
#Java #Quarkus #Docling #AIML #PDF #DocumentParsing
A pauta de hoje do #TerSoftware é sobre "gestão de papel". Recentemente, testei OCR para digitalização de tabelas e... não fiquei muito feliz com o resultado.
Acredito que #OCR funcione melhor quando fica bem amarrado com o documento digitalizado (por exemplo, tornando um arquivo PDF buscável), mas para extração de texto, ainda é um grande "depende".
Na minha curta jornada, testei #Tesseract e #Docling. Talvez funcione com código bem escrito, mas acabei me rendendo e indo "no muque" mesmo.
O Tesseract parece bem fácil de instalar no Linux (mesmo no #openSUSE Leap, que tem suas limitações por sair do SUSE empresarial, achei fácil), mas o Docling exigiu alguns malabarismos com ambientes em Python (usando conda e pip).
Para texto corrido, o Tesseract parece bem suficiente, já. Pode ser rodado via linha de comando e, pelo menos no openSUSE Leap, vários dicionários se encontram empacotados para facilitar.
Taking part in the #Docling workshop at the #OpenSource AI conference. This is a project I heard about at #DINAconCH a few months ago, and it seems to since have exploded in popularity on PyPi and GitHub - in part thanks to the #CHopen community
There are strong overlaps with what I've been doing at #ProxeusApp - my notes from the Docling deep-dive have been posted here: https://log.alets.ch/105/
Simplify AI data integration with RamaLama and RAG
https://developers.redhat.com/articles/2025/04/03/simplify-ai-data-integration-ramalama-and-rag#
#Docling #Ramalama #podman #aiml
Wrestling with PDF files today… delighted to find #Docling https://ds4sd.github.io/docling/
It’s a solid CLI for parsing documents. It was annoying to install, but works well. I still have manual cleanup to do, but way easier than manual and higher quality than other AI options
Docling: AI-powered document processing!
PDFs & DOCXs in your AI workflow? Docling makes it easy! Converts to markdown & JSON for RAG and more. Blazing fast!
Docling, IBM’s new open-source toolkit, is designed to more easily unearth that information for generative AI applications. The toolkit streamlines the process of turning unstructured documents into JSON and Markdown files that are easy for large language models (LLMs) and other foundation models to digest.