Nuxt HN | Launch HN: Parsewise (YC P25) – Reason Across Documents with an API

Launch HN: Parsewise (YC P25) – Reason Across Documents with an API

Hi all, it’s Greg and Max, founders of Parsewise here

Parsewise transforms a bucket of unstructured data into schema compliant data retaining lineage for values resolved across documents. Imagine giving Claude a bunch of files and asking for a CSV or JSON output. If you have tried this, you know both the system limitations (number of files, type of inputs, cost, latency) but also the human-facing challenge of having no way to validate the results quickly. We solve both. We help tech teams simplify their unstructured data ETL, and loop in business experts for the definitions and for instant validation.

Here is a video with a few use cases: https://www.youtube.com/watch?v=dbRllnnh47w

Parsewise in the words of someone coming to us: ”I need to extract information from insurance policy PDFs, phone calls that have been transcribed, emails, etc. I am NOT looking for something that would just extract data point by data point, page by page into a structured well-defined schema but more something more agentic that can understand that information might be across documents and that it should reason over what to extract.”

We started the company based on a decade of experience (and pain) in complex data transformation and data analysis / synthesis. Greg was building both classical ETL and implemented AI workflows at Palantir. At Bain, Max did highly complex data analysis in the financial sector, similar to many of our customers.

Parsewise works by taking in a bucket of data (think hundreds or thousands of pdfs, excels etc.), and outputting schema compliant data where every single value is traceable down to word level citations across multiple documents in the bucket. We provide API customers with ways to show the lineage in their own applications, or they can use our platform for internal operations. At the core of the data processing we have self-improving agent definitions. They define the acceptable sources, the logic for resolving or combining values, and the rule for highlighting uncertainty to the end user.

The underlying tech is model and cloud agnostic and can be deployed in private networks. We have seen the best results with Gemini models for visual reasoning, achieving SOTA (beating Claude Fable) on the strongest grounded reasoning benchmark we have found (Databricks OfficeQA). Notably, we focused more on the “human harness” rather than the model harness, leaning into the actual friction we saw in uptake, which is around verifiability. That means optimizing the time and clicks required to trust the outcomes. We use vLLMs for parsing, and then we use small models for efficient large scale exhaustive search. Unlike RAG, we do not sample; instead, we exhaustively find all relevant values for a given query. We use larger models for decision making around resolutions and flagging inconsistencies to users.

This exhaustiveness and explicit value sourcing is unique to our platform, and it goes beyond the first step of data parsing that many existing providers cover.

We would love to welcome builders and tinkerers to try Parsewise on your complex document challenges. We have a ton of ideas on how we can expand the product and make it better, but would appreciate feedback and ideas from the community!

23 points | by gergelycsegzi 2 hours ago

5 comments

gergelycsegzi 2 hours ago
Ah probably should add a link to our website: https://www.parsewise.ai/api
[-]
- stevesimmons 1 hour ago
  "retaining lineage"
  [-]
  - gergelycsegzi 1 hour ago
    "That is a great catch!"
gorgmah 1 hour ago
I worked recently on an internal tool to achieve this kind of things, mostly plugging mistral OCR to gemini to extract structured data from documents. We then perform automated diffs too.
There seems to be an insane amount of competition in the "Intelligent Document Processing" market, like for instance parseur, whose founder is often on HN himself.
What do you think sets you apart from competition like : 1) Mistral document AI : depending on the model, it looks way cheaper than yours, OCR model pricing ranges from 0.001 to 0.004 EUR / page and they have structured output wired in the OCR API if needed (things then get fed to one of their LLMs) + EU-based and GDPR ready 2) parseur / rossum / docsumo / nanonets (which is YC 2017) ?
[-]
- joss82 18 minutes ago
  Hi, Parseur founder here :D
  I understand what they are trying to do, but to me it feels like the moment when MongoDB entered the database space, with semi-structured, "flexible" storage format. It has its uses, for prototyping mostly.
  But in high-volume, production workloads, giving a structure to the data you extract (what Parseur does through defining the Fields in your Mailbox, basically giving your output data a schema) adds a ton of value, and the larger the dataset, the truer it is.
  Usually, you start by defining where you want your data to go, and which structure it should have, before working backwards from here and starting to extract the data. This is the key to automating your document workflow.
- gergelycsegzi 1 hour ago
  Great question!
  1. We are working with the assumption that OCR is (or soon will be) solved at super low prices.
  So if we have the extracted data, what can we do with it? Where we see Parsewise making a difference is for use cases that span across documents. I.e. if you are extracting the same 5 fields from every invoice, there are lots of solutions as you listed (+ reducto etc). However, once you have a set of documents (e.g. an entire mortgage application package) and you are trying to get a structured response out, then your option is either an LLM API (if things fit into context and you are okay with limited citations), or building a pipeline with LLMs. I posted it in another comment but an example of trawling through 90k pages is here: https://www.parsewise.ai/officeqa-sota
  2. While we rely on LLMs, the outcomes will be non-deterministic, so the bottleneck is and will remain the human verification (that is for somewhat complex use cases). The architecture that we have built is optimizing for the human reviewer to provide as granular values and citations as possible. This is either through our platform, or API clients.
red_hare 54 minutes ago
I say this with a lot of love: The vibecoded applications in your demo reek of AI slop design.
This isn't a critique of your product. It's just that the a beige-orange theme, the pill components, and the left-border highlight give me that visceral reaction as reading a paragraph littered with em dashes and "not X but Y." It makes me take you less seriously.
Cool demo otherwise.
[-]
- gergelycsegzi 24 minutes ago
  Haha no appreciate it! That's on me for not calling it out explicitly (was trying to make the video as short as possible), but the demo UIs were literally vibe coded to show the ease of integration https://youtu.be/F1cSuZal03s?si=1H4zTcO-8cosLbVr&t=70
mauryaudayan 51 minutes ago
llamaparse also do it, what is different here?
[-]
- maxhofer 8 minutes ago
  Mostly cross-doc reasoning at scale (e.g., 90k-page corpora) as opposed to doc-to-markdown conversions.
gnerd00 1 hour ago
> implemented AI workflows at Palantir
you show this in the first paragraph, before many other details
> We would love to welcome builders and tinkerers
Love? really .. cognitive dissonance here.. I read this as " we are security state friendly so we can get that big security state funding" plus "people who work for free like love, so we say that word"
coupled with the free-riding of VC capital on decades of open work, I just can not, not say this
[-]
- gergelycsegzi 1 hour ago
  I learnt a lot at Palantir, though always worked in commercial so no ties to security state (for the better or worse). (Also side-note, we are working towards enabling frontier performance with smaller open models that allows our customers to protect their data. https://www.parsewise.ai/officeqa-sota )
  And I do get genuine joy from helping our users, so love it is:)
  [-]
  - Johnny_Bonk 1 hour ago
    Have you really ever thought deeply about your reasoning that it’s all fine and dandy and just love when your paycheck every two weeks of what you spent 50 hours a week doing was coming from, pretty convenient to look the other way when it’s a categorical fact your old employer knowingly assists in genocide, mass deportations, democratic backsliding.. your intentions may be genuine and good natured and if that’s the case then hopefully this new chapter of your company will serve good things, and hopefully not more triangulation of data points vis a vis document parsing to help continue what palantir and the likes are doing
    [-]
    - gergelycsegzi 1 hour ago
      Planning to serve good things for sure, and appreciate your note. Ofc I didn't agree with everything Palantir was doing (also to the extent that we even knew about them at the time). I was working on vaccine distribution and cancer research as well, so definitely felt like helping.