Pdftriage question answering over long, structured documents

General Thoughts

Overall I like the idea of creating metadata and functions. But usage of each function will blow up the total token. Regarding to table 3 and figure 13 this approach does not create miracles. There is no embedding can be used with graph databases.. Paper link

Highlights

fetch_table, fetch_figure, and retrieve ⤴️

Table 3: Answer Quality Scoring ⤴️

Why this table results is so different than figure 3? In my opinion, structure related questions are not relevant for page and chunk embedding. Did the overall preference in figure3 calculated regarding to taxonomy proportions? structure related questions are only 3.7%

The key contributions of this paper are: • We identify a gap in question answering over structured documents with current LLM approaches, namely treating documents as plain text rather than structured objects; • We release a dataset of tagged question types, along with model responses, in order to facilitate further research on this topic; and • We present a method of prompting the model, called PDFTriage, that improves the ability of an LLM to respond to questions over structured documents. ⤴️

  1. Outside Questions (8.6%): Ask a question that can’t be answered with just the document. 9. Cross-page Tasks (1.1%): Ask a question that needs multiple parts of the document to answer. 10. Classification (3.7%): Ask about the type of the document. ⤴️

ReAct (Yao et al., 2022) or Toolformer (Schick et al., 2023) -like way ⤴️

Check papers for improvement

section #5? ⤴️

It’s more of a structure question than a text question

PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents ⤴️

Structured text does not only look like a simple headeres titles and text. But also includes tables, images and follow a standard structure

On average a document contains 4,257 tokens of text ⤴️

Not bad

3.1 Document Representation ⤴️

Metadata creation process! Quite important but not much technical detail given

  1. Figure Questions (6.5%): Ask a question about a figure in the document. 2. Text Questions (26.2%): Ask a question about the document. 3. Table Reasoning (7.4%): Ask a question about a table in the document. 4. Structure Questions (3.7%): Ask a question about the structure of the document. 5. Summarization (16.4%): Ask for a summary of parts of the document or the full document. 6. Extraction (21.2%): Ask for specific content to be extracted from the document. 7. Rewrite (5.2%): Ask for a rewrite of some text in the document. ⤴️

we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA ⤴️

Great resource for evaluation

100-word pieces ⤴️

No semantic chunk(not even recursive). Wonder how that would score

leverage document structure by augmenting prompts with both document structure metadata ⤴️

Document structure metadata!

Landeghem et al. (2023) ⤴️

DUDE ⤴️

Check paper

Page Retrieval ⤴️

Very simple and straightforward approach

QASPER (Dasigi et al., 2021) ⤴️

Check paper

five different functions ⤴️

multi-step reasoning across the whole document. ⤴️

So multiple document answering is subject in this paper, right?

fetch_pages, fetch_sections ⤴️


Comments