Algoverse Logo UC Berkeley Logo BMC Logo
logo

NovelHopQA

Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts

About NovelHopQA

NovelHopQA is a large-scale benchmark designed to test how language models handle multi-step reasoning over long passages from real novels. With 4,000 questions spanning 64k–128k-token excerpts and up to 4 reasoning hops, it reveals that even top models struggle as tasks get longer and more complex. NovelHopQA highlights key challenges in deep comprehension and multi-hop inference, providing a valuable tool for improving future language models.

Key Findings & Impact

Paper Overview
Simplified Methodology Diagram
Figure: Simplified Methodology. NovelHopQA constructs multi-hop QA chains by extracting high-frequency keywords from novels, filtering paragraphs, and incrementally building reasoning paths (hops) that culminate in challenging question–answer pairs. Each hop integrates more context, enabling deeper reasoning.

NovelHopQA is the first benchmark to jointly vary both hop depth (1-4 hops) and context length (64k-128k tokens) in natural narrative settings. Current large language models (LLMs) struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. While prior benchmarks explore long-context comprehension or multi-hop reasoning in isolation, none jointly vary context length and reasoning depth in natural narrative settings. When crucial evidence is buried in the middle of a long context, accuracy can plunge by more than 20 points (Liu et al., 2023). Even frontier models score below 50% exact match on multi-document suites, showing that larger context windows alone cannot solve cross-document reasoning.

Related Work

Multi-hop benchmarks fall into two groups:

NovelHopQA fills this gap by simultaneously testing reasoning depth and long-context comprehension in coherent narratives.

Benchmark Features
  • 4,000 multi-hop QA examples from 83 full-length public-domain novels
  • Context windows ranging from 64k to 128k tokens
  • Questions require integrating 1–4 reasoning hops across narrative chains
  • Human validation ensures high alignment (>6.5/7) and hop-match accuracy (>94%)
Methodology

We build NovelHopQA through a four-stage pipeline:

  1. Novel Selection: We selected 83 English novels from Project Gutenberg, spanning mystery, adventure, romance, and literary classics, including both first- and third-person narration.
  2. Anchor-Keyword Discovery: For each novel, we prompted GPT-4o-mini to suggest five "anchor" keywords—characters, locations, or objects central to the plot. If any keyword appears fewer than 50 times in the text, we discard and re-sample.
  3. Paragraph Chaining & QA Generation: We implemented a keyword-guided process that:
    • Selects paragraphs containing specific keywords
    • Extracts new related keywords for subsequent hops
    • Chains paragraphs with increasing hop depth (1-4)
    • Regenerates QA pairs at each step to integrate new evidence
  4. QA Validation: We filter examples using model and human validation to ensure answerability and correct hop depth. Ten human annotators confirmed high alignment (>6.5/7) and hop-match accuracy (>94%).
Results

We evaluated six state-of-the-art models: o1, GPT-4o, GPT-4o-mini, Gemini 2.5 Pro, Gemini 2.0 Flash, and Gemini 2.0 Flash Lite. Key findings:

  • Impact of hop depth: All models exhibit consistent performance degradation as hop depth increases. On average, accuracy drops roughly 12 points from 1-hop to 4-hop at 64k context length.
  • Impact of context length: Longer contexts also lead to reduced accuracy, though the effect is milder than that of hop count. Across models, 1-hop performance drops about 5 points when moving from 64k to 128k contexts.
  • No model maintains strong performance on the hardest tasks (4-hop at 128k), where even top models dip below 80% accuracy.

Contact

For questions about the benchmark, collaboration opportunities, or to report issues, please contact us at: abhaygupta1266@gmail.com

We welcome contributions and feedback from the research community to help improve model fairness across diverse English dialects.

Citation

@misc{gupta2025novelhopqadiagnosingmultihopreasoning,
      title={NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts}, 
      author={Abhay Gupta and Michael Lu and Kevin Zhu and Sean O'Brien and Vasu Sharma},
      year={2025},
      eprint={2506.02000},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.02000}
}

This table reports the accuracy (%) of six language models on the NovelHopQA benchmark, grouped by context length (64k, 96k, 128k tokens) and reasoning hop count (1–4). For each group, the highest model accuracy is bolded. Results show that all models experience accuracy drops as context length and hop count increase, revealing the challenge of multi-hop reasoning over long narratives.

Context Hop 🥇 Gemini 2.5 P 🥈 o1 🥉 4o Gemini 2.0 F Gemini 2.0 FL 4o-mini Avg.
64k192.3492.5190.1287.3782.5375.4986.73
287.8487.6684.2577.0271.3974.7780.48
385.1284.9981.3474.2570.0573.1478.13
482.4582.1578.4771.7665.3368.0474.69
96k190.1290.3588.8382.2678.4472.2583.71
286.0385.8882.6774.0267.0467.4477.18
383.7183.4180.4173.3866.0566.9775.66
480.9880.6876.9270.2662.8165.5972.87
128k189.1088.7686.9581.7775.3170.0381.99
284.7084.3380.5269.1362.2163.9574.14
382.2081.9278.0368.7862.0762.9572.66
478.5578.8074.6467.3257.3961.1869.65
Performance Visualizations

The following graphs visualize model performance across different hop depths and context lengths, clearly showing the consistent accuracy drops as both factors increase.

Model Performance Across Hop Depths
64k Tokens
96k Tokens
128k Tokens

These visualizations demonstrate that even top-performing models experience significant performance degradation as reasoning depth increases, especially at longer context lengths.

Model Performance Across Context Lengths

These graphs show how model accuracy changes with increasing context length for each hop level, highlighting the combined challenge of long contexts and multi-hop reasoning.

Hop Level 1
Hop Level 2
Hop Level 3
Hop Level 4

These graphs reveal that while context length affects performance, the impact is more pronounced at higher hop levels, suggesting that reasoning complexity is a stronger factor in model performance degradation than just context length alone.

NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts

Paper Overview

NovelHopQA is the first benchmark to jointly vary both hop depth (1–4 hops) and context length (64k–128k tokens) in natural narrative settings. Current large language models (LLMs) struggle to answer questions that span tens of thousands of tokens, especially when multi-hop reasoning is involved. Prior benchmarks explore long-context comprehension or multi-hop reasoning in isolation, but none jointly vary context length and reasoning depth in natural narrative settings. When crucial evidence is buried in the middle of a long context, accuracy can plunge by more than 20 points (Liu et al., 2023). Even frontier models score below 50% exact match on multi-document suites, showing that larger context windows alone cannot solve cross-document reasoning.

Related Work

NovelHopQA builds upon and extends several foundational directions in long-context and multi-hop question answering:

NovelHopQA is the first benchmark to jointly vary both context length and reasoning depth in natural narrative settings, providing a controlled diagnostic for multi-hop reasoning at scale.

Benchmark Features
Feature Details
Number of QA examples4,000
Source novels83 (Project Gutenberg)
Context window sizes64k, 128k tokens
Reasoning hops1–4
Human validation>6.5/7 alignment, >94% hop-match accuracy
Methodology Pipeline
Simplified Methodology
Simplified Methodology Diagram
Figure: Simplified Methodology. NovelHopQA constructs multi-hop QA chains by extracting high-frequency keywords from novels, filtering paragraphs, and incrementally building reasoning paths (hops) that culminate in challenging question–answer pairs. Each hop integrates more context, enabling deeper reasoning.
Technical Pipeline Details

For each book and hop depth H ∈ {1, 2, 3, 4}, we assemble contexts and QA pairs as follows:

  1. Hop 1: Select a paragraph containing one of the book’s anchor keywords k₁. Prompt GPT-4o to generate a single-hop QA pair (Q₁, A₁) from this paragraph.
  2. Hops h ∈ {2–H}:
    1. Extract a new keyword kh from the context Ch-1 using a related-keyword prompt.
    2. Sample a paragraph that contains both k₁ and kh, and append it to the growing context Ch = Ch-1 ∥ new-paragraph.
    3. Prompt GPT-4o to re-generate a single QA pair (Qh, Ah) over the full context Ch, ensuring the new QA integrates evidence from all h paragraphs.
  3. Paragraph Exclusivity: Remove each selected paragraph from the pool to prevent reuse. If no matching paragraph is found after seven attempts, abort the chain and restart with a fresh anchor.

This process matures each datapoint from (C₁, Q₁, A₁) through (CH, QH, AH), yielding coherent multi-hop QA examples grounded in authentic narrative context. Each 64k, 96k, or 128k window is sampled from a continuous span, with all hop paragraphs required to fall within it—ensuring the QA chain reflects a cohesive narrative flow.

Golden-Context Filtering: To ensure answerability, all six models are evaluated on the original golden contexts used to generate each QA pair. Any question missed by any model is discarded, resulting in a dataset where all retained QA pairs are answerable by current leading models. This step ensures high dataset validity; detailed results are provided in the research paper.

Irrelevant and No-Context Sanity Check: To confirm that questions require actual reasoning and are not solvable by recall alone, 800 QA pairs (100 per hop) are tested under irrelevant and no-context conditions. Models perform poorly in these settings, indicating that correct answers depend on contextual grounding rather than memorization. This strengthens the benchmark's focus on true reasoning. Detailed results are provided in the research paper.

Human Validation Results

The following table reports the average scores from 10 independent human annotators who evaluated the alignment and hop-match accuracy for each hop depth H ∈ {1, 2, 3, 4}. Alignment is rated on a 1–7 Likert scale, and Hop Match measures the percentage of questions judged to require exactly H reasoning steps. High average alignment (>6.5/7) and hop-match (>94%) indicate strong dataset quality and clear multi-hop structure. These results demonstrate that the questions are both contextually grounded and require the intended number of reasoning steps.

Metric H = 1 H = 2 H = 3 H = 4
Alignment (1–7)6.696.586.586.57
Hop Match (%)95.994.994.995.2
Evaluation Results

We evaluated six state-of-the-art models: o1, GPT-4o, GPT-4o-mini, Gemini 2.5 Pro, Gemini 2.0 Flash, and Gemini 2.0 Flash Lite. Key findings:

  • Impact of hop depth: All models exhibit consistent performance degradation as hop depth increases. On average, accuracy drops roughly 12 points from 1-hop to 4-hop at 64k context length.
  • Impact of context length: Longer contexts also lead to reduced accuracy, though the effect is milder than that of hop count. Across models, 1-hop performance drops about 5 points when moving from 64k to 128k contexts.
  • No model maintains strong performance on the hardest tasks (4-hop at 128k), where even top models dip below 80% accuracy.

These results highlight that simply increasing context window size isn't enough—robust multi-hop reasoning remains a key challenge for LLMs, even at the frontier of model capabilities.