Skip to content
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ share/python-wheels/
.installed.cfg
*.egg
MANIFEST
.DS_Store

# PyInstaller
# Usually these files are written by a python script from a template
Expand Down Expand Up @@ -98,7 +99,7 @@ ipython_config.py
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
#uv.lock
uv.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
Expand Down Expand Up @@ -173,7 +174,7 @@ cython_debug/
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
.idea/

# Abstra
# Abstra is an AI-powered process automation framework.
Expand Down
41 changes: 41 additions & 0 deletions baseline_quest/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Baseline Experiments with QUEST

## Experiment Setup
1. Indexed with Chroma (`indexing/index_chroma.py`)
- 512 tokens with 80-token overlap / index only first 512 tokens (same as QUEST)
- Embedded in batches of 256 chunks
- Embedding model: `bge-small-en-v1.5`

2. Decompose (optional)
- Use gpt-4o-mini to decompose the query into subqueries connected with operators
- E.g. "Stoloniferous plants or crops originating from Bolivia" -> retrieve("crops from Bolivia", 100) | retrieve("stoloniferous plants", 100)
- E.g. "Neogene mammals of Africa that are Odd-toed ungulates" -> retrieve("Neogene mammals of Africa", 100) & retrieve("Odd-toed ungulates", 100)
- Examples decomposition python files: `decompose/k_scripts/query_9` and `decompose/k_scripts/query_10`
- Generate decompositions with `decompose/generate_decompositions.py`

3. Retrieval (@ k)
- Retrieve with query / subquery (vector similarity)
- If indexed entired documented, retrieves the top 200 most likely chunks, then maps them to the document
- If indexed first 512 tokens only, retrieves the top k most likely chunks
- Retrieve after decomposition with query / subquery
- `decompose/execute_decompositions.py` executes all of the generated decomposition pythons scripts from step 2 and uses the same vectory similarity retrieval for each subquery.

## Data
The data is directly from QUEST (https://github.com/google-research/language/tree/master/language/quest#examples).
- The documents that are embedded are: https://storage.googleapis.com/gresearch/quest/documents.zip
- `data/train_subset1.jsonl` is 20 randomly sampled queries from `train.jsonl` of QUEST.
- `data/train_subset2.jsonl` is 20 randomly sampled non-union queries from `train.jsonl` of QUEST.

## Retrieval
Retrieval is done with `semantic_retrieval/retrieve.py`.
- For decompose, we are retrieving titles only, and the code for this is written in `decompose/retrieve.py`
- For vector similarity, we are retrieving (title, chunk) tuples (`INCLUDE_CHUNKS = true`)

## Results:
| | Retrieve (entire document) | Retrieve (first 512 tokens) | Decompose + Retrieve* (entire document) | Decompose + Retrieve* (first 512 tokens) |
|--------------|----------------------------|-----------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------|
| Recall @ 20 | 0.0886 | 0.1127 | - | |
| Recall @ 50 | 0.1663 | 0.1593 | 0.1560 (\|Pred\| = 61.10) | 0.1617 (\|Pred\| = 60.70) |
| Recall @ 100 | 0.2122 | 0.2250 | 0.2285 (\|Pred\| = 205.95) (k for subqueries increased in size) | 0.2157 (\|Pred\| = 209.30) (k for subqueries increased in size) |

## Data
61 changes: 61 additions & 0 deletions baseline_quest/data/example.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
{
"query": "Non Horror demon novels.",
"docs": ["List of the Lost", "Blood Price", "The Black Spider", "The Castle in the Forest", "The Devil in Love (novel)", "Melmoth the Wanderer", "Practical Demonkeeping", "Artemis Fowl and the Lost Colony", "The Black Tattoo", "Good Omens", "Eric (novel)"], "original_query": "<mark>Demon novels</mark> that are not <mark>Horror novel series</mark>",
"scores": null,
"metadata": {
"template": "_ that are not _",
"domain": "books",
"fluency": ["Fluent: It is clear, and grammatically correct."],
"meaning": ["Same Meaning: The paraphrased query asks for the same set of items as the original query. All the highlighted clauses are included."], "naturalness": ["Yes - A user could plausibly issue this query."],
"relevance_ratings": {
"List of the Lost": ["Definitely relevant"],
"Blood Price": ["Definitely relevant"],
"The Black Spider": ["Likely relevant"],
"The Castle in the Forest": ["Likely relevant"],
"The Devil in Love (novel)": ["Definitely relevant"],
"Melmoth the Wanderer": ["Definitely relevant"],
"Practical Demonkeeping": ["Likely relevant"],
"Artemis Fowl and the Lost Colony": ["Definitely relevant"],
"The Black Tattoo": ["Definitely relevant"],
"Good Omens": ["Likely relevant"],
"Eric (novel)": ["Definitely relevant"]},
"evidence_ratings": {"List of the Lost": ["Complete"],
"Blood Price": ["Complete"],
"The Black Spider": ["Partial"],
"The Castle in the Forest": ["Partial"],
"The Devil in Love (novel)": ["Complete"],
"Melmoth the Wanderer": ["Complete"],
"Practical Demonkeeping": ["Partial"],
"Artemis Fowl and the Lost Colony": ["Complete"],
"The Black Tattoo": ["Complete"],
"Good Omens": ["Partial"],
"Eric (novel)": ["Complete"]
},
"attributions": {
"List of the Lost": [{"Non Horror demon novels.": "The book is about a 1970s relay team in Boston who accidentally kill a homeless person, whose death brings misfortune to the team."}],
"Blood Price": [{"Non Horror demon novels.": "He tells her that the killer is a demon, that she actually did see him disappear."}],
"The Black Spider": [{"demon": "The hunter used his demonic powers to instill a curse in the kiss, which would ensure his payment."}],
"The Castle in the Forest": [{"Non Horror demon novels.": "'''''The Castle in the Forest''''' is the last novel by writer Norman Mailer, published in the year of his death, 2007. It is the story of Adolf Hitler's childhood as seen through the eyes of Dieter, a demon sent to put him on his destructive path. The novel explores the idea that Hitler was the product of incest. It forms a thematic contrast with the writer's immediately previous novel ''The Gospel According to the Son'' (1999), which deals with the early life of Jesus. It received a good deal of praise, including a glowing review from Lee Siegel of ''The New York Times Book Review'', and was the ''New York Times'' Bestseller for 2007."}],
"The Devil in Love (novel)": [{"Non Horror demon novels.": "Author of ''The Devil in Love, Jacques Cazotte''\n'''''The Devil in Love''''' (, 1772) is an occult romance by Jacques Cazotte which tells of a demon, or devil, who falls in love with a young Spanish nobleman named Don Alvaro, an amateur human dabbler, and attempts, in the guise of a young woman, to win his affections."}],
"Melmoth the Wanderer": [{"Non Horror demon novels.": "'''''Melmoth the Wanderer''''' is an 1820 Gothic novel by Irish playwright, novelist and clergyman Charles Maturin. The novel's titular character is a scholar who sold his soul to the devil in exchange for 150 extra years of life, and searches the world for someone who will take over the pact for him, in a manner reminiscent of the Wandering Jew."}],
"Practical Demonkeeping": [{"Non Horror demon novels": "His first novel, it deals with a demon from Hell and his master."}],
"Artemis Fowl and the Lost Colony": [{"Non Horror demon novels.": "In Barcelona, Spain, Artemis Fowl II and Butler, his bodyguard, wait for a demon. They suddenly encounter a demon who transports Artemis through time."}],
"The Black Tattoo": [{"Non Horror demon novels.": "'''''The Black Tattoo''''' is a young adult fantasy novel by Sam Enthoven, published in 2006. It deals with a boy, Charlie, becoming possessed by a demon that manifests itself in the form of a black tattoo on his body."}],
"Good Omens": [{"demon novels.": "There are attempts by the angel Aziraphale and the demon Crowley to sabotage the coming of the end times, having grown accustomed to their comfortable surroundings in England."}],
"Eric (novel)": [{"demon novels.": "the Demon King"}]
}
}
}

{
"query": "what are Oceanian realm fauna that are also both Birds of North America and Fauna of Europe", "docs": ["Sooty tern", "Bulwer's petrel", "Black noddy", "Bar-tailed godwit", "Masked booby", "Red-footed booby", "Roseate tern"],
"original_query": "<mark>Oceanian realm fauna</mark> that are also both <mark>Birds of North America</mark> and <mark>Fauna of Europe</mark>",
"scores": null,
"metadata": {
"template": "_ that are also both _ and _",
"relevance_ratings": null,
"evidence_ratings": null,
"attributions": null,
"domain": "animals"
}
}
Loading