This repository anonymously releases the codes and data for the paper Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection.
- [28/05/2026] Our code for Loong is released!
- [29/05/2026] Our paper is published on arXiv: arXiv:2605.30274!
Loong is a human-like long document translation agent that employs reasoning-driven adaptive context selection optimized via reinforcement learning to resolve context limitation and noise issues in DocMT-LLMs, achieving significant gains in document translation quality and ultra-long document stability.
Loong is equipped with a "3E" memory model to store and retrieve candidate context information as the segment-wise translation proceeds:
- Enssence: Summaries of the processed segments
- Exemplar: Source-Target sentence pairs of all previous contents
- Entity: Structured records of all appeared entities (people, location, events, etc.)
Loong actively filter these candidate context items with a three-step observe-and-act trajectory. In each step, Loong observe one kind of candidate items, determine whether it is helpful to translation through reasoning, and make its final information selection. The filtered information subset is then employed to guide the translation process of the current segment to produce the final output. We conduct parallel samping on each step, constructing preference data for DPO training to optimize Loong's context selection and utilization strategies, enhancing its overall performance.
| Directory | Contents |
|---|---|
./data |
Testing Data |
./scripts |
Shared Python modules used by sampling and inference |
./scripts/prompts |
Prompts for the agent |
./sample |
Training-data sampling launchers (run_sample.sh, run_process.sh, process.py) |
./train |
LLaMA-Factory recipes and launcher for SFT + DPO fine-tuning |
./inference |
Inference launcher (run_infer.sh) |
./evaluate |
Scripts for sCOMET, dCOMET and LLM-as-a-Judge |
./results |
Testing outputs |
./installation |
Pip requirement files for the four conda environments |
./LLaMA-Factory |
Vendored copy of LLaMA-Factory installed editably by the training environment |
./COMET |
Vendored copy of doc-mt-metrics/COMET installed editably by the dCOMET evaluation environment |
Loong relies on five separate conda environments β each isolates a stage of the
pipeline with conflicting dependency requirements (vLLM serving, the agent
client used for sampling/inference, LLaMA-Factory training, the sentence-level
COMET service, and the document-level COMET fork). The pip requirement files
for all five are provided under ./installation, and the two environments
that depend on editable installs (llama-factory and doccomet) point at
vendored copies of the source repositories shipped with this release.
| Environment | Purpose | Requirements |
|---|---|---|
vllm |
vLLM model serving (deployed via OpenAI-compatible API) | installation/requirements_vllm.txt |
loong |
Trajectory sampling and inference (client side) | installation/requirements_loong.txt |
llama-factory |
SFT + LoRA-DPO fine-tuning | installation/requirements_llama-factory.txt |
comet |
sCOMET service deployment and scoring | installation/requirements_comet.txt |
doccomet |
dCOMET (document-level COMET) evaluation | installation/requirements_doccomet.txt |
Create each environment and install its dependencies:
# vLLM serving
conda create -n vllm python=3.11 -y
conda activate vllm
pip install -r installation/requirements_vllm.txt
# Sampling and inference (client)
conda create -n loong python=3.11 -y
conda activate loong
pip install -r installation/requirements_loong.txt
# Training (editable install of vendored LLaMA-Factory)
conda create -n llama-factory python=3.11 -y
conda activate llama-factory
pip install -r installation/requirements_llama-factory.txt
# sCOMET service + evaluation
conda create -n comet python=3.10 -y
conda activate comet
pip install -r installation/requirements_comet.txt
# dCOMET (editable install of vendored doc-mt-metrics/COMET)
conda create -n doccomet python=3.9 -y
conda activate doccomet
pip install -r installation/requirements_doccomet.txtTraining data is produced in two steps: first sample raw trajectories with
run_sample.sh, then convert them into LLaMA-Factory SFT/DPO datasets with
run_process.sh. A COMET scoring service must be deployed beforehand, since the
sampling pipeline calls it on every trajectory.
Two services must be running before Step 1: a vLLM-served LLM (consumed by the
agent) and a COMET scorer (consumed for trajectory rewards). Their endpoints
must match the urls and comet_apis arrays used by run_sample.sh.
vLLM serving β run inside the vllm environment. Launch one process per
GPU you want to dedicate to serving; the endpoints are then passed to Step 1
through the urls array.
conda activate vllm
export LLM_MODEL=qwen # must match --served-model-name below; read by the sampling/inference clients
CUDA_VISIBLE_DEVICES=0 nohup vllm serve Qwen3-8B \
--host 0.0.0.0 \
--port 8010 \
--served-model-name "$LLM_MODEL" \
--enable-prefix-caching \
&> vllm.log &COMET serving β evaluate/deploy_comet.sh launches a COMET model server in
the background (logs to evaluate/deploy.log). The endpoint exposed here must
match the comet_apis entries used by Step 1 (and the comet_api argument
used later by sCOMET evaluation).
Set the following inside the script before running:
model_pathβ path to the COMET checkpoint to serve (e.g.,wmt22-comet-da/model.ckpt).portβ listening port (default8090).
conda activate comet
bash evaluate/deploy_comet.sh- sample/run_sample.sh
Runs the observe-and-act sampling pipeline over the News Commentary v18.1 source files
to collect training data.
The input directory is expected to hold per-language sub-directories whose files are
named ${src_lang}.${doc_id} and ${tgt_lang}.${doc_id} (e.g., en-zh/en.0,
en-zh/zh.0). Outputs are written under ${out_dir}/${language}/${doc_id}.
Set the following inside the script before running:
in_dirβ parent directory containing per-language sub-dirs of News Commentary v18.1 raw training files.out_dirβ output directory for sampled trajectories (default./results).languagesβ bash array of one or more translation directions, choices=[en-zh,en-de,en-fr,zh-en,de-en,fr-en].urlsβ bash array of one or more deployed vLLM model APIs (e.g.,127.0.0.1:8000); the worker pool size equals the number of URLs.comet_apisβ bash array of one or more deployed COMET model APIs (e.g.,127.0.0.1:8088); workers are sharded across them.tokenizer_pathβ path to the LLM's tokenizer.encoder_pathβ path to theall-distilroberta-v1checkpoint.window_sizeβ number of sentences per page within a document.
conda activate loong
LLM_MODEL=qwen bash sample/run_sample.sh # value must match the --served-model-name registered on the vLLM endpoints- sample/run_process.sh
Constructs the SFT and DPO training data from the trajectories sampled in Step 1.
Set the following inside the script before running:
input_pathβ directory holding the per-chapter trajectories emitted by Step 1.output_pathβ directory where the SFT/DPO JSON files anddataset_info.jsonare written.tokenizer_pathβ path to the LLM's tokenizer (used for length filtering).
bash sample/run_process.sh- train/run_train.sh
Fine-tunes the base LLM in two stages: full-parameter SFT followed by LoRA-based DPO, driven by LLaMA-Factory recipes.
Set the following inside the recipe files before running:
full_sft.yamlmodel_name_or_pathβ path to the pre-trained LLM checkpoint.deepspeedβ path tods_z3_config.json.dataset_dirβ path to the SFT training data.templateβqwenfor Qwen2.5,qwen3for Qwen3,llama3for Llama3.1.
lora_dpo.yamlmodel_name_or_pathβ path to the SFT checkpoint.dataset_dirβ path to the DPO training data.templateβqwenfor Qwen2.5,qwen3for Qwen3,llama3for Llama3.1.
conda activate llama-factory
bash train/run_train.shInference also depends on a vLLM-served LLM β point it at the fine-tuned
checkpoint produced by Model Tuning. Launch one process per GPU you want to
dedicate to serving, and pass the endpoint(s) to run_infer.sh via address.
conda activate vllm
export LLM_MODEL=qwen # must match --served-model-name below; read by the inference client
CUDA_VISIBLE_DEVICES=0 nohup vllm serve <path/to/finetuned/checkpoint> \
--host 0.0.0.0 \
--port 8010 \
--served-model-name "$LLM_MODEL" \
--enable-prefix-caching \
&> vllm.log &- inference/run_infer.sh
Translates each document under the given source test file with the trained Loong agent and writes hypotheses to the result directory.
Set the following inside the script before running:
addressβ deployed vLLM model API (e.g.,127.0.0.1:8000).languageβ translation direction, choices=[en-zh,en-de,en-fr,zh-en,de-en,fr-en].src_fileβ source test file.
conda activate loong
LLM_MODEL=qwen bash inference/run_infer.sh # value must match the --served-model-name registered on the vLLM endpointWe provide three evaluators under ./evaluate to assess translation quality from complementary perspectives:
sentence-level COMET (sCOMET), document-level COMET (dCOMET), and an LLM-as-a-Judge protocol.
All three scripts share the same I/O convention:
data_dirβ directory containing the source and reference files, named${src_lang}.${doc_id}and${tgt_lang}.${doc_id}(e.g.,en.0,zh.0).result_dirβ directory containing the hypothesis files produced by inference, named${tgt_lang}.${doc_id}.languageβ translation direction, choices=[en-zh,en-de,en-fr,zh-en,de-en,fr-en].
- evaluate/eval_scomet.sh
Posts sourceβhypothesisβreference triples to a deployed COMET service and reports the
per-document average and the overall average. Output is written to ${result_dir}/comet.txt.
bash eval_scomet.sh <data_dir> <result_dir> <language> [comet_api]
# comet_api defaults to 127.0.0.1:8090- evaluate/eval_dcomet_total.sh
Concatenates all documents and uses comet-score with document-id boundaries to compute
document-level COMET in a single pass. Output is appended to ${result_dir}/doccomet_total.txt.
The path to the wmt22-comet-da checkpoint can be provided either as a 4th positional
argument or through the COMET_MODEL_PATH environment variable.
conda activate doccomet
bash eval_dcomet_total.sh <data_dir> <result_dir> <language> <comet_model_path>- evaluate/eval_llm.sh
Prompts a judge LLM to score each hypothesis document holistically on five dimensions β
General Quality, Cohesion, Coherence, Style Consistency, and Terminology Consistency
(0β100 each, plus a Meta average). Output is written to ${result_dir}/llm_${model}.txt.
Set the judge model inside eval_llm.sh, and provide OpenAI-compatible credentials
either via environment variables or by editing the script:
# eval_llm.sh
model= # judge model name (e.g., gpt-4.1)
# environment variables (read by default)
export OPENAI_API_KEY=... # OpenAI-compatible API key
export OPENAI_BASE_URL=... # OpenAI-compatible endpointbash eval_llm.sh <data_dir> <result_dir> <language>If you find this repo useful, please cite our paper as:
@misc{wang2026loonghumanlikelongdocument,
title={Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection},
author={Yutong Wang and Xuebo Liu and Derek F. Wong and Zhilin Li and Rongqing Jiang and Min Zhang and Shimin Tao and Daimeng Wei and Min Zhang},
year={2026},
eprint={2605.30274},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.30274},
}
This codebase builds on the following open-source projects, and we thank their authors and maintainers for releasing them:
- vLLM β high-throughput LLM serving used to host the agent's translation backbone.
- LlamaFactory β training framework used for SFT and LoRA-DPO fine-tuning.
- uvicorn β ASGI server that hosts our COMET evaluation endpoint.
- COMET β sentence-level translation quality metric.
- doc-mt-metrics β document-level COMET extension used for dCOMET evaluation.

