To quickly evaluate the performance of a model in different bootcamp environments, you can use the run_eval.py script. This script supports multiple configuration options and is highly flexible to accommodate various testing needs.
Below is a complete example command demonstrating how to run the evaluation script:
cd InternBootcamp
python examples/unittests/run_eval.py \
--url http://127.0.0.1:8000/v1 \
--api_key EMPTY \
--model_name r1_32B \
--test_dir /path/to/test_dir \
--max_concurrent_requests 128 \
--template r1 \
--max_tokens 32768 \
--temperature 0 \
--timeout 6000 \
--api_mode completion \
--max_retries 16 \
--max_retrying_delay 60 \
--resumeHere are the main parameters supported by the script and their meanings:
| Parameter Name | Type | Example Value | Description |
|---|---|---|---|
--url |
str | http://127.0.0.1:8000/v1 |
Base URL for the OpenAI API. |
--api_key |
str | EMPTY |
API key required to access the model service. Default is EMPTY. |
--model_name |
str | r1_32B |
The name of the model used, e.g., r1_32B or other custom model names. |
--test_dir |
str | /path/to/test_dir |
Path to the directory containing test data (should include JSONL files). |
--max_concurrent_requests |
int | 128 |
Maximum number of concurrent requests allowed globally. |
--template |
str | r1 |
Predefined conversation template (e.g., r1, qwen, internthinker, chatml). |
--max_tokens |
int | 32768 |
Maximum number of tokens generated by the model. |
--temperature |
float | 0 |
Controls randomness in text generation; lower values yield more deterministic results. |
--timeout |
int | 6000 |
Request timeout in milliseconds. |
--api_mode |
str | completion |
API mode; options are completion or chat_completion. |
--sys_prompt |
str | "You are an expert reasoner..." |
System prompt content; only effective when api_mode is chat_completion. |
--max_retries |
int | 16 |
Number of retries per failed request. |
--max_retrying_delay |
int | 60 |
Maximum delay between retries in seconds. |
--resume |
bool | true |
Resume from previous run. |
--check_model_url |
bool | true |
Check if the model service URL is available before starting the evaluation. |
--sys_promptis only effective if--api_modeis set tochat_completion.--templateis only effective if--api_modeis set tocompletion.- Valid values for
--templateinclude:r1,qwen,internthinker,chatml(from predefinedTEMPLATE_MAP). - If
--sys_promptis not provided, the default system prompt from the template will be used (if any).
Evaluation results will be saved under the directory:
examples/unittests/output/{model_name}_{test_dir}_{timestamp}
The output includes:
-
Detailed Results:
- Each JSONL file's result is saved in
output/details/, named after the original JSONL file. - Each record contains the following fields:
id: Sample ID.prompt: Input prompt.output: Model-generated output.output_len: Length of the output in tokens.ground_truth: Ground truth answer.score: Score calculated byverify_scoremethod.extracted_output: Extracted output viaextract_outputmethod.
- Each JSONL file's result is saved in
-
Metadata:
- Metadata including average score and average output length per bootcamp is saved in
output/meta.jsonl.
- Metadata including average score and average output length per bootcamp is saved in
-
Summary Report:
- A summary report is saved as an Excel file at
output/{model_name}_scores.xlsx, including:- Average score and output length per bootcamp.
- Overall average score and output length across all bootcamps.
- A summary report is saved as an Excel file at
-
Progress Log:
- Progress information is logged in
output/progress.log, showing real-time progress and estimated remaining time for each dataset.
- Progress information is logged in
-
Parameter Configuration:
- The full configuration used in the current run is saved in
output/eval_args.jsonfor experiment reproducibility.
- The full configuration used in the current run is saved in
-
Concurrency Settings:
- Adjust
--max_concurrent_requestsbased on machine capabilities and the size of the test set to avoid resource exhaustion due to excessive concurrency.
- Adjust
-
URL Health Check:
- Before starting the evaluation, the script automatically checks whether the model service is running and has registered the specified
model_name. - If the service is not ready, it will wait up to 60 minutes (default), retrying every 60 seconds.
- Before starting the evaluation, the script automatically checks whether the model service is running and has registered the specified
-
Error Handling Mechanism:
- Each request can be retried up to
--max_retriestimes using exponential backoff (up to--max_retrying_delayseconds). - If all retries fail, the script raises an exception and terminates processing of the current sample.
- Each request can be retried up to
After execution, the output directory structure looks like this:
examples/unittests/output/
└── {model_name}_{test_dir}_{timestamp}/
├── details/
│ ├── test_file1.jsonl
│ ├── test_file2.jsonl
│ └── ...
├── meta.jsonl
├── eval_args.json
├── progress.log
└── {model_name}_scores.xlsx