InternBootcamp Evaluation Guide

To quickly evaluate the performance of a model in different bootcamp environments, you can use the run_eval.py script. This script supports multiple configuration options and is highly flexible to accommodate various testing needs.

Example Execution Command

Below is a complete example command demonstrating how to run the evaluation script:

cd InternBootcamp
python examples/unittests/run_eval.py \
    --url http://127.0.0.1:8000/v1 \
    --api_key EMPTY \
    --model_name r1_32B \
    --test_dir /path/to/test_dir \
    --max_concurrent_requests 128 \
    --template r1 \
    --max_tokens 32768 \
    --temperature 0 \
    --timeout 6000 \
    --api_mode completion \
    --max_retries 16 \
    --max_retrying_delay 60 \
    --resume

Parameter Description

Here are the main parameters supported by the script and their meanings:

Parameter Name	Type	Example Value	Description
`--url`	str	`http://127.0.0.1:8000/v1`	Base URL for the OpenAI API.
`--api_key`	str	`EMPTY`	API key required to access the model service. Default is `EMPTY`.
`--model_name`	str	`r1_32B`	The name of the model used, e.g., `r1_32B` or other custom model names.
`--test_dir`	str	`/path/to/test_dir`	Path to the directory containing test data (should include JSONL files).
`--max_concurrent_requests`	int	`128`	Maximum number of concurrent requests allowed globally.
`--template`	str	`r1`	Predefined conversation template (e.g., `r1`, `qwen`, `internthinker`, `chatml`).
`--max_tokens`	int	`32768`	Maximum number of tokens generated by the model.
`--temperature`	float	`0`	Controls randomness in text generation; lower values yield more deterministic results.
`--timeout`	int	`6000`	Request timeout in milliseconds.
`--api_mode`	str	`completion`	API mode; options are `completion` or `chat_completion`.
`--sys_prompt`	str	`"You are an expert reasoner..."`	System prompt content; only effective when `api_mode` is `chat_completion`.
`--max_retries`	int	`16`	Number of retries per failed request.
`--max_retrying_delay`	int	`60`	Maximum delay between retries in seconds.
`--resume`	bool	`true`	Resume from previous run.
`--check_model_url`	bool	`true`	Check if the model service URL is available before starting the evaluation.

Parameter Relationships

--sys_prompt is only effective if --api_mode is set to chat_completion.
--template is only effective if --api_mode is set to completion.
Valid values for --template include: r1, qwen, internthinker, chatml (from predefined TEMPLATE_MAP).
If --sys_prompt is not provided, the default system prompt from the template will be used (if any).

Output Results

Evaluation results will be saved under the directory:
examples/unittests/output/{model_name}_{test_dir}_{timestamp}

The output includes:

Detailed Results:
- Each JSONL file's result is saved in output/details/, named after the original JSONL file.
- Each record contains the following fields:
  - id: Sample ID.
  - prompt: Input prompt.
  - output: Model-generated output.
  - output_len: Length of the output in tokens.
  - ground_truth: Ground truth answer.
  - score: Score calculated by verify_score method.
  - extracted_output: Extracted output via extract_output method.
Metadata:
- Metadata including average score and average output length per bootcamp is saved in output/meta.jsonl.
Summary Report:
- A summary report is saved as an Excel file at output/{model_name}_scores.xlsx, including:
  - Average score and output length per bootcamp.
  - Overall average score and output length across all bootcamps.
Progress Log:
- Progress information is logged in output/progress.log, showing real-time progress and estimated remaining time for each dataset.
Parameter Configuration:
- The full configuration used in the current run is saved in output/eval_args.json for experiment reproducibility.

Notes

Concurrency Settings:
- Adjust --max_concurrent_requests based on machine capabilities and the size of the test set to avoid resource exhaustion due to excessive concurrency.
URL Health Check:
- Before starting the evaluation, the script automatically checks whether the model service is running and has registered the specified model_name.
- If the service is not ready, it will wait up to 60 minutes (default), retrying every 60 seconds.
Error Handling Mechanism:
- Each request can be retried up to --max_retries times using exponential backoff (up to --max_retrying_delay seconds).
- If all retries fail, the script raises an exception and terminates processing of the current sample.

Example Output Directory Structure

After execution, the output directory structure looks like this:

examples/unittests/output/
└── {model_name}_{test_dir}_{timestamp}/
    ├── details/
    │   ├── test_file1.jsonl
    │   ├── test_file2.jsonl
    │   └── ...
    ├── meta.jsonl
    ├── eval_args.json
    ├── progress.log
    └── {model_name}_scores.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InternBootcamp Evaluation Guide

Example Execution Command

Parameter Description

Parameter Relationships

Output Results

Notes

Example Output Directory Structure

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

InternBootcamp Evaluation Guide

Example Execution Command

Parameter Description

Parameter Relationships

Output Results

Notes

Example Output Directory Structure