Skip to content

Latest commit

 

History

History
executable file
·
126 lines (98 loc) · 6.31 KB

File metadata and controls

executable file
·
126 lines (98 loc) · 6.31 KB

InternBootcamp Evaluation Guide

To quickly evaluate the performance of a model in different bootcamp environments, you can use the run_eval.py script. This script supports multiple configuration options and is highly flexible to accommodate various testing needs.


Example Execution Command

Below is a complete example command demonstrating how to run the evaluation script:

cd InternBootcamp
python examples/unittests/run_eval.py \
    --url http://127.0.0.1:8000/v1 \
    --api_key EMPTY \
    --model_name r1_32B \
    --test_dir /path/to/test_dir \
    --max_concurrent_requests 128 \
    --template r1 \
    --max_tokens 32768 \
    --temperature 0 \
    --timeout 6000 \
    --api_mode completion \
    --max_retries 16 \
    --max_retrying_delay 60 \
    --resume

Parameter Description

Here are the main parameters supported by the script and their meanings:

Parameter Name Type Example Value Description
--url str http://127.0.0.1:8000/v1 Base URL for the OpenAI API.
--api_key str EMPTY API key required to access the model service. Default is EMPTY.
--model_name str r1_32B The name of the model used, e.g., r1_32B or other custom model names.
--test_dir str /path/to/test_dir Path to the directory containing test data (should include JSONL files).
--max_concurrent_requests int 128 Maximum number of concurrent requests allowed globally.
--template str r1 Predefined conversation template (e.g., r1, qwen, internthinker, chatml).
--max_tokens int 32768 Maximum number of tokens generated by the model.
--temperature float 0 Controls randomness in text generation; lower values yield more deterministic results.
--timeout int 6000 Request timeout in milliseconds.
--api_mode str completion API mode; options are completion or chat_completion.
--sys_prompt str "You are an expert reasoner..." System prompt content; only effective when api_mode is chat_completion.
--max_retries int 16 Number of retries per failed request.
--max_retrying_delay int 60 Maximum delay between retries in seconds.
--resume bool true Resume from previous run.
--check_model_url bool true Check if the model service URL is available before starting the evaluation.
Parameter Relationships
  • --sys_prompt is only effective if --api_mode is set to chat_completion.
  • --template is only effective if --api_mode is set to completion.
  • Valid values for --template include: r1, qwen, internthinker, chatml (from predefined TEMPLATE_MAP).
  • If --sys_prompt is not provided, the default system prompt from the template will be used (if any).

Output Results

Evaluation results will be saved under the directory:
examples/unittests/output/{model_name}_{test_dir}_{timestamp}

The output includes:

  1. Detailed Results:

    • Each JSONL file's result is saved in output/details/, named after the original JSONL file.
    • Each record contains the following fields:
      • id: Sample ID.
      • prompt: Input prompt.
      • output: Model-generated output.
      • output_len: Length of the output in tokens.
      • ground_truth: Ground truth answer.
      • score: Score calculated by verify_score method.
      • extracted_output: Extracted output via extract_output method.
  2. Metadata:

    • Metadata including average score and average output length per bootcamp is saved in output/meta.jsonl.
  3. Summary Report:

    • A summary report is saved as an Excel file at output/{model_name}_scores.xlsx, including:
      • Average score and output length per bootcamp.
      • Overall average score and output length across all bootcamps.
  4. Progress Log:

    • Progress information is logged in output/progress.log, showing real-time progress and estimated remaining time for each dataset.
  5. Parameter Configuration:

    • The full configuration used in the current run is saved in output/eval_args.json for experiment reproducibility.

Notes

  1. Concurrency Settings:

    • Adjust --max_concurrent_requests based on machine capabilities and the size of the test set to avoid resource exhaustion due to excessive concurrency.
  2. URL Health Check:

    • Before starting the evaluation, the script automatically checks whether the model service is running and has registered the specified model_name.
    • If the service is not ready, it will wait up to 60 minutes (default), retrying every 60 seconds.
  3. Error Handling Mechanism:

    • Each request can be retried up to --max_retries times using exponential backoff (up to --max_retrying_delay seconds).
    • If all retries fail, the script raises an exception and terminates processing of the current sample.

Example Output Directory Structure

After execution, the output directory structure looks like this:

examples/unittests/output/
└── {model_name}_{test_dir}_{timestamp}/
    ├── details/
    │   ├── test_file1.jsonl
    │   ├── test_file2.jsonl
    │   └── ...
    ├── meta.jsonl
    ├── eval_args.json
    ├── progress.log
    └── {model_name}_scores.xlsx