Unverified Commit 2387c22b authored by Xiaoyu Zhang's avatar Xiaoyu Zhang Committed by GitHub
Browse files

Ci monitor support performance (#10965)

parent 592ddf37
......@@ -32,7 +32,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install requests
pip install requests matplotlib pandas
- name: Run CI Analysis
env:
......@@ -43,9 +43,20 @@ jobs:
cd scripts/ci_monitor
python ci_analyzer.py --token $GITHUB_TOKEN --limit ${{ github.event.inputs.limit || '1000' }} --output ci_analysis_$(date +%Y%m%d_%H%M%S).json
- name: Run Performance Analysis
env:
GITHUB_TOKEN: ${{ secrets.GH_PAT_FOR_NIGHTLY_CI }}
PYTHONUNBUFFERED: 1
PYTHONIOENCODING: utf-8
run: |
cd scripts/ci_monitor
python ci_analyzer_perf.py --token $GITHUB_TOKEN --limit 500 --output-dir performance_tables_$(date +%Y%m%d_%H%M%S)
- name: Upload Analysis Results
uses: actions/upload-artifact@v4
with:
name: ci-analysis-results-${{ github.run_number }}
path: scripts/ci_monitor/ci_analysis_*.json
path: |
scripts/ci_monitor/ci_analysis_*.json
scripts/ci_monitor/performance_tables_*
retention-days: 30
# SGLang CI Monitor
A simple tool to analyze CI failures for the SGLang project. This tool fetches recent CI run data from GitHub Actions and provides detailed analysis of failure patterns.
> **Note**: This README.md is primarily generated by Claude 4 with some manual adjustments.
A comprehensive toolkit to analyze CI failures and performance trends for the SGLang project. This toolkit includes two main tools:
1. **CI Analyzer** (`ci_analyzer.py`): Analyzes CI failures and provides detailed failure pattern analysis
2. **Performance Analyzer** (`ci_analyzer_perf.py`): Tracks performance metrics over time and generates trend charts
## Features
### CI Analyzer (`ci_analyzer.py`)
- **Simple Analysis**: Analyze recent CI runs and identify failure patterns
- **Category Classification**: Automatically categorize failures by type (unit-test, performance, etc.)
- **Pattern Recognition**: Identify common failure patterns (timeouts, build failures, etc.)
- **CI Links**: Direct links to recent failed CI runs for detailed investigation
- **Last Success Tracking**: Track the last successful run for each failed job with PR information
- **JSON Export**: Export detailed analysis data to JSON format
- **Automated Monitoring**: GitHub Actions workflow for continuous CI monitoring
### Performance Analyzer (`ci_analyzer_perf.py`)
- **Performance Tracking**: Monitor performance metrics across CI runs over time
- **Automated Chart Generation**: Generate time-series charts for each performance metric
- **Multi-Test Support**: Track performance for all test types (throughput, latency, accuracy)
- **CSV Export**: Export performance data in structured CSV format
- **Trend Analysis**: Visualize performance trends with interactive charts
- **Comprehensive Metrics**: Track output throughput, E2E latency, TTFT, accept length, and more
### Common Features
- **Automated Monitoring**: GitHub Actions workflow for continuous CI and performance monitoring
## Installation
### For CI Analyzer
No additional dependencies required beyond Python standard library and `requests`:
```bash
pip install requests
```
### For Performance Analyzer
Additional dependencies required for chart generation:
```bash
pip install requests matplotlib pandas
```
## Usage
### Basic Usage
### CI Analyzer
#### Basic Usage
```bash
# Replace YOUR_GITHUB_TOKEN with your actual token from https://github.com/settings/tokens
python ci_analyzer.py --token YOUR_GITHUB_TOKEN
```
### Advanced Usage
#### Advanced Usage
```bash
# Analyze last 1000 runs
......@@ -39,16 +65,45 @@ python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 1000
python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_analysis.json
```
### Performance Analyzer
#### Basic Usage
```bash
# Analyze performance trends from recent CI runs
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN
```
#### Advanced Usage
```bash
# Analyze last 1000 PR Test runs
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 1000
# Custom output directory
python ci_analyzer_perf.py --token YOUR_GITHUB_TOKEN --limit 500 --output-dir my_performance_data
```
**Important**: Make sure your GitHub token has `repo` and `workflow` permissions, otherwise you'll get 404 errors.
## Parameters
### CI Analyzer Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--token` | Required | GitHub Personal Access Token |
| `--limit` | 100 | Number of CI runs to analyze |
| `--output` | ci_analysis.json | Output JSON file for detailed data |
### Performance Analyzer Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--token` | Required | GitHub Personal Access Token |
| `--limit` | 100 | Number of PR Test runs to analyze |
| `--output-dir` | performance_tables | Output directory for CSV tables and PNG charts |
## Getting GitHub Token
1. Go to [GitHub Settings > Personal Access Tokens](https://github.com/settings/tokens)
......@@ -62,15 +117,15 @@ python ci_analyzer.py --token YOUR_GITHUB_TOKEN --limit 500 --output my_analysis
## Output
The tool provides:
### CI Analyzer Output
### Console Output
#### Console Output
- Overall statistics (total runs, success rate, etc.)
- Category failure breakdown
- Most frequently failed jobs (Top 50) with direct CI links
- Failure pattern analysis
### JSON Export
#### JSON Export
Detailed analysis data including:
- Complete failure statistics
- Job failure counts
......@@ -78,8 +133,51 @@ Detailed analysis data including:
- Failure patterns
- Recent failure details
### Performance Analyzer Output
#### Console Output
- Performance data collection progress
- Summary statistics of collected tests and records
- Generated file locations (CSV tables and PNG charts)
#### File Outputs
- **CSV Tables**: Structured performance data with columns:
- `created_at`: Timestamp of the CI run
- `run_number`: GitHub Actions run number
- `pr_number`: Pull request number (if applicable)
- `author`: Developer who triggered the run
- `head_sha`: Git commit SHA
- Performance metrics (varies by test type):
- `output_throughput_token_s`: Output throughput in tokens/second
- `median_e2e_latency_ms`: Median end-to-end latency in milliseconds
- `median_ttft_ms`: Median time-to-first-token in milliseconds
- `accept_length`: Accept length for speculative decoding tests
- `url`: Direct link to the GitHub Actions run
- **PNG Charts**: Time-series visualization charts for each metric:
- X-axis: Time (MM-DD HH:MM format)
- Y-axis: Performance metric values
- File naming: `{test_name}_{metric_name}.png`
#### Directory Structure
```
performance_tables/
├── performance-test-1-gpu-part-1_summary/
│ ├── test_bs1_default.csv
│ ├── test_bs1_default_output_throughput_token_s.png
│ ├── test_online_latency_default.csv
│ ├── test_online_latency_default_median_e2e_latency_ms.png
│ └── ...
├── performance-test-1-gpu-part-2_summary/
│ └── ...
└── performance-test-2-gpu_summary/
└── ...
```
## Example Output
### CI Analyzer Example
```
============================================================
......@@ -412,6 +510,58 @@ Failure Pattern Analysis:
Build Failure: 15 times
```
### Performance Analyzer Example
```
============================================================
SGLang Performance Analysis Report
============================================================
Getting recent 100 PR Test runs...
Got 100 PR test runs...
Collecting performance data from CI runs...
Processing run 34882 (2025-09-26 03:16)...
Found performance-test-1-gpu-part-1 job (success)
Found performance-test-1-gpu-part-2 job (success)
Found performance-test-2-gpu job (success)
Processing run 34881 (2025-09-26 02:45)...
Found performance-test-1-gpu-part-1 job (success)
Found performance-test-1-gpu-part-2 job (success)
...
Performance data collection completed!
Generating performance tables to directory: performance_tables
Generated table: performance_tables/performance-test-1-gpu-part-1_summary/test_bs1_default.csv
Generated chart: performance_tables/performance-test-1-gpu-part-1_summary/test_bs1_default_output_throughput_token_s.png
Generated table: performance_tables/performance-test-1-gpu-part-1_summary/test_online_latency_default.csv
Generated chart: performance_tables/performance-test-1-gpu-part-1_summary/test_online_latency_default_median_e2e_latency_ms.png
...
Performance tables and charts generation completed!
============================================================
Performance Analysis Summary
============================================================
Total PR Test runs processed: 100
Total performance tests found: 15
Total performance records collected: 1,247
Performance test breakdown:
performance-test-1-gpu-part-1: 7 tests, 423 records
performance-test-1-gpu-part-2: 5 tests, 387 records
performance-test-2-gpu: 6 tests, 437 records
Generated files:
CSV tables: 18 files
PNG charts: 18 files
Output directory: performance_tables/
Analysis completed successfully!
```
## CI Job Categories
The tool automatically categorizes CI jobs into:
......@@ -459,11 +609,17 @@ logging.basicConfig(level=logging.DEBUG)
## Automated Monitoring
The CI monitor is also available as a GitHub Actions workflow that runs automatically every 6 hours. The workflow:
Both CI and Performance analyzers are available as a GitHub Actions workflow that runs automatically every 6 hours. The workflow:
- Analyzes the last 500 CI runs
- Generates detailed reports
- Uploads analysis results as artifacts
### CI Analysis
- Analyzes the last 1000 CI runs (configurable)
- Generates detailed failure reports
- Uploads analysis results as JSON artifacts
### Performance Analysis
- Analyzes the last 1000 PR Test runs (configurable)
- Generates performance trend data and charts
- Uploads CSV tables and PNG charts as artifacts
### Workflow Configuration
......@@ -472,7 +628,16 @@ The workflow is located at `.github/workflows/ci-monitor.yml` and uses the `GH_P
### Manual Trigger
You can manually trigger the workflow from the GitHub Actions tab with custom parameters:
- `limit`: Number of CI runs to analyze (default: 500)
- `limit`: Number of CI runs to analyze (default: 1000)
### Artifacts Generated
The workflow generates and uploads the following artifacts:
- **CI Analysis**: JSON files with failure analysis data
- **Performance Analysis**:
- CSV files with performance metrics organized by test type
- PNG charts showing performance trends over time
- Directory structure: `performance_tables_{timestamp}/`
## License
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment