Real-time log monitoring system powered by Claude AI. Combines statistical pattern analysis with AI intelligence to detect anomalies, identify root causes, and recommend automated remediation actions. Built for production systems requiring proactive incident detection.
Production-grade log analysis platform that processes application logs to detect anomalies before they impact users. Uses baseline comparison for statistical anomaly detection, then leverages Claude AI for intelligent root cause analysis and action recommendations.
Key Capabilities:
- Real-time log ingestion and analysis
- Statistical baseline tracking with exponential moving averages
- AI-powered root cause analysis and impact assessment
- Automated alert generation with rate limiting
- Actionable remediation recommendations
- Multi-service anomaly correlation
Application Logs โ FastAPI Endpoint โ Background Processing
โ
โโโโโโโโโโโโโโโโโโโ
โ Pattern Analyzer โ
โโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ
โ โ
Current Stats Baseline Stats
(5-min window) (Redis cache)
โ โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ
Statistical Comparison
(Error rate, volume, etc.)
โ
Anomaly Detected?
โ
Yes
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Claude AI Analysis โ
โ - Root cause โ
โ - Impact assessment โ
โ - Recommended actions โ
โโโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโ
โ Alert System โ
โ - Rate limiting โ
โ - PostgreSQL โ
โโโโโโโโโโโโโโโโโโโโ
Data Flow:
- Logs ingested via POST endpoint
- Background task analyzes 5-minute window
- Compare statistics to Redis-cached baseline
- If anomaly detected โ Claude AI analyzes root cause
- Create alert (if not in cooldown period)
- Update baseline with exponential moving average
Metrics Tracked:
- Error rate (ERROR + CRITICAL logs)
- Log volume (total logs per window)
- Service count (unique services logging)
- Level distribution (INFO, WARN, ERROR, CRITICAL)
Detection Algorithms:
- Error rate spike: 5x baseline AND >5% absolute
- Volume spike: 3x baseline
- Service changes: ยฑ3 services from baseline
Baseline Management:
- Exponential moving average (ฮฑ = 0.2)
- 24-hour Redis cache per window size
- Automatic baseline updates
Claude AI Integration:
- Analyzes anomaly context and log samples
- Identifies likely root causes
- Assesses user/business impact
- Generates remediation recommendations
- Provides confidence scores (0.0-1.0)
Prompt Engineering:
Analyze this system anomaly:
ANOMALIES DETECTED:
- error_spike: high severity (0.15 vs baseline 0.01)
CURRENT METRICS:
- Error rate: 15%
- Services affected: 3
ERROR SAMPLES:
- ERROR: Database connection timeout
- CRITICAL: Payment processing failed
Provide: root_cause, impact, recommended_actions, confidenceAlert Structure:
- Anomaly ID (timestamp-based)
- Severity (critical, high, medium, low)
- Category (error_spike, volume_spike, service_change)
- Affected services
- AI analysis and recommendations
- Confidence score
Rate Limiting:
- Max 1 alert per category per 15 minutes
- Prevents alert fatigue
- Redis-based cooldown tracking
Async Processing:
- Non-blocking log ingestion
- Background analysis tasks
- Parallel AI calls
Data Persistence:
- PostgreSQL for alert history
- Redis for baselines and cooldowns
- Indexed queries for fast retrieval
Observability:
- Structured JSON logging
- Health check endpoints
- Processing time tracking
# Exponential moving average for smooth baseline
alpha = 0.2
updated_baseline = {
"avg_error_rate":
alpha * current_error_rate +
(1 - alpha) * previous_baseline,
"avg_total_logs":
alpha * current_total_logs +
(1 - alpha) * previous_baseline
}Why exponential moving average?
- Recent data weighted more heavily
- Adapts to gradual changes
- Smooths out temporary spikes
# Error rate spike detection
if (current_error_rate > baseline * 5 and
current_error_rate > 0.05):
anomaly = {
"type": "error_spike",
"severity": "high" if current_error_rate > 0.2 else "medium",
"multiplier": current_error_rate / baseline
}Multi-threshold approach:
- Relative threshold (5x baseline)
- Absolute threshold (>5%)
- Prevents false positives from low baselines
async def analyze_anomaly(anomaly_data):
"""
Uses Claude to analyze detected anomaly
Returns structured insights
"""
prompt = build_analysis_prompt(anomaly_data)
response = await client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1500,
temperature=0.2, # Low for factual analysis
messages=[{"role": "user", "content": prompt}]
)
return parse_json_response(response.content[0].text)Response Format:
{
"root_cause": "Database connection pool exhaustion",
"impact": "Payment processing degraded, ~15% of transactions failing",
"recommended_actions": [
"Scale database connection pool from 20 to 50",
"Investigate long-running queries blocking connections",
"Enable query timeout enforcement"
],
"confidence": 0.87
}git clone https://github.com/yourusername/ai-log-analyzer.git
cd ai-log-analyzer
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Set environment variables
export ANTHROPIC_API_KEY=your_key_here
export DATABASE_URL=postgresql://localhost/log_analyzer
export REDIS_URL=redis://localhost:6379
# Run server
python src/main.pyDependencies:
fastapi>=0.104.0
uvicorn>=0.24.0
anthropic>=0.8.0
asyncpg>=0.29.0
redis>=5.0.0
pydantic>=2.0.0curl -X POST http://localhost:8000/logs/ingest \
-H "Content-Type: application/json" \
-d '{
"logs": [
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "ERROR",
"service": "payment-api",
"message": "Database connection timeout",
"context": {"duration_ms": 5000}
}
]
}'curl http://localhost:8000/alerts/recent?limit=10Response:
{
"alerts": [
{
"anomaly_id": "anom_20240115_103045",
"detected_at": "2024-01-15T10:30:45Z",
"severity": "high",
"category": "error_spike",
"description": "Database connection pool exhaustion...",
"affected_services": ["payment-api", "user-service"],
"recommended_actions": [
"Scale database connection pool",
"Investigate long-running queries"
],
"confidence": 0.87
}
]
}| Metric | Value |
|---|---|
| Log Ingestion | <50ms (async) |
| Analysis Window | 5 minutes |
| Baseline Update | <10ms (Redis) |
| AI Analysis | 1-2s (Claude API) |
| Alert Creation | <100ms (PostgreSQL) |
Scalability:
- 10K+ logs/minute throughput
- Sub-second anomaly detection
- Redis-cached baselines (no DB reads)
AI/ML:
- Claude Sonnet 4 for root cause analysis
- Structured prompt engineering
- Confidence scoring
- Impact assessment
Algorithms:
- Exponential moving average for baselines
- Multi-threshold anomaly detection
- Statistical pattern recognition
Production Patterns:
- Async Python (FastAPI)
- Background task queues
- Rate-limited alerting
- Redis caching
- PostgreSQL persistence
SRE Concepts:
- Anomaly detection
- Incident response automation
- Alert fatigue prevention
- Baseline drift handling
- Error rate spike detection
- Service degradation alerts
- Anomaly root cause analysis
- Automated incident triage
- Remediation recommendation
- Alert correlation
- Proactive issue detection
- Service health monitoring
- Pattern-based alerting
MIT License