Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Health Monitoring

BWS provides built-in health monitoring endpoints and logging capabilities to help you monitor your server's status and performance.

Health Endpoints

BWS automatically provides health check endpoints for monitoring:

Basic Health Check

GET /health

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "version": "0.1.5",
  "uptime": 3600
}

Detailed Health Status

GET /health/detailed

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "version": "0.1.5",
  "uptime": 3600,
  "sites": [
    {
      "name": "main",
      "hostname": "localhost",
      "port": 8080,
      "status": "active",
      "requests_served": 1542,
      "last_request": "2024-01-15T10:29:45Z"
    }
  ],
  "system": {
    "memory_usage": "45.2 MB",
    "cpu_usage": "12.5%",
    "disk_usage": "78.3%"
  }
}

Monitoring Configuration

Health Check Settings

[monitoring]
enabled = true
health_endpoint = "/health"
detailed_endpoint = "/health/detailed"
metrics_endpoint = "/metrics"

Custom Health Checks

[monitoring.checks]
disk_threshold = 90    # Alert if disk usage > 90%
memory_threshold = 80  # Alert if memory usage > 80%
response_time_threshold = 1000  # Alert if response time > 1000ms

Logging Configuration

Basic Logging Setup

[logging]
level = "info"
format = "json"
output = "stdout"

File Logging

[logging]
level = "info"
format = "json"
output = "file"
file_path = "/var/log/bws/bws.log"
max_size = "100MB"
max_files = 10
compress = true

Structured Logging

[logging]
level = "debug"
format = "json"
include_fields = [
    "timestamp",
    "level",
    "message",
    "request_id",
    "client_ip",
    "user_agent",
    "response_time",
    "status_code"
]

Log Levels

Available Levels

  • ERROR: Error conditions and failures
  • WARN: Warning conditions
  • INFO: General informational messages
  • DEBUG: Detailed debugging information
  • TRACE: Very detailed tracing information

Log Level Examples

# Production
[logging]
level = "info"

# Development
[logging]
level = "debug"

# Troubleshooting
[logging]
level = "trace"

Metrics Collection

Basic Metrics

BWS automatically collects:

  • Request count
  • Response times
  • Error rates
  • Active connections
  • Memory usage
  • CPU usage

Prometheus Integration

[monitoring.prometheus]
enabled = true
endpoint = "/metrics"
port = 9090

Example metrics output:

# HELP bws_requests_total Total number of requests
# TYPE bws_requests_total counter
bws_requests_total{site="main",method="GET",status="200"} 1542

# HELP bws_response_time_seconds Response time in seconds
# TYPE bws_response_time_seconds histogram
bws_response_time_seconds_bucket{site="main",le="0.1"} 1200
bws_response_time_seconds_bucket{site="main",le="0.5"} 1500
bws_response_time_seconds_bucket{site="main",le="1.0"} 1540

Alerting

Health Check Monitoring

#!/bin/bash
# health_check.sh
HEALTH_URL="http://localhost:8080/health"
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_URL)

if [ $RESPONSE -eq 200 ]; then
    echo "BWS is healthy"
    exit 0
else
    echo "BWS health check failed: HTTP $RESPONSE"
    exit 1
fi

Uptime Monitoring Script

#!/bin/bash
# uptime_check.sh
while true; do
    if curl -f -s http://localhost:8080/health > /dev/null; then
        echo "$(date): BWS is running"
    else
        echo "$(date): BWS is DOWN" | mail -s "BWS Alert" admin@example.com
    fi
    sleep 60
done

System Integration

Systemd Integration

# /etc/systemd/system/bws.service
[Unit]
Description=BWS Web Server
After=network.target

[Service]
Type=simple
User=bws
ExecStart=/usr/local/bin/bws --config /etc/bws/config.toml
Restart=always
RestartSec=5

# Health check
ExecHealthCheck=/usr/local/bin/health_check.sh
HealthCheckInterval=30s

[Install]
WantedBy=multi-user.target

Docker Health Checks

# Dockerfile
FROM rust:1.89-slim

# ... build steps ...

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

EXPOSE 8080
CMD ["bws", "--config", "/app/config.toml"]

Docker Compose Health Check

# docker-compose.yml
version: '3.8'
services:
  bws:
    build: .
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    restart: unless-stopped

Monitoring Tools Integration

Grafana Dashboard

{
  "dashboard": {
    "title": "BWS Monitoring",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(bws_requests_total[5m])"
          }
        ]
      },
      {
        "title": "Response Time",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, bws_response_time_seconds)"
          }
        ]
      }
    ]
  }
}

Nagios Check

#!/bin/bash
# check_bws.sh for Nagios
HOST="localhost"
PORT="8080"
WARNING_TIME=1000
CRITICAL_TIME=2000

RESPONSE_TIME=$(curl -o /dev/null -s -w "%{time_total}" http://$HOST:$PORT/health)
RESPONSE_TIME_MS=$(echo "$RESPONSE_TIME * 1000" | bc)

if (( $(echo "$RESPONSE_TIME_MS > $CRITICAL_TIME" | bc -l) )); then
    echo "CRITICAL - Response time: ${RESPONSE_TIME_MS}ms"
    exit 2
elif (( $(echo "$RESPONSE_TIME_MS > $WARNING_TIME" | bc -l) )); then
    echo "WARNING - Response time: ${RESPONSE_TIME_MS}ms"
    exit 1
else
    echo "OK - Response time: ${RESPONSE_TIME_MS}ms"
    exit 0
fi

Log Analysis

Log Parsing with jq

# Extract error logs
cat bws.log | jq 'select(.level == "ERROR")'

# Count requests by status code
cat bws.log | jq -r '.status_code' | sort | uniq -c

# Average response time
cat bws.log | jq -r '.response_time' | awk '{sum+=$1; count++} END {print sum/count}'

Log Aggregation

# Tail logs in real-time
tail -f /var/log/bws/bws.log | jq '.'

# Filter by log level
tail -f /var/log/bws/bws.log | jq 'select(.level == "ERROR" or .level == "WARN")'

# Monitor specific endpoints
tail -f /var/log/bws/bws.log | jq 'select(.path == "/api/users")'

Performance Monitoring

Request Tracking

[monitoring.requests]
track_response_times = true
track_status_codes = true
track_user_agents = true
track_client_ips = true
sample_rate = 1.0  # 100% sampling

Memory Monitoring

[monitoring.memory]
track_usage = true
alert_threshold = 80  # Alert at 80% usage
gc_metrics = true

Connection Monitoring

[monitoring.connections]
track_active = true
track_total = true
max_connections = 1000
timeout = 30  # seconds

Troubleshooting Health Issues

Common Health Check Failures

Service Unavailable

# Check if BWS is running
ps aux | grep bws

# Check port binding
netstat -tulpn | grep :8080

# Check configuration
bws --config-check

High Response Times

# Check system load
top
htop

# Check disk usage
df -h

# Check memory usage
free -h

Memory Leaks

# Monitor memory over time
while true; do
    ps -o pid,ppid,cmd,%mem,%cpu -p $(pgrep bws)
    sleep 10
done

# Generate memory dump (if available)
kill -USR1 $(pgrep bws)

Health Check Debugging

# Test health endpoint
curl -v http://localhost:8080/health

# Check detailed health
curl -s http://localhost:8080/health/detailed | jq '.'

# Monitor health over time
while true; do
    echo "$(date): $(curl -s http://localhost:8080/health | jq -r '.status')"
    sleep 30
done

Best Practices

Health Check Configuration

  • Set appropriate timeouts (3-5 seconds)
  • Use consistent intervals (30-60 seconds)
  • Include multiple health indicators
  • Monitor both application and system metrics

Logging Strategy

  • Use structured logging (JSON format)
  • Include correlation IDs for request tracking
  • Log at appropriate levels
  • Rotate logs to prevent disk space issues

Alerting Guidelines

  • Set realistic thresholds
  • Avoid alert fatigue
  • Include actionable information
  • Test alert mechanisms regularly

Monitoring Coverage

  • Monitor all critical endpoints
  • Track business metrics, not just technical
  • Set up both reactive and proactive monitoring
  • Document monitoring procedures

Next Steps