Health Monitoring

BWS provides built-in health monitoring endpoints and logging capabilities to help you monitor your server's status and performance.

Health Endpoints

BWS automatically provides health check endpoints for monitoring:

Basic Health Check

GET /health

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "version": "0.1.5",
  "uptime": 3600
}

Detailed Health Status

GET /health/detailed

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "version": "0.1.5",
  "uptime": 3600,
  "sites": [
    {
      "name": "main",
      "hostname": "localhost",
      "port": 8080,
      "status": "active",
      "requests_served": 1542,
      "last_request": "2024-01-15T10:29:45Z"
    }
  ],
  "system": {
    "memory_usage": "45.2 MB",
    "cpu_usage": "12.5%",
    "disk_usage": "78.3%"
  }
}

Monitoring Configuration

Health Check Settings

[monitoring]
enabled = true
health_endpoint = "/health"
detailed_endpoint = "/health/detailed"
metrics_endpoint = "/metrics"

Custom Health Checks

[monitoring.checks]
disk_threshold = 90    # Alert if disk usage > 90%
memory_threshold = 80  # Alert if memory usage > 80%
response_time_threshold = 1000  # Alert if response time > 1000ms

Logging Configuration

Basic Logging Setup

[logging]
level = "info"
format = "json"
output = "stdout"

File Logging

[logging]
level = "info"
format = "json"
output = "file"
file_path = "/var/log/bws/bws.log"
max_size = "100MB"
max_files = 10
compress = true

Structured Logging

[logging]
level = "debug"
format = "json"
include_fields = [
    "timestamp",
    "level",
    "message",
    "request_id",
    "client_ip",
    "user_agent",
    "response_time",
    "status_code"
]

Log Levels

Available Levels

ERROR: Error conditions and failures
WARN: Warning conditions
INFO: General informational messages
DEBUG: Detailed debugging information
TRACE: Very detailed tracing information

Log Level Examples

# Production
[logging]
level = "info"

# Development
[logging]
level = "debug"

# Troubleshooting
[logging]
level = "trace"

Metrics Collection

Basic Metrics

BWS automatically collects:

Request count
Response times
Error rates
Active connections
Memory usage
CPU usage

Prometheus Integration

[monitoring.prometheus]
enabled = true
endpoint = "/metrics"
port = 9090

Example metrics output:

# HELP bws_requests_total Total number of requests
# TYPE bws_requests_total counter
bws_requests_total{site="main",method="GET",status="200"} 1542

# HELP bws_response_time_seconds Response time in seconds
# TYPE bws_response_time_seconds histogram
bws_response_time_seconds_bucket{site="main",le="0.1"} 1200
bws_response_time_seconds_bucket{site="main",le="0.5"} 1500
bws_response_time_seconds_bucket{site="main",le="1.0"} 1540

Alerting

Health Check Monitoring

#!/bin/bash
# health_check.sh
HEALTH_URL="http://localhost:8080/health"
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_URL)

if [ $RESPONSE -eq 200 ]; then
    echo "BWS is healthy"
    exit 0
else
    echo "BWS health check failed: HTTP $RESPONSE"
    exit 1
fi

Uptime Monitoring Script

#!/bin/bash
# uptime_check.sh
while true; do
    if curl -f -s http://localhost:8080/health > /dev/null; then
        echo "$(date): BWS is running"
    else
        echo "$(date): BWS is DOWN" | mail -s "BWS Alert" admin@example.com
    fi
    sleep 60
done

System Integration

Systemd Integration

# /etc/systemd/system/bws.service
[Unit]
Description=BWS Web Server
After=network.target

[Service]
Type=simple
User=bws
ExecStart=/usr/local/bin/bws --config /etc/bws/config.toml
Restart=always
RestartSec=5

# Health check
ExecHealthCheck=/usr/local/bin/health_check.sh
HealthCheckInterval=30s

[Install]
WantedBy=multi-user.target

Docker Health Checks

# Dockerfile
FROM rust:1.89-slim

# ... build steps ...

HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

EXPOSE 8080
CMD ["bws", "--config", "/app/config.toml"]

Docker Compose Health Check

# docker-compose.yml
version: '3.8'
services:
  bws:
    build: .
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s
    restart: unless-stopped

Monitoring Tools Integration

Grafana Dashboard

{
  "dashboard": {
    "title": "BWS Monitoring",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(bws_requests_total[5m])"
          }
        ]
      },
      {
        "title": "Response Time",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, bws_response_time_seconds)"
          }
        ]
      }
    ]
  }
}

Nagios Check

#!/bin/bash
# check_bws.sh for Nagios
HOST="localhost"
PORT="8080"
WARNING_TIME=1000
CRITICAL_TIME=2000

RESPONSE_TIME=$(curl -o /dev/null -s -w "%{time_total}" http://$HOST:$PORT/health)
RESPONSE_TIME_MS=$(echo "$RESPONSE_TIME * 1000" | bc)

if (( $(echo "$RESPONSE_TIME_MS > $CRITICAL_TIME" | bc -l) )); then
    echo "CRITICAL - Response time: ${RESPONSE_TIME_MS}ms"
    exit 2
elif (( $(echo "$RESPONSE_TIME_MS > $WARNING_TIME" | bc -l) )); then
    echo "WARNING - Response time: ${RESPONSE_TIME_MS}ms"
    exit 1
else
    echo "OK - Response time: ${RESPONSE_TIME_MS}ms"
    exit 0
fi

Log Analysis

Log Parsing with jq

# Extract error logs
cat bws.log | jq 'select(.level == "ERROR")'

# Count requests by status code
cat bws.log | jq -r '.status_code' | sort | uniq -c

# Average response time
cat bws.log | jq -r '.response_time' | awk '{sum+=$1; count++} END {print sum/count}'

Log Aggregation

# Tail logs in real-time
tail -f /var/log/bws/bws.log | jq '.'

# Filter by log level
tail -f /var/log/bws/bws.log | jq 'select(.level == "ERROR" or .level == "WARN")'

# Monitor specific endpoints
tail -f /var/log/bws/bws.log | jq 'select(.path == "/api/users")'

Performance Monitoring

Request Tracking

[monitoring.requests]
track_response_times = true
track_status_codes = true
track_user_agents = true
track_client_ips = true
sample_rate = 1.0  # 100% sampling

Memory Monitoring

[monitoring.memory]
track_usage = true
alert_threshold = 80  # Alert at 80% usage
gc_metrics = true

Connection Monitoring

[monitoring.connections]
track_active = true
track_total = true
max_connections = 1000
timeout = 30  # seconds

Troubleshooting Health Issues

Common Health Check Failures

Service Unavailable

# Check if BWS is running
ps aux | grep bws

# Check port binding
netstat -tulpn | grep :8080

# Check configuration
bws --config-check

High Response Times

# Check system load
top
htop

# Check disk usage
df -h

# Check memory usage
free -h

Memory Leaks

# Monitor memory over time
while true; do
    ps -o pid,ppid,cmd,%mem,%cpu -p $(pgrep bws)
    sleep 10
done

# Generate memory dump (if available)
kill -USR1 $(pgrep bws)

Health Check Debugging

# Test health endpoint
curl -v http://localhost:8080/health

# Check detailed health
curl -s http://localhost:8080/health/detailed | jq '.'

# Monitor health over time
while true; do
    echo "$(date): $(curl -s http://localhost:8080/health | jq -r '.status')"
    sleep 30
done

Best Practices

Health Check Configuration

Set appropriate timeouts (3-5 seconds)
Use consistent intervals (30-60 seconds)
Include multiple health indicators
Monitor both application and system metrics

Logging Strategy

Use structured logging (JSON format)
Include correlation IDs for request tracking
Log at appropriate levels
Rotate logs to prevent disk space issues

Alerting Guidelines

Set realistic thresholds
Avoid alert fatigue
Include actionable information
Test alert mechanisms regularly

Monitoring Coverage

Monitor all critical endpoints
Track business metrics, not just technical
Set up both reactive and proactive monitoring
Document monitoring procedures

Next Steps

Configure Docker Deployment with health checks
Set up Production Environment monitoring
Learn about Performance Tuning

Keyboard shortcuts

BWS Web Server Documentation