SuperChat Monitoring Guide

Complete guide for monitoring and observability of SuperChat servers.

Overview
Log Files
Prometheus Metrics
Grafana Setup
Alert Rules
Health Checks
Performance Profiling
Log Aggregation
Common Patterns

Overview

SuperChat provides multiple monitoring mechanisms:

Log files - Server activity, errors, debug info
Prometheus metrics - Real-time performance metrics (port 9090)
Health endpoint - Simple health check HTTP endpoint
pprof - CPU and memory profiling (port 6060)

Critical Security Note: Ports 9090 (metrics) and 6060 (pprof) must NEVER be exposed publicly. Use firewall rules and SSH tunneling for access.

Log Files

SuperChat writes three log files:

server.log

Location: ~/.local/share/superchat/server.log or $XDG_DATA_HOME/superchat/server.log

Contents: All server activity (connections, messages, errors)

Rotation: Truncated on each server startup

Format:

2025/10/08 14:30:15.123456 [INFO] Server started on port 6465
2025/10/08 14:30:20.234567 [INFO] Connection from 192.168.1.100:54321
2025/10/08 14:30:21.345678 [INFO] Session 1 set nickname: alice
2025/10/08 14:30:25.456789 [ERROR] Rate limit exceeded for session 2

Monitoring:

# Tail logs in real-time
tail -f ~/.local/share/superchat/server.log

# Search for errors
grep ERROR ~/.local/share/superchat/server.log

# Count connections today
grep "Connection from" ~/.local/share/superchat/server.log | wc -l

# Find rate limit violations
grep "rate limit exceeded" ~/.local/share/superchat/server.log

errors.log

Location: ~/.local/share/superchat/errors.log

Contents: Error-level logs only

Rotation: Append mode, persists across restarts

Use case: Long-term error tracking, debugging intermittent issues

Monitoring:

# Check recent errors
tail -n 50 ~/.local/share/superchat/errors.log

# Count errors today
grep "$(date +%Y/%m/%d)" ~/.local/share/superchat/errors.log | wc -l

# Find specific error patterns
grep "database" ~/.local/share/superchat/errors.log

debug.log

Location: ~/.local/share/superchat/debug.log

Contents: Debug-level logs (only when --debug flag is used)

Rotation: Append mode

Use case: Development, troubleshooting complex issues

Enable debug logging:

scd --debug

Warning: Debug logs can grow quickly and may contain sensitive information. Only enable when needed.

systemd Journal

If using systemd, logs are also sent to the journal:

# View all SuperChat logs
sudo journalctl -u superchat

# Follow logs in real-time
sudo journalctl -u superchat -f

# Last 100 lines
sudo journalctl -u superchat -n 100

# Since specific time
sudo journalctl -u superchat --since "2025-10-08 14:00:00"

# Only errors
sudo journalctl -u superchat -p err

# JSON output (for parsing)
sudo journalctl -u superchat -o json

Prometheus Metrics

SuperChat exposes Prometheus metrics on port 9090 at /metrics.

Critical: This port must be firewalled! Access via SSH tunnel only.

Accessing Metrics

Local access (on server):

curl http://localhost:9090/metrics

Remote access (via SSH tunnel):

# From your local machine
ssh -L 9090:localhost:9090 user@server

# Then access locally
curl http://localhost:9090/metrics

Available Metrics

Session Metrics

superchat_active_sessions (Gauge)

Current number of active sessions
Use: Monitor concurrent user count
Alert: Spike may indicate DoS attack or viral growth

superchat_sessions_created_total (Counter)

Total sessions created since server start
Use: Track total connections over time
Rate: rate(superchat_sessions_created_total[5m]) = sessions/second

superchat_sessions_disconnected_total (Counter)

Total sessions disconnected
Use: Track disconnect rate
Alert: High rate may indicate connectivity issues

Message Metrics

superchat_messages_received_total{type="..."} (Counter)

Total messages received from clients by type
Labels: type (SET_NICKNAME, POST_MESSAGE, etc.)
Use: Track message volume by type
Example types: POST_MESSAGE, PING, LIST_MESSAGES

superchat_messages_sent_total{type="..."} (Counter)

Total messages sent to clients by type
Labels: type (MESSAGE_LIST, NEW_MESSAGE, ERROR, etc.)
Use: Track server responses by type

superchat_messages_broadcast_total (Counter)

Total unique messages broadcast (not deliveries)
Use: Track actual message creation rate
Rate: rate(superchat_messages_broadcast_total[5m]) = messages/second

superchat_messages_delivered_total{channel_id="...",thread_id="..."} (Counter)

Total message deliveries to clients
Labels: channel_id, thread_id
Use: Track message delivery volume per channel/thread
Note: One broadcast = N deliveries (N = subscriber count)

Subscription Metrics

superchat_channel_subscribers{channel_id="..."} (Gauge)

Active subscribers per channel
Labels: channel_id
Use: Monitor channel popularity
Alert: Sudden spike in one channel = possible spam attack

superchat_thread_subscribers{thread_id="..."} (Gauge)

Active subscribers per thread
Labels: thread_id
Use: Track thread engagement

Broadcast Metrics

superchat_broadcast_fanout{type="..."} (Histogram)

Number of recipients per broadcast message
Labels: type (channel or thread)
Buckets: 1, 5, 10, 25, 50, 100, 250, 500, 1000, 2000, 5000
Use: Understand broadcast reach
Query: histogram_quantile(0.95, superchat_broadcast_fanout_bucket) = 95th percentile fanout

superchat_broadcast_duration_seconds{type="..."} (Histogram)

Time taken to broadcast a message to all subscribers
Labels: type (channel or thread)
Use: Monitor broadcast performance
Alert: P95 > 1s = performance issue
Query: histogram_quantile(0.95, rate(superchat_broadcast_duration_seconds_bucket[5m]))

Go Runtime Metrics (Built-in)

go_goroutines (Gauge)

Current number of goroutines
Use: Detect goroutine leaks
Alert: Continuously increasing = leak

go_memstats_alloc_bytes (Gauge)

Bytes of allocated heap memory
Use: Monitor memory usage

go_memstats_heap_inuse_bytes (Gauge)

Bytes in in-use heap spans
Use: Monitor heap size

process_cpu_seconds_total (Counter)

Total CPU time consumed
Use: Calculate CPU usage percentage
Query: rate(process_cpu_seconds_total[5m]) * 100 = CPU %

process_resident_memory_bytes (Gauge)

Resident memory size (RSS)
Use: Monitor overall memory usage

Example Queries

Active users:

superchat_active_sessions

Message rate (messages per second):

rate(superchat_messages_broadcast_total[5m])

Connection rate (new sessions per minute):

rate(superchat_sessions_created_total[1m]) * 60

Average broadcast fanout:

rate(superchat_messages_delivered_total[5m]) / rate(superchat_messages_broadcast_total[5m])

P95 broadcast latency:

histogram_quantile(0.95, rate(superchat_broadcast_duration_seconds_bucket[5m]))

CPU usage:

rate(process_cpu_seconds_total[5m]) * 100

Memory usage (MB):

process_resident_memory_bytes / 1024 / 1024

Goroutine leak detection:

deriv(go_goroutines[5m])  # Positive value = increasing goroutines

Grafana Setup

Installation

Ubuntu/Debian:

sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

Docker:

docker run -d \
  --name=grafana \
  -p 3000:3000 \
  grafana/grafana-oss

Access: http://localhost:3000 (default login: admin/admin)

Prometheus Data Source

Navigate to Configuration → Data Sources
Click Add data source
Select Prometheus
Set URL: http://localhost:9090 (or SSH tunnel)
Click Save & Test

Dashboard Creation

Create a new dashboard:

Click + → Create Dashboard
Add panels with queries:

Panel: Active Sessions

Query: superchat_active_sessions
Visualization: Stat or Time series
Unit: Short

Panel: Message Rate

Query: rate(superchat_messages_broadcast_total[5m])
Visualization: Graph
Unit: Messages/sec

Panel: Broadcast Latency (P95)

Query: histogram_quantile(0.95, rate(superchat_broadcast_duration_seconds_bucket[5m]))
Visualization: Graph
Unit: Seconds
Threshold: Warning at 0.5s, Critical at 1s

Panel: CPU Usage

Query: rate(process_cpu_seconds_total[5m]) * 100
Visualization: Gauge
Unit: Percent (0-100)
Threshold: Warning at 70%, Critical at 90%

Panel: Memory Usage

Query: process_resident_memory_bytes / 1024 / 1024
Visualization: Gauge
Unit: MB

Panel: Connection Rate

Query: rate(superchat_sessions_created_total[5m]) * 60
Visualization: Graph
Unit: Connections/min

Panel: Top Channels by Subscribers

Query: topk(10, superchat_channel_subscribers)
Visualization: Bar chart

Panel: Goroutine Count

Query: go_goroutines
Visualization: Graph
Alert: Continuously increasing

Sample Dashboard JSON

Save this as superchat-dashboard.json and import into Grafana:

{
  "dashboard": {
    "title": "SuperChat Server Metrics",
    "panels": [
      {
        "title": "Active Sessions",
        "targets": [{"expr": "superchat_active_sessions"}],
        "type": "stat"
      },
      {
        "title": "Message Rate",
        "targets": [{"expr": "rate(superchat_messages_broadcast_total[5m])"}],
        "type": "graph"
      },
      {
        "title": "Broadcast Latency P95",
        "targets": [{"expr": "histogram_quantile(0.95, rate(superchat_broadcast_duration_seconds_bucket[5m]))"}],
        "type": "graph"
      }
    ]
  }
}

Alert Rules

Prometheus Alert Configuration

Create prometheus/alerts.yml:

groups:
  - name: superchat_alerts
    interval: 30s
    rules:
      # Server down
      - alert: SuperChatServerDown
        expr: up{job="superchat"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "SuperChat server is down"
          description: "No metrics received for 5 minutes"

      # High connection rate (possible DoS)
      - alert: HighConnectionRate
        expr: rate(superchat_sessions_created_total[1m]) > 100
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High connection rate detected"
          description: "{{ $value }} connections/sec (threshold: 100)"

      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(superchat_messages_sent_total{type="ERROR"}[5m])) /
          sum(rate(superchat_messages_received_total[5m])) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate (>10%)"
          description: "{{ $value | humanizePercentage }} of messages are errors"

      # High broadcast latency
      - alert: HighBroadcastLatency
        expr: histogram_quantile(0.95, rate(superchat_broadcast_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High broadcast latency"
          description: "P95 broadcast latency: {{ $value }}s (threshold: 1s)"

      # Database size growing rapidly
      - alert: DatabaseGrowthRapid
        expr: |
          predict_linear(
            process_resident_memory_bytes[1h], 24*3600
          ) > 16 * 1024 * 1024 * 1024
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Database growing rapidly"
          description: "Projected to reach 16GB in 24 hours"

      # Goroutine leak
      - alert: GoroutineLeakSuspected
        expr: deriv(go_goroutines[10m]) > 0.5
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Possible goroutine leak"
          description: "Goroutine count increasing: {{ $value }}/sec"

      # Active sessions near capacity
      - alert: SessionsNearCapacity
        expr: superchat_active_sessions > 9000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Active sessions near tested limit"
          description: "{{ $value }} active sessions (tested max: 10k)"

      # CPU usage high
      - alert: HighCPUUsage
        expr: rate(process_cpu_seconds_total[5m]) * 100 > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage"
          description: "CPU usage: {{ $value }}% (threshold: 80%)"

Alertmanager Configuration

Create alertmanager/config.yml:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

receivers:
  - name: 'default'
    email_configs:
      - to: 'admin@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager@example.com'
        auth_password: 'password'

  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#superchat-alerts'
        title: 'SuperChat Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'

Health Checks

HTTP Health Endpoint

Endpoint: http://localhost:9090/health

Response (healthy):

{
  "status": "ok",
  "uptime": 12345,
  "sessions": 42,
  "database": "ok",
  "directory_enabled": true
}

Use cases:

Load balancer health checks
Monitoring system checks
Kubernetes liveness/readiness probes

Health Check Script

Create healthcheck.sh:

#!/bin/bash
# SuperChat health check script

set -e

# Check if server is listening on TCP port
if ! nc -z localhost 6465; then
  echo "ERROR: Server not listening on port 6465"
  exit 1
fi

# Check health endpoint
RESPONSE=$(curl -sf http://localhost:9090/health || echo "")
if [ -z "$RESPONSE" ]; then
  echo "ERROR: Health endpoint not responding"
  exit 1
fi

# Parse JSON response
STATUS=$(echo "$RESPONSE" | jq -r '.status')
DB_STATUS=$(echo "$RESPONSE" | jq -r '.database')

if [ "$STATUS" != "ok" ]; then
  echo "ERROR: Server status: $STATUS"
  exit 1
fi

if [ "$DB_STATUS" != "ok" ]; then
  echo "ERROR: Database status: $DB_STATUS"
  exit 1
fi

echo "OK: SuperChat server is healthy"
exit 0

Usage:

chmod +x healthcheck.sh
./healthcheck.sh

Kubernetes Probes

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: superchat
    image: superchat:latest
    ports:
    - containerPort: 6465
    - containerPort: 9090
    livenessProbe:
      httpGet:
        path: /health
        port: 9090
      initialDelaySeconds: 10
      periodSeconds: 30
    readinessProbe:
      httpGet:
        path: /health
        port: 9090
      initialDelaySeconds: 5
      periodSeconds: 10

Performance Profiling

SuperChat exposes pprof endpoints on port 6060.

Critical: This port must NEVER be exposed publicly!

Accessing pprof

Via SSH tunnel:

# From your local machine
ssh -L 6060:localhost:6060 user@server

# Access pprof endpoints
open http://localhost:6060/debug/pprof/

Available Profiles

CPU Profile (30 seconds):

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

Heap Profile:

go tool pprof http://localhost:6060/debug/pprof/heap

Goroutine Profile:

go tool pprof http://localhost:6060/debug/pprof/goroutine

Allocs Profile (memory allocations):

go tool pprof http://localhost:6060/debug/pprof/allocs

Block Profile (blocking operations):

go tool pprof http://localhost:6060/debug/pprof/block

Interactive pprof

# Capture CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Interactive commands:
(pprof) top10      # Show top 10 functions by CPU time
(pprof) list main  # Show source code with line-by-line profile
(pprof) web        # Generate call graph (requires graphviz)
(pprof) pdf        # Save call graph as PDF

Flame Graphs

# Install go-torch (flame graph tool)
go install github.com/uber/go-torch@latest

# Generate flame graph
go-torch -u http://localhost:6060 -t 30

Log Aggregation

syslog Integration

SuperChat logs to systemd journal, which can forward to syslog.

Forward to remote syslog:

# /etc/rsyslog.d/superchat.conf
:programname, isequal, "scd" @@remote-syslog-server:514

logrotate Configuration

Create /etc/logrotate.d/superchat:

/var/lib/superchat/*.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    create 640 superchat superchat
    postrotate
        systemctl reload superchat > /dev/null 2>&1 || true
    endscript
}

ELK Stack Integration

Filebeat configuration (/etc/filebeat/filebeat.yml):

filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/lib/superchat/server.log
      - /var/lib/superchat/errors.log
    fields:
      app: superchat
      type: server_log

output.elasticsearch:
  hosts: ["localhost:9200"]
  index: "superchat-%{+yyyy.MM.dd}"

Common Patterns

Detecting DoS Attacks

# High connection rate from single IP
curl -s http://localhost:9090/metrics | grep superchat_sessions_created_total | awk '{print $2}'

# Check if rate is increasing rapidly
# Compare values 1 minute apart

Identifying Spam Users

# Top message senders (requires log analysis)
grep "POST_MESSAGE" ~/.local/share/superchat/server.log | \
  awk '{print $4}' | sort | uniq -c | sort -rn | head -10

Monitoring Database Growth

# Database file size
du -h ~/.local/share/superchat/superchat.db

# Track growth over time
watch -n 60 'du -h ~/.local/share/superchat/superchat.db'

Analyzing Broadcast Performance

# Average fanout per broadcast type
avg by (type) (superchat_broadcast_fanout)

# P99 broadcast latency
histogram_quantile(0.99, rate(superchat_broadcast_duration_seconds_bucket[5m]))

Finding Memory Leaks

# Capture heap profile every 10 minutes
while true; do
  TIMESTAMP=$(date +%Y%m%d_%H%M%S)
  go tool pprof -text http://localhost:6060/debug/pprof/heap > heap_$TIMESTAMP.txt
  sleep 600
done

# Compare heap profiles
go tool pprof -base heap_20250108_140000.txt heap_20250108_150000.txt

Next Steps

DEPLOYMENT.md - Server deployment guide
CONFIGURATION.md - Configuration reference
SECURITY.md - Security hardening
BACKUP_AND_RECOVERY.md - Backup strategies