Skip to content

Monitoring

FluxiQ Core includes comprehensive monitoring with Prometheus metrics, Grafana dashboards, and alerting.

Metrics Overview

Application Metrics

MetricTypeDescription
fluxiq_http_requests_totalCounterTotal HTTP requests by method, path, status
fluxiq_http_request_duration_secondsHistogramRequest latency distribution
fluxiq_pix_transactions_totalCounterPIX transactions by type and status
fluxiq_pix_transaction_amount_totalCounterTotal PIX transaction amount (centavos)
fluxiq_account_balanceGaugeCurrent account balance
fluxiq_active_connectionsGaugeActive WebSocket connections

Ledger Metrics

MetricTypeDescription
ledger_operations_totalCounterOperations by backend (tigerbeetle/fallback) and status
ledger_operation_duration_secondsHistogramLedger operation latency
ledger_circuit_breaker_stateGaugeCircuit breaker state (0=closed, 1=open, 2=half-open)
ledger_fallback_totalCounterFallback events
ledger_backend_healthGaugeBackend health (1=healthy, 0=unhealthy)
ledger_active_backendGaugeCurrently active backend

Infrastructure Metrics

MetricTypeDescription
nats_messages_totalCounterNATS messages by stream and subject
nats_consumer_lagGaugeConsumer message lag
redis_connections_activeGaugeActive Redis connections
redis_memory_used_bytesGaugeRedis memory usage
pg_connections_activeGaugeActive PostgreSQL connections
pg_query_duration_secondsHistogramQuery latency

Grafana Dashboards

Main Operations Dashboard

The primary dashboard provides a real-time view of system health:

  • Request rate (TPS) with SLO overlay
  • Error rate by status code
  • P50/P95/P99 latency
  • PIX In/Out transaction volume
  • Active accounts gauge
  • Revenue tracking (daily/monthly)

Import: infrastructure/monitoring/main-operations-dashboard.json

TigerBeetle Performance Dashboard

  • Transfer throughput (TPS)
  • Account lookup latency
  • Batch size distribution
  • io_uring operations
  • Memory usage and Disk I/O

Import: infrastructure/monitoring/tigerbeetle-performance-dashboard.json

Ledger Fallback Dashboard

  • Circuit breaker state (gauge)
  • Active backend indicator
  • Operations by backend (timeseries)
  • Fallback rate and Error rate
  • P50/P95/P99 latency by backend
  • Recent fallback events

Import: infrastructure/monitoring/ledger-fallback-dashboard.json

Alert Policies

Critical Alerts

AlertConditionAction
Both Backends DownTigerBeetle AND fallback unhealthy for 30sPage on-call
High Error Rate>5% error rate for 5 minutesPage on-call
Database DownPostgreSQL unreachable for 1 minutePage on-call
Zero TransactionsNo PIX transactions for 15 minutes (business hours)Investigate

Warning Alerts

AlertConditionAction
Circuit Breaker OpenTigerBeetle circuit open for 1 minuteMonitor
High Fallback Rate>10 fallback events/min for 2 minutesInvestigate
P95 Latency High>200ms for 5 minutesInvestigate
NATS Consumer Lag>1000 messages lag for 5 minutesCheck workers
Redis Memory High>80% capacityScale up

Health Checks

Application Health

http
GET /health
json
{
  "status": "healthy",
  "version": "1.2.0",
  "uptime": 864000,
  "checks": {
    "database": "ok",
    "redis": "ok",
    "nats": "ok",
    "tigerbeetle": "ok"
  }
}

Readiness and Liveness

  • GET /ready — Returns 200 when all dependencies are connected
  • GET /live — Returns 200 if the process is running (Kubernetes liveness probe)

Log Aggregation

Structured JSON logs are shipped to Cloud Logging:

json
{
  "timestamp": "2026-02-03T12:00:00.000Z",
  "level": "info",
  "message": "PIX charge paid",
  "request_id": "req_01HQGX...",
  "charge_id": "chg_01HQGX...",
  "amount": 15000,
  "duration_ms": 8,
  "backend": "tigerbeetle"
}

Useful Log Queries

bash
# All errors in the last hour
gcloud logging read 'severity>=ERROR AND resource.type="cloud_run_revision"' --limit=50

# PIX transaction logs
gcloud logging read 'jsonPayload.message=~"PIX"' --limit=20

# Slow queries (>100ms)
gcloud logging read 'jsonPayload.duration_ms>100' --limit=20

SLO Targets

MetricTargetCurrent
Availability99.95%99.99%
P95 Latency<200ms45ms
Error Rate<0.1%0.03%
PIX Processing Time<5s<2s
Failover Time<10s<5s

FluxiQ Core - PIX Payment Gateway