Monitoring
FluxiQ Core includes comprehensive monitoring with Prometheus metrics, Grafana dashboards, and alerting.
Metrics Overview
Application Metrics
| Metric | Type | Description |
|---|---|---|
fluxiq_http_requests_total | Counter | Total HTTP requests by method, path, status |
fluxiq_http_request_duration_seconds | Histogram | Request latency distribution |
fluxiq_pix_transactions_total | Counter | PIX transactions by type and status |
fluxiq_pix_transaction_amount_total | Counter | Total PIX transaction amount (centavos) |
fluxiq_account_balance | Gauge | Current account balance |
fluxiq_active_connections | Gauge | Active WebSocket connections |
Ledger Metrics
| Metric | Type | Description |
|---|---|---|
ledger_operations_total | Counter | Operations by backend (tigerbeetle/fallback) and status |
ledger_operation_duration_seconds | Histogram | Ledger operation latency |
ledger_circuit_breaker_state | Gauge | Circuit breaker state (0=closed, 1=open, 2=half-open) |
ledger_fallback_total | Counter | Fallback events |
ledger_backend_health | Gauge | Backend health (1=healthy, 0=unhealthy) |
ledger_active_backend | Gauge | Currently active backend |
Infrastructure Metrics
| Metric | Type | Description |
|---|---|---|
nats_messages_total | Counter | NATS messages by stream and subject |
nats_consumer_lag | Gauge | Consumer message lag |
redis_connections_active | Gauge | Active Redis connections |
redis_memory_used_bytes | Gauge | Redis memory usage |
pg_connections_active | Gauge | Active PostgreSQL connections |
pg_query_duration_seconds | Histogram | Query latency |
Grafana Dashboards
Main Operations Dashboard
The primary dashboard provides a real-time view of system health:
- Request rate (TPS) with SLO overlay
- Error rate by status code
- P50/P95/P99 latency
- PIX In/Out transaction volume
- Active accounts gauge
- Revenue tracking (daily/monthly)
Import: infrastructure/monitoring/main-operations-dashboard.json
TigerBeetle Performance Dashboard
- Transfer throughput (TPS)
- Account lookup latency
- Batch size distribution
- io_uring operations
- Memory usage and Disk I/O
Import: infrastructure/monitoring/tigerbeetle-performance-dashboard.json
Ledger Fallback Dashboard
- Circuit breaker state (gauge)
- Active backend indicator
- Operations by backend (timeseries)
- Fallback rate and Error rate
- P50/P95/P99 latency by backend
- Recent fallback events
Import: infrastructure/monitoring/ledger-fallback-dashboard.json
Alert Policies
Critical Alerts
| Alert | Condition | Action |
|---|---|---|
| Both Backends Down | TigerBeetle AND fallback unhealthy for 30s | Page on-call |
| High Error Rate | >5% error rate for 5 minutes | Page on-call |
| Database Down | PostgreSQL unreachable for 1 minute | Page on-call |
| Zero Transactions | No PIX transactions for 15 minutes (business hours) | Investigate |
Warning Alerts
| Alert | Condition | Action |
|---|---|---|
| Circuit Breaker Open | TigerBeetle circuit open for 1 minute | Monitor |
| High Fallback Rate | >10 fallback events/min for 2 minutes | Investigate |
| P95 Latency High | >200ms for 5 minutes | Investigate |
| NATS Consumer Lag | >1000 messages lag for 5 minutes | Check workers |
| Redis Memory High | >80% capacity | Scale up |
Health Checks
Application Health
http
GET /healthjson
{
"status": "healthy",
"version": "1.2.0",
"uptime": 864000,
"checks": {
"database": "ok",
"redis": "ok",
"nats": "ok",
"tigerbeetle": "ok"
}
}Readiness and Liveness
GET /ready— Returns 200 when all dependencies are connectedGET /live— Returns 200 if the process is running (Kubernetes liveness probe)
Log Aggregation
Structured JSON logs are shipped to Cloud Logging:
json
{
"timestamp": "2026-02-03T12:00:00.000Z",
"level": "info",
"message": "PIX charge paid",
"request_id": "req_01HQGX...",
"charge_id": "chg_01HQGX...",
"amount": 15000,
"duration_ms": 8,
"backend": "tigerbeetle"
}Useful Log Queries
bash
# All errors in the last hour
gcloud logging read 'severity>=ERROR AND resource.type="cloud_run_revision"' --limit=50
# PIX transaction logs
gcloud logging read 'jsonPayload.message=~"PIX"' --limit=20
# Slow queries (>100ms)
gcloud logging read 'jsonPayload.duration_ms>100' --limit=20SLO Targets
| Metric | Target | Current |
|---|---|---|
| Availability | 99.95% | 99.99% |
| P95 Latency | <200ms | 45ms |
| Error Rate | <0.1% | 0.03% |
| PIX Processing Time | <5s | <2s |
| Failover Time | <10s | <5s |