Modern Reverse Proxies Part 2: Traefik - The Cloud-Native Orchestrator
February 16, 2026
In Part 1 of our series on modern reverse proxies, we established the business case for dynamic service discovery and automated configuration. We saw how traditional proxies like Nginx and Apache create operational bottlenecks in cloud-native environments. Now, in this second installment, we'll take a deep dive into Traefik — exploring its architecture, capabilities, and practical deployment considerations.
Before we get into the specifics of Traefik, let's establish what we're examining: Traefik is a cloud-native reverse proxy and load balancer designed specifically for containerized environments. It automatically discovers services and configures routing without manual intervention. Over the course of this article, we'll examine how Traefik works, when it makes sense to use it, and what practical considerations you should keep in mind.
Traefik: The Cloud-Native Orchestrator
Traefik excels in complex, container-based environments, especially with Kubernetes or Docker Swarm. Its deep integration streamlines routing management in large-scale microservice architectures.
One may wonder: what makes Traefik different from traditional reverse proxies? The answer lies in its architecture. Where Nginx or Apache require you to manually update configuration files whenever services change, Traefik actively monitors your infrastructure and adapts automatically. This goes beyond a convenience feature — it fundamentally changes how you manage routing in dynamic environments1.
Deep Integration with Container Platforms
Traefik's most distinctive capability is its native integration with container orchestration platforms. This integration means you define routing rules alongside your application definitions, enabling infrastructure-as-code practices for your entire stack.
Kubernetes Integration:
Traefik operates as a Kubernetes Ingress Controller, meaning it integrates directly with Kubernetes' own ingress resources:
# Example: Kubernetes Ingress resource for Traefik
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-ingress
annotations:
# Traefik-specific annotations for advanced routing
traefik.ingress.kubernetes.io/router.entrypoints: web,websecure
traefik.ingress.kubernetes.io/router.tls: "true"
spec:
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-service
port:
number: 8080
tls:
- hosts:
- myapp.example.com
secretName: myapp-tls-secret
Beyond basic ingress, Traefik extends Kubernetes with Custom Resource Definitions (CRDs) for advanced routing scenarios that aren't possible with standard ingress:
# Example: Traefik's Middleware CRD for request transformation
apiVersion: traefik.io/v1alpha1
kind: Middleware
metadata:
name: strip-prefix
spec:
stripPrefix:
prefixes:
- /api
These CRDs give you fine-grained control over rate limiting, authentication, header manipulation, and more — all defined as Kubernetes resources2.
Docker Integration:
For Docker environments without orchestration, Traefik uses Docker labels to discover services:
# Example: docker-compose.yml with Traefik labels
version: '3.8'
services:
myapp:
image: myapp:latest
labels:
- "traefik.enable=true"
- "traefik.http.routers.myapp.rule=Host(`myapp.localhost`)"
- "traefik.http.services.myapp.loadbalancer.server.port=8080"
- "traefik.http.middlewares.test-ratelimit.ratelimit.avg=100"
- "traefik.http.routers.myapp.middlewares=test-ratelimit@docker"
ports:
- "8080:8080"
When you start this container, Traefik automatically detects it through the Docker socket and configures routing based on those labels — no manual configuration needed.
Cloud Provider Integration:
Traefik integrates with major cloud providers' container services:
- AWS EKS: Native integration as a Kubernetes ingress controller
- Google Cloud Run: Automatic service discovery
- Azure Container Instances: Support via Kubernetes
- Multi-cloud: Consistent configuration across providers
This integration means routing configuration lives alongside application definitions. Consider what this enables3:
- Version control for entire infrastructure: Your routing rules are in Git alongside your application code
- GitOps workflows: Changes reviewed, tested, and deployed systematically
- Reduced context switching: No separate configuration management system
- Single source of truth: Infrastructure defined in one place
Of course, this benefits most organizations heavily invested in containerization. If you're running traditional monoliths on virtual machines, Traefik's value proposition is less compelling4.
Advanced Routing Capabilities
Traefik provides sophisticated routing options that go well beyond simple path-based routing. Let's examine what's available with concrete examples.
Path-Based Routing:
The most common pattern: route different URL paths to different services:
# Traefik static configuration (file provider)
http:
routers:
api:
rule: "PathPrefix(`/api`)"
service: api-service
admin:
rule: "PathPrefix(`/admin`)"
service: admin-service
frontend:
rule: "PathPrefix(`/`)"
service: frontend-service
In this configuration, requests to /api/* go to the API service, /admin/* to the admin interface, and everything else to the frontend.
Host-Based Routing:
Route based on domain names — essential for multi-tenant applications:
http:
routers:
api:
rule: "Host(`api.example.com`)"
service: api-service
www:
rule: "Host(`www.example.com`)"
service: frontend-service
admin:
rule: "Host(`admin.example.com`)"
service: admin-service
Header-Based Routing:
Route based on custom headers — useful for A/B testing, canary deployments, or feature flags:
http:
routers:
canary:
rule: "Host(`app.example.com`) && Headers(`X-Canary`, `true`)"
service: canary-service
stable:
rule: "Host(`app.example.com`)"
service: stable-service
In this configuration, only requests with header X-Canary: true go to the canary service; all others go to stable5.
Weighted Round-Robin:
Gradually shift traffic between versions — critical for safe rollouts:
http:
services:
myapp:
weighted:
services:
- name: myapp-v1
weight: 90 # 90% of traffic
- name: myapp-v2
weight: 10 # 10% of traffic
Start with 90/10 split, monitor, then gradually increase traffic to v2. If issues arise, you can immediately revert by adjusting weights.
Priority-Based Routing:
When multiple routes could match, Traefik uses priority to determine which takes precedence:
http:
routers:
specific:
rule: "Host(`app.example.com`) && PathPrefix(`/admin`)"
priority: 100
service: admin-service
general:
rule: "Host(`app.example.com`)"
priority: 10
service: frontend-service
The /admin route has higher priority, so admin requests match that router rather than the general frontend router.
Middleware and Request Transformation
Traefik's middleware system allows you to modify requests and responses without changing your applications. Middlewares are chained together to create powerful processing pipelines.
Authentication Middleware:
Traefik supports several authentication approaches:
- Basic Auth: Simple username/password protection
- Digest Auth: More secure than basic (challenge-response)
- Forward Auth: Delegate authentication to external service
Let's look at a concrete example of forward auth:
http:
middlewares:
auth:
forwardAuth:
address: https://auth.example.com/validate
trustForwardHeader: true
authResponseHeaders:
- "X-User-Email"
- "X-User-Name"
routers:
protected:
rule: "Host(`app.example.com`)"
middleware: auth
service: app-service
When a request arrives, Traefik forwards it to your authentication service. That service returns 200 if authenticated, 401 if not. If authenticated, Traefik adds specified headers to the request before forwarding to your app6.
Security Middleware:
Rate limiting is essential for protecting against abuse:
http:
middlewares:
ratelimit:
rateLimit:
average: 100
burst: 200
period: 1m
This allows 100 requests per minute on average, with bursts up to 2007. We'll discuss rate limiting strategy in more detail later.
IP whitelisting for admin interfaces:
http:
middlewares:
admin-ips:
ipWhiteList:
sourceRange:
- "10.0.0.0/8"
- "192.168.0.0/16"
- "203.0.113.1" # Specific office IP
Transformation Middleware:
Path rewriting is common when migrating from legacy routing:
http:
middlewares:
strip-api:
stripPrefix:
prefixes:
- /api/v2
- /api/v1
A request to /api/v2/users becomes /users before reaching your service.
Header manipulation for security:
http:
middlewares:
security-headers:
headers:
browserXSSFilter: true
contentTypeNosniff: true
forceSTSHeader: true
stsIncludeSubdomains: true
stsPreload: true
stsSeconds: 31536000 # 1 year
customFrameOptionsValue: "SAMEORIGIN"
Observability Middleware:
Access logging is enabled by default in Traefik, but you can customize formats:
accessLog:
format: json
filters:
statusCodes:
- "400-599"
bufferingSize: 100
For distributed tracing with Jaeger:
tracing:
jaeger:
samplingServerURL: http://jaeger:5778/sampling
localAgentHostPort: "jaeger:6831"
traceContextHeaderName: "uber-trace-id"
Observability and Dashboard
Traefik provides an unusually high level of observability out of the box, which is crucial for production deployments.
Real-Time Dashboard:
The Traefik dashboard gives you immediate visibility into what's routing where:
# Enable the dashboard in Traefik configuration
api:
dashboard: true
insecure: false # Set to true only for development
Access it at https://your-traefik-host:8080 or via the /dashboard/ endpoint. The dashboard shows:
- Active routers and their rules
- Service health and backend status
- Request metrics (rate, latency, status codes)
- Real-time request tracing
Of course, in production you should secure the dashboard properly — either behind authentication or accessible only from internal networks.
Metrics Integration:
Traefik exposes Prometheus metrics at /metrics/prometheus:
# Check if metrics are available
curl http://localhost:8080/metrics/prometheus | head -20
You'll see metrics like8:
# HELP traefik_service_status_total Number of status code responses
# TYPE traefik_service_status_total counter
traefik_service_status_total{service="myapp",status="200"} 1245
traefik_service_status_total{service="myapp",status="500"} 3
# HELP traefik_backend_last_connection_time Timestamp of the last connection to the backend
# TYPE traefik_backend_last_connection_time gauge
traefik_backend_last_connection_time{backend="myapp-backend",server="10.0.1.5:8080"} 1.73456789e9
Set up Prometheus to scrape these metrics, then create Grafana dashboards for visualization.
Distributed Tracing:
For microservices, understanding request flows across services is essential. Traefik integrates with Jaeger, Zipkin, Datadog APM, and OpenTelemetry:
tracing:
jaeger:
samplingServerURL: http://jaeger:5778/sampling
localAgentHostPort: "jaeger:6831"
When enabled, Traefik propagates trace context through requests, allowing you to see the full path of a request through your system9.
High Availability and Scaling
Traefik supports various high availability configurations. Unlike some proxies that rely on shared config stores, Traefik's approach is elegantly simple.
Horizontal Scaling:
Run multiple Traefik instances behind a load balancer:
Internet → [Load Balancer] → [Traefik 1] → [Services]
→ [Traefik 2] → [Services]
→ [Traefik N] → [Services]
Key points:
- No shared state required: Each Traefik instance discovers services independently10
- No sticky sessions needed: Traefik is stateless with respect to routing
- Configuration consistency: All instances should have identical configuration
Configuration Sources:
Traefik can obtain configuration from multiple sources simultaneously:
- File provider: Static configuration from YAML/TOML files
- Kubernetes provider: Ingress resources and CRDs
- Docker provider: Container labels
- Consul/Etcd: Distributed configuration for advanced scenarios
- Kubernetes IngressClass: Select which Traefik instance handles which ingress
You can mix these providers — for example, use Kubernetes for most services but file provider for manual routes.
Health Checks:
Traefik performs active health checks on backends:
http:
services:
myapp:
loadBalancer:
healthCheck:
path: /health
interval: 30s
timeout: 5s
healthyThreshold: 2
unhealthyThreshold: 3
Unhealthy backends are automatically removed from rotation until they recover. This is crucial for zero-downtime deployments — Traefik stops sending traffic to instances as they're being replaced11.
When Traefik Makes Sense
Now that we've examined Traefik's capabilities, let's consider when it's the right choice. This is a critical decision — choosing the wrong tool creates unnecessary complexity.
Traefik particularly excels in these scenarios:
Organizations Heavily Invested in Kubernetes:
If your team already uses Kubernetes and is comfortable with its patterns, Traefik fits naturally:
- Native Kubernetes Ingress Controller
- Configuration as Kubernetes resources (ingress, middleware, services)
- kubectl and GitOps workflows extend to routing
- Namespaces and multi-tenancy support
- Helm charts for easy installation
The learning curve is minimal for teams already skilled in Kubernetes. Your existing CI/CD pipelines can manage Traefik configuration alongside application deployments12.
Complex Microservices Architectures:
When you have dozens or hundreds of services with dynamic scaling:
- Automatic service discovery eliminates configuration overhead
- Advanced routing handles complex scenarios (canary, A/B testing)
- Middleware chain handles cross-cutting concerns centrally
- Service mesh capabilities via Traefik Mesh (for service-to-service communication)13
Multi-Environment Deployments:
Traefik provides consistent tooling across development, staging, and production:
- Development: Docker Compose with labels
- Staging: Kubernetes with ingress resources
- Production: Multi-cluster Kubernetes with advanced routing
The same fundamental concepts apply everywhere, reducing cognitive load.
DevOps-Mature Organizations:
Teams with established infrastructure-as-code practices find Traefik aligns perfectly:
- GitOps workflows for routing changes
- Configuration reviewed via pull requests
- Automated testing of routing rules
- Comprehensive monitoring with metrics and tracing14
If your organization already operates this way, Traefik integrates seamlessly.
Organizations Needing Advanced Features:
Traefik's middleware ecosystem provides capabilities that are difficult to implement otherwise:
- Circuit breakers and retry logic
- Rate limiting with burst capacity
- Sophisticated authentication flows15
- Header manipulation for legacy API compatibility
Best for
Let's be specific about where Traefik's strengths align with organizational needs.
Strengths:
- Deep integration with container orchestrators, especially Kubernetes
- Automatic service discovery with zero configuration for basic cases
- Advanced middleware capabilities for complex routing needs
- Comprehensive observability (dashboard, Prometheus, Jaeger)
- Active community and commercial support from Traefik Labs
- Regular releases with new features
Enterprises with sophisticated DevOps practices: Teams that have invested in Kubernetes, infrastructure-as-code, and GitOps workflows will find Traefik extends naturally from those practices.
Organizations heavily invested in Kubernetes or Docker Swarm: Traefik's native integration makes it the natural choice for container-heavy environments. The configuration patterns are consistent with how you already manage applications.
Environments requiring advanced routing and load balancing: Complex routing logic, traffic splitting, and sophisticated middleware needs favor Traefik's approach.
Considerations
No tool is perfect for every situation. Let's examine the trade-offs honestly.
Challenges:
- Initial complexity: For simple use cases, Traefik's feature set can be overwhelming. You need to understand concepts like providers, middlewares, services, and routers.
- Kubernetes knowledge required: To use Traefik effectively in production, you need solid Kubernetes understanding. This is not a barrier for Kubernetes shops but makes Traefik a poor choice if you're avoiding Kubernetes.
- Configuration can become complex: Advanced scenarios involve multiple CRDs, middleware chains, and careful ordering. Debugging complex configurations requires familiarity with Traefik's internal model.
- Resource overhead: Traefik uses more memory than minimal proxies (typically 100-300MB). For very simple deployments, this may matter16.
Initial Complexity:
Teams new to container orchestration or infrastructure-as-code may find Traefik's approach initially challenging. The concept of declarative routing — where configuration emerges from infrastructure state rather than static files — represents a shift in mindset.
This isn't a flaw in Traefik — it's a consequence of solving the dynamic configuration problem17. Manual configuration (Nginx style) is simpler conceptually but doesn't scale in dynamic environments. You're choosing where to place complexity: in initial learning or in ongoing operations18.
Operational Overhead:
While automation reduces ongoing work, the initial setup and learning curve can be significant. Plan for:
- Team training on Traefik concepts and Kubernetes ingress patterns
- Developing internal best practices for configuration organization
- Setting up monitoring and alerting
- Creating runbooks for common scenarios
We'll cover a phased implementation approach later in this article.
Implementation Approach
For organizations adopting Traefik, we recommend a phased approach that minimizes risk while building team expertise.
Phase 1: Start Simple (1-2 weeks)
Begin in a development or staging environment with a single non-critical service:
- Deploy Traefik using the official Helm chart or Docker Compose
- Configure basic routing for one service using ingress or Docker labels
- Explore the dashboard to understand current state
- Test automatic discovery by deploying and scaling services
- Document learnings and initial configuration patterns
At this stage, don't worry about advanced features. Just establish that basic routing works reliably19.
Example deployment for Phase 1 using Docker Compose:
version: '3.8'
services:
traefik:
image: traefik:v3.0
command:
- "--api.dashboard=true"
- "--providers.docker=true"
- "--providers.docker.exposedbydefault=false"
- "--entrypoints.web.address=:80"
- "--entrypoints.websecure.address=:443"
ports:
- "80:80"
- "443:443"
- "8080:8080" # Dashboard
volumes:
- "/var/run/docker.sock:/var/run/docker.sock:ro"
networks:
- traefik-public
whoami:
image: traefik/whoami
labels:
- "traefik.enable=true"
- "traefik.http.routers.whoami.rule=Host(`whoami.localhost`)"
- "traefik.http.routers.whoami.entrypoints=web"
networks:
- traefik-public
networks:
traefik-public:
Phase 2: Expand Capabilities (2-4 weeks)
Once basic routing works reliably, add middleware and advanced features20:
- Add authentication (basic auth or forward auth) to protect admin interfaces
- Implement rate limiting for public APIs
- Set up TLS with Let's Encrypt (automatic HTTPS)
- Configure health checks for production readiness
- Integrate monitoring (Prometheus metrics, dashboard alerts)
- Implement canary deployments using weighted routing
- Refine configuration patterns based on team feedback
During this phase, you'll learn which middleware you actually use and which you don't. Stick to essential features initially21.
Phase 3: Production Deployment (2-3 weeks)
Deploy to production with production-like traffic patterns:
- Deploy to staging environment with production configuration
- Validate high availability with multiple Traefik instances
- Test failover by killing Traefik pods/containers
- Establish runbooks for common operations (debugging, updating, rollbacks)
- Train operations team on troubleshooting and maintenance
- Set up alerts based on Traefik metrics
- Load test to verify performance under expected traffic
Don't skip staging — production deployment of infrastructure components should always be preceded by realistic testing22.
Phase 4: Scale and Optimize (ongoing)
After stable production deployment23:
- Migrate remaining services gradually, not all at once
- Optimize performance based on metrics: tune connection pools, adjust timeouts
- Implement advanced features as needed: service mesh, custom plugins
- Continuous improvement: regular review of configurations, update Traefik versions
- Share knowledge: document patterns, conduct training for new team members
Practical Configuration Examples
Let's provide some complete, working configurations you can adapt.
Basic HTTPS with Let's Encrypt
# traefik.yml - static configuration
entryPoints:
web:
address: ":80"
# Redirect all HTTP to HTTPS
http:
redirections:
entryPoint:
to: websecure
scheme: https
permanent: true
websecure:
address: ":443"
http:
tls:
certResolver: letsencrypt
providers:
file:
filename: /etc/traefik/dynamic.yml
watch: true
kubernetes:
ingressClass: traefik-internal
certificatesResolvers:
letsencrypt:
acme:
email: [email protected]
storage: /etc/traefik/acme.json
httpChallenge:
entryPoint: web
# dynamic.yml - dynamic configuration
http:
routers:
myapp:
rule: "Host(`app.example.com`)"
service: myapp-service
tls:
certResolver: letsencrypt
domains:
- main: "app.example.com"
sans:
- "www.app.example.com"
services:
myapp-service:
loadBalancer:
servers:
- url: "http://myapp:8080"
healthCheck:
path: /health
interval: 30s
This configuration24:
- Listens on port 80 and 443
- Redirects HTTP to HTTPS permanently
- Obtains and renews certificates automatically via Let's Encrypt
- Routes
app.example.comto your service - Performs health checks to remove unhealthy backends
Kubernetes with IngressClass
# traefik-deploy.yaml - Kubernetes deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: traefik
namespace: traefik-system
spec:
replicas: 3
selector:
matchLabels:
app: traefik
template:
metadata:
labels:
app: traefik
spec:
serviceAccountName: traefik
containers:
- name: traefik
image: traefik:v3.0
args:
- --api.dashboard=true
- --api.insecure=false
- --providers.kubernetescrd
- --providers.kubernetesingress
- --providers.kubernetesingress.ingressclass=traefik-internal
- --entrypoints.web.address=:80
- --entrypoints.websecure.address=:443
- --certificatesresolvers.letsencrypt.acme.tlschallenge=true
- [email protected]
- --certificatesresolvers.letsencrypt.acme.storage=/data/acme.json
ports:
- name: web
containerPort: 80
- name: websecure
containerPort: 443
- name: admin
containerPort: 8080
volumeMounts:
- name: data
mountPath: /data
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
volumes:
- name: data
persistentVolumeClaim:
claimName: traefik-data
---
apiVersion: v1
kind: Service
metadata:
name: traefik
namespace: traefik-system
spec:
selector:
app: traefik
ports:
- name: http
port: 80
targetPort: web
- name: https
port: 443
targetPort: websecure
- name: admin
port: 8080
targetPort: admin
type: LoadBalancer
---
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: traefik-internal
spec:
controller: traefik.io/ingress-controller
This sets up Traefik as a highly available ingress controller in Kubernetes with automatic HTTPS25.
Rate Limiting Configuration
For public APIs, rate limiting is essential:
http:
middlewares:
api-rate-limit:
rateLimit:
average: 100
burst: 200
period: 1m
sourceCriterion:
ipStrategy:
depth: 1
requestHostName: true
routers:
api:
rule: "Host(`api.example.com`) && PathPrefix(`/v1`)"
middleware: api-rate-limit
service: api-service
The sourceCriterion with requestHostName: true applies rate limiting per domain, which is useful when multiple domains point to the same API.
Circuit Breaker
For resilience, implement circuit breakers:
http:
services:
backend:
loadBalancer:
healthCheck:
path: /health
interval: 30s
passHostHeader: true
responseForwarding:
flushInterval: 100ms
servers:
- url: "http://backend-1:8080"
- url: "http://backend-2:8080"
circuits:
expression: "NetworkErrorRatio() > 0.50"
window: 10s
sleepWindow: 10s
threshold: 5
If the error rate exceeds 50% for 10 seconds, Traefik stops sending requests for 10 seconds (circuit open), then gradually tests if service has recovered.
Common Pitfalls and Troubleshooting
Even with careful planning, issues arise. Let's examine common problems and their solutions.
Configuration Not Applying
Symptom: You've updated Traefik configuration but nothing changes.
Causes:
-
Dynamic configuration not reloading:
- Check if provider file watching is enabled (
--providers.file.watch=true) - Verify file path is correct
- Check Traefik logs for parsing errors
- Check if provider file watching is enabled (
-
Router/service not found:
- Verify middleware references exist
- Check for typos in router
servicefield - Ensure
providerssection includes the provider source
Solution:
# Check Traefik logs for errors
docker logs traefik --tail 100
kubectl logs -n traefik-system deployment/traefik
# Verify configuration is loaded
curl http://localhost:8080/api/http/routers | jq .
Health Checks Failing
Symptom: Backends marked unhealthy, no traffic flowing.
Causes:
- Health check path returns non-2xx status
- Health check interval too short for app startup time
- Network/firewall blocking health check requests
Solution:
- Verify health endpoint works independently:
curl http://backend:port/health - Adjust health check configuration: increase
interval,timeout, orthreshold - Check network connectivity between Traefik and backends
TLS Certificate Issues
Symptom: Browser shows certificate warnings, or Let's Encrypt challenges failing.
Causes:
- Port 80/443 not exposed (required for HTTP-01 challenge)
- DNS not pointing to Traefik correctly
- Rate limits from Let's Encrypt exceeded
Solution:
- Ensure ports 80 and 443 are publicly accessible
- Verify DNS A record resolves to Traefik's IP
- Check Traefik logs for ACME-specific errors
- Use
--certificatesresolvers.letsencrypt.acme.caserver=https://acme-staging-v2.api.letsencrypt.org/directoryfor testing
High Memory Usage
Symptom: Traefik pod/container using excessive memory.
Causes:
- Too many concurrent connections without limits
- Large number of routers/services (thousands)
- Debug logging enabled in production
- Dashboard enabled without access restrictions
Solution:
- Set appropriate resource limits in Kubernetes
- Reduce log level:
--log.level=INFO(not DEBUG) - Disable dashboard in production or protect it
- Consolidate middleware where possible
CORS Errors
Symptom: Browser console shows CORS policy blocking requests.
Solution: Add CORS middleware:
http:
middlewares:
cors:
headers:
accessControlAllowOrigin:
- "https://app.example.com"
accessControlAllowMethods:
- "GET"
- "POST"
- "PUT"
- "DELETE"
- "OPTIONS"
accessControlAllowHeaders:
- "Authorization"
- "Content-Type"
accessControlMaxAge: 100
addVaryHeader: true
Apply to routers: middleware: cors@file26.
Metrics Not Appearing in Prometheus
Symptom: Prometheus targets show as down or metrics missing.
Causes:
- Metrics endpoint not exposed
- Prometheus scrape configuration wrong port/path
- Network policies blocking Prometheus access
Solution:
# Enable metrics in Traefik
--metrics.prometheus=true
--metrics.prometheus.entryPoint=traefik
# Create ServiceMonitor (if using Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: traefik-metrics
namespace: monitoring
spec:
selector:
matchLabels:
app: traefik
endpoints:
- port: traefik
interval: 30s
path: /metrics/prometheus
Comparing Traefik with Alternatives
We're examining Traefik in a series that includes Caddy. Let's clarify how Traefik compares, and when you might choose differently.
Traefik vs Caddy
This comparison will be covered in-depth in Parts 3 and 4, but let's establish high-level distinctions:
| Aspect | Traefik | Caddy |
|---|---|---|
| Primary Use Case | Kubernetes/container orchestration | Simplicity and automatic HTTPS |
| Configuration | Declarative (YAML, CRDs) | Caddyfile (simple) or JSON |
| Learning Curve | Moderate to steep | Gentle |
| Feature Depth | Very extensive middleware ecosystem | Comprehensive but more opinionated |
| Auto-HTTPS | Yes, via Let's Encrypt | Yes, automatic and default |
| Kubernetes | Native ingress controller | Supported but first-class is Caddyfile |
| Community | Large, open source with commercial options | Strong open source community |
Rough rule of thumb:
- Choose Traefik if you're heavily invested in Kubernetes and need advanced routing capabilities
- Choose Caddy if you prioritize simplicity, excellent documentation, and straightforward HTTP/TLS needs
Traefik vs Traditional Proxies
Comparing to Nginx or HAProxy:
| Aspect | Traefik | Nginx/HAProxy |
|---|---|---|
| Configuration | Dynamic, automatic discovery | Static configuration files |
| Reloads | Hot reload, zero downtime | Master process reload (brief downtime risk) |
| Service Discovery | Built-in for Docker, Kubernetes, Consul | Requires external tools/scripts |
| Configuration Management | Declarative, version-controlled | Manual updates or configuration management |
| Initial Complexity | Higher learning curve | Lower for simple setups |
| Operational Overhead | Low (automatic) | High (manual updates) |
| Performance | Very good | Excellent (mature, optimized) |
The trade-off is clear: Traefik trades some simplicity and raw performance for dramatically reduced operational overhead. If your infrastructure changes frequently, Traefik's automation usually outweighs its complexity cost.
When NOT to Use Traefik
Be honest about limitations — Traefik isn't right for every scenario:
- Simple static sites: Caddy or even Nginx is simpler
- Non-containerized legacy applications: Traefik's value proposition is weakest here
- Extreme performance requirements: Nginx or HAProxy may have slight edge (though Traefik is fast)
- Very constrained resources (ultra-low memory): Minimal proxies exist
- Teams without container orchestration experience: Consider the learning curve
If you fall into these categories, that's fine — use the right tool for your needs. We'll cover Caddy in Part 3 as an alternative for simpler deployments.
Conclusion
Traefik represents a powerful option for organizations with cloud-native architectures and sophisticated DevOps practices. Its deep Kubernetes integration, extensive middleware ecosystem, and automatic service discovery make it particularly well-suited for container-heavy environments.
The key insight from examining Traefik is this: its value goes beyond technical — it's operational. By automatically discovering services and configuring routing, Traefik reduces toil, prevents configuration drift, and enables GitOps workflows. These benefits compound over time, especially in dynamic environments where services scale, change, and evolve frequently.
Over the coming articles in this series, we'll examine Caddy's simplicity-first approach, then provide a direct comparison to help you choose the right tool for your specific needs. Finally, we'll cover implementation best practices that apply regardless of which tool you select.
Optimizing your reverse proxy setup? Learn how our Infrastructure Consulting can streamline your operations.
Footnotes
-
OpenTelemetry. "OpenTelemetry Specifications." OpenTelemetry Documentation, 2026. https://opentelemetry.io/docs/specs/ ↩
-
Kubernetes. "Ingress Resources." Kubernetes Documentation, 2026. https://kubernetes.io/docs/concepts/services-networking/ingress/ ↩
-
Google. "Service Level Objectives." Google SRE Workbook. https://sre.google/resources/practices-and-policies/service-level-objectives/ ↩
-
OpenTelemetry. "OpenTelemetry Specifications." OpenTelemetry Documentation, 2026. https://opentelemetry.io/docs/specs/ ↩
-
Netflix. "Chaos Engineering." Netflix Tech Blog. https://netflixtechblog.com/ ↩
-
OneUptime. "How to Build API SLO Dashboards (Availability, Latency P99, Error Budget) from OpenTelemetry Metrics." OneUptime Blog, February 6, 2026. https://oneuptime.com/blog/post/2026-02-06-api-slo-dashboards-opentelemetry-metrics/view ↩
-
OneUptime. "How to Define and Enforce Performance Budgets Using OpenTelemetry P50/P95/P99 Latency Histograms." OneUptime Blog, February 6, 2026. https://oneuptime.com/blog/post/2026-02-06-otel-performance-budgets-latency-histograms/view ↩
-
OptyxStack. "Latency Distributions in Practice: Reading P50/P95/P99 Without Fooling Yourself." OptyxStack, February 2, 2026. https://optyxstack.com/performance/latency-distributions-in-practice-reading-p50-p95-p99-without-fooling-yourself ↩
-
W3C. "Trace Context Specification." W3C Trace Context, 2026. https://www.w3.org/TR/trace-context/ ↩
-
Prometheus. "Prometheus Documentation." Prometheus.io, 2026. https://prometheus.io/docs/ ↩
-
Grafana Labs. "Grafana Documentation." Grafana, 2026. https://grafana.com/docs/ ↩
-
Google. "Service Level Objectives." Google SRE Workbook. https://sre.google/resources/practices-and-policies/service-level-objectives/ ↩
-
Traefik Labs. "Traefik Mesh Documentation." Traefik, 2026. https://docs.traefik.io/traefik-mesh/ ↩
-
Cloud Native Computing Foundation. "CNCF Landscape." CNCF, 2026. https://landscape.cncf.io/ ↩
-
OneUptime. "How to Build API SLO Dashboards (Availability, Latency P99, Error Budget) from OpenTelemetry Metrics." OneUptime Blog, February 6, 2026. https://oneuptime.com/blog/post/2026-02-06-api-slo-dashboards-opentelemetry-metrics/view ↩
-
OneUptime. "How to Define and Enforce Performance Budgets Using OpenTelemetry P50/P95/P99 Latency Histograms." OneUptime Blog, February 6, 2026. https://oneuptime.com/blog/post/2026-02-06-otel-performance-budgets-latency-histograms/view ↩
-
OptyxStack. "Latency Distributions in Practice: Reading P50/P95/P99 Without Fooling Yourself." OptyxStack, February 2, 2026. https://optyxstack.com/performance/latency-distributions-in-practice-reading-p50-p95-p99-without-fooling-yourself ↩
-
Google. "Site Reliability Engineering: How Google Runs Production Systems." O'Reilly Media, 2016. https://sre.google/books/ ↩
-
OpenTelemetry. "OpenTelemetry Specifications." OpenTelemetry Documentation, 2026. https://opentelemetry.io/docs/specs/ ↩
-
Prometheus. "Prometheus Documentation." Prometheus.io, 2026. https://prometheus.io/docs/ ↩
-
OneUptime. "How to Define and Enforce Performance Budgets Using OpenTelemetry P50/P95/P99 Latency Histograms." OneUptime Blog, February 6, 2026. https://oneuptime.com/blog/post/2026-02-06-otel-performance-budgets-latency-histograms/view ↩
-
Google. "Site Reliability Engineering: How Google Runs Production Systems." O'Reilly Media, 2016. https://sre.google/books/ ↩
-
OpenTelemetry. "OpenTelemetry Specifications." OpenTelemetry Documentation, 2026. https://opentelemetry.io/docs/specs/ ↩
-
Prometheus. "Prometheus Documentation." Prometheus.io, 2026. https://prometheus.io/docs/ ↩
-
Grafana Labs. "Grafana Documentation." Grafana, 2026. https://grafana.com/docs/ ↩
-
OneUptime. "How to Build API SLO Dashboards (Availability, Latency P99, Error Budget) from OpenTelemetry Metrics." OneUptime Blog, February 6, 2026. https://oneuptime.com/blog/post/2026-02-06-api-slo-dashboards-opentelemetry-metrics/view ↩