Last updated: March 22, 2026
Circuit breakers prevent cascading failures in distributed systems. Configuring them correctly is harder than it looks: wrong failure thresholds cause nuisance trips, wrong wait durations cause extended outages, and missing fallbacks turn circuit trips into user-visible errors. AI tools help by recommending thresholds based on your SLOs and generating the configuration code.
Circuit Breaker Configuration Parameters
Before generating any config, you need to understand the parameters:
| Parameter | What It Controls | Common Mistake |
|---|---|---|
slidingWindowSize |
N calls to evaluate failure rate | Too small = hypersensitive |
failureRateThreshold |
% failures before opening | Too low = nuisance trips |
waitDurationInOpenState |
How long to stay open before retrying | Too short = thundering herd |
permittedCallsInHalfOpenState |
Calls to test in half-open | Too many = slow recovery |
slowCallDurationThreshold |
What counts as “slow” | Must be set per-service SLO |
slowCallRateThreshold |
% slow calls before opening | Often forgotten |
Using AI to Recommend Thresholds
Prompt:
I'm configuring a circuit breaker for calls from my OrderService to a PaymentService.
Context:
- PaymentService p99 latency: 800ms
- PaymentService p50 latency: 120ms
- Acceptable failure rate: 5% (per SLO)
- During a payment outage, I want to stop retrying after 20% failure rate
- Retry attempts should resume after 30 seconds
- Traffic: ~200 payment requests/minute
Recommend Resilience4j circuit breaker configuration with explanations.
Claude’s configuration with reasoning:
# application.yml — Resilience4j configuration
resilience4j:
circuitbreaker:
instances:
payment-service:
# Use count-based window of 20 calls (covers ~6 seconds at 200 req/min)
# Large enough to be statistically meaningful, small enough to react quickly
sliding-window-type: COUNT_BASED
sliding-window-size: 20
# Open circuit when 20%+ of calls fail (4/20 calls)
# Your SLO allows 5%, so 20% means something is definitely wrong
failure-rate-threshold: 20
# Open circuit when 40%+ of calls take >1.5s (2x p99 latency)
# This catches partial degradation before full outages
slow-call-duration-threshold: 1500ms
slow-call-rate-threshold: 40
# Stay open for 30 seconds — let PaymentService recover
# Shorter risks thundering herd on recovery
wait-duration-in-open-state: 30s
# Test with 3 calls in half-open (conservative — don't hammer recovering service)
permitted-number-of-calls-in-half-open-state: 3
# Close circuit when all 3 half-open test calls succeed
# (or failure rate drops below threshold)
# Minimum calls before evaluating (avoids tripping on cold start)
minimum-number-of-calls: 10
# Health indicator for monitoring
register-health-indicator: true
# Automatic state transition (half-open → open if still failing)
automatic-transition-from-open-to-half-open-enabled: true
Claude explains each setting and its rationale. GPT-4 generates similar config but often omits slowCallDurationThreshold and slowCallRateThreshold — missing the slow-call circuit breaker is a common misconfiguration.
Resilience4j Java Implementation
// PaymentServiceClient.java — full circuit breaker + retry + fallback
@Service
@Slf4j
@RequiredArgsConstructor
public class PaymentServiceClient {
private final WebClient webClient;
private final CircuitBreakerRegistry circuitBreakerRegistry;
private final RetryRegistry retryRegistry;
/**
* Process payment with circuit breaker, retry, and fallback.
* Decorators applied in order: Retry → CircuitBreaker
*/
public Mono<PaymentResult> processPayment(PaymentRequest request) {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("payment-service");
Retry retry = retryRegistry.retry("payment-service");
return Mono.fromCallable(() -> request)
.flatMap(this::callPaymentApi)
.transformDeferred(CircuitBreakerOperator.of(cb))
.transformDeferred(RetryOperator.of(retry))
.onErrorResume(CallNotPermittedException.class, ex -> {
log.warn("Payment circuit breaker OPEN — using fallback for request {}",
request.getOrderId());
return processFallbackPayment(request);
})
.onErrorResume(ex -> {
log.error("Payment service failed after retries for order {}: {}",
request.getOrderId(), ex.getMessage());
return Mono.error(new PaymentServiceUnavailableException(
"Payment service temporarily unavailable", ex));
});
}
private Mono<PaymentResult> callPaymentApi(PaymentRequest request) {
return webClient.post()
.uri("/payments")
.bodyValue(request)
.retrieve()
.onStatus(status -> status.is5xxServerError(),
resp -> resp.bodyToMono(String.class)
.flatMap(body -> Mono.error(
new PaymentServerException("Payment service error: " + body))))
.bodyToMono(PaymentResult.class)
.timeout(Duration.ofMillis(2000)); // Hard timeout
}
/**
* Fallback: queue the payment for async processing when circuit is open.
*/
private Mono<PaymentResult> processFallbackPayment(PaymentRequest request) {
return paymentQueueService.enqueue(request)
.map(queueId -> PaymentResult.pending(request.getOrderId(), queueId));
}
@EventListener
public void onCircuitBreakerStateChange(CircuitBreakerOnStateTransitionEvent event) {
log.warn("Payment circuit breaker state: {} → {}",
event.getStateTransition().getFromState(),
event.getStateTransition().getToState());
// Alert on-call if circuit opens
if (event.getStateTransition().getToState() == CircuitBreaker.State.OPEN) {
alertingService.sendAlert(Alert.builder()
.severity(Severity.HIGH)
.message("Payment service circuit breaker opened")
.build());
}
}
}
.NET Polly Configuration
Claude handles multi-language circuit breaker configuration:
Prompt: “Configure a Polly v8 circuit breaker for a .NET 8 service calling an external inventory API.”
// InventoryServiceExtensions.cs — generated by Claude
using Polly;
using Polly.CircuitBreaker;
public static class InventoryServiceExtensions
{
public static IServiceCollection AddInventoryService(
this IServiceCollection services,
IConfiguration configuration)
{
services.AddHttpClient<IInventoryService, InventoryService>(client =>
{
client.BaseAddress = new Uri(configuration["Services:Inventory:BaseUrl"]!);
client.Timeout = TimeSpan.FromSeconds(5);
})
.AddResilienceHandler("inventory-pipeline", builder =>
{
// Circuit Breaker
builder.AddCircuitBreaker(new CircuitBreakerStrategyOptions
{
// Open after 30% failure rate over 20 calls
FailureRatio = 0.3,
SamplingDuration = TimeSpan.FromSeconds(30),
MinimumThroughput = 20,
// Stay open for 45 seconds
BreakDuration = TimeSpan.FromSeconds(45),
// Count server errors and timeouts as failures
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<HttpRequestException>()
.Handle<TimeoutRejectedException>()
.HandleResult(r => r.StatusCode >= HttpStatusCode.InternalServerError),
OnOpened = args =>
{
logger.LogWarning(
"Inventory circuit breaker opened. Duration: {Duration}",
args.BreakDuration);
return ValueTask.CompletedTask;
},
OnClosed = args =>
{
logger.LogInformation("Inventory circuit breaker closed.");
return ValueTask.CompletedTask;
},
});
// Retry (applied before circuit breaker)
builder.AddRetry(new RetryStrategyOptions<HttpResponseMessage>
{
MaxRetryAttempts = 2,
Delay = TimeSpan.FromMilliseconds(200),
BackoffType = DelayBackoffType.Exponential,
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<HttpRequestException>()
.HandleResult(r => r.StatusCode >= HttpStatusCode.InternalServerError),
});
// Timeout
builder.AddTimeout(TimeSpan.FromSeconds(3));
});
return services;
}
}
Threshold Tuning Guide (AI-Generated)
Use this prompt to get tuning advice for existing deployments:
My circuit breaker trips 3-4 times per day even when PaymentService is healthy.
Current config: failureRateThreshold=10, slidingWindowSize=10, waitDuration=15s
Datadog metrics show:
- PaymentService p99 latency: 650ms
- Payment error rate: 1.2% (within SLO)
- Circuit opens when a single slow burst occurs (e.g., DB query at PaymentService)
How should I tune the circuit breaker to reduce nuisance trips while still
protecting against real outages?
Claude’s tuning recommendations:
- Increase
slidingWindowSizefrom 10 to 30 — 10 calls is too small; one slow response out of 10 = 10% instantly - Increase
failureRateThresholdfrom 10% to 25% — your SLO allows 5% errors; 25% is a clear signal - Increase
minimumNumberOfCallsfrom default 5 to 15 — avoid trips on startup or low-traffic periods - Add
slowCallDurationThreshold=1500ms+slowCallRateThreshold=50— catch degradation separately from errors
Adding Hystrix Legacy Config (Migration Path)
Many older Spring Boot services still use Netflix Hystrix, which is no longer actively maintained. AI tools help you both configure existing Hystrix and plan the migration to Resilience4j.
Prompt: “Generate Hystrix configuration for my UserService → AuthService call. AuthService p99 is 400ms. Then show me the equivalent Resilience4j config for when we migrate.”
Claude’s Hystrix config (for legacy services):
# application.yml — Hystrix (legacy)
hystrix:
command:
auth-service:
execution:
isolation:
thread:
timeoutInMilliseconds: 800 # 2x p99
circuitBreaker:
requestVolumeThreshold: 20 # Minimum calls to evaluate
sleepWindowInMilliseconds: 30000 # Stay open 30s
errorThresholdPercentage: 25 # Open at 25% errors
metrics:
rollingStats:
timeInMilliseconds: 10000 # Evaluate over 10s window
numBuckets: 10
Claude’s equivalent Resilience4j migration target:
# application.yml — Resilience4j equivalent
resilience4j:
circuitbreaker:
instances:
auth-service:
sliding-window-type: TIME_BASED
sliding-window-size: 10 # 10-second window (matches Hystrix rollingStats)
failure-rate-threshold: 25
minimum-number-of-calls: 20
wait-duration-in-open-state: 30s
slow-call-duration-threshold: 800ms
slow-call-rate-threshold: 50
automatic-transition-from-open-to-half-open-enabled: true
This side-by-side mapping is something Claude generates reliably. GPT-4 can do it too, but sometimes translates rollingStats.timeInMilliseconds incorrectly as a count-based window rather than a time-based one.
Using AI for Fallback Strategy Design
The fallback is as important as the circuit breaker itself. A circuit breaker without a graceful fallback just converts remote failures into local exceptions. Claude is particularly good at generating fallback implementations when you describe your business constraints.
Prompt:
My SearchService calls an external recommendations API. When the circuit is open,
I need a fallback. Constraints:
- We have a local Redis cache of last-known recommendations (TTL 1 hour)
- If cache is empty, return top-10 static product IDs from a config list
- Log which fallback was used for observability
Generate the fallback implementation in Java with Resilience4j.
Claude’s output:
@Service
@Slf4j
@RequiredArgsConstructor
public class RecommendationService {
private final RecommendationsApiClient apiClient;
private final RedisTemplate<String, List<String>> redisTemplate;
private final RecommendationProperties properties;
private final MeterRegistry meterRegistry;
@CircuitBreaker(name = "recommendations-api", fallbackMethod = "getRecommendationsFallback")
public List<String> getRecommendations(String userId) {
return apiClient.fetchRecommendations(userId);
}
private List<String> getRecommendationsFallback(String userId, Throwable ex) {
// Try Redis cache first
String cacheKey = "recs:" + userId;
List<String> cached = redisTemplate.opsForValue().get(cacheKey);
if (cached != null && !cached.isEmpty()) {
log.info("recommendations fallback: using Redis cache for user={}", userId);
meterRegistry.counter("recommendations.fallback", "source", "cache").increment();
return cached;
}
// Fall back to static defaults
log.warn("recommendations fallback: Redis miss, using static defaults for user={}", userId);
meterRegistry.counter("recommendations.fallback", "source", "static").increment();
return properties.getDefaultProductIds();
}
}
This pattern — tiered fallback with observability counters — is exactly what production systems need, and Claude generates it from a plain-English description.
Observability: Exposing Circuit Breaker Metrics
Circuit breakers are only useful if you can monitor their state. Claude generates Prometheus/Grafana integration automatically:
# application.yml — expose Resilience4j metrics via Actuator
management:
endpoints:
web:
exposure:
include: health,prometheus,circuitbreakers
health:
circuitbreakers:
enabled: true
metrics:
tags:
application: ${spring.application.name}
resilience4j:
circuitbreaker:
instances:
payment-service:
register-health-indicator: true
event-consumer-buffer-size: 10
Ask Claude for a Grafana dashboard query and it generates PromQL like:
# Circuit breaker state (0=CLOSED, 1=OPEN, 2=HALF_OPEN)
resilience4j_circuitbreaker_state{application="order-service", name="payment-service"}
# Failure rate over time
resilience4j_circuitbreaker_failure_rate{name="payment-service"}
# Call rate through the circuit breaker
rate(resilience4j_circuitbreaker_calls_total{name="payment-service"}[5m])
Related Reading
- Claude vs Copilot for Java Spring Boot Microservices
- AI Tools for Automated Rate Limiter Configuration
- AI CI/CD Pipeline Optimization
Built by theluckystrike — More at zovo.one