Skip to content

Converter: add optional retry policy for PayloadCodec calls#2373

Open
whitemanthedj wants to merge 1 commit into
temporalio:mainfrom
whitemanthedj:codec-retry
Open

Converter: add optional retry policy for PayloadCodec calls#2373
whitemanthedj wants to merge 1 commit into
temporalio:mainfrom
whitemanthedj:codec-retry

Conversation

@whitemanthedj

Copy link
Copy Markdown

Adds NewCodecDataConverterWithOptions and a CodecDataConverterOptions struct exposing an opt-in retry policy applied per individual codec call. Motivated by codecs that call external services (the in-tree RemotePayloadCodec, KMS-backed codecs, schema-registry codecs) where transient failures today abort the whole conversion with no recovery path.

NewCodecDataConverter signature is unchanged; the zero-value options struct preserves current behavior.

Fixes #2370

What was changed

New public surface, all in converter/codec.go:

func NewCodecDataConverterWithOptions(
    parent DataConverter,
    codecs []PayloadCodec,
    options CodecDataConverterOptions,
) DataConverter

type CodecDataConverterOptions struct {
    RetryPolicy *CodecRetryPolicy
    IsRetryable func(error) bool
    Context     context.Context
}

type CodecRetryPolicy struct {
    InitialInterval    time.Duration
    BackoffCoefficient float64
    MaximumInterval    time.Duration
    MaximumAttempts    int
    ExpirationInterval time.Duration
}

Behavior:

  • Default (no RetryPolicy set): each codec is invoked exactly once and the first error aborts the chain. Byte-for-byte identical to current behavior. A TestCodecDataConverter_Retry_NewCodecDataConverterIsZeroOptions test asserts equivalence.
  • With RetryPolicy set: each codec in the chain is wrapped in a retry loop independently. IsRetryable (if non-nil) classifies each error; nil means retry all, matching the convention in internal/common/backoff.Retry and temporal.RetryPolicy (empty NonRetryableErrorTypes retries everything). Context (defaulting to context.Background()) governs cancellation of sleeps between attempts.
  • CodecRetryPolicy zero-value field defaults: 100ms initial interval, backoff coefficient 2.0, max interval equal to 100 times initial. These mirror the conventions used by internal/common/retry.Default* via three unexported package constants declared at the top of converter/codec.go.

Retry boundary is per-codec, not per-chain. A chain of [compress, encrypt] only retries the failing codec. Re-running a successful predecessor would invalidate its output for non-idempotent codecs such as authenticated encryption and KMS data-key wrapping.

Files modified:

  • converter/codec.go (+186, -6): new exported types and constructor, three unexported default constants, private callWithRetry helper (no internal/ import), options field on CodecDataConverter, updated WithSerializationContext to propagate options through derived converters. Existing NewCodecDataConverter doc comment updated to mention the new sibling.
  • converter/codec_test.go (+334): flakyCodec, countingCodec, retryTestSerializationContext, and fastRetry helpers, plus ten new retry tests.

Zero changes outside converter/. No new module dependencies (go.mod and go.sum are untouched; verified by go mod tidy producing no diff). No changes under internal/. No changes to the PayloadCodec interface. No changes to NewPayloadCodecHTTPHandler.

Out of scope, deliberately:

  • Retry behavior on PayloadConverter and FailureConverter. Those have programmer-error failure modes rather than transient I/O.
  • Adding context.Context to the PayloadCodec interface itself. The options-struct Context only governs sleeps between retries.
  • Retry inside NewPayloadCodecHTTPHandler. Server-side retry is the operator's responsibility.
  • Promotion of internal/common/backoff to a public package.

Why?

A PayloadCodec may call an external service during Encode and Decode. The in-tree RemotePayloadCodec is the canonical example (HTTP to a sidecar), and KMS-backed encryption codecs plus schema-registry codecs are common community variants. Any of these can fail transiently from network blips, throttling, or 5xx responses. Today the SDK has no way to retry a codec call: the first error from any codec immediately aborts the encode/decode and surfaces to user code. Each downstream user has to either fork the codec or wrap retry logic inside their own codec implementation, which duplicates effort across the ecosystem and does not help the stdlib RemotePayloadCodec, which has no retries today (pc.options.Client.Do(req) followed by return payloads, err).

This proposal complements PR #2228, which made codec-failure side effects non-catastrophic for session workflows. That fix prevents permanent bricking when a codec failure cascades through a session activity cancellation, but a transient DeadlineExceeded from a remote codec still kills the in-flight workflow task and forces the server to retry the entire task. This PR is the root-cause complement: codecs that opt in can absorb the transient blip in place without ever losing the workflow task.

Determinism: codecs run at the SDK serialization boundary, never inside the workflow goroutine's user code. Retrying does not introduce new history events, new commands, or new branches in workflow code. History replay re-invokes the codec the same way it would on the original execution. Replay-aware retries are safe by construction.

Design choices the maintainers may want to weigh in on (carried from issue #2370):

  1. Local CodecRetryPolicy struct as proposed, or thread something else (temporal.RetryPolicy would require breaking an import cycle from converter/ back into temporal/).
  2. Per-codec retry boundary as proposed, or would a RetryWholeChain bool opt-in be preferred.
  3. IsRetryable nil = retry all matches backoff.Retry and temporal.RetryPolicy conventions. Would the safer IsRetryable nil = retry none default be preferred instead.

Checklist

  1. Closes Add optional retry policy for PayloadCodec failures #2370

  2. How was this tested:

Local verification, all from internal/cmd/build:

go run . check                                   # vet, errcheck, staticcheck, doclink: clean
go run . unit-test -run "TestCodecDataConverter" # 11 tests pass (10 new plus 1 pre-existing)
go run . unit-test                               # full suite passes, zero FAIL lines
go run . integration-test -dev-server            # passes; one pre-existing Nexus flake unrelated to converter

Ten new tests in converter/codec_test.go:

  • TestCodecDataConverter_Retry_EncodeSucceedsAfterTransientFailures: flaky codec fails twice and succeeds on the third attempt.
  • TestCodecDataConverter_Retry_EncodeExhaustsAttempts: always-failing codec exhausts MaximumAttempts and returns the last error.
  • TestCodecDataConverter_Retry_DecodeRetried: symmetric coverage on the Decode path.
  • TestCodecDataConverter_Retry_NonRetryableErrorFailsImmediately: IsRetryable returns false; single attempt.
  • TestCodecDataConverter_Retry_DefaultPolicyPreservesOldBehavior: backward compatibility confirmed against the pre-existing errorCodecOnEncode fake.
  • TestCodecDataConverter_Retry_PerCodecBoundary: verifies a successful predecessor codec is not re-run when a later codec retries.
  • TestCodecDataConverter_Retry_ContextCancellationStopsRetries: context cancellation aborts the retry loop and surfaces the codec error, not context.Canceled.
  • TestCodecDataConverter_Retry_NewCodecDataConverterIsZeroOptions: NewCodecDataConverter produces output behaviorally identical to NewCodecDataConverterWithOptions with a zero-value options struct.
  • TestCodecDataConverter_Retry_OptionsPreservedThroughSerializationContext: WithSerializationContext derivation does not silently drop retry config.
  • TestCodecDataConverter_Retry_NilContextDefaultsToBackground: Context: nil does not panic and falls back to context.Background().

The pre-existing TestCodecDataConverter_ToPayload_EncodeError was not modified and continues to pass without changes. The six other pre-existing TestCodecDataConverter_* tests (propagation, signing-mismatch, etc.) also pass unchanged.

In addition to in-repo tests, an out-of-tree consumer program that calls NewCodecDataConverterWithOptions against a flaky codec returning kms 503 for the first two attempts succeeds on the third attempt and prints err=<nil> payload=true. This confirms the public API shape is usable from outside the SDK module.

All retry tests use sub-millisecond intervals so the new suite runs in well under one second.

  1. Any docs updates needed?

No external docs updates are required. The public API additions carry doc comments on every exported symbol matching the density set by serialization_context.go. The existing NewCodecDataConverter doc comment was updated to point readers at the new sibling constructor for the retry-aware path. No README updates needed; no docs.temporal.io updates needed since this is a Go SDK addition with self-documenting types.

Adds NewCodecDataConverterWithOptions and a CodecDataConverterOptions
struct exposing an opt-in retry policy applied per individual codec
call. Motivated by codecs that call external services (the in-tree
RemotePayloadCodec, KMS-backed codecs, schema-registry codecs) where
transient failures today abort the whole conversion with no recovery
path.

NewCodecDataConverter signature is unchanged; the zero-value options
struct preserves current behavior byte-for-byte.

Fixes temporalio#2370
@CLAassistant

CLAassistant commented May 28, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@whitemanthedj whitemanthedj changed the title DRAFT: converter: add optional retry policy for PayloadCodec calls Converter: add optional retry policy for PayloadCodec calls May 29, 2026
@whitemanthedj whitemanthedj marked this pull request as ready for review May 29, 2026 16:38
@whitemanthedj whitemanthedj requested a review from a team as a code owner May 29, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add optional retry policy for PayloadCodec failures

2 participants