Converter: add optional retry policy for PayloadCodec calls#2373
Open
whitemanthedj wants to merge 1 commit into
Open
Converter: add optional retry policy for PayloadCodec calls#2373whitemanthedj wants to merge 1 commit into
whitemanthedj wants to merge 1 commit into
Conversation
Adds NewCodecDataConverterWithOptions and a CodecDataConverterOptions struct exposing an opt-in retry policy applied per individual codec call. Motivated by codecs that call external services (the in-tree RemotePayloadCodec, KMS-backed codecs, schema-registry codecs) where transient failures today abort the whole conversion with no recovery path. NewCodecDataConverter signature is unchanged; the zero-value options struct preserves current behavior byte-for-byte. Fixes temporalio#2370
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds NewCodecDataConverterWithOptions and a CodecDataConverterOptions struct exposing an opt-in retry policy applied per individual codec call. Motivated by codecs that call external services (the in-tree RemotePayloadCodec, KMS-backed codecs, schema-registry codecs) where transient failures today abort the whole conversion with no recovery path.
NewCodecDataConverter signature is unchanged; the zero-value options struct preserves current behavior.
Fixes #2370
What was changed
New public surface, all in
converter/codec.go:Behavior:
RetryPolicyset): each codec is invoked exactly once and the first error aborts the chain. Byte-for-byte identical to current behavior. ATestCodecDataConverter_Retry_NewCodecDataConverterIsZeroOptionstest asserts equivalence.RetryPolicyset: each codec in the chain is wrapped in a retry loop independently.IsRetryable(if non-nil) classifies each error; nil means retry all, matching the convention ininternal/common/backoff.Retryandtemporal.RetryPolicy(emptyNonRetryableErrorTypesretries everything).Context(defaulting tocontext.Background()) governs cancellation of sleeps between attempts.CodecRetryPolicyzero-value field defaults: 100ms initial interval, backoff coefficient 2.0, max interval equal to 100 times initial. These mirror the conventions used byinternal/common/retry.Default*via three unexported package constants declared at the top ofconverter/codec.go.Retry boundary is per-codec, not per-chain. A chain of
[compress, encrypt]only retries the failing codec. Re-running a successful predecessor would invalidate its output for non-idempotent codecs such as authenticated encryption and KMS data-key wrapping.Files modified:
converter/codec.go(+186, -6): new exported types and constructor, three unexported default constants, privatecallWithRetryhelper (nointernal/import),optionsfield onCodecDataConverter, updatedWithSerializationContextto propagate options through derived converters. ExistingNewCodecDataConverterdoc comment updated to mention the new sibling.converter/codec_test.go(+334):flakyCodec,countingCodec,retryTestSerializationContext, andfastRetryhelpers, plus ten new retry tests.Zero changes outside
converter/. No new module dependencies (go.modandgo.sumare untouched; verified bygo mod tidyproducing no diff). No changes underinternal/. No changes to thePayloadCodecinterface. No changes toNewPayloadCodecHTTPHandler.Out of scope, deliberately:
PayloadConverterandFailureConverter. Those have programmer-error failure modes rather than transient I/O.context.Contextto thePayloadCodecinterface itself. The options-structContextonly governs sleeps between retries.NewPayloadCodecHTTPHandler. Server-side retry is the operator's responsibility.internal/common/backoffto a public package.Why?
A
PayloadCodecmay call an external service during Encode and Decode. The in-treeRemotePayloadCodecis the canonical example (HTTP to a sidecar), and KMS-backed encryption codecs plus schema-registry codecs are common community variants. Any of these can fail transiently from network blips, throttling, or 5xx responses. Today the SDK has no way to retry a codec call: the first error from any codec immediately aborts the encode/decode and surfaces to user code. Each downstream user has to either fork the codec or wrap retry logic inside their own codec implementation, which duplicates effort across the ecosystem and does not help the stdlibRemotePayloadCodec, which has no retries today (pc.options.Client.Do(req)followed byreturn payloads, err).This proposal complements PR #2228, which made codec-failure side effects non-catastrophic for session workflows. That fix prevents permanent bricking when a codec failure cascades through a session activity cancellation, but a transient
DeadlineExceededfrom a remote codec still kills the in-flight workflow task and forces the server to retry the entire task. This PR is the root-cause complement: codecs that opt in can absorb the transient blip in place without ever losing the workflow task.Determinism: codecs run at the SDK serialization boundary, never inside the workflow goroutine's user code. Retrying does not introduce new history events, new commands, or new branches in workflow code. History replay re-invokes the codec the same way it would on the original execution. Replay-aware retries are safe by construction.
Design choices the maintainers may want to weigh in on (carried from issue #2370):
CodecRetryPolicystruct as proposed, or thread something else (temporal.RetryPolicywould require breaking an import cycle fromconverter/back intotemporal/).RetryWholeChain boolopt-in be preferred.IsRetryable nil = retry allmatchesbackoff.Retryandtemporal.RetryPolicyconventions. Would the saferIsRetryable nil = retry nonedefault be preferred instead.Checklist
Closes Add optional retry policy for PayloadCodec failures #2370
How was this tested:
Local verification, all from
internal/cmd/build:Ten new tests in
converter/codec_test.go:TestCodecDataConverter_Retry_EncodeSucceedsAfterTransientFailures: flaky codec fails twice and succeeds on the third attempt.TestCodecDataConverter_Retry_EncodeExhaustsAttempts: always-failing codec exhaustsMaximumAttemptsand returns the last error.TestCodecDataConverter_Retry_DecodeRetried: symmetric coverage on the Decode path.TestCodecDataConverter_Retry_NonRetryableErrorFailsImmediately:IsRetryablereturns false; single attempt.TestCodecDataConverter_Retry_DefaultPolicyPreservesOldBehavior: backward compatibility confirmed against the pre-existingerrorCodecOnEncodefake.TestCodecDataConverter_Retry_PerCodecBoundary: verifies a successful predecessor codec is not re-run when a later codec retries.TestCodecDataConverter_Retry_ContextCancellationStopsRetries: context cancellation aborts the retry loop and surfaces the codec error, notcontext.Canceled.TestCodecDataConverter_Retry_NewCodecDataConverterIsZeroOptions:NewCodecDataConverterproduces output behaviorally identical toNewCodecDataConverterWithOptionswith a zero-value options struct.TestCodecDataConverter_Retry_OptionsPreservedThroughSerializationContext:WithSerializationContextderivation does not silently drop retry config.TestCodecDataConverter_Retry_NilContextDefaultsToBackground:Context: nildoes not panic and falls back tocontext.Background().The pre-existing
TestCodecDataConverter_ToPayload_EncodeErrorwas not modified and continues to pass without changes. The six other pre-existingTestCodecDataConverter_*tests (propagation, signing-mismatch, etc.) also pass unchanged.In addition to in-repo tests, an out-of-tree consumer program that calls
NewCodecDataConverterWithOptionsagainst a flaky codec returningkms 503for the first two attempts succeeds on the third attempt and printserr=<nil> payload=true. This confirms the public API shape is usable from outside the SDK module.All retry tests use sub-millisecond intervals so the new suite runs in well under one second.
No external docs updates are required. The public API additions carry doc comments on every exported symbol matching the density set by
serialization_context.go. The existingNewCodecDataConverterdoc comment was updated to point readers at the new sibling constructor for the retry-aware path. No README updates needed; no docs.temporal.io updates needed since this is a Go SDK addition with self-documenting types.