ALL PROXY REWARDS ARE BAD…
BUT SOME CAN BE MADE USEFUL!

Inference-time alignment is simple and effective: generate candidate responses from your model, score them using a reward model, and pick a response—no retraining needed! One successful approach is Best-of-n where you select the highest scoring response.


However, these reward models function as proxies for our desired objectives such as helpfulness and safety. By overoptimizing for a misspecified reward, we can fall victim to REWARD HACKING!

REWARD HACKING: The desired performance first improves, then collapses as we optimize harder.

Our contributions:

1. We mathematically formalise inference-time reward hacking and prove its inevitability.

2. We introduce Best-of-Poisson, which approximates the optimal tilted distribution with negligible KL gap.

3. We develop HedgeTune, a lightweight framework that finds optimal inference-time parameters.

Inference-time Reward Hacking

YOU CAN HEDGE TO FIND IT!safedanger!true rewardnumber of samples n00.250.50.751.01481216202428
REWARD HACKING IS INEVITABLE

We derive exact conditions under which one of the four possible regimes (shown on the right) is envitable for common inference-time schemes such as Best-of-n.

Consequence: A unique hacking threshold θ exists, and we have a principled way to find it!

HEDGING CAN MITIGATE INFERENCE-TIME HACKING

BEST-OF-N

1. Draw n candidates
2. Score them using proxy reward
3. Select the one with the highest score.

SOFT BEST-OF-N

1. Draw n candidates
2. Score them using proxy reward
3. Sample a response from a softmax over the scores using a temperature λ.

BEST-OF-POISSON [OURS]

1. Sample n from a Poisson distribution with parameter μ
2. Apply BoN using this n.

Provably close to the optimal RLHF tilted distribution & efficient at inference time!

HEDGETUNE: An efficient algorithm for finding the optimal inference-time parameter θ for BoN, SBoN, and BoP.
We only require samples from the proxy and true rewards!

1. Build score ψ(u, θ) for the chosen method
2. Define residual R(θ) = E[ r_true(u) · ψ(u, θ) ]
3. Find θ, the root of the residual R(θ) using bisection or Newton's method
HEDGETUNE: An efficient algorithm for finding the optimal inference-time parameter θ for BoN, SBoN, and BoP.
We only require samples from the proxy and true rewards!

1. Build score ψ(u, θ) for the chosen method
2. Define residual R(θ) = E[ r_true(u) · ψ(u, θ) ]
3. Find θ, the root of the residual R(θ) using bisection or Newton's method

Check our paper to see how hedging mitigates hacking in real-world settings!