Hadi Khalaf

ALL PROXY REWARDS ARE BAD…
BUT SOME CAN BE MADE USEFUL!

Inference-time alignment is simple and effective: generate candidate responses from your model, score them using a reward model, and pick a response—no retraining needed! One successful approach is Best-of-n where you select the highest scoring response.

However, these reward models function as proxies for our desired objectives such as helpfulness and safety. By overoptimizing for a misspecified reward, we can fall victim to REWARD HACKING!

REWARD HACKING: The desired performance first improves, then collapses as we optimize harder.

Our contributions:

1. We mathematically formalise inference-time reward hacking and prove its inevitability.

2. We introduce Best-of-Poisson, which approximates the optimal tilted distribution with negligible KL gap.

3. We develop HedgeTune, a lightweight framework that finds optimal inference-time parameters.

REWARD HACKING IS INEVITABLE

We derive exact conditions under which one of the four possible regimes (shown on the right) is envitable for common inference-time schemes such as Best-of-n.

Consequence: A unique hacking threshold θ^† exists, and we have a principled way to find it!

HEDGING CAN MITIGATE INFERENCE-TIME HACKING

BEST-OF-N

1. Draw n candidates
2. Score them using proxy reward
3. Select the one with the highest score.

SOFT BEST-OF-N

1. Draw n candidates
2. Score them using proxy reward
3. Sample a response from a softmax over the scores using a temperature λ.

BEST-OF-POISSON [OURS]

1. Sample n from a Poisson distribution with parameter μ
2. Apply BoN using this n.

Provably close to the optimal RLHF tilted distribution & efficient at inference time!

HEDGETUNE: An efficient algorithm for finding the optimal inference-time parameter θ^† for BoN, SBoN, and BoP.
We only require samples from the proxy and true rewards!

1. Build score ψ(u, θ) for the chosen method
2. Define residual R(θ) = E[ r_true(u) · ψ(u, θ) ]
3. Find θ^†, the root of the residual R(θ) using bisection or Newton's method

Check our paper to see how hedging mitigates hacking in real-world settings!

INFERENCE-TIME REWARD HACKING
IN LARGE LANGUAGE MODELS

Inference-time Reward Hacking

BEST-OF-N

SOFT BEST-OF-N

BEST-OF-POISSON [OURS]