BUT SOME CAN BE MADE USEFUL!
Inference-time alignment is simple and effective: generate candidate responses from your model, score them using a reward model, and pick a response—no retraining needed! One successful approach is Best-of-n where you select the highest scoring response.
However, these reward models function as proxies for our desired objectives such as helpfulness and safety. By overoptimizing for a misspecified reward, we can fall victim to REWARD HACKING!
Our contributions:
1. We mathematically formalise inference-time reward hacking and prove its inevitability.
2. We introduce Best-of-Poisson, which approximates the optimal tilted distribution with negligible KL gap.
3. We develop HedgeTune, a lightweight framework that finds optimal inference-time parameters.
Inference-time Reward Hacking
We derive exact conditions under which one of the four possible regimes (shown on the right) is envitable for common inference-time schemes such as Best-of-n.
Consequence: A unique hacking threshold θ† exists, and we have a principled way to find it!
BEST-OF-N
1. Draw n candidates
2. Score them using proxy reward
3. Select the one with the highest score.
SOFT BEST-OF-N
1. Draw n candidates
2. Score them using proxy reward
3. Sample a response from a softmax over the scores using a temperature λ.
BEST-OF-POISSON [OURS]
1. Sample n from a Poisson distribution with parameter μ
2. Apply BoN using this n.
Provably close to the optimal RLHF tilted distribution & efficient at inference time!
We only require samples from the proxy and true rewards!
1. Build score ψ(u, θ) for the chosen method
2. Define residual R(θ) = E[ r_true(u) · ψ(u, θ) ]
3. Find θ†, the root of the residual R(θ) using bisection or Newton's method
We only require samples from the proxy and true rewards!
1. Build score ψ(u, θ) for the chosen method
2. Define residual R(θ) = E[ r_true(u) · ψ(u, θ) ]
3. Find θ†, the root of the residual R(θ) using bisection or Newton's method
Check our paper to see how hedging mitigates hacking in real-world settings!