Generate. Judge. Revise. Stop when good enough.
When to use
- Quality bar is clear.
- Output can improve from feedback.
- Evaluation criteria can be written or coded.
- Failure is costly.
- Extra loop cost is acceptable.
Do not use
- No reliable eval signal.
- Feedback would be vague or subjective.
- Fast answer matters more than polished answer.
- Output is high-risk and needs human review.
- Revision loops may drift from user intent.
Goal
- Produce stronger output through controlled iteration.
- Keep evaluator strict.
- Stop before cost or drift grows.
Rules
- Define rubric before generating.
- Separate generator and evaluator roles.
- Evaluator returns score, reasons, and fix hints.
- Set max rounds.
- Stop on pass, budget, or plateau.
- Save failing examples for rubric tuning.
Good eval signals
- Schema validity.
- Test pass rate.
- Grounding to source facts.
- Style compliance.
- Policy compliance.
- Ranking score.
Flow
- Define pass/fail rubric.
- Generate candidate.
- Evaluate against rubric.
- If pass, return candidate.
- If fail, revise using feedback.
- Repeat until stop condition.
Failure modes
- Evaluator too soft.
- Evaluator and generator share blind spots.
- Feedback not actionable.
- Loop overfits rubric.
- No stop rule.
- No human review path for high-risk output.
Output
## Result
- Status: pass | fail | stopped
- Rounds: [n]
- Score: [score]
## Final Output
[candidate]
## Evaluation Notes
- [reason]
- [remaining risk]