lizikk 1f9907df4c ralph-loop[agent-skills-dev]: checkpoint before iteration

2026-03-23 11:24:19 +08:00

7.8 KiB

Raw Blame History

Critic Checklist Reference

Loaded by Step 5 (Critic Evaluation). 5 dimensions, 20 items. Each scored PASS/FAIL. Threshold: 18/20 to pass. Maximum 3 revision rounds.

Scoring Rules
Dimension 1: Clarity (5 items)
Dimension 2: Accuracy (4 items)
Dimension 3: Style (5 items)
Dimension 4: Reproducibility (3 items)
Dimension 5: Caption (3 items)
Evaluation Output Template
Revision Protocol

Scoring Rules

Each item: PASS (1 point) or FAIL (0 points)
Total: 20 points maximum
Pass threshold: 18/20 (90%)
Maximum revision rounds: 3
Each revision must address ALL flagged FAIL items
Items not applicable to the figure type: auto-PASS (e.g., error bars for conceptual figures)

Dimension 1: Clarity (5 items)

ID	Check	PASS Criteria	Common FAIL
C1	Caption self-explanatory	Reader understands figure without reading paper body	Caption says only "Results" with no detail
C2	Labels readable	All text >= minimum font size at print dimensions	6pt text on IEEE single-column figure
C3	Visual hierarchy	Most important data visually prominent (size, color, position)	All lines same weight, "Ours" indistinguishable
C4	No overlapping elements	Labels, legends, data points, annotations do not occlude each other	Legend sits on top of data region
C5	Adequate whitespace	Margins around figure, between subfigures, around legends	Elements crammed to edges, no breathing room

How to Fix Common FAIL

C1: Add "showing that [method] achieves [X]% improvement over [baseline]" to caption
C2: Increase font size or reduce figure complexity (fewer elements)
C3: Use thicker lines / brighter colors for "Ours", thinner / muted for baselines
C4: Relocate legend (outside plot area, or to whitespace region)
C5: Add plt.tight_layout(pad=1.5) or increase figure dimensions

Dimension 2: Accuracy (4 items)

ID	Check	PASS Criteria	Common FAIL
A1	Data fidelity	Plotted values exactly match source data (spot-check 3 points)	Transposed rows/columns, wrong method assigned to line
A2	Encoding faithful	Visual encoding matches data semantics (e.g., larger bar = larger value)	Inverted y-axis without indicator, log scale unlabeled
A3	Axes honest	No truncated axes without break markers; scale is fair	Y-axis starts at 90% making 1% difference look huge
A4	Error bars correct	Error bars represent stated metric (std, sem, CI) and are labeled	Error bars present but unlabeled, or wrong metric

How to Fix Common FAIL

A1: Cross-reference code data arrays with source table, print values to verify
A2: Check axis direction, scale type (linear/log), and legend-to-data mapping
A3: Start y-axis at 0, or use axis break (//) marker if truncated
A4: Add "(mean +/- std, n=5)" to caption or legend

Dimension 3: Style (5 items)

ID	Check	PASS Criteria	Common FAIL
S1	Palette matches venue	Colors from venue-specific palette in academic-styles.md	Using matplotlib default colors instead of venue palette
S2	Colorblind safe	All series distinguishable under simulated deuteranopia	Red and green lines only differ by hue
S3	Grayscale compatible	Figure readable when desaturated (for B&W printing)	Two series both map to medium gray
S4	Font compliant	Font family and sizes match venue specification	Sans-serif used for NeurIPS (should be serif)
S5	Size compliant	Figure fits within venue column width at stated dimensions	8" wide figure for IEEE single-column (max 3.5")

How to Fix Common FAIL

S1: Replace colors with hex values from references/academic-styles.md
S2: Add secondary encoding (markers: o, s, ^, D, v) alongside color
S3: Add line styles (solid, dashed, dotted, dash-dot) as tertiary encoding
S4: Update matplotlib.rcParams['font.family']
S5: Adjust figsize to venue constraints

Dimension 4: Reproducibility (3 items)

ID	Check	PASS Criteria	Common FAIL
R1	Code complete	Code runs without modification (all data inline, no external files)	References `data.csv` that doesn't exist
R2	Imports present	All required imports listed at top of code	Missing `import seaborn as sns`
R3	Output configured	Explicit DPI, savefig with bbox_inches='tight', both PNG and PDF	No savefig, or missing DPI, or only PNG

Scope: Only applies to code-based figures (data-plot, comparison-chart, result-visualization). Auto-PASS for AI-generated figures.

How to Fix Common FAIL

R1: Replace file reads with inline data arrays (np.array([...]))
R2: Run code mentally — every function call must have its import
R3: Add plt.savefig('fig.png', dpi=300, bbox_inches='tight') + PDF variant

Dimension 5: Caption (3 items)

ID	Check	PASS Criteria	Common FAIL
P1	Content described	Caption describes WHAT the figure shows (not just "Results")	"Figure 1: Comparison." — no detail
P2	Key finding stated	Caption highlights the main takeaway for the reader	Describes setup but not conclusion
P3	Subfigures referenced	Multi-panel figures: each panel described in caption (a, b, c)	4 panels but caption only mentions "left" and "right"

How to Fix Common FAIL

P1: Structure as "Figure N: [verb] [what]. [context]." e.g., "Figure 3: Comparison of training convergence across 5 methods on CIFAR-10."
P2: Add final sentence: "Method A converges 2x faster than the strongest baseline."
P3: Add "(a) Training loss. (b) Validation accuracy. (c) Inference latency."

Evaluation Output Template

## Critic Evaluation — Round N/3

| Dim.            | Items | Score | Failed IDs | Issues |
|-----------------|-------|-------|------------|--------|
| Clarity         | 5     | X/5   | [CX]       | [description] |
| Accuracy        | 4     | X/4   | [AX]       | [description] |
| Style           | 5     | X/5   | [SX]       | [description] |
| Reproducibility | 3     | X/3   | [RX]       | [description] |
| Caption         | 3     | X/3   | [PX]       | [description] |
|-----------------|-------|-------|------------|--------|
| **Total**       | **20**| **X/20** |         |        |

Verdict: PASS (>= 18) / REVISE (< 18)

[If REVISE:]
Revision actions:
1. [Fix for failed item 1]
2. [Fix for failed item 2]
...

Revision Protocol

Round Management

Round 1: Initial evaluation
  ↓ REVISE → apply fixes
Round 2: Re-evaluate ALL items (not just failed ones)
  ↓ REVISE → apply fixes
Round 3: Final evaluation
  ↓ REVISE → output with [!] Quality warning
  ↓ PASS → proceed to Step 6

Rules

Each round evaluates ALL 20 items fresh (fixes can introduce new issues)
Revision must address every FAIL item — no "defer to next round"
After round 3, if still < 18/20:
- Output the figure with explicit warning
- List all remaining FAIL items
- Suggest user manual review
A PASS in any round immediately proceeds to Step 6 (no unnecessary iterations)

Quality Warning Format

[!] Quality warning: Figure output with 2 unresolved issues after 3 revision rounds.

Remaining issues:
- [S2] Colorblind safety: series 3 and 5 may be confusing under deuteranopia
- [C4] Minor legend overlap with rightmost data point

Recommendation: Manual review before submission.

Cross-reference: SKILL.md §Step 5, §NEVER Rule #7

7.8 KiB Raw Blame History

Critic Checklist Reference

Table of Contents

Scoring Rules

Dimension 1: Clarity (5 items)

How to Fix Common FAIL

Dimension 2: Accuracy (4 items)

How to Fix Common FAIL

Dimension 3: Style (5 items)

How to Fix Common FAIL

Dimension 4: Reproducibility (3 items)

How to Fix Common FAIL

Dimension 5: Caption (3 items)

How to Fix Common FAIL

Evaluation Output Template

Revision Protocol

Round Management

Rules

Quality Warning Format

7.8 KiB

Raw Blame History