Would you consider adding different sorting strategies for the model ranking list?
Observed behavior:
The current sorting occasionally creates unintuitive ordering where models perform similar on green, but very different on orange and red but are sorted by green only.
Example in image:
#16 and #17 have very similar green score, but differ a lot on orange and red. In this specific case I would assume GPT-5.4 to be a "better" pick than Gemini 3 Pro Preview (Low) given the higher percentage of partial challenges (orange), however it ranks lower.
A similar example occurs below that on #18, #19 and #20 where all three share a similar score on green but the bottom one appears slightly favorable.
Potential suggestions:
Would it make sense to add different sorting methods (or adjust the existing one) to perhaps something like:
- sort by green first, when within 1% sort by orange
- sort by green first, then by a weighed factor of orange (e.g. 1/10th the value of green)
- sort inversely by red overall, when tied sort by orange inversely (probably a bad idea)
- ...
I can assume some of these make less sense than others, but was wondering if one can be found that is less likely to have these results.
Would you consider adding different sorting strategies for the model ranking list?
Observed behavior:
The current sorting occasionally creates unintuitive ordering where models perform similar on green, but very different on orange and red but are sorted by green only.
Example in image:
#16 and #17 have very similar green score, but differ a lot on orange and red. In this specific case I would assume GPT-5.4 to be a "better" pick than Gemini 3 Pro Preview (Low) given the higher percentage of partial challenges (orange), however it ranks lower.
A similar example occurs below that on #18, #19 and #20 where all three share a similar score on green but the bottom one appears slightly favorable.
Potential suggestions:
Would it make sense to add different sorting methods (or adjust the existing one) to perhaps something like:
I can assume some of these make less sense than others, but was wondering if one can be found that is less likely to have these results.