Comment 1 by
Heinz van Kempen
CEGT 2 is finished. We now
have 1088 game for each of the top engines. The website is updated.
http://www.husvankempen.de/nunn/
I have done some great
statistics comparing CEGT 1 and 2
and the performance on Athlon and Pentium respectively and for those who really
want to read more about a project that as already in progress for some months
now here is an explanation from my side
We think that it is
necessary to play at least one thousand games for each engine to draw any
conclusions because then we have error bars in the rating list of around + 18
ELO points what is still quite a lot.
So let us see:
CEGT 1 and 2 were played on
the same computers using a wide spectrum of all Nunn positions and some general
books. Opponents for the engines were partly different, but this alone does not
explain the huge differences you still have with only 500 games. The
differences are due to statistical aberrations, being of course even more
pronounced with only 300, 200 or 100 games for each engine.
CEGT 1 4080 games, 16 engines, 510 games for each engine
CEGT 2 5202 games, 18 engines, 578 games for each engine
totally 1088 games each for each top engine
Similarities and striking differences:
Shredder 9 dominated in
both. ELO after totally 1088 games now is exactly 50 points higher than for the
next best - Fritz 8 Bilbao
Junior 9 seemed to be second best
engines in CEGT 1 scoring 17,5 points more than Fritz 8, but then in CEGT 2 the
same Fritz 8 version scored 41 points more from the games than Junior !!!
Almost unbelievable. Now the rating from Fritz from the combined CEGTīs is 15
points superior compared to Junior
Hiarcs and Chess Tiger equally good in both CEGTīs, but it is striking that
Chess Tiger performs overall much better on Athlon than on Pentium CPUīs
Gandalf 6, really weird. Rank 4 in
CEGT 1 scoring 15.5 points more from games than Hiarcs. Only rank 9 in CEGT 2
and 23.5 points less in the overall table than Hiarcs
Ruffian 2.1.0 in both
tournaments behind most other commercials
List, only rank 10 in CEGT 1,
but number 6 in the total score of CEGT 2 after 578 games and better than
Ruffian and Gandalf in that one
ProDeo with a good performance in
both. After 1088 games for each we have 1 point rating difference between List
and ProDeo, so we will not even be able to tell after 5000 games or more which
one is the best amateur
Chessmaster Steadfast is 32 ELO
points better than CMX Yoda, but with only above 500 points for each setting it
would not be correct to claim that it is better. Believe it or not, just look
at the error bars
SOS 5 started furiously in CEGT
1 and for a long time this seemed to be the best amateur. But after 400, 600,
800 games it dropped and dropped and now Fruit is ahead of SOS, because it
performed the better the longer the tournaments lasted. Anyway after 1088 games
SOS 5 is still ahead of Aristarch, what means considerable improvement over SOS
4 still.
After seeing all this we do
not dare to draw any conclusions at all for those engines where we have only
above 500 games so far.
Considering all this and no
matter of the time control, maybe Blitz, maybe the longer time controls we are
using, it is for example absurd to state after only 100 games that Engine X is
improved by 50 ELO points or that it could already be seen that a new version
is not better than the previous one
Let me give an example:
You have Engine X version
3.0 and play 100 games against different opponents and you have Engine X
version 4.0 and you play 100 games under the same conditions. Then you get the
same rating for both and claim that there is no improvement. You repeat this
and play again 100 games for both and found that the new one is 120 points
better and you are enthusiastic about the improvement and you play a third time
and now the old version is 120 ELO points better and you are disappointed. But
when you look at the error bars you find that all the results are statistically
in the normal range, because error bars with 100 games for an engine version
are +60. And then there are still 5% that still drop out of those statistical
normal results.
Can you then understand how
hard it is to come to any conclusions at all?
This is why we are
continuing to investigate the "truth" and do not post quick
sensational results after only 50 or 100 games for a new engine and it may
happen (when we will not lose our energy) that we will soon even have 1500 or
2000 games for the top commercials and at least 1000 for more and more top
amateurs, what should be
a minimum.
On the other hand something
like this seems to be crazy for normal people, but some of you will understand
that it is a lot of fun to run those engine tournaments and probably we will
continue a few months more.
Anyway we should stop
before the men with the white gowns will come with the straitjackets.