Comment 1 by Heinz van Kempen
CEGT 2 is finished. We now have 1088 game for each of the top engines. The website is updated.
I have done some great statistics comparing CEGT 1 and 2 and the performance on Athlon and Pentium respectively and for those who really want to read more about a project that as already in progress for some months now here is an explanation from my side
We think that it is necessary to play at least one thousand games for each engine to draw any conclusions because then we have error bars in the rating list of around + 18 ELO points what is still quite a lot.
So let us see:
CEGT 1 and 2 were played on the same computers using a wide spectrum of all Nunn positions and some general books. Opponents for the engines were partly different, but this alone does not explain the huge differences you still have with only 500 games. The differences are due to statistical aberrations, being of course even more pronounced with only 300, 200 or 100 games for each engine.
CEGT 1 4080 games, 16 engines, 510 games for each engine
CEGT 2 5202 games, 18 engines, 578 games for each engine
totally 1088 games each for each top engine
Similarities and striking differences:
Shredder 9 dominated in both. ELO after totally 1088 games now is exactly 50 points higher than for the next best - Fritz 8 Bilbao
Junior 9 seemed to be second best engines in CEGT 1 scoring 17,5 points more than Fritz 8, but then in CEGT 2 the same Fritz 8 version scored 41 points more from the games than Junior !!! Almost unbelievable. Now the rating from Fritz from the combined CEGTīs is 15 points superior compared to Junior
Hiarcs and Chess Tiger equally good in both CEGTīs, but it is striking that Chess Tiger performs overall much better on Athlon than on Pentium CPUīs
Gandalf 6, really weird. Rank 4 in CEGT 1 scoring 15.5 points more from games than Hiarcs. Only rank 9 in CEGT 2 and 23.5 points less in the overall table than Hiarcs
Ruffian 2.1.0 in both tournaments behind most other commercials
List, only rank 10 in CEGT 1, but number 6 in the total score of CEGT 2 after 578 games and better than Ruffian and Gandalf in that one
ProDeo with a good performance in both. After 1088 games for each we have 1 point rating difference between List and ProDeo, so we will not even be able to tell after 5000 games or more which one is the best amateur
Chessmaster Steadfast is 32 ELO points better than CMX Yoda, but with only above 500 points for each setting it would not be correct to claim that it is better. Believe it or not, just look at the error bars
SOS 5 started furiously in CEGT 1 and for a long time this seemed to be the best amateur. But after 400, 600, 800 games it dropped and dropped and now Fruit is ahead of SOS, because it performed the better the longer the tournaments lasted. Anyway after 1088 games SOS 5 is still ahead of Aristarch, what means considerable improvement over SOS 4 still.
After seeing all this we do not dare to draw any conclusions at all for those engines where we have only above 500 games so far.
Considering all this and no matter of the time control, maybe Blitz, maybe the longer time controls we are using, it is for example absurd to state after only 100 games that Engine X is improved by 50 ELO points or that it could already be seen that a new version is not better than the previous one
Let me give an example:
You have Engine X version 3.0 and play 100 games against different opponents and you have Engine X version 4.0 and you play 100 games under the same conditions. Then you get the same rating for both and claim that there is no improvement. You repeat this and play again 100 games for both and found that the new one is 120 points better and you are enthusiastic about the improvement and you play a third time and now the old version is 120 ELO points better and you are disappointed. But when you look at the error bars you find that all the results are statistically in the normal range, because error bars with 100 games for an engine version are +60. And then there are still 5% that still drop out of those statistical normal results.
Can you then understand how hard it is to come to any conclusions at all?
This is why we are continuing to investigate the "truth" and do not post quick sensational results after only 50 or 100 games for a new engine and it may happen (when we will not lose our energy) that we will soon even have 1500 or 2000 games for the top commercials and at least 1000 for more and more top amateurs, what should be
On the other hand something like this seems to be crazy for normal people, but some of you will understand that it is a lot of fun to run those engine tournaments and probably we will continue a few months more.
Anyway we should stop before the men with the white gowns will come with the straitjackets.