Comment 2 by
Heinz van Kempen
Okay, as interest in Ktulu seems to be very high, we will try
to give a first careful impression. At the same time we want to demonstrate
with the description of the progression how variable and alternating such a
test can be due to statistical anomalies. Anyway no one really should expect an
ELO 100 points better than the previous version.
Did we ever have an engine that
improved by that much at this level so far?
But maybe it comes close to this. So
let us see.
Anyone out there who missed our
comparison for Junior, Fritz and Gandalf in CEGT 1 and 2?
Anyone still believing that 150 games
for each engine already gives a reliable picture of how well a new engine
version performs, that it is not necessary to play 1000 games or more for each
engine to claim anything founded?
This is not to criticize other testers
- we also like all the results posted even with only four games and all results
combined also give a good picture and not all testers have several machines or
like to combine their results with others.
We saw a lot of results posted, some
sensational, some mediocre or like expected, some poor under Winboard where it
was reported that bugs from Ktulu show up.
Now have a look at Ktulu 7.0 in CEGT 3 and consider it at this point as another way
to evaluate the strength of the new Ktulu.
We played the first 150 games for
Ktulu in CEGT 3 and after that we got the following incredible and sensational
rating list
(combined games from Charles,
Christian and mine, time control 40/40 adapted to 2Ghz P4 CPU via Crafty
benchmark correspondent to AEGT):
|
|
Program |
Elo |
+ |
- |
Games |
Score |
Av.Op. |
Draws |
|
1 |
Shredder 9 |
2750 |
17 |
17 |
1237 |
69.8 % |
2604 |
28.2 % |
|
2 |
Fritz 8 |
2699 |
18 |
18 |
1088 |
62.1 % |
2613 |
28.6 % |
|
3 |
Ktulu 7.0 |
2693 |
49 |
48 |
150 |
67.3 % |
2567 |
30.7 % |
|
4 |
Junior 9 |
2682 |
16 |
16 |
1238 |
60.4 % |
2609 |
29.9 % |
|
5 |
Hiarcs 9 |
2653 |
17 |
17 |
1088 |
55.3 % |
2616 |
33.1 % |
|
6 |
Gandalf 6.0 |
2648 |
17 |
17 |
1088 |
54.6 % |
2616 |
31.4 % |
|
7 |
Chess Tiger 15.0 |
2643 |
16 |
16 |
1088 |
53.8 % |
2617 |
36.9 % |
|
8 |
CM 10000 Steadfast |
2639 |
24 |
24 |
510 |
52.5 % |
2622 |
34.7 % |
|
9 |
Ruffian 2.1.0 |
2629 |
17 |
17 |
1088 |
51.7 % |
2618 |
32.9 % |
|
10 |
List 512 |
2618 |
17 |
17 |
1088 |
50.0 % |
2618 |
33.2 % |
|
11 |
Pro Deo 1.1 |
2617 |
17 |
17 |
1088 |
49.8 % |
2618 |
29.7 % |
|
12 |
Spike 0.9 |
2616 |
47 |
47 |
150 |
56.0 % |
2574 |
29.3 % |
|
13 |
CMX Yoda |
2607 |
23 |
23 |
578 |
49.0 % |
2614 |
33.4 % |
|
14 |
Fruit 2.0 |
2589 |
17 |
17 |
1088 |
45.5 % |
2620 |
29.0 % |
|
15 |
SOS 5 for Arena |
2586 |
17 |
17 |
1088 |
45.1 % |
2620 |
34.4 % |
|
16 |
Deep Sjeng 1.6 |
2580 |
26 |
26 |
510 |
43.4 % |
2626 |
28.4 % |
|
17 |
Aristarch 4.50 |
2577 |
17 |
17 |
1088 |
43.7 % |
2621 |
31.2 % |
|
18 |
SlowChess Blitz WV |
2572 |
20 |
20 |
728 |
45.0 % |
2607 |
36.7 % |
|
19 |
Ktulu 5.1 |
2562 |
26 |
26 |
510 |
40.8 % |
2627 |
28.6 % |
|
20 |
Thinker 4.7a |
2556 |
23 |
23 |
578 |
41.3 % |
2617 |
34.9 % |
|
21 |
DanChess CCT7 |
2552 |
45 |
46 |
150 |
46.3 % |
2577 |
34.0 % |
|
22 |
Zappa 1.0 |
2549 |
47 |
47 |
150 |
46.3 % |
2574 |
30.0 % |
|
23 |
Anaconda 2.0.1 |
2549 |
21 |
21 |
728 |
41.4 % |
2609 |
33.7 % |
|
24 |
Delfi 4.5 |
2540 |
21 |
21 |
728 |
40.1 % |
2610 |
31.9 % |
|
25 |
Pharaon 3.2 |
2536 |
21 |
21 |
728 |
39.6 % |
2610 |
35.2 % |
|
26 |
AnMon5.50 |
2535 |
45 |
46 |
150 |
43.3 % |
2581 |
34.7 % |
|
27 |
Naum 1.7 |
2522 |
53 |
53 |
104 |
42.3 % |
2576 |
38.5 % |
|
28 |
Patriot 1.3.0 |
2503 |
27 |
27 |
510 |
32.4 % |
2631 |
25.5 % |
|
29 |
Yace 0.99.87 |
2465 |
49 |
50 |
149 |
33.6 % |
2584 |
26.8 % |
|
30 |
Amyan 1.595 |
2447 |
56 |
57 |
114 |
31.6 % |
2581 |
28.1 % |
What prevented us from giving those
results in all the fora a few days ago when we had them?
Mainly the knowledge we collected over
years of testing and being aware that we had similar cases before, although not
that extremely pronounced.
So what do we have here?
A new sensation, an engine performing
on the level of the best from ChessBase, being more than 130 ELO better than
Ktulu 5.1, being the new number 3 in our list?!?
No, we thought, the most probable
thing we have here is a very good and improved engine , but starting
accidentally with very positive results like tossing a coin ten times and seven
times it drops to the ground showing tails. So we decided to be careful and not
to give these results, because we want people to trust our ratinglist, which would not be the
case when they then see Ktulu dropping like a stone afterwards. What was still
possible after already 150 games?
Anyway we thought ... look at the
error bars: even if Ktulu drops by the maximum there would still remain a
rating of 2645 ELO (2693-48), which would be 83 points more than Ktulu 5.1 and
a very remarkable improvement at this high level, where usually every single
point means a lot of work, testing, removing bugs, adding useful things.
The next double round robin came and
Ktulu only scored 11 out of 26 games against Shredder, Junior and some of the
best amateur engines (in CEGT 4 Ktulu will also play against Fritz, Hiarcs and
Gandalf).
What would happen to the rating list
adding only these 26 games? Not much, because we already had 150 games before?
So take a look:
|
|
Program |
Elo |
+ |
- |
Games |
Score |
Av.Op. |
Draws |
|
1 |
Shredder 9 |
2750 |
17 |
17 |
1239 |
69.9 % |
2604 |
28.2 % |
|
2 |
Fritz 8 |
2699 |
18 |
18 |
1088 |
62.1 % |
2613 |
28.6 % |
|
3 |
Junior 9 |
2683 |
16 |
16 |
1240 |
60.5 % |
2609 |
29.8 % |
|
4 |
Ktulu 7.0 |
2664 |
46 |
45 |
176 |
63.6 % |
2567 |
26.1 % |
|
5 |
Hiarcs 9 |
2653 |
17 |
17 |
1088 |
55.3 % |
2616 |
33.1 % |
|
6 |
Gandalf 6.0 |
2649 |
17 |
17 |
1088 |
54.6 % |
2617 |
31.4 % |
|
7 |
Chess Tiger 15.0 |
2643 |
16 |
16 |
1088 |
53.8 % |
2617 |
36.9 % |
|
8 |
CM 10000 Steadfast |
2639 |
24 |
24 |
510 |
52.5 % |
2622 |
34.7 % |
|
9 |
Ruffian 2.1.0 |
2629 |
17 |
17 |
1088 |
51.7 % |
2618 |
32.9 % |
|
10 |
List 512 |
2619 |
17 |
17 |
1088 |
50.0 % |
2618 |
33.2 % |
|
11 |
Pro Deo 1.1 |
2617 |
17 |
17 |
1088 |
49.8 % |
2619 |
29.7 % |
|
12 |
Spike 0.9 |
2614 |
47 |
47 |
152 |
55.9 % |
2573 |
28.9 % |
|
13 |
CMX Yoda |
2608 |
23 |
23 |
578 |
49.0 % |
2614 |
33.4 % |
|
14 |
Fruit 2.0 |
2589 |
17 |
17 |
1088 |
45.5 % |
2620 |
29.0 % |
|
15 |
SOS 5 for Arena |
2586 |
17 |
17 |
1088 |
45.1 % |
2620 |
34.4 % |
|
16 |
Deep Sjeng 1.6 |
2580 |
26 |
26 |
510 |
43.4 % |
2626 |
28.4 % |
|
17 |
Aristarch 4.50 |
2577 |
17 |
17 |
1088 |
43.7 % |
2621 |
31.2 % |
|
18 |
SlowChess Blitz WV |
2572 |
20 |
20 |
730 |
45.0 % |
2607 |
36.6 % |
|
19 |
Ktulu 5.1 |
2563 |
26 |
26 |
510 |
40.8 % |
2628 |
28.6 % |
|
20 |
Thinker 4.7a |
2557 |
23 |
23 |
578 |
41.3 % |
2617 |
34.9 % |
|
21 |
Anaconda 2.0.1 |
2549 |
21 |
21 |
730 |
41.4 % |
2609 |
33.6 % |
|
22 |
Zappa 1.0 |
2548 |
47 |
47 |
152 |
46.4 % |
2573 |
29.6 % |
|
23 |
DanChess CCT7 |
2547 |
45 |
45 |
152 |
45.7 % |
2576 |
33.6 % |
|
24 |
Delfi 4.5 |
2539 |
21 |
21 |
730 |
40.0 % |
2610 |
31.8 % |
|
25 |
Pharaon 3.2 |
2537 |
20 |
21 |
730 |
39.7 % |
2610 |
35.1 % |
|
26 |
AnMon5.50 |
2534 |
45 |
45 |
152 |
43.4 % |
2580 |
34.2 % |
|
27 |
Naum 1.7 |
2528 |
53 |
53 |
106 |
43.4 % |
2575 |
37.7 % |
|
28 |
Patriot 1.3.0 |
2503 |
27 |
27 |
510 |
32.4 % |
2632 |
25.5 % |
|
29 |
Yace 0.99.87 |
2466 |
49 |
50 |
151 |
33.8 % |
2583 |
26.5 % |
|
30 |
Amyan 1.595 |
2449 |
56 |
57 |
116 |
31.9 % |
2580 |
27.6 % |
Ktulu by these 26 games alone lost 29
rating points and dropped to rank 4 in our list. But also look at the error
bars. In the list before it looked like a "guaranteed" minimum ELO of
2645, now we have 2664 but still a possible minus value of 45 could be
subtracted in the worst case!!!
In the worst case?
No, we can´t trust in this, this was
given otherwise in the first rating list we had after 150 games only. There was
shown a minimum of 2645, but now the minimum should be 2619.
So Ktulu is one of those 5% engines
that are still out of any statistical probability? What are the reasons? Is
Ktulu so unbalanced that it can win with tactical blows against the best like
Shredder and lose on the other hand against relatively much weaker engines
because of some weaknesses in endgames? These statistics we will also give
below and a look at the games will help to clarify and for the moment it really
seems that scores against Shredder and Junior are really good, but there are
problems against certain amateurs.
Let us continue...the next double
round robin. We were curious now and played some gauntlets in advance. Ktulu
scored 14 points out of 26. Not bad, but not enough to keep this still high
ELO, 8 more points were subtracted and Ktulu dropped to rank 5:
|
|
Program |
Elo |
+ |
- |
Games |
Score |
Av.Op. |
Draws |
|
1 |
Shredder 9 |
2750 |
17 |
17 |
1241 |
69.9 % |
2603 |
28.1 % |
|
2 |
Fritz 8 |
2698 |
18 |
18 |
1088 |
62.1 % |
2613 |
28.6 % |
|
3 |
Junior 9 |
2682 |
16 |
16 |
1242 |
60.5 % |
2608 |
29.8 % |
|
4 |
Hiarcs 9 |
2653 |
17 |
17 |
1088 |
55.3 % |
2615 |
33.1 % |
|
5 |
Ktulu 7.0 |
2652 |
43 |
42 |
202 |
62.4 % |
2565 |
24.8 % |
|
6 |
Gandalf 6.0 |
2648 |
17 |
17 |
1088 |
54.6 % |
2616 |
31.4 % |
|
7 |
Chess Tiger 15.0 |
2642 |
16 |
16 |
1088 |
53.8 % |
2616 |
36.9 % |
|
8 |
CM 10000 Steadfast |
2639 |
24 |
24 |
510 |
52.5 % |
2622 |
34.7 % |
|
9 |
Ruffian 2.1.0 |
2628 |
17 |
17 |
1088 |
51.7 % |
2617 |
32.9 % |
|
10 |
List 512 |
2618 |
17 |
17 |
1088 |
50.0 % |
2618 |
33.2 % |
|
11 |
Pro Deo 1.1 |
2616 |
17 |
17 |
1088 |
49.8 % |
2618 |
29.7 % |
|
12 |
Spike 0.9 |
2608 |
47 |
47 |
154 |
55.2 % |
2572 |
28.6 % |
|
13 |
CMX Yoda |
2607 |
23 |
23 |
578 |
49.0 % |
2613 |
33.4 % |
|
14 |
Fruit 2.0 |
2588 |
17 |
17 |
1088 |
45.5 % |
2619 |
29.0 % |
|
15 |
SOS 5 for Arena |
2586 |
17 |
17 |
1088 |
45.1 % |
2620 |
34.4 % |
|
16 |
Deep Sjeng 1.6 |
2580 |
26 |
26 |
510 |
43.4 % |
2626 |
28.4 % |
|
17 |
Aristarch 4.50 |
2576 |
17 |
17 |
1088 |
43.7 % |
2620 |
31.2 % |
|
18 |
SlowChess Blitz WV |
2570 |
20 |
20 |
732 |
44.9 % |
2606 |
36.5 % |
|
19 |
Ktulu 5.1 |
2562 |
26 |
26 |
510 |
40.8 % |
2627 |
28.6 % |
|
20 |
Thinker 4.7a |
2556 |
23 |
23 |
578 |
41.3 % |
2616 |
34.9 % |
|
21 |
Anaconda 2.0.1 |
2548 |
21 |
21 |
732 |
41.5 % |
2608 |
33.5 % |
|
22 |
Zappa 1.0 |
2547 |
46 |
47 |
154 |
46.4 % |
2572 |
29.2 % |
|
23 |
DanChess CCT7 |
2546 |
45 |
45 |
154 |
45.8 % |
2575 |
33.1 % |
|
24 |
Delfi 4.5 |
2538 |
21 |
21 |
732 |
40.0 % |
2609 |
31.8 % |
|
25 |
Pharaon 3.2 |
2537 |
20 |
21 |
732 |
39.8 % |
2608 |
35.1 % |
|
26 |
Naum 1.7 |
2532 |
52 |
52 |
108 |
44.0 % |
2574 |
38.0 % |
|
27 |
AnMon5.50 |
2529 |
45 |
45 |
154 |
42.9 % |
2579 |
33.8 % |
|
28 |
Patriot 1.3.0 |
2503 |
27 |
27 |
510 |
32.4 % |
2631 |
25.5 % |
|
29 |
Yace 0.99.87 |
2461 |
49 |
50 |
153 |
33.3 % |
2581 |
26.1 % |
|
30 |
Amyan 1.595 |
2453 |
55 |
56 |
118 |
32.6 % |
2579 |
28.0 % |
The next 26 games also only gave 13.5
points out of 26 to Ktulu and again 8 points were lost. Ktulu dropped to rank 6
but close to Gandalf. So far we had the following progression.
|
Anzahl Partien |
Rating |
Veränderung |
+ |
- |
|
150 |
2693 |
+15 |
49 |
48 |
|
176 |
2664 |
-29 |
46 |
45 |
|
202 |
2652 |
- 8 |
43 |
42 |
|
228 |
2644 |
- 8 |
40 |
40 |
Between games 150 and 254 52 rating points
were lost, almost unbelievable again. Would it go down further? No, Ktulu
started to have good results again,15 points, 15,5 points, 16 points and 16.5
points out of 26. Also more good results from Christian and Charles came.
|
Anzahl Partien |
Rating |
Veränderung |
+ |
- |
|
96 |
2678 |
00 |
58 |
57 |
|
150 |
2693 |
+15 |
49 |
48 |
|
176 |
2664 |
-29 |
46 |
45 |
|
202 |
2652 |
- 8 |
43 |
42 |
|
228 |
2644 |
- 8 |
40 |
40 |
|
254 |
2641 |
- 3 |
38 |
38 |
|
280 |
2642 |
+ 1 |
36 |
26 |
|
306 |
2643 |
+ 1 |
34 |
34 |
|
319 |
2647 |
+ 3 |
34 |
33 |
|
|
Program |
Elo |
+ |
- |
Games |
Score |
Av.Op. |
Draws |
|
1 |
Shredder 9 |
2750 |
17 |
17 |
1263 |
70.0 % |
2602 |
28.2 % |
|
2 |
Fritz 8 |
2698 |
18 |
18 |
1088 |
62.1 % |
2612 |
28.6 % |
|
3 |
Junior 9 |
2682 |
16 |
16 |
1263 |
60.7 % |
2607 |
29.9 % |
|
4 |
Hiarcs 9 |
2652 |
17 |
17 |
1088 |
55.3 % |
2615 |
33.1 % |
|
5 |
Gandalf 6.0 |
2647 |
17 |
17 |
1088 |
54.6 % |
2615 |
31.4 % |
|
6 |
Ktulu 7.0 |
2647 |
34 |
33 |
319 |
61.8 % |
2564 |
26.3 % |
|
7 |
Chess Tiger 15.0 |
2642 |
16 |
16 |
1088 |
53.8 % |
2616 |
36.9 % |
|
8 |
CM 10000 Steadfast |
2638 |
24 |
24 |
510 |
52.5 % |
2621 |
34.7 % |
|
9 |
Ruffian 2.1.0 |
2628 |
17 |
17 |
1088 |
51.7 % |
2616 |
32.9 % |
|
10 |
List 512 |
2617 |
17 |
17 |
1088 |
50.0 % |
2617 |
33.2 % |
|
11 |
Pro Deo 1.1 |
2616 |
17 |
17 |
1088 |
49.8 % |
2617 |
29.7 % |
|
12 |
Spike 0.9a |
2607 |
44 |
43 |
175 |
54.9 % |
2573 |
29.7 % |
|
13 |
CMX Yoda |
2606 |
23 |
23 |
578 |
49.0 % |
2613 |
33.4 % |
|
14 |
Fruit 2.0 |
2588 |
17 |
17 |
1088 |
45.5 % |
2619 |
29.0 % |
|
15 |
SOS 5 for Arena |
2585 |
17 |
17 |
1088 |
45.1 % |
2619 |
34.4 % |
|
16 |
Deep Sjeng 1.6 |
2579 |
26 |
26 |
510 |
43.4 % |
2625 |
28.4 % |
|
17 |
Aristarch 4.50 |
2576 |
17 |
17 |
1088 |
43.7 % |
2620 |
31.2 % |
|
18 |
SlowChess Blitz WV |
2570 |
20 |
20 |
753 |
45.0 % |
2605 |
37.1 % |
|
21 |
DanChess CCT7 |
2553 |
43 |
43 |
175 |
46.6 % |
2577 |
30.3 % |
|
22 |
Anaconda 2.0.1 |
2547 |
20 |
20 |
753 |
41.4 % |
2607 |
33.6 % |
|
24 |
Pharaon 3.2 |
2537 |
20 |
20 |
753 |
40.0 % |
2608 |
34.5 % |
|
25 |
Delfi 4.5 |
2535 |
21 |
21 |
753 |
39.6 % |
2608 |
31.7 % |
|
28 |
Patriot 1.3.0 |
2502 |
27 |
27 |
510 |
32.4 % |
2630 |
25.5 % |
|
29 |
Amyan 1.595 |
2483 |
49 |
50 |
142 |
36.3 % |
2581 |
28.9 % |
So what will we have after 1000 games.
Nothing is as sure as uncertainty in this world .
For the moment we can tell the
following. Ktulu is a very good engine with tactical strength and it is fun to
watch the games against the best. ELO improvement over Ktulu 5.1 in the CEGT
rating list now is 85 ELO, which is a more than expected improvement, it is
still sensational. If it will be finally rank 3, 4 or 7 in the world after many
games more, who knows by now? Is this really important? What will the next
Ktulu version give then?