Comment 2 by Heinz van Kempen

 

 

Okay, as interest in Ktulu seems to be very high, we will try to give a first careful impression. At the same time we want to demonstrate with the description of the progression how variable and alternating such a test can be due to statistical anomalies. Anyway no one really should expect an ELO 100 points better than the previous version.

Did we ever have an engine that improved by that much at this level so far?

But maybe it comes close to this. So let us see.

 

Anyone out there who missed our comparison for Junior, Fritz and Gandalf in CEGT 1 and 2?

Anyone still believing that 150 games for each engine already gives a reliable picture of how well a new engine version performs, that it is not necessary to play 1000 games or more for each engine to claim anything founded?

This is not to criticize other testers - we also like all the results posted even with only four games and all results combined also give a good picture and not all testers have several machines or like to combine their results with others.

 

We saw a lot of results posted, some sensational, some mediocre or like expected, some poor under Winboard where it was reported that bugs from Ktulu show up.

 

Now have a look at Ktulu 7.0 in CEGT 3 and consider it at this point as another way to evaluate the strength of the new Ktulu.

 

We played the first 150 games for Ktulu in CEGT 3 and after that we got the following incredible and sensational rating list

(combined games from Charles, Christian and mine, time control 40/40 adapted to 2Ghz P4 CPU via Crafty benchmark correspondent to AEGT):

 

 

 

Program

Elo

+

-

Games

Score

Av.Op.

Draws

1

Shredder 9

2750

17

17

1237

69.8 %

2604

28.2 %

2

Fritz 8

2699

18

18

1088

62.1 %

2613

28.6 %

3

Ktulu 7.0

2693

49

48

150

67.3 %

2567

30.7 %

4

Junior 9

2682

16

16

1238

60.4 %

2609

29.9 %

5

Hiarcs 9

2653

17

17

1088

55.3 %

2616

33.1 %

6

Gandalf 6.0

2648

17

17

1088

54.6 %

2616

31.4 %

7

Chess Tiger 15.0

2643

16

16

1088

53.8 %

2617

36.9 %

8

CM 10000 Steadfast

2639

24

24

510

52.5 %

2622

34.7 %

9

Ruffian 2.1.0

2629

17

17

1088

51.7 %

2618

32.9 %

10

List 512

2618

17

17

1088

50.0 %

2618

33.2 %

11

Pro Deo 1.1

2617

17

17

1088

49.8 %

2618

29.7 %

12

Spike 0.9

2616

47

47

150

56.0 %

2574

29.3 %

13

CMX Yoda

2607

23

23

578

49.0 %

2614

33.4 %

14

Fruit 2.0

2589

17

17

1088

45.5 %

2620

29.0 %

15

SOS 5 for Arena

2586

17

17

1088

45.1 %

2620

34.4 %

16

Deep Sjeng 1.6

2580

26

26

510

43.4 %

2626

28.4 %

17

Aristarch 4.50

2577

17

17

1088

43.7 %

2621

31.2 %

18

SlowChess Blitz WV

2572

20

20

728

45.0 %

2607

36.7 %

19

Ktulu 5.1

2562

26

26

510

40.8 %

2627

28.6 %

20

Thinker 4.7a

2556

23

23

578

41.3 %

2617

34.9 %

21

DanChess CCT7

2552

45

46

150

46.3 %

2577

34.0 %

22

Zappa 1.0

2549

47

47

150

46.3 %

2574

30.0 %

23

Anaconda 2.0.1

2549

21

21

728

41.4 %

2609

33.7 %

24

Delfi 4.5

2540

21

21

728

40.1 %

2610

31.9 %

25

Pharaon 3.2

2536

21

21

728

39.6 %

2610

35.2 %

26

AnMon5.50

2535

45

46

150

43.3 %

2581

34.7 %

27

Naum 1.7

2522

53

53

104

42.3 %

2576

38.5 %

28

Patriot 1.3.0

2503

27

27

510

32.4 %

2631

25.5 %

29

Yace 0.99.87

2465

49

50

149

33.6 %

2584

26.8 %

30

Amyan 1.595

2447

56

57

114

31.6 %

2581

28.1 %

 

 

 

What prevented us from giving those results in all the fora a few days ago when we had them?

Mainly the knowledge we collected over years of testing and being aware that we had similar cases before, although not that extremely pronounced.

 

So what do we have here?

A new sensation, an engine performing on the level of the best from ChessBase, being more than 130 ELO better than Ktulu 5.1, being the new number 3 in our list?!?

No, we thought, the most probable thing we have here is a very good and improved engine , but starting accidentally with very positive results like tossing a coin ten times and seven times it drops to the ground showing tails. So we decided to be careful and not to give these results, because we want people to  trust our ratinglist, which would not be the case when they then see Ktulu dropping like a stone afterwards. What was still possible after already 150 games?

 

Anyway we thought ... look at the error bars: even if Ktulu drops by the maximum there would still remain a rating of 2645 ELO (2693-48), which would be 83 points more than Ktulu 5.1 and a very remarkable improvement at this high level, where usually every single point means a lot of work, testing, removing bugs, adding useful things.

 

The next double round robin came and Ktulu only scored 11 out of 26 games against Shredder, Junior and some of the best amateur engines (in CEGT 4 Ktulu will also play against Fritz, Hiarcs and Gandalf).

What would happen to the rating list adding only these 26 games? Not much, because we already had 150 games before? So take a look:

 

 

   

 

Program

Elo

+

-

Games

Score

Av.Op.

Draws

1

Shredder 9

2750

17

17

1239

69.9 %

2604

28.2 %

2

Fritz 8

2699

18

18

1088

62.1 %

2613

28.6 %

3

Junior 9

2683

16

16

1240

60.5 %

2609

29.8 %

4

Ktulu 7.0

2664

46

45

176

63.6 %

2567

26.1 %

5

Hiarcs 9

2653

17

17

1088

55.3 %

2616

33.1 %

6

Gandalf 6.0

2649

17

17

1088

54.6 %

2617

31.4 %

7

Chess Tiger 15.0

2643

16

16

1088

53.8 %

2617

36.9 %

8

CM 10000 Steadfast

2639

24

24

510

52.5 %

2622

34.7 %

9

Ruffian 2.1.0

2629

17

17

1088

51.7 %

2618

32.9 %

10

List 512

2619

17

17

1088

50.0 %

2618

33.2 %

11

Pro Deo 1.1

2617

17

17

1088

49.8 %

2619

29.7 %

12

Spike 0.9

2614

47

47

152

55.9 %

2573

28.9 %

13

CMX Yoda

2608

23

23

578

49.0 %

2614

33.4 %

14

Fruit 2.0

2589

17

17

1088

45.5 %

2620

29.0 %

15

SOS 5 for Arena

2586

17

17

1088

45.1 %

2620

34.4 %

16

Deep Sjeng 1.6

2580

26

26

510

43.4 %

2626

28.4 %

17

Aristarch 4.50

2577

17

17

1088

43.7 %

2621

31.2 %

18

SlowChess Blitz WV

2572

20

20

730

45.0 %

2607

36.6 %

19

Ktulu 5.1

2563

26

26

510

40.8 %

2628

28.6 %

20

Thinker 4.7a

2557

23

23

578

41.3 %

2617

34.9 %

21

Anaconda 2.0.1

2549

21

21

730

41.4 %

2609

33.6 %

22

Zappa 1.0

2548

47

47

152

46.4 %

2573

29.6 %

23

DanChess CCT7

2547

45

45

152

45.7 %

2576

33.6 %

24

Delfi 4.5

2539

21

21

730

40.0 %

2610

31.8 %

25

Pharaon 3.2

2537

20

21

730

39.7 %

2610

35.1 %

26

AnMon5.50

2534

45

45

152

43.4 %

2580

34.2 %

27

Naum 1.7

2528

53

53

106

43.4 %

2575

37.7 %

28

Patriot 1.3.0

2503

27

27

510

32.4 %

2632

25.5 %

29

Yace 0.99.87

2466

49

50

151

33.8 %

2583

26.5 %

30

Amyan 1.595

2449

56

57

116

31.9 %

2580

27.6 %

 

 

 

Ktulu by these 26 games alone lost 29 rating points and dropped to rank 4 in our list. But also look at the error bars. In the list before it looked like a "guaranteed" minimum ELO of 2645, now we have 2664 but still a possible minus value of 45 could be subtracted in the worst case!!!

 

In the worst case?

No, we can´t trust in this, this was given otherwise in the first rating list we had after 150 games only. There was shown a minimum of 2645, but now the minimum should be 2619.

 

So Ktulu is one of those 5% engines that are still out of any statistical probability? What are the reasons? Is Ktulu so unbalanced that it can win with tactical blows against the best like Shredder and lose on the other hand against relatively much weaker engines because of some weaknesses in endgames? These statistics we will also give below and a look at the games will help to clarify and for the moment it really seems that scores against Shredder and Junior are really good, but there are problems against certain amateurs.

 

Let us continue...the next double round robin. We were curious now and played some gauntlets in advance. Ktulu scored 14 points out of 26. Not bad, but not enough to keep this still high ELO, 8 more points were subtracted and Ktulu dropped to rank 5:

 

 

 

Program

Elo

+

-

Games

Score

Av.Op.

Draws

1

Shredder 9

2750

17

17

1241

69.9 %

2603

28.1 %

2

Fritz 8

2698

18

18

1088

62.1 %

2613

28.6 %

3

Junior 9

2682

16

16

1242

60.5 %

2608

29.8 %

4

Hiarcs 9

2653

17

17

1088

55.3 %

2615

33.1 %

5

Ktulu 7.0

2652

43

42

202

62.4 %

2565

24.8 %

6

Gandalf 6.0

2648

17

17

1088

54.6 %

2616

31.4 %

7

Chess Tiger 15.0

2642

16

16

1088

53.8 %

2616

36.9 %

8

CM 10000 Steadfast

2639

24

24

510

52.5 %

2622

34.7 %

9

Ruffian 2.1.0

2628

17

17

1088

51.7 %

2617

32.9 %

10

List 512

2618

17

17

1088

50.0 %

2618

33.2 %

11

Pro Deo 1.1

2616

17

17

1088

49.8 %

2618

29.7 %

12

Spike 0.9

2608

47

47

154

55.2 %

2572

28.6 %

13

CMX Yoda

2607

23

23

578

49.0 %

2613

33.4 %

14

Fruit 2.0

2588

17

17

1088

45.5 %

2619

29.0 %

15

SOS 5 for Arena

2586

17

17

1088

45.1 %

2620

34.4 %

16

Deep Sjeng 1.6

2580

26

26

510

43.4 %

2626

28.4 %

17

Aristarch 4.50

2576

17

17

1088

43.7 %

2620

31.2 %

18

SlowChess Blitz WV

2570

20

20

732

44.9 %

2606

36.5 %

19

Ktulu 5.1

2562

26

26

510

40.8 %

2627

28.6 %

20

Thinker 4.7a

2556

23

23

578

41.3 %

2616

34.9 %

21

Anaconda 2.0.1

2548

21

21

732

41.5 %

2608

33.5 %

22

Zappa 1.0

2547

46

47

154

46.4 %

2572

29.2 %

23

DanChess CCT7

2546

45

45

154

45.8 %

2575

33.1 %

24

Delfi 4.5

2538

21

21

732

40.0 %

2609

31.8 %

25

Pharaon 3.2

2537

20

21

732

39.8 %

2608

35.1 %

26

Naum 1.7

2532

52

52

108

44.0 %

2574

38.0 %

27

AnMon5.50

2529

45

45

154

42.9 %

2579

33.8 %

28

Patriot 1.3.0

2503

27

27

510

32.4 %

2631

25.5 %

29

Yace 0.99.87

2461

49

50

153

33.3 %

2581

26.1 %

30

Amyan 1.595

2453

55

56

118

32.6 %

2579

28.0 %

 

 

The next 26 games also only gave 13.5 points out of 26 to Ktulu and again 8 points were lost. Ktulu dropped to rank 6 but close to Gandalf. So far we had the following progression.

 

Anzahl

Partien

 

Rating

 

Veränderung

 

+

 

-

150

2693

+15

49

48

176

2664

-29

46

45

202

2652

- 8

43

42

228

2644

- 8

40

40

 

 

Between games 150 and 254 52 rating points were lost, almost unbelievable again. Would it go down further? No, Ktulu started to have good results again,15 points, 15,5 points, 16 points and 16.5 points out of 26. Also more good results from Christian and Charles came.

 

       

Anzahl

Partien

 

Rating

 

Veränderung

 

+

 

-

96

2678

00

58

57

150

2693

+15

49

48

176

2664

-29

46

45

202

2652

- 8

43

42

228

2644

- 8

40

40

254

2641

- 3

38

38

280

2642

+ 1

36

26

306

2643

+ 1

34

34

319

2647

+ 3

34

33

 

 

   

 

 

Program

Elo

+

-

Games

Score

Av.Op.

Draws

1

Shredder 9

2750

17

17

1263

70.0 %

2602

28.2 %

2

Fritz 8

2698

18

18

1088

62.1 %

2612

28.6 %

3

Junior 9

2682

16

16

1263

60.7 %

2607

29.9 %

4

Hiarcs 9

2652

17

17

1088

55.3 %

2615

33.1 %

5

Gandalf 6.0

2647

17

17

1088

54.6 %

2615

31.4 %

6

Ktulu 7.0

2647

34

33

319

61.8 %

2564

26.3 %

7

Chess Tiger 15.0

2642

16

16

1088

53.8 %

2616

36.9 %

8

CM 10000 Steadfast

2638

24

24

510

52.5 %

2621

34.7 %

9

Ruffian 2.1.0

2628

17

17

1088

51.7 %

2616

32.9 %

10

List 512

2617

17

17

1088

50.0 %

2617

33.2 %

11

Pro Deo 1.1

2616

17

17

1088

49.8 %

2617

29.7 %

12

Spike 0.9a

2607

44

43

175

54.9 %

2573

29.7 %

13

CMX Yoda

2606

23

23

578

49.0 %

2613

33.4 %

14

Fruit 2.0

2588

17

17

1088

45.5 %

2619

29.0 %

15

SOS 5 for Arena

2585

17

17

1088

45.1 %

2619

34.4 %

16

Deep Sjeng 1.6

2579

26

26

510

43.4 %

2625

28.4 %

17

Aristarch 4.50

2576

17

17

1088

43.7 %

2620

31.2 %

18

SlowChess Blitz WV

2570

20

20

753

45.0 %

2605

37.1 %

21

DanChess CCT7

2553

43

43

175

46.6 %

2577

30.3 %

22

Anaconda 2.0.1

2547

20

20

753

41.4 %

2607

33.6 %

24

Pharaon 3.2

2537

20

20

753

40.0 %

2608

34.5 %

25

Delfi 4.5

2535

21

21

753

39.6 %

2608

31.7 %

28

Patriot 1.3.0

2502

27

27

510

32.4 %

2630

25.5 %

29

Amyan 1.595

2483

49

50

142

36.3 %

2581

28.9 %

 

 

So what will we have after 1000 games. Nothing is as sure as uncertainty in this world .

 

For the moment we can tell the following. Ktulu is a very good engine with tactical strength and it is fun to watch the games against the best. ELO improvement over Ktulu 5.1 in the CEGT rating list now is 85 ELO, which is a more than expected improvement, it is still sensational. If it will be finally rank 3, 4 or 7 in the world after many games more, who knows by now? Is this really important? What will the next Ktulu version give then?