Evaluation Rating List

I have a dream....

that one day many chess programmers will participate in this new type of competition.


The goal is to improve the evaluation in a new way, that is, without the obstacle of search. Imagine a reasonable strong (open source) engine with a reasonable good search, readable source code and we replace the evaluation funnction with our own. What are the advantages c.q disadvantages?


Advantages


1. It's much more easy to discover the weaknesses of your evaluation since search hardly (to none) plays its dominant role. You don't lose (or win) a game because you are outsearched. You lose (or win) a game because of your evaluation.


2. Playing X versus Y -- since the 2 searches are indentical -- you are measuring the evaluation strength.


3. If we can determine strength we can create a competition based on fixed depth games in order to avoid the last issue that may influence the result as engine X and Y have different time cycles, engine X might have a slow evaluation while engine Y has a fast one. As such we eliminate the last obstacle for a reasonable fair estimation who has the strongest eval based within the scope of this project.


4. The learning effect. Will depend on the number of participants considering the status is open source and GPL.



Disadvantages


I can name a few but I leave the issue open for a public discussion first on Talkchess (aka CCC).


______________________________________________________________________________________________________



How?

The technical part


Step-1 - I found a good candidate in TOGA II 3.0 that will serve as a base. It's FRUIT based, mailbox, readable source code and probably well known to many programmers. It's CCRL rated 2852 ranked at place 61.


Step-2 - In a nutshell, I isolated the evaluation of my first engine for the PC (Mephisto Gideon 1993) and included it in EVAL.CPP and replaced the TOGA evaluation with GIDEON. Compiled it and played 2 matches of each 4000 games at D=8 and D=10. See results below.


Step-3 - Made a start updating the GIDEON (1993) evaluation to the ProDeo (2017) evaluation. I am half-way and called it REBEL for the moment. Played the same 2 matches against TOGA II 3.0 Results so far:

         DEPTH=8
# PLAYER :  RATING  POINTS PLAYED  (%)
1 Toga3  :  2879.6  4468.5  8000  55.8%
2 Rebel  :  2862.0  4169.5  8000  52.1%
3 Gideon :  2814.4  3362.0  8000  42.0%

Gideon - Toga3  4000 40.7% 
Gideon - Rebel  4000 43.4%

Rebel  - Toga3  4000 47.6%
Rebel  - Gideon 4000 56.6%
         DEPTH=10
# PLAYER :  RATING  POINTS PLAYED  (%)
1 Toga3  :  2882.8  4520.0  8000  56.5%
2 Rebel  :  2866.1  4239.0  8000  53.0%
3 Gideon :  2807.1  3241.0  8000  40.5%

Gideon - Toga3  4000 39.8%
Gideon - Rebel  4000 41.2%

Rebel - Toga3   4000 47.2%
Rebel - Gideon  4000 58.8%

As one can see the difference between D8 and D10 is not that much which is a good thing.

______________________________________________________________________________________________________


Download and compilation


Step-1 - Create a project and (only) add the TOGA *.cpp and *.h files into your project.


Step-2 - Open EVAl.CPP and you will see #define ENGINE GIDEON     // 0=TOGA | 1=GIDEON | 2=REBEL


It will compile the GIDEON engine, change GIDEON to REBEL to compile the current REBEL engine and TOGA will compile the orginal TOGA II 3.0 engine. Create a new #define to compile your own engine and add the #include files.


Step-3 - Go to line 347, it's the start of the TOGA evaluation function. Here we convert the TOGA board and color to our own engine format. Then we call our evaluation and return the score in centipawns. That's all there is to it. As a bonus I added the search items ply, alpha and beta for lazy-eval lovers in case new engine authors want to use this project as a base creating a chess engine instead of participating in the evaluation competition which step always can be taken later.


______________________________________________________________________________________________________


Update


As noted by the number of likes and the poll (and discussion) at Talkchess the idea is received positive but nothing happened in practise which is understandable keeping in mind the work involved for programmers with higher priorities such as improving their engine the classic way. And I did not expect any other. Alright, but that doesn't stop me to put some oil on the (quenching) fire :-)


Here is what I did, from the collection of engines on my harddrive I selected those 1) who don't extend moves during the first iteration (also called the root-move search) and 2) engines that allow checks in QS, thus exactly as Rebel and ProDeo have always done it and then play 16.000 depth=1 games. The results are:

ProDeo 2.7 beta

Fruit 2.1

63.4%

ProDeo 2.7 beta

Stockfish 6

45.6%

ProDeo 2.7 beta

Stockfish 7

44.6%

ProDeo 2.7 beta

Stockfish 8

44.8%

ProDeo 2.7 beta

Texel 1.06

59.8%

ProDeo 2.7 beta

Fire 4

66.1%

ProDeo 2.7 beta

Komodo 9.42

46.6%

ProDeo 2.7 beta

Nirvana 2.2

45.6%

ProDeo 2.7 beta

Rodent2

71.3%

ProDeo 2.7 beta

Sting88

60.9%

ProDeo 2.7 beta

Toga3

57.6%

ProDeo 2.7 beta

Strelka2

60.9%

ProDeo 2.7 beta

Arasan 18.3

70.6%

ProDeo 2.7 beta

Crafty 24.1

78.2%

Remarks


None of these numbers are conclusive, there still could be differences in the QS search regarding captures and checks.


Nevertheless keeping in mind the differences in ELO between ProDeo and the rest of the field suggests the old Rebel | ProDeo evaluation is 1) a lot better than its search and 2) competitive with many higher ELO rated engines and 3) not even far behind the top-shots of these days, Komodo and Stockfish.


Based on this I like to repeat what I said at the top of this page, there is value to invest some time in creating such an uniform platform and profit from each other by competition.


On this page I have suggested TOGA, from the Talkchess discussion Sungorus is suggested which IMO is an even better choice.