Following the recent meeting in Athens, Nick Faulks, member of FIDE Qualification Commission, offers further thoughts on the current FIDE Rating system:
Ever since I first became acquainted with the Elo rating system I have intended to look more deeply into its underlying philosophy. However, like most of those who use it on a regular basis, I have been too busy dealing with its details and practicalities to devote sufficient time to that effort. In recent months, three events have taken place which have focused my attention back to the fundamentals, and I hope I can add something to the debate currently taking place.
First, I read the paper on the rating system written by GM Dmitry Jakovenko, which approached the issues from a fresh mathematical viewpoint and would have been of great interest even if the writer were not one of the world's top players.
Next, after making some contributions to the ongoing discussion concerning a possible change to the K-factor, I was very pleased to be invited to participate in the "ratings summit" recently held in Athens. Having arrived without fully formed opinions, I was impressed by the depth of knowledge shown by the participants and left with increased confidence that it will be possible for any necessary changes to the system to be made in a rational manner.
Finally, and long after due time, I have obtained ( on loan from FIDE Treasurer Nigel Freeman's extensive library ) a copy of Prof. Elo's legendary book "The Rating of Chessplayers". After a couple of readings of this I can understand why it is still viewed by experts in the field as a definitive text, even though the second edition ( to which I shall refer ) was published in 1986. Spurred by this education, I hope that I can express my own thoughts in a useful form.
Future readers of Prof. Elo's work need to be warned that it should be approached more as a developing story than as a conventional textbook. I was greatly alarmed to read in the first chapter that the rating system is necessarily based on the normal distribution. Para 1.31 includes the indented statement
"The many performances of an individual will be normally distributed ( Prof. Elo's italics ), when evaluated on an appropriate scale."
Such assertions are not uncommon, and are generally justified either by the claim that this is a law of nature ( which it is not, at least not in any useful way – you can, of course, define an “appropriate scale” to make it true ), or by an appeal to the Central Limit Theorem, which is a very powerful tool in its place, but has no application here. Only when you reach 8.75, almost at the end of the book, do you find
"Even the assumption of the normal distribution, the vehicle for the entire derivation, is superfluous".
Prof. Elo goes on to recognise that other distributions might be at least equally appropriate, and devotes particular attention to the "fatter-tailed" logistic distribution. However, he does not comment upon the possibility that no suitable distribution might exist. I propose to start my own contribution from this point.
It must be recognised that FIDE's rating system is a highly ambitious project, when compared for instance to the rankings used in professional tennis. The ATP system is simple, objective and accepted by the players, so it serves its purpose well. It does not, however, make any attempt to determine an expectation for the outcome of a match between two ranked tennis players, and there is not even any great surprise when the bookmakers predict a result which would upset the rankings. By contrast, the Elo rating system does make the claim that, given only the rating difference between two players, it is possible to determine the “expected value” of the outcome.
The fundamental tenet of the system is that, if you know the expected scores of X playing Y, and of Y playing Z, then you can deduce the expected score of X playing Z. To investigate the implications of this claim, I ask readers to consider the following thought experiment, in six stages.
- Select a group of 1000 players of identical strength, say by holding a large round robin of 3000 players rated ELO 2200 and taking the middle third. We shall call this the 2200 group. Since this is just a thought experiment we can always improve statistical reliability by adding zeroes to the sample sizes.
- Find a similarly sized group of players who each achieve a 75% score in a Scheveningen tournament against the 2200 group. We call this the 2400 group.
- Find a similarly sized group of players who each achieve a 25% score in a Scheveningen tournament against the 2200 group. We call this the 2000 group.
- Find a group of players who each score 50% against an equal number of players from the 2200 group and the 2400 group. This is our 2300 group.
- Find a group of players who each score 50% against an equal number of players from the 2200 group and the 2000 group. This is our 2100 group.
- Play a match between the 2100 group and the 2300 group. We know the result even before the first move is played, the higher group will score precisely 75%.
But what if, as I suspect is inevitable, this is not the exact outcome? We can always rule out random fluctuations by increasing the group sizes, so I see three ways forward.
A. We can admit that our premise is incorrect, meaning that no perfect rating system can exist and the project should be abandoned. That would be a shame.
B. We can declare that our model is theoretically correct, and therefore that the chessplayers must be at fault. Such an approach is popular with social scientists, but I do not favour it.
C. We can recognise that our rating system will never be perfect, but strive to find parameters which optimise its output and attempt to estimate the inevitable inaccuracies. This would appear to be the most sensible way to proceed.
It is clear that further work will require considerable analysis of historical data, and here we are very fortunate to have the benefit of Jeff Sonas' expertise. His presentation in Athens appeared to show evidence that, while the results of games between players rated over 2200 have generally followed those predicted by the Elo system quite closely, at the lower levels this has been less true. Given what has been said above, I'm not sure whether we should be more surprised by the first half of that statement or the second.
Having reviewed the framework of the system, I hope I am now in a position to do some useful numerical work on some of the issues discussed in Athens, which can be presented in a future commentary. However, one major point of controversy at that meeting which requires no mathematics concerns the question of just what information a player's rating is intended to convey. Here, I do not think Prof. Elo's wording can be improved. At the end of 8.75, in what might be considered the final sentence of the main body of his book, he writes
“The system is a hunting system, always seeking the most probable value of the elusive ever-changing player rating.”
It must be noted that we are searching for the best estimate of a player's underlying playing strength, as evidenced by their results. While good or poor runs of form will necessarily impact ratings, the resulting changes are not intended explicitly as a reward or a punishment for these. Nor are ratings published for the excitement of the chess public, who may understandably like to view the top of the list as a form of horse race.
The view was expressed in Athens that the Elo system, in conjunction with the current K-factor of 10, does not reflect recent performance closely enough to be the appropriate tool to decide selection into the 2010 candidates matches. That may be correct, but the solution is not to bend the rating system to this purpose. A performance rating, unlike a full rating system, is very simple to specify, and if that is what is required one could easily be used in the selection process. I would recommend that the Presidential Board should at least take this possibility into account in future, but it would be a great mistake to make fundamental changes to the rating system just to improve its suitability for this one very special use.
Any comments on the points made above, whether in agreement or otherwise, would of course be appreciated. Private comments may be sent to
Rating Inflation - Its Causes and Possible Cures by Jeff Sonas