NON-Computer Rankings

April 6, 2016

Last time I wrote about various ways to combine computer-based ratings' rankings into a single ranking that represents them all, along with a measure of how well they do so. The problem is different for the subjective rankings provided by the various media outlets. These are entertaining (and that's all they're meant to be) but in general are not reliable measurements of relative team quality.

For the subjective human rankings, the only composite I use is a modified median - I order the teams by the rank for which at least half of the rankings rank the team that highly. That's a weaker criterion than I use for computers (where it must be more than half) mainly because with just six rankings there might not be enough teams with a "consensus" ranking to form a top-25. (That sometimes happens even with the weaker criterion.)

The media rankings don't tell us much about how good the teams are, but it is kind of fun to neasure how alike the rankings are and how much that changes as they read each others' rankings. If you know how the rankings are produced, it is also possible to infer which are more "representative" of expert opinion.

The challenge presented by the media rankings stems from the fact that they don't rank every team. Four (Baseball America, USA Today's Coaches' Poll, Perfect Game and D1Baseball.com) rank 25 while two (Collegiate Baseball Newspaper and the National Collegiate Baseball Writers Association) rank 30. Necessarily they do not all rank the same teams, but that would be true even if all ranked the same number (smaller than all teams.)

To facilitate rank correlation for each of the media rankings every team not in their published top N is assigned rank N+1 for that ranking. I then calculate the Rank Correlations using the teams that are ranked in their respective top N by any ranking. If two teams are both unranked by a ranking, that team-pair is not used in the correllations to the other rankings.

The rank correlations I report are based upon testing team ranks according to Ranking-I and Ranking-J for each pair of ranked teams TeamX and TeamY. Just count the number of times the same team is ranked "better" in both ratings (concordant pairs), the number of times one team is "better" in one ranking and "worse" in the other ("discordant" pairs), ignoring the pairs that are ranked the same in either or both rankings.

Goodman and Kruskal's gamma

γ =	#concordant − #discordant

	#concordant + #discordant

Kendall's tau

τ =	#concordant − #discordant

	#Team-Pairs

The #Team-Pairs forming the denominator for the Kendall's tau calculation depends upon the number of teams ranked in either of the rankings being compared. Teams not ranked in either will show up as ties, and you cannot infer concordance or discordance when there is a tie in one of the rankings. In my report for each pair of rankings I include {#teams, #team-pairs} along with τ. The counts of concordant (C) and discordant (D) team-pairs for each pair of ranking make visible how the correlations work.

A few observations and comments follow.

Generally the number of team-pairs for a given ranking pair — ½ × N × (N − 1), where N is the number of teams included in either ranking — will be higher than the sum of concordant and discordant team-pairs for the same ranking pair. The impact of ties results in |γ| ≥ |τ| always, and the relationship is usually strictly greater than.
The number of discordant pairs for rankings A and B is called the distance between A and B. When the rankings are themselves based upon combinations of rankings (true "polls") the distance tends to be lower. It is no surprise that the Coaches Poll and Baseball Writers Poll are both closer to each other and to the consensus than the rankings not based upon polls.
One should not use the absolute distances to compare (pairs of) rankings, but for a given ranking-pair the distance over time is expected to decrease. Both correlations are also expected to increase as the season progresses. The same is true for the computer correlations, but for different reasons.
Speaking of the computers, for those I report "%Concordant", which is related to τ by
100 × τ = 2 × %Concordant − 100
because I only include computers that rank all D1 teams. Notice that the majority of the computers are better-correlated than any of the human ranking-pairs.

That last bullet is not a mathematical artifact of having denominator 44,850 rather than #pairs ≤ 703. Quite the opposite - teams ranked from ~51 to ~250 have ratings from which the rankings are derived "squeezed" towards the average and the likelihood of discordancies is pretty high for those 19,900 team-pairs. Ranking all 300 teams is intrinsically hard (which is why humans don't even give it a serious try.) If you want a measure of team quality, go with computer ratings.

This is not meant in any way to disparage human rankings. If I didn't find them as entertaining as they are meant to be I wouldn't go to the considerable trouble of compiling and correlating them. I even use them to highlight upcoming games "between ranked teams." But if I want to know whether team X or team Y has had a better season so far, I look to the rankings that incorporate every game played without knowing the colors on the uniforms.

In memory of
SEBaseball.com

Paul Kislanko