cancel
Showing results for 
Search instead for 
Did you mean: 

Percent_rank vs. Cumm_dist

PostgreSQL Core Team - EDB

 

During research for my Postgres Window Magic talk, I studied the unusual behavior of percent_rank and cumm_dist (cumulative distribution). The Postgres documentation wasn't helpful. I finally came up with this paragraph to add to the Postgres 10 docs:

 

cume_dist computes the fraction of partition rows that are less than or equal to the current row and its peers, while percent_rank computes the fraction of partition rows that are less than the current row, assuming the current row does not exist in the partition.

There are some key phrases in there. First, cume_dist computes the percentage of rows that are "less than or equal," while percent_rank computes a similar percentage that are "less than" the current row, except it also assumes the current row isn't in the partition.

 

That last issue of not counting the current row struck me as odd. I talked to some statisticians about this and did some web searches. The best I can understand is that this special percent_rank behavior is designed to handle cases where there is a single maximum value in the partition. For example:

 

 

SELECT x, (PERCENT_RANK() OVER w)::numeric(10, 2),
       (CUME_DIST() OVER w)::numeric(10, 2)
FROM generate_series(1, 5) AS f(x)
WINDOW w AS (ORDER BY x);
 x | percent_rank | cume_dist
---+--------------+-----------
 1 |         0.00 |      0.20
 2 |         0.25 |      0.40
 3 |         0.50 |      0.60
 4 |         0.75 |      0.80
 5 |         1.00 |      1.00

 

 

In this example, what are the ideal ratios for the last line? Cume_dist is easy since is it "less than or equal" to the current row, so that is clearly 1.00. What about percent_rank, which is "less than?" Without the "assume the current row isn't in the partition" exclusion, it would show 0.80. That would be a silly answer since all the rows are less than the last row, except the last row itself. I think this is why that exclusion was added.

 

This issue becomes even more complex when there are many duplicates. For example, imagine this data set:

 

 

WITH cte (x) AS (
        SELECT 0
        UNION ALL
        SELECT 1 FROM generate_series(1, 5)
        UNION ALL
        SELECT 2
)
SELECT x, (PERCENT_RANK() OVER w)::numeric(10, 2),
       (CUME_DIST() OVER w)::numeric(10, 2)
FROM cte
WINDOW w AS (ORDER BY x);
 x | percent_rank | cume_dist
---+--------------+-----------
 0 |         0.00 |      0.14
 1 |         0.17 |      0.86
 1 |         0.17 |      0.86
 1 |         0.17 |      0.86
 1 |         0.17 |      0.86
 1 |         0.17 |      0.86
 2 |         1.00 |      1.00

 

 

Frankly, neither percent_rank nor cume_dist look like they produce reasonable results. If I got a score of 1, is saying that a ratio of 0.17 rows are less than mine, or that 0.86 are equal or less than mine, useful? Imagine this was an sat test and many people got 650. Wouldn't it be reasonable to count half of the people who had matching scores less then or equal to mine, which would compute 0.50, rather than including all or none of the matching scores?

 

Right now, there is no way in Postgres to do that, but I think it would be a nice thing to add. I assume a server-side function could be written to do this. Computing the average of percent_rank and cume_dist would yield a similar result.

 

(Article originally published on Bruce's blog - Monday, October 2, 2017 @ https://momjian.us/main/blogs/pgblog/2017.html#October_2_2017)