Finding the Largest Series with no Gaps
Stack Overflow has this very nice feature to motivate people to stay on their website for as long as possible. Badges:
For scale, you can see how many badges I have. Tons.
How do you calculate these badges? Let’s have a look at the “Enthusiast” and the “Fanatic”. These badges are awarded to anyone who spends a given amount of consecutive days on their platform. Regardless of any wedding date or wife’s birthday, you HAVE TO LOG IN, or the counter starts from zero again.
Now as we’re doing declarative programming, we don’t care about maintaining any state and in-memory counters. We want to express this in the form of online analytic SQL. I.e. consider this data:
| LOGIN_TIME | |---------------------| | 2014-03-18 05:37:13 | | 2014-03-16 08:31:47 | | 2014-03-16 06:11:17 | | 2014-03-16 05:59:33 | | 2014-03-15 11:17:28 | | 2014-03-15 10:00:11 | | 2014-03-15 07:45:27 | | 2014-03-15 07:42:19 | | 2014-03-14 09:38:12 |
That doesn’t help much. Let’s remove the hours from the timestamp. That’s easy:
1
2
3
4
|
SELECT DISTINCT
cast (login_time AS DATE ) AS login_date
FROM logins
WHERE user_id = :user_id
|
Which yields:
| LOGIN_DATE | |------------| | 2014-03-18 | | 2014-03-16 | | 2014-03-15 | | 2014-03-14 |
Now, that we’ve learned about window functions, let’s just add a simple row number to each of these dates:
1
2
3
4
|
SELECT login_date,
row_number() OVER ( ORDER BY login_date)
FROM login_dates
Which produces:
| LOGIN_DATE | RN | |------------|----| | 2014-03-18 | 4 | | 2014-03-16 | 3 | | 2014-03-15 | 2 | | 2014-03-14 | 1 | |
Still easy. Now, what happens, if instead of selecting these values separately, we subtract them?
1
2
3
4
|
SELECT login_date -
row_number() OVER ( ORDER BY login_date)
FROM login_dates
|
We’re getting something like this:
| LOGIN_DATE | RN | GRP | |------------|----|------------| | 2014-03-18 | 4 | 2014-03-14 | | 2014-03-16 | 3 | 2014-03-13 | | 2014-03-15 | 2 | 2014-03-13 | | 2014-03-14 | 1 | 2014-03-13 |
Wow. Interesting. So, 14 - 1 = 13
, 15 - 2 = 13
, 16 - 3 = 13
, but 18 - 4 = 14
.
There’s a simple example for this behaviour:
- ROW_NUMBER() never has gaps. That’s how it’s defined
- Our data, however, does
So when we subtract a “gapless” series of consecutive integers from a “gapful” series of non-consecutive dates, we will get the same date for each “gapless” subseries of consecutive dates, and we’ll get a new date again where the date series had gaps.
Huh.
This means we can now simply GROUP BY
this arbitrary date value:
1
2
3
4
5
6
7
|
SELECT min (login_date), max (login_date),
max (login_date) -
min (login_date) + 1 AS length
FROM login_date_groups
GROUP BY grp
ORDER BY length DESC
|
And we’re done. The largest series of consecutive dates with no gaps has been found:
| MIN | MAX | LENGTH | |------------|------------|--------| | 2014-03-14 | 2014-03-16 | 3 | | 2014-03-18 | 2014-03-18 | 1 |
With the full query being:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
WITH login_dates AS (
SELECT DISTINCT cast (login_time AS DATE ) login_date
FROM logins WHERE user_id = :user_id
),
login_date_groups AS (
SELECT
login_date,
login_date - row_number() OVER ( ORDER BY login_date) AS grp
FROM login_dates
)
SELECT min (login_date), max (login_date),
max (login_date) - min (login_date) + 1 AS length
FROM login_date_groups
GROUP BY grp
ORDER BY length DESC
|
- ۹۵/۰۳/۱۱