SQL Information

Finding the Largest Series with no Gaps

سه شنبه, ۱۱ خرداد ۱۳۹۵، ۰۸:۰۷ ق.ظ

Stack Overflow has this very nice feature to motivate people to stay on their website for as long as possible. Badges:

sql-tricks-slide-090

For scale, you can see how many badges I have. Tons.

How do you calculate these badges? Let’s have a look at the “Enthusiast” and the “Fanatic”. These badges are awarded to anyone who spends a given amount of consecutive days on their platform. Regardless of any wedding date or wife’s birthday, you HAVE TO LOG IN, or the counter starts from zero again.

Now as we’re doing declarative programming, we don’t care about maintaining any state and in-memory counters. We want to express this in the form of online analytic SQL. I.e. consider this data:

| LOGIN_TIME          |
|---------------------|
| 2014-03-18 05:37:13 |
| 2014-03-16 08:31:47 |
| 2014-03-16 06:11:17 |
| 2014-03-16 05:59:33 |
| 2014-03-15 11:17:28 |
| 2014-03-15 10:00:11 |
| 2014-03-15 07:45:27 |
| 2014-03-15 07:42:19 |
| 2014-03-14 09:38:12 |

That doesn’t help much. Let’s remove the hours from the timestamp. That’s easy:

1
2
3
4
SELECT DISTINCT
  cast(login_time AS DATE) AS login_date
FROM logins
WHERE user_id = :user_id

Which yields:

| LOGIN_DATE |
|------------|
| 2014-03-18 |
| 2014-03-16 |
| 2014-03-15 |
| 2014-03-14 |

Now, that we’ve learned about window functions, let’s just add a simple row number to each of these dates:

1
2
3
4
SELECT
  login_date,
  row_number() OVER (ORDER BY login_date)
FROM login_dates

Which produces:
| LOGIN_DATE | RN |
|------------|----|
| 2014-03-18 |  4 |
| 2014-03-16 |  3 |
| 2014-03-15 |  2 |
| 2014-03-14 |  1 |

Still easy. Now, what happens, if instead of selecting these values separately, we subtract them?

1
2
3
4
SELECT
  login_date -
  row_number() OVER (ORDER BY login_date)
FROM login_dates

We’re getting something like this:

| LOGIN_DATE | RN | GRP        |
|------------|----|------------|
| 2014-03-18 |  4 | 2014-03-14 |
| 2014-03-16 |  3 | 2014-03-13 |
| 2014-03-15 |  2 | 2014-03-13 |
| 2014-03-14 |  1 | 2014-03-13 |

Wow. Interesting. So, 14 - 1 = 1315 - 2 = 1316 - 3 = 13, but 18 - 4 = 14

There’s a simple example for this behaviour:

  1. ROW_NUMBER() never has gaps. That’s how it’s defined
  2. Our data, however, does

So when we subtract a “gapless” series of consecutive integers from a “gapful” series of non-consecutive dates, we will get the same date for each “gapless” subseries of consecutive dates, and we’ll get a new date again where the date series had gaps.

Huh.

This means we can now simply GROUP BY this arbitrary date value:

1
2
3
4
5
6
7
SELECT
  min(login_date), max(login_date),
  max(login_date) -
  min(login_date) + 1 AS length
FROM login_date_groups
GROUP BY grp
ORDER BY length DESC

And we’re done. The largest series of consecutive dates with no gaps has been found:

| MIN        | MAX        | LENGTH |
|------------|------------|--------|
| 2014-03-14 | 2014-03-16 |      3 |
| 2014-03-18 | 2014-03-18 |      1 |
With the full query being:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
WITH
  login_dates AS (
    SELECT DISTINCT cast(login_time AS DATE) login_date
    FROM logins WHERE user_id = :user_id
  ),
  login_date_groups AS (
    SELECT
      login_date,
      login_date - row_number() OVER (ORDER BY login_date) AS grp
    FROM login_dates
  )
SELECT
  min(login_date), max(login_date),
  max(login_date) - min(login_date) + 1 AS length
FROM login_date_groups
GROUP BY grp
ORDER BY length DESC
  • سمیرا امیری

نظرات  (۰)

هیچ نظری هنوز ثبت نشده است

ارسال نظر

ارسال نظر آزاد است، اما اگر قبلا در بیان ثبت نام کرده اید می توانید ابتدا وارد شوید.
شما میتوانید از این تگهای html استفاده کنید:
<b> یا <strong>، <em> یا <i>، <u>، <strike> یا <s>، <sup>، <sub>، <blockquote>، <code>، <pre>، <hr>، <br>، <p>، <a href="" title="">، <span style="">، <div align="">
تجدید کد امنیتی