The first 70% of the SQL course seems pretty straightforward. Difficulties begin on the remaining 30%.

From 2015 to 2019, I went through four interviews for the positions of data analyst and data analysis specialist in more than a dozen companies. After another unsuccessful interview in 2017 - when I got confused about complex SQL questions - I started compiling a problem book with SQL questions of medium and high complexity in order to better prepare for interviews. This guide has come in handy in the last round of interviews in 2019. Over the past year, I shared this guide with a couple of friends, and thanks to the extra free time due to a pandemic, I polished it and compiled this document.

There are many great SQL tutorials for beginners. My favorites areCodecademy's interactive SQL and Select Star SQL courses by Zi Chung Kao. But in reality, the first 70% of the SQL course is quite simple, and the real difficulties begin in the remaining 30%, which are not covered in the beginner's guides. So, at interviews for data analysts and data analysis specialists in technology companies often ask questions about these 30%.

Surprisingly, I did not find an exhaustive source on such issues of medium difficulty, so I compiled this guide.

It is useful for interviews, but at the same time it will increase your effectiveness in your current and future jobs. Personally, I believe that some of the mentioned SQL templates are also useful for ETL systems that run reporting tools and data analysis functions to identify trends.

Content

SQL

№ 1.
№ 2.
№ 3. ( )
1
2
3
№ 4.
№ 5.
№ 6.

№ 1.
№ 2. ( )
1
2

№ 1.
№ 2. ( )
1
2
№ 3.

You need to understand that during interviews with data analysts and data analysts, they ask questions not only about SQL. Other common topics include discussion of past projects, A / B testing, metric development, and open analytic issues. About three years ago, Quora posted tips on interviewing for a product analyst position on Facebook. There, this topic is discussed in more detail. However, if improving your knowledge of SQL will help you in your interview, then this guide is well worth the time.

In the future, I can port the code from this guide to a site like Select Star SQLto make it easier to write SQL statements - and see the result of code execution in real time. As an option, add questions as problems to the platform to prepare for LeetCode interviews . In the meantime, I just wanted to publish this document so that people can now get acquainted with this information.

Assumptions made and how to use the manual

Assumptions about knowledge of the SQL language: It is assumed that you have working knowledge of SQL. You probably use it often at work, but want to hone your skills in topics such as self-associations and window functions.

How to use this manual: Since a board or a virtual notebook is often used in an interview (without compiling the code), I recommend taking a pencil and paper and writing down solutions for each problem, and after completing, compare your notes with the answers. Or work out your answers with a friend who will act as an interviewer!

Minor syntax errors do not matter much during an interview with a whiteboard or notepad. But they can distract the interviewer, so ideally try to reduce their number in order to concentrate all attention on logic.
The answers given are not necessarily the only way to solve each problem. Feel free to write comments with additional solutions that you can add to this guide!

Tips for solving complex tasks in SQL interviews

First, standard tips for all programming interviews ...

Listen carefully to the description of the problem, repeat the essence of the problem to the interviewer
Formulate a borderline case to demonstrate that you really understand the problem (i.e., a line that will not be included in the final SQL query that you are going to write)
( ) , — : ,
- , ,
SQL, , . , .

Some of the issues listed here are adapted from old Periscope blog entries (mostly written by Sean Cook around 2014, although his authorship seems to have been removed from the materials after SiSense merged with Periscope ), as well as from discussions on StackOverflow. If necessary, sources are marked at the beginning of each question.

On the Select Star SQL is also a good selection of brainteasers , complementary issues of this document.

Please note that these questions are not literal copies of questions from my own interviews, and they were not used in the companies in which I worked or work.

Self-Association Tasks

No. 1. Percentage change month to month

Context: It is often useful to know how a key metric changes, for example, the monthly audience of active users, from month to month. Let's say we have a table loginsin this form:

| user_id | date |
| --------- | ------------ |
| 1 | 2018-07-01 |
| 234 | 2018-07-02 |
| 3 | 2018-07-02 |
| 1 | 2018-07-02 |
| ... | ... |
| 234 | 2018-10-04 |

Objective : find the monthly percentage change in the monthly audience of active users (MAU).

Solution:
(This solution, like the other blocks of code in this document, contains comments about SQL syntax elements that may differ between different SQL variants, and other notes)

WITH mau AS 
(
  SELECT 
   /* 
    *       
    *  , . .   ,    . 
    *    ,   
    *
    *  Postgres  DATE_TRUNC(),   
    *      SQL   
    * . https://www.postgresql.org/docs/9.0/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC
    */ 
    DATE_TRUNC('month', date) month_timestamp,
    COUNT(DISTINCT user_id) mau
  FROM 
    logins 
  GROUP BY 
    DATE_TRUNC('month', date)
  )
 
 SELECT 
    /*
    *    SELECT      . 
    * 
    *        ,   
    *    ,   , 
    *        ..
    */ 
    a.month_timestamp previous_month, 
    a.mau previous_mau, 
    b.month_timestamp current_month, 
    b.mau current_mau, 
    ROUND(100.0*(b.mau - a.mau)/a.mau,2) AS percent_change 
 FROM
    mau a 
 JOIN 
    /*
    *   `ON b.month_timestamp = a.month_timestamp + interval '1 month'` 
    */
    mau b ON a.month_timestamp = b.month_timestamp - interval '1 month'

No. 2. Marking tree structure

Context: suppose you have a table treewith two columns: the first indicates the nodes, and the second the parent nodes.

node parent
12
2 5
3 5
4 3
5 NULL

Task: write SQL in such a way that we designate each node as inner, root, or leaf or leaf, so that for the above values we get the following:

node    label  
1       Leaf
2       Inner
3       Inner
4       Leaf
5       Root

(Note: more information about the terminology of the tree-like data structure can be found here . However, it is not needed to solve this problem!)

Solution:
Acknowledgment: Fabian Hoffman proposed this more generalized solution on May 2, 2020. Thank Fabian

WITH join_table AS
(
    SELECT 
        cur.node, 
        cur.parent, 
        COUNT(next.node) AS num_children
    FROM 
        tree cur
    LEFT JOIN 
        tree next ON (next.parent = cur.node)
    GROUP BY 
        cur.node, 
        cur.parent
)

SELECT
    node,
    CASE
        WHEN parent IS NULL THEN "Root"
        WHEN num_children = 0 THEN "Leaf"
        ELSE "Inner"
    END AS label
FROM 
    join_table

Alternative solution, without explicit connections:

Acknowledgment: William Chardgin on May 2, 2020 drew attention to the need for the condition WHERE parent IS NOT NULLthat this solution return Leafinstead NULL. Thank you William!

SELECT 
    node,
    CASE 
        WHEN parent IS NULL THEN 'Root'
        WHEN node NOT IN 
            (SELECT parent FROM tree WHERE parent IS NOT NULL) THEN 'Leaf'
        WHEN node IN (SELECT parent FROM tree) AND parent IS NOT NULL THEN 'Inner'
    END AS label 
 from 
    tree

No. 3. User retention per month (several parts)

Acknowledgment: This task was adapted from the SiSense blog article, “Using Self-Associations to Calculate Retention, Outflow, and Reactivation . ”

Part 1

Context: suppose we have statistics on user authorization on a site in the table logins:

| user_id | date |
| --------- | ------------ |
| 1 | 2018-07-01 |
| 234 | 2018-07-02 |
| 3 | 2018-07-02 |
| 1 | 2018-07-02 |
| ... | ... |
| 234 | 2018-10-04 |

Task: write a request that receives the number of retained users per month. In our case, this parameter is defined as the number of users who logged in to the system in this and the previous month.

Decision:

SELECT 
    DATE_TRUNC('month', a.date) month_timestamp, 
    COUNT(DISTINCT a.user_id) retained_users 
 FROM 
    logins a 
 JOIN 
    logins b ON a.user_id = b.user_id 
        AND DATE_TRUNC('month', a.date) = DATE_TRUNC('month', b.date) + 
                                             interval '1 month'
 GROUP BY 
    date_trunc('month', a.date)

Acknowledgment:
Tom Moertel pointed out that pre-duplicating user_id before self-joining makes the solution more efficient, and suggested the code below. Thanks Tom!

Alternative solution:

WITH DistinctMonthlyUsers AS (
  /*
  *     ** , 
  *  
  */
    SELECT DISTINCT
      DATE_TRUNC('MONTH', a.date) AS month_timestamp,
      user_id
    FROM logins
  )

SELECT
  CurrentMonth.month_timestamp month_timestamp,
  COUNT(PriorMonth.user_id) AS retained_user_count
FROM 
    DistinctMonthlyUsers AS CurrentMonth
LEFT JOIN 
    DistinctMonthlyUsers AS PriorMonth
  ON
    CurrentMonth.month_timestamp = PriorMonth.month_timestamp + INTERVAL '1 MONTH'
    AND 
    CurrentMonth.user_id = PriorMonth.user_id

Part 2

Task: now we take the previous task of calculating the number of retained users per month - and turn it upside down. We’ll write a request to count users who have n’t returned to the site this month. That is, "lost" users.

Decision:

SELECT 
    DATE_TRUNC('month', a.date) month_timestamp, 
    COUNT(DISTINCT b.user_id) churned_users 
FROM 
    logins a 
FULL OUTER JOIN 
    logins b ON a.user_id = b.user_id 
        AND DATE_TRUNC('month', a.date) = DATE_TRUNC('month', b.date) + 
                                         interval '1 month'
WHERE 
    a.user_id IS NULL 
GROUP BY 
    DATE_TRUNC('month', a.date)

Please note that this problem can also be resolved with LEFTor RIGHT.

Part 3

Note: this is probably a more difficult task than you will be offered at a real interview. Think of it more like a puzzle - or you can skip and move on to the next task.

Context : so we did a good job of two previous problems. Under the terms of the new task, we now have a table of lost users user_churns. If the user was active in the past month, but then not active in this, then he is entered in the table for this month. Here's what it looks like user_churns:

| user_id | month_date |
| --------- | ------------ |
| 1 | 2018-05-01 |
| 234 | 2018-05-01 |
| 3 | 2018-05-01 |
| 12 | 2018-05-01 |
| ... | ... |
| 234 | 2018-10-01 |

Task : now you want to conduct a cohort analysis, that is, an analysis of the totality of active users who have been reactivated in the past . Create a table with these users. You can use tables user_churnsand to create a cohort logins. In Postgres, the current timestamp is accessible via current_timestamp.

Decision:

WITH user_login_data AS 
(
    SELECT 
        DATE_TRUNC('month', a.date) month_timestamp,
        a.user_id,
        /* 
        *   ,    SQL,   , 
        *      SELECT   HAVING.
        *       .  
        */ 
        MAX(b.month_date) as most_recent_churn, 
        MAX(DATE_TRUNC('month', c.date)) as most_recent_active 
     FROM 
        logins a
     JOIN 
        user_churns b 
            ON a.user_id = b.user_id AND DATE_TRUNC('month', a.date) > b.month_date 
     JOIN
        logins c 
            ON a.user_id = c.user_id 
            AND 
            DATE_TRUNC('month', a.date) > DATE_TRUNC('month', c.date)
     WHERE 
        DATE_TRUNC('month', a.date) = DATE_TRUNC('month', current_timestamp)
     GROUP BY 
        DATE_TRUNC('month', a.date),
        a.user_id
     HAVING 
        most_recent_churn > most_recent_active

No. 4. Increasing total

Acknowledgment: This task was adapted from the SiSense blog article , Cash Flow Modeling in SQL .

Context: suppose we have a table transactionsin this form:

| date | cash_flow |
| ------------ | ----------- |
| 2018-01-01 | -1000 |
| 2018-01-02 | -100 |
| 2018-01-03 | 50 |
| ... | ... |

Where cash_flowis the revenue minus the costs for each day.

Objective: write a request to get a running total for cash flow every day in such a way that in the end you get a table in this form:

| date | cumulative_cf |
| ------------ | --------------- |
| 2018-01-01 | -1000 |
| 2018-01-02 | -1100 |
| 2018-01-03 | -1050 |
| ... | ... |

Decision:

SELECT 
    a.date date, 
    SUM(b.cash_flow) as cumulative_cf 
FROM
    transactions a
JOIN b 
    transactions b ON a.date >= b.date 
GROUP BY 
    a.date 
ORDER BY 
    date ASC

An alternative solution using a window function (more efficient!):

SELECT 
    date, 
    SUM(cash_flow) OVER (ORDER BY date ASC) as cumulative_cf 
FROM
    transactions 
ORDER BY 
    date ASC

No. 5. Moving Average

Acknowledgment: This task is adapted from the SiSense blog article , Moving Averages in MySQL and SQL Server .

Note: the moving average can be calculated in various ways. Here we use the previous average. Thus, the metric for the seventh day of the month will be the average of the previous six days and himself.

Context : suppose we have a table signupsin this form:

| date | sign_ups |
| ------------ | ---------- |
| 2018-01-01 | 10 |
| 2018-01-02 | 20 |
| 2018-01-03 | 50 |
| ... | ... |
| 2018-10-01 | 35 |

Task : write a request to get a 7-day moving average of daily registrations.

Decision:

SELECT 
  a.date, 
  AVG(b.sign_ups) average_sign_ups 
FROM 
  signups a 
JOIN 
  signups b ON a.date <= b.date + interval '6 days' AND a.date >= b.date
GROUP BY 
  a.date

No. 6. Several connection conditions

Acknowledgment: This task is adapted from the SiSense blog article, “Analyzing Your Email Using SQL” .

Context: let's say our table emailscontains emails sent from the address zach@g.comand received on it:

| id | subject | from | to | timestamp |
| ---- | ---------- | -------------- | -------------- | --- ------------------ |
| 1 | Yosemite | zach@g.com | thomas@g.com | 2018-01-02 12:45:03 |
| 2 | Big Sur | sarah@g.com | thomas@g.com | 2018-01-02 16:30:01 |
| 3 | Yosemite | thomas@g.com | zach@g.com | 2018-01-02 16:35:04 |
| 4 | Running | jill@g.com | zach@g.com | 2018-01-03 08:12:45 |
| 5 | Yosemite | zach@g.com | thomas@g.com | 2018-01-03 14:02:01 |
| 6 | Yosemite | thomas@g.com | zach@g.com | 2018-01-03 15:01:05 |
| .. | .. | .. | .. | .. |

Task: write a request to get response time for each letter ( id) sent to zach@g.com. Do not include letters to other addresses. Suppose each thread has a unique theme. Keep in mind that the thread may have several round-trip letters between zach@g.comand other recipients.

Decision:

SELECT 
    a.id, 
    MIN(b.timestamp) - a.timestamp as time_to_respond 
FROM 
    emails a 
JOIN
    emails b 
        ON 
            b.subject = a.subject 
        AND 
            a.to = b.from
        AND 
            a.from = b.to 
        AND 
            a.timestamp < b.timestamp 
 WHERE 
    a.to = 'zach@g.com' 
 GROUP BY 
    a.id

Tasks for window functions

No. 1. Find the identifier with the maximum value

Context: Suppose we have a table salarieswith data on departments and employee salaries in the following format:

  depname | empno | salary |     
----------- + ------- + -------- +
 develop | 11 | 5200 |
 develop | 7 | 4200 |
 develop | 9 | 4500 |
 develop | 8 | 6000 |
 develop | 10 | 5200 |
 personnel | 5 | 3500 |
 personnel | 2 | 3900 |
 sales | 3 | 4800 |
 sales | 1 | 5000 |
 sales | 4 | 4800 |

Task : write a request to get empnothe highest salary. Make sure your solution handles cases of equal salaries!

Decision:

WITH max_salary AS (
    SELECT 
        MAX(salary) max_salary
    FROM 
        salaries
    )
SELECT 
    s.empno
FROM 
    salaries s
JOIN 
    max_salary ms ON s.salary = ms.max_salary

Alternative solution using RANK():

WITH sal_rank AS 
  (SELECT 
    empno, 
    RANK() OVER(ORDER BY salary DESC) rnk
  FROM 
    salaries)
SELECT 
  empno
FROM
  sal_rank
WHERE 
  rnk = 1;

No. 2. Average value and ranking with a window function (several parts)

Part 1

Context : suppose we have a table salariesin this format:

  depname | empno | salary |     
----------- + ------- + -------- +
 develop | 11 | 5200 |
 develop | 7 | 4200 |
 develop | 9 | 4500 |
 develop | 8 | 6000 |
 develop | 10 | 5200 |
 personnel | 5 | 3500 |
 personnel | 2 | 3900 |
 sales | 3 | 4800 |
 sales | 1 | 5000 |
 sales | 4 | 4800 |

Task: write a query that returns the same table, but with a new column that shows the average salary for the department. We would expect a table like this:

  depname | empno | salary | avg_salary |     
----------- + ------- + -------- + ------------ +
 develop | 11 | 5200 | 5020 |
 develop | 7 | 4200 | 5020 |
 develop | 9 | 4500 | 5020 |
 develop | 8 | 6000 | 5020 |
 develop | 10 | 5200 | 5020 |
 personnel | 5 | 3500 | 3700 |
 personnel | 2 | 3900 | 3700 |
 sales | 3 | 4800 | 4867 |
 sales | 1 | 5000 | 4867 |
 sales | 4 | 4800 | 4867 |

Decision:

SELECT 
    *, 
    /*
    * AVG() is a Postgres command, but other SQL flavors like BigQuery use 
    * AVERAGE()
    */ 
    ROUND(AVG(salary),0) OVER (PARTITION BY depname) avg_salary
FROM
    salaries

Part 2

Task: write a query that adds a column with the position of each employee in the time sheet based on his salary in his department, where the employee with the highest salary gets position 1. We would expect a table in this form:

  depname | empno | salary | salary_rank |     
----------- + ------- + -------- + ------------- +
 develop | 11 | 5200 | 2 |
 develop | 7 | 4200 | 5 |
 develop | 9 | 4500 | 4 |
 develop | 8 | 6000 | 1 |
 develop | 10 | 5200 | 2 |
 personnel | 5 | 3500 | 2 |
 personnel | 2 | 3900 | 1 |
 sales | 3 | 4800 | 2 |
 sales | 1 | 5000 | 1 |
 sales | 4 | 4800 | 2 |

Decision:

SELECT 
    *, 
    RANK() OVER(PARTITION BY depname ORDER BY salary DESC) salary_rank
 FROM  
    salaries

Other tasks of medium and high difficulty

No. 1. Histograms

Context: Let's say we have a table sessionswhere each row represents a video streaming session with a length in seconds:

| session_id | length_seconds |
| ------------ | ---------------- |
| 1 | 23 |
| 2 | 453 |
| 3 | 27 |
| .. | .. |

Task: write a query to calculate the number of sessions that fall intervals of five seconds, i.e. for the above fragment, the result will be something like this:

| bucket | count |
| --------- | ------- |
| 20-25 | 2 |
| 450-455 | 1 |

The maximum score counts for the proper line labels ("5-10", etc.)

Solution:

WITH bin_label AS 
(SELECT 
    session_id, 
    FLOOR(length_seconds/5) as bin_label 
 FROM
    sessions 
 )
 SELECT 
    CONCATENTATE(STR(bin_label*5), '-', STR(bin_label*5+5)) bucket, 
    COUNT(DISTINCT session_id) count 
 GROUP BY 
    bin_label
 ORDER BY 
    bin_label ASC

No. 2. Cross connection (several parts)

Part 1

Context: let's say we have a table state_streamswhere the name of the state and the total number of hours of streaming from video hosting are indicated on each line:

| state | total_streams |
| ------- | --------------- |
| NC | 34569 |
| SC | 33999 |
| CA | 98324 |
| MA | 19345 |
| .. | .. |

(In fact, aggregated tables of this type usually have a date column, but we will exclude it for this task)

Task: write a query to get pairs of states with a total number of threads within a thousand from each other. For the above snippet, we would like to see something like:

| state_a | state_b |
| --------- | --------- |
| NC | SC |
| SC | NC |

Decision:

SELECT
    a.state as state_a, 
    b.state as state_b 
 FROM   
    state_streams a
 CROSS JOIN 
    state_streams b 
 WHERE 
    ABS(a.total_streams - b.total_streams) < 1000
    AND 
    a.state <> b.state

For information, cross joins can also be written without explicitly specifying joins:

SELECT
    a.state as state_a, 
    b.state as state_b 
 FROM   
    state_streams a, state_streams b 
 WHERE 
    ABS(a.total_streams - b.total_streams) < 1000
    AND 
    a.state <> b.state

Part 2

Note: this is a bonus question rather than a really important SQL template. You can skip it!

Task: how can I modify SQL from a previous solution to remove duplicates? For example, the example of the same table, to steam NCand SCthere was only one time, not two.

Decision:

SELECT
    a.state as state_a, 
    b.state as state_b 
 FROM   
    state_streams a, state_streams b 
 WHERE 
    ABS(a.total_streams - b.total_streams) < 1000
    AND 
    a.state > b.state

No. 3. Advanced calculations

Acknowledgment: This task was adapted from a discussion on a question that I asked on StackOverflow (my nickname is zthomas.nc).

Note: this is probably a more difficult task than you will be offered at a real interview. Think of it more like a puzzle - or you can skip it!

Context: suppose we have a table of tablethis kind, where userdifferent values of a class can correspond to the same user class:

| user | class |
| ------ | ------- |
| 1 | a |
| 1 | b |
| 1 | b |
| 2 | b |
| 3 | a |

Problem: Suppose that there are only two possible values for a class. Write a query to calculate the number of users in each class. In this case, users with both labels aand bmust refer to the class b.

For our sample, we get the following result:

| class | count |
| ------- | ------- |
| a | 1 |
| b | 2 |

Decision:

WITH usr_b_sum AS 
(
    SELECT 
        user, 
        SUM(CASE WHEN class = 'b' THEN 1 ELSE 0 END) num_b
    FROM 
        table
    GROUP BY 
        user
), 

usr_class_label AS 
(
    SELECT 
        user, 
        CASE WHEN num_b > 0 THEN 'b' ELSE 'a' END class 
    FROM 
        usr_b_sum
)

SELECT 
    class, 
    COUNT(DISTINCT user) count 
FROM
    usr_class_label
GROUP BY 
    class 
ORDER BY 
    class ASC

An alternative solution uses instructions SELECTin the operators SELECTand UNION:

SELECT 
    "a" class,
    COUNT(DISTINCT user_id) - 
        (SELECT COUNT(DISTINCT user_id) FROM table WHERE class = 'b') count 
UNION
SELECT 
    "b" class,
    (SELECT COUNT(DISTINCT user_id) FROM table WHERE class = 'b') count

Best SQL Medium Difficulty Interviewing Questions

Content

Assumptions made and how to use the manual

Tips for solving complex tasks in SQL interviews

Self-Association Tasks

No. 1. Percentage change month to month

No. 2. Marking tree structure

No. 3. User retention per month (several parts)

Part 1

Part 2

Part 3

No. 4. Increasing total

No. 5. Moving Average

No. 6. Several connection conditions

Tasks for window functions

No. 1. Find the identifier with the maximum value

No. 2. Average value and ranking with a window function (several parts)

Part 1

Part 2

Other tasks of medium and high difficulty

No. 1. Histograms

No. 2. Cross connection (several parts)

Part 1

Part 2

No. 3. Advanced calculations

More articles: