The first 70% of the SQL course seems pretty straightforward. Difficulties begin on the remaining 30%.From 2015 to 2019, I went through four interviews for the positions of data analyst and data analysis specialist in more than a dozen companies. After another unsuccessful interview in 2017 - when I got confused about complex SQL questions - I started compiling a problem book with SQL questions of medium and high complexity in order to better prepare for interviews. This guide has come in handy in the last round of interviews in 2019. Over the past year, I shared this guide with a couple of friends, and thanks to the extra free time due to a pandemic, I polished it and compiled this document.There are many great SQL tutorials for beginners. My favorites areCodecademy's interactive SQL and Select Star SQL courses by Zi Chung Kao. But in reality, the first 70% of the SQL course is quite simple, and the real difficulties begin in the remaining 30%, which are not covered in the beginner's guides. So, at interviews for data analysts and data analysis specialists in technology companies often ask questions about these 30%.Surprisingly, I did not find an exhaustive source on such issues of medium difficulty, so I compiled this guide.It is useful for interviews, but at the same time it will increase your effectiveness in your current and future jobs. Personally, I believe that some of the mentioned SQL templates are also useful for ETL systems that run reporting tools and data analysis functions to identify trends.
You need to understand that during interviews with data analysts and data analysts, they ask questions not only about SQL. Other common topics include discussion of past projects, A / B testing, metric development, and open analytic issues. About three years ago, Quora posted tips on interviewing for a product analyst position on Facebook. There, this topic is discussed in more detail. However, if improving your knowledge of SQL will help you in your interview, then this guide is well worth the time.In the future, I can port the code from this guide to a site like Select Star SQLto make it easier to write SQL statements - and see the result of code execution in real time. As an option, add questions as problems to the platform to prepare for LeetCode interviews . In the meantime, I just wanted to publish this document so that people can now get acquainted with this information.Assumptions made and how to use the manual
Assumptions about knowledge of the SQL language: It is assumed that you have working knowledge of SQL. You probably use it often at work, but want to hone your skills in topics such as self-associations and window functions.How to use this manual: Since a board or a virtual notebook is often used in an interview (without compiling the code), I recommend taking a pencil and paper and writing down solutions for each problem, and after completing, compare your notes with the answers. Or work out your answers with a friend who will act as an interviewer!- Minor syntax errors do not matter much during an interview with a whiteboard or notepad. But they can distract the interviewer, so ideally try to reduce their number in order to concentrate all attention on logic.
- The answers given are not necessarily the only way to solve each problem. Feel free to write comments with additional solutions that you can add to this guide!
Tips for solving complex tasks in SQL interviews
First, standard tips for all programming interviews ...- Listen carefully to the description of the problem, repeat the essence of the problem to the interviewer
- Formulate a borderline case to demonstrate that you really understand the problem (i.e., a line that will not be included in the final SQL query that you are going to write)
- ( ) , β : ,
- SQL, , . , .
Some of the issues listed here are adapted from old Periscope blog entries (mostly written by Sean Cook around 2014, although his authorship seems to have been removed from the materials after SiSense merged with Periscope ), as well as from discussions on StackOverflow. If necessary, sources are marked at the beginning of each question.On the Select Star SQL is also a good selection of brainteasers , complementary issues of this document.Please note that these questions are not literal copies of questions from my own interviews, and they were not used in the companies in which I worked or work.Self-Association Tasks
No. 1. Percentage change month to month
Context: It is often useful to know how a key metric changes, for example, the monthly audience of active users, from month to month. Let's say we have a table logins
in this form:| user_id | date |
| --------- | ------------ |
| 1 | 2018-07-01 |
| 234 | 2018-07-02 |
| 3 | 2018-07-02 |
| 1 | 2018-07-02 |
| ... | ... |
| 234 | 2018-10-04 |
Objective : find the monthly percentage change in the monthly audience of active users (MAU).Solution:(This solution, like the other blocks of code in this document, contains comments about SQL syntax elements that may differ between different SQL variants, and other notes)WITH mau AS
(
SELECT
DATE_TRUNC('month', date) month_timestamp,
COUNT(DISTINCT user_id) mau
FROM
logins
GROUP BY
DATE_TRUNC('month', date)
)
SELECT
a.month_timestamp previous_month,
a.mau previous_mau,
b.month_timestamp current_month,
b.mau current_mau,
ROUND(100.0*(b.mau - a.mau)/a.mau,2) AS percent_change
FROM
mau a
JOIN
mau b ON a.month_timestamp = b.month_timestamp - interval '1 month'
No. 2. Marking tree structure
Context: suppose you have a table tree
with two columns: the first indicates the nodes, and the second the parent nodes.node parent
12
2 5
3 5
4 3
5 NULL
Task: write SQL in such a way that we designate each node as inner, root, or leaf or leaf, so that for the above values ββwe get the following:node label
1 Leaf
2 Inner
3 Inner
4 Leaf
5 Root
(Note: more information about the terminology of the tree-like data structure can be found here . However, it is not needed to solve this problem!)Solution:Acknowledgment: Fabian Hoffman proposed this more generalized solution on May 2, 2020. Thank FabianWITH join_table AS
(
SELECT
cur.node,
cur.parent,
COUNT(next.node) AS num_children
FROM
tree cur
LEFT JOIN
tree next ON (next.parent = cur.node)
GROUP BY
cur.node,
cur.parent
)
SELECT
node,
CASE
WHEN parent IS NULL THEN "Root"
WHEN num_children = 0 THEN "Leaf"
ELSE "Inner"
END AS label
FROM
join_table
Alternative solution, without explicit connections:Acknowledgment: William Chardgin on May 2, 2020 drew attention to the need for the condition WHERE parent IS NOT NULL
that this solution return Leaf
instead NULL
. Thank you William!SELECT
node,
CASE
WHEN parent IS NULL THEN 'Root'
WHEN node NOT IN
(SELECT parent FROM tree WHERE parent IS NOT NULL) THEN 'Leaf'
WHEN node IN (SELECT parent FROM tree) AND parent IS NOT NULL THEN 'Inner'
END AS label
from
tree
No. 3. User retention per month (several parts)
Acknowledgment: This task was adapted from the SiSense blog article, βUsing Self-Associations to Calculate Retention, Outflow, and Reactivation . βPart 1
Context: suppose we have statistics on user authorization on a site in the table logins
:| user_id | date |
| --------- | ------------ |
| 1 | 2018-07-01 |
| 234 | 2018-07-02 |
| 3 | 2018-07-02 |
| 1 | 2018-07-02 |
| ... | ... |
| 234 | 2018-10-04 |
Task: write a request that receives the number of retained users per month. In our case, this parameter is defined as the number of users who logged in to the system in this and the previous month.Decision:SELECT
DATE_TRUNC('month', a.date) month_timestamp,
COUNT(DISTINCT a.user_id) retained_users
FROM
logins a
JOIN
logins b ON a.user_id = b.user_id
AND DATE_TRUNC('month', a.date) = DATE_TRUNC('month', b.date) +
interval '1 month'
GROUP BY
date_trunc('month', a.date)
Acknowledgment:Tom Moertel pointed out that pre-duplicating user_id before self-joining makes the solution more efficient, and suggested the code below. Thanks Tom!Alternative solution:WITH DistinctMonthlyUsers AS (
SELECT DISTINCT
DATE_TRUNC('MONTH', a.date) AS month_timestamp,
user_id
FROM logins
)
SELECT
CurrentMonth.month_timestamp month_timestamp,
COUNT(PriorMonth.user_id) AS retained_user_count
FROM
DistinctMonthlyUsers AS CurrentMonth
LEFT JOIN
DistinctMonthlyUsers AS PriorMonth
ON
CurrentMonth.month_timestamp = PriorMonth.month_timestamp + INTERVAL '1 MONTH'
AND
CurrentMonth.user_id = PriorMonth.user_id
Part 2
Task: now we take the previous task of calculating the number of retained users per month - and turn it upside down. Weβll write a request to count users who have nβt returned to the site this month. That is, "lost" users.Decision:SELECT
DATE_TRUNC('month', a.date) month_timestamp,
COUNT(DISTINCT b.user_id) churned_users
FROM
logins a
FULL OUTER JOIN
logins b ON a.user_id = b.user_id
AND DATE_TRUNC('month', a.date) = DATE_TRUNC('month', b.date) +
interval '1 month'
WHERE
a.user_id IS NULL
GROUP BY
DATE_TRUNC('month', a.date)
Please note that this problem can also be resolved with LEFT
or RIGHT
.Part 3
Note: this is probably a more difficult task than you will be offered at a real interview. Think of it more like a puzzle - or you can skip and move on to the next task.Context : so we did a good job of two previous problems. Under the terms of the new task, we now have a table of lost users user_churns
. If the user was active in the past month, but then not active in this, then he is entered in the table for this month. Here's what it looks like user_churns
:| user_id | month_date |
| --------- | ------------ |
| 1 | 2018-05-01 |
| 234 | 2018-05-01 |
| 3 | 2018-05-01 |
| 12 | 2018-05-01 |
| ... | ... |
| 234 | 2018-10-01 |
Task : now you want to conduct a cohort analysis, that is, an analysis of the totality of active users who have been reactivated in the past . Create a table with these users. You can use tables user_churns
and to create a cohort logins
. In Postgres, the current timestamp is accessible via current_timestamp
.Decision:WITH user_login_data AS
(
SELECT
DATE_TRUNC('month', a.date) month_timestamp,
a.user_id,
MAX(b.month_date) as most_recent_churn,
MAX(DATE_TRUNC('month', c.date)) as most_recent_active
FROM
logins a
JOIN
user_churns b
ON a.user_id = b.user_id AND DATE_TRUNC('month', a.date) > b.month_date
JOIN
logins c
ON a.user_id = c.user_id
AND
DATE_TRUNC('month', a.date) > DATE_TRUNC('month', c.date)
WHERE
DATE_TRUNC('month', a.date) = DATE_TRUNC('month', current_timestamp)
GROUP BY
DATE_TRUNC('month', a.date),
a.user_id
HAVING
most_recent_churn > most_recent_active
No. 4. Increasing total
Acknowledgment: This task was adapted from the SiSense blog article , Cash Flow Modeling in SQL .Context: suppose we have a table transactions
in this form:| date | cash_flow |
| ------------ | ----------- |
| 2018-01-01 | -1000 |
| 2018-01-02 | -100 |
| 2018-01-03 | 50 |
| ... | ... |
Where cash_flow
is the revenue minus the costs for each day.Objective: write a request to get a running total for cash flow every day in such a way that in the end you get a table in this form:| date | cumulative_cf |
| ------------ | --------------- |
| 2018-01-01 | -1000 |
| 2018-01-02 | -1100 |
| 2018-01-03 | -1050 |
| ... | ... |
Decision:SELECT
a.date date,
SUM(b.cash_flow) as cumulative_cf
FROM
transactions a
JOIN b
transactions b ON a.date >= b.date
GROUP BY
a.date
ORDER BY
date ASC
An alternative solution using a window function (more efficient!):SELECT
date,
SUM(cash_flow) OVER (ORDER BY date ASC) as cumulative_cf
FROM
transactions
ORDER BY
date ASC
No. 5. Moving Average
Acknowledgment: This task is adapted from the SiSense blog article , Moving Averages in MySQL and SQL Server .Note: the moving average can be calculated in various ways. Here we use the previous average. Thus, the metric for the seventh day of the month will be the average of the previous six days and himself.Context : suppose we have a table signups
in this form:| date | sign_ups |
| ------------ | ---------- |
| 2018-01-01 | 10 |
| 2018-01-02 | 20 |
| 2018-01-03 | 50 |
| ... | ... |
| 2018-10-01 | 35 |
Task : write a request to get a 7-day moving average of daily registrations.Decision:SELECT
a.date,
AVG(b.sign_ups) average_sign_ups
FROM
signups a
JOIN
signups b ON a.date <= b.date + interval '6 days' AND a.date >= b.date
GROUP BY
a.date
No. 6. Several connection conditions
Acknowledgment: This task is adapted from the SiSense blog article, βAnalyzing Your Email Using SQLβ .Context: let's say our table emails
contains emails sent from the address zach@g.com
and received on it:| id | subject | from | to | timestamp |
| ---- | ---------- | -------------- | -------------- | --- ------------------ |
| 1 | Yosemite | zach@g.com | thomas@g.com | 2018-01-02 12:45:03 |
| 2 | Big Sur | sarah@g.com | thomas@g.com | 2018-01-02 16:30:01 |
| 3 | Yosemite | thomas@g.com | zach@g.com | 2018-01-02 16:35:04 |
| 4 | Running | jill@g.com | zach@g.com | 2018-01-03 08:12:45 |
| 5 | Yosemite | zach@g.com | thomas@g.com | 2018-01-03 14:02:01 |
| 6 | Yosemite | thomas@g.com | zach@g.com | 2018-01-03 15:01:05 |
| .. | .. | .. | .. | .. |
Task: write a request to get response time for each letter ( id
) sent to zach@g.com
. Do not include letters to other addresses. Suppose each thread has a unique theme. Keep in mind that the thread may have several round-trip letters between zach@g.com
and other recipients.Decision:SELECT
a.id,
MIN(b.timestamp) - a.timestamp as time_to_respond
FROM
emails a
JOIN
emails b
ON
b.subject = a.subject
AND
a.to = b.from
AND
a.from = b.to
AND
a.timestamp < b.timestamp
WHERE
a.to = 'zach@g.com'
GROUP BY
a.id
Tasks for window functions
No. 1. Find the identifier with the maximum value
Context: Suppose we have a table salaries
with data on departments and employee salaries in the following format: depname | empno | salary |
----------- + ------- + -------- +
develop | 11 | 5200 |
develop | 7 | 4200 |
develop | 9 | 4500 |
develop | 8 | 6000 |
develop | 10 | 5200 |
personnel | 5 | 3500 |
personnel | 2 | 3900 |
sales | 3 | 4800 |
sales | 1 | 5000 |
sales | 4 | 4800 |
Task : write a request to get empno
the highest salary. Make sure your solution handles cases of equal salaries!Decision:WITH max_salary AS (
SELECT
MAX(salary) max_salary
FROM
salaries
)
SELECT
s.empno
FROM
salaries s
JOIN
max_salary ms ON s.salary = ms.max_salary
Alternative solution using RANK()
:WITH sal_rank AS
(SELECT
empno,
RANK() OVER(ORDER BY salary DESC) rnk
FROM
salaries)
SELECT
empno
FROM
sal_rank
WHERE
rnk = 1;
No. 2. Average value and ranking with a window function (several parts)
Part 1
Context : suppose we have a table salaries
in this format: depname | empno | salary |
----------- + ------- + -------- +
develop | 11 | 5200 |
develop | 7 | 4200 |
develop | 9 | 4500 |
develop | 8 | 6000 |
develop | 10 | 5200 |
personnel | 5 | 3500 |
personnel | 2 | 3900 |
sales | 3 | 4800 |
sales | 1 | 5000 |
sales | 4 | 4800 |
Task: write a query that returns the same table, but with a new column that shows the average salary for the department. We would expect a table like this: depname | empno | salary | avg_salary |
----------- + ------- + -------- + ------------ +
develop | 11 | 5200 | 5020 |
develop | 7 | 4200 | 5020 |
develop | 9 | 4500 | 5020 |
develop | 8 | 6000 | 5020 |
develop | 10 | 5200 | 5020 |
personnel | 5 | 3500 | 3700 |
personnel | 2 | 3900 | 3700 |
sales | 3 | 4800 | 4867 |
sales | 1 | 5000 | 4867 |
sales | 4 | 4800 | 4867 |
Decision:SELECT
*,
ROUND(AVG(salary),0) OVER (PARTITION BY depname) avg_salary
FROM
salaries
Part 2
Task: write a query that adds a column with the position of each employee in the time sheet based on his salary in his department, where the employee with the highest salary gets position 1. We would expect a table in this form: depname | empno | salary | salary_rank |
----------- + ------- + -------- + ------------- +
develop | 11 | 5200 | 2 |
develop | 7 | 4200 | 5 |
develop | 9 | 4500 | 4 |
develop | 8 | 6000 | 1 |
develop | 10 | 5200 | 2 |
personnel | 5 | 3500 | 2 |
personnel | 2 | 3900 | 1 |
sales | 3 | 4800 | 2 |
sales | 1 | 5000 | 1 |
sales | 4 | 4800 | 2 |
Decision:SELECT
*,
RANK() OVER(PARTITION BY depname ORDER BY salary DESC) salary_rank
FROM
salaries
Other tasks of medium and high difficulty
No. 1. Histograms
Context: Let's say we have a table sessions
where each row represents a video streaming session with a length in seconds:| session_id | length_seconds |
| ------------ | ---------------- |
| 1 | 23 |
| 2 | 453 |
| 3 | 27 |
| .. | .. |
Task: write a query to calculate the number of sessions that fall intervals of five seconds, i.e. for the above fragment, the result will be something like this:| bucket | count |
| --------- | ------- |
| 20-25 | 2 |
| 450-455 | 1 |
The maximum score counts for the proper line labels ("5-10", etc.)Solution:WITH bin_label AS
(SELECT
session_id,
FLOOR(length_seconds/5) as bin_label
FROM
sessions
)
SELECT
CONCATENTATE(STR(bin_label*5), '-', STR(bin_label*5+5)) bucket,
COUNT(DISTINCT session_id) count
GROUP BY
bin_label
ORDER BY
bin_label ASC
No. 2. Cross connection (several parts)
Part 1
Context: let's say we have a table state_streams
where the name of the state and the total number of hours of streaming from video hosting are indicated on each line:| state | total_streams |
| ------- | --------------- |
| NC | 34569 |
| SC | 33999 |
| CA | 98324 |
| MA | 19345 |
| .. | .. |
(In fact, aggregated tables of this type usually have a date column, but we will exclude it for this task)Task: write a query to get pairs of states with a total number of threads within a thousand from each other. For the above snippet, we would like to see something like:| state_a | state_b |
| --------- | --------- |
| NC | SC |
| SC | NC |
Decision:SELECT
a.state as state_a,
b.state as state_b
FROM
state_streams a
CROSS JOIN
state_streams b
WHERE
ABS(a.total_streams - b.total_streams) < 1000
AND
a.state <> b.state
For information, cross joins can also be written without explicitly specifying joins:SELECT
a.state as state_a,
b.state as state_b
FROM
state_streams a, state_streams b
WHERE
ABS(a.total_streams - b.total_streams) < 1000
AND
a.state <> b.state
Part 2
Note: this is a bonus question rather than a really important SQL template. You can skip it!Task: how can I modify SQL from a previous solution to remove duplicates? For example, the example of the same table, to steam NC
and SC
there was only one time, not two.Decision:SELECT
a.state as state_a,
b.state as state_b
FROM
state_streams a, state_streams b
WHERE
ABS(a.total_streams - b.total_streams) < 1000
AND
a.state > b.state
No. 3. Advanced calculations
Acknowledgment: This task was adapted from a discussion on a question that I asked on StackOverflow (my nickname is zthomas.nc).Note: this is probably a more difficult task than you will be offered at a real interview. Think of it more like a puzzle - or you can skip it!Context: suppose we have a table of table
this kind, where user
different values ββof a class can correspond to the same user class
:| user | class |
| ------ | ------- |
| 1 | a |
| 1 | b |
| 1 | b |
| 2 | b |
| 3 | a |
Problem: Suppose that there are only two possible values ββfor a class. Write a query to calculate the number of users in each class. In this case, users with both labels a
and b
must refer to the class b
.For our sample, we get the following result:| class | count |
| ------- | ------- |
| a | 1 |
| b | 2 |
Decision:WITH usr_b_sum AS
(
SELECT
user,
SUM(CASE WHEN class = 'b' THEN 1 ELSE 0 END) num_b
FROM
table
GROUP BY
user
),
usr_class_label AS
(
SELECT
user,
CASE WHEN num_b > 0 THEN 'b' ELSE 'a' END class
FROM
usr_b_sum
)
SELECT
class,
COUNT(DISTINCT user) count
FROM
usr_class_label
GROUP BY
class
ORDER BY
class ASC
An alternative solution uses instructions SELECT
in the operators SELECT
and UNION
:SELECT
"a" class,
COUNT(DISTINCT user_id) -
(SELECT COUNT(DISTINCT user_id) FROM table WHERE class = 'b') count
UNION
SELECT
"b" class,
(SELECT COUNT(DISTINCT user_id) FROM table WHERE class = 'b') count