This is a short SQL tutorial on how to do multiple transformation/aggregates of a table in order to perform analyses of the kind:
Given a list of grocery stores that includes how much fruit they've sold, per type of fruit, e.g.:
| store | fruit | price | sold |
|--------------|---------|-------|------|
| Safeway | Apples | 2.00 | 50 |
| Safeway | Oranges | 1.50 | 30 |
| Trader Joe's | Apples | 0.99 | 100 |
| Whole Foods | Apples | 5.99 | 0 |
| Whole Foods | Oranges | 2.50 | 7 |
Find all stores that:
- sold at least one apple
- sold only apples
- sold apples at a price below $1
- sold apples but not oranges
There are lots of ways to do such calculations, so I want to keep the SQL syntax as basic as possible, i.e. limited to the SQL syntax that I personally understand, and that is consistent among the major flavors of SQL, including SQLite, MySQL, and PostgreSQL.
So, basically, no DISTINCT
or WINDOW
. Just: GROUP BY
, COUNT
, and the concepts of subqueries and joins.
The queries in this tutorial work on MySQL, SQlite, and PostgreSQL alike, and you can use the following SQLFiddle to practice queries against:
http://sqlfiddle.com/#!9/24b2e4
Instead of grocery stores and fruit sales, let's use a simple mockup of criminal court data, in which criminal cases have multiple charges, e.g. a defendant can be charged with assault and DUI in a single case:
| rowid | casenum | charge | disposition |
|-------|---------|------------|-------------|
| 1 | A | assault | dismissed |
| 2 | B | battery | convicted |
| 3 | C | conspiracy | dismissed |
| 4 | C | conspiracy | dismissed |
| 5 | D | dui | dismissed |
| 6 | D | dui | convicted |
| 7 | D | assault | convicted |
| 8 | E | evilness | dismissed |
| 9 | E | assault | dismissed |
| 10 | E | assault | dismissed |
Given that data format, how do we find all cases...:
| casenum |
|---------|
| B |
| D |
| casenum | total_charges | total_dismissals |
|---------|---------------|------------------|
| A | 1 | 1 |
| C | 2 | 2 |
| E | 3 | 3 |
3. ...with at least one assault charge, but include all the charge-per-case data for contextual purposes
| rowid | casenum | charge | disposition |
|-------|---------|----------|-------------|
| 1 | A | assault | dismissed |
| 5 | D | dui | dismissed |
| 6 | D | dui | convicted |
| 7 | D | assault | convicted |
| 8 | E | evilness | dismissed |
| 9 | E | assault | dismissed |
| 10 | E | assault | dismissed |
| rowid | casenum | charge | disposition |
|-------|---------|----------|-------------|
| 5 | D | dui | dismissed |
| 6 | D | dui | convicted |
| 7 | D | assault | convicted |
| 8 | E | evilness | dismissed |
| 9 | E | assault | dismissed |
| 10 | E | assault | dismissed |
| casenum | total_charges | total_convictions |
|---------|---------------|-------------------|
| A | 1 | (null) |
| C | 2 | (null) |
| D | 3 | 2 |
If you don't want to use the SQLFiddle, here's the code to create and seed the example table. This statement (and most of the subsequent SELECT queries) should work in MySQL, PostgreSQL, and SQLite:
CREATE TABLE charges (
rowid INTEGER PRIMARY KEY,
casenum VARCHAR(255),
charge VARCHAR(255),
disposition VARCHAR(255)
);
INSERT INTO charges(rowid, casenum, charge, disposition)
VALUES
(1, 'A','assault','dismissed'),
(2, 'B','battery','convicted'),
(3, 'C','conspiracy','dismissed'),
(4, 'C','conspiracy','dismissed'),
(5, 'D','dui','dismissed'),
(6, 'D','dui','convicted'),
(7, 'D','assault','convicted'),
(8, 'E','evilness','dismissed'),
(9, 'E','assault','dismissed'),
(10, 'E','assault','dismissed')
;
This requires a GROUP BY
clause but not necessarily a COUNT
column, as every case without a conviction will be filtered out of the results by the WHERE
clause before the aggregate is done:
SELECT casenum
FROM charges
WHERE disposition = 'convicted'
GROUP BY casenum;
| casenum |
|---------|
| B |
| D |
However, if we want to include the number of convictions per case:
SELECT casenum,
COUNT(1) AS total_convictions
FROM charges
WHERE disposition = 'convicted'
GROUP BY casenum;
| casenum | total_convictions |
|---------|-------------------|
| B | 1 |
| D | 2 |
We don't need the HAVING
clause -- i.e. HAVING total_convictions > 0
-- because the WHERE
clause automatically eliminates all cases without a single convictions (by selecting only the rows that have a conviction before grouping them). If we wanted to find all cases with more than one conviction (which seems of little utility in a real-world analysis), we would have to use a HAVING
clause to filter the aggregated results:
SELECT casenum,
COUNT(1) AS total_convictions
FROM charges
WHERE disposition = 'convicted'
GROUP BY casenum
HAVING total_convictions > 1;
| casenum | total_convictions |
|---------|-------------------|
| D | 2 |
In other words, all cases in which the total number of charges is equal to the number of charges that were dismissed.
Conceptually, we create two aggregate tables:
- A table with 2 columns:
case_num
and a count of all charges per case:total_charges
- A table with 2 columns:
case_num
and a count of all dismissed charges per case:total_dismissals
The query that solves our problem is a INNER JOIN between the two tables on case_num
, in which total_dismissals
is equal to total_charges
.
SELECT casenum,
COUNT(1) AS total_charges
FROM charges
GROUP BY casenum;
| casenum | total_charges |
|---------|---------------|
| A | 1 |
| B | 1 |
| C | 2 |
| D | 3 |
| E | 3 |
SELECT casenum,
COUNT(1) AS total_dismissals
FROM charges
WHERE disposition = 'dismissed'
GROUP BY casenum;
Note that case B
does not show up because none of its charges resulted in a dismissal:
| casenum | total_dismissals |
|---------|------------------|
| A | 1 |
| C | 2 |
| D | 1 |
| E | 3 |
You could create two new tables, temporary or not, and then do the join on those two tables. But since the result of each of the SELECT statements effectively results in a new table, we can just execute both of the aforementioned statements, alias and then join their results, to save us the tedium of creating and deleting temp tables.
Here's how to do it with subqueries:
SELECT tc.casenum,
tc.total_charges,
td.total_dismissals
FROM
(SELECT casenum, COUNT(1) AS total_charges
FROM charges
GROUP BY casenum)
AS tc
INNER JOIN
(SELECT casenum, COUNT(1) AS total_dismissals
FROM charges
WHERE disposition = 'dismissed'
GROUP BY casenum)
AS td
ON tc.casenum = td.casenum
WHERE tc.total_charges = td.total_dismissals;
Note that including tc.total_charges
and td.total_dismissals
is unnecessary, but we include it here for a sanity check in the results:
| casenum | total_charges | total_dismissals |
|---------|---------------|------------------|
| A | 1 | 1 |
| C | 2 | 2 |
| E | 3 | 3 |
Another side note: it's possible to eliminate the WHERE
clause by putting its condition into the ON
clause:
\\ ...
ON tc.casenum = td.casenum
AND tc.total_charges = td.total_dismissals;
The result would be the same as the previous query, but in my opinion, the code does not describe what we want conceptually, and it may be code that is harder to maintain in terms of preventing errors if we incorporate it into a non-trivial analysis.
And I believe it would screw up the efficiency of the query, as casenum
would presumably be indexed, but total_charges
and total_dismissals
would not be.
If you're using a SQL variant that is not MySQL, you can achieve the same result using a more declarative (and arguably more readable) style with the WITH
common table expression -- hat-tip to this great listicle of SQL tricks:
WITH
tc AS (SELECT casenum,
COUNT(1) AS total_charges
FROM charges
GROUP BY casenum),
td AS (SELECT casenum,
COUNT(1) AS total_dismissals
FROM charges
WHERE disposition = 'dismissed'
GROUP BY casenum)
SELECT tc.casenum,
tc.total_charges,
td.total_dismissals
FROM tc
INNER JOIN td
ON tc.casenum = td.casenum
WHERE tc.total_charges = td.total_dismissals;
Hoa Nguyen on the NICAR-L mailing list pointed out a much simpler method that requires no JOINs and far less code:
SELECT casenum
FROM charges
WHERE casenum
NOT IN (SELECT casenum FROM charges WHERE disposition != 'dismissed')
GROUP BY casenum;
My main objection was that I had thought this would always be less optimal than a join, but according to EXPLAIN
on MySQL 5.6.x, I would be very wrong. And the semantics of Hoa's query is likely more intuitive to more people. That said, there are some limitations (specifically with MySQL) when it coems to subqueries (see documentation here). And, in my opinion, the immediate clarity of this code will become much more muddled when doing more complicated JOINs. (but I don't have an examples at the moment to clearly illustrate this)
This seems like it'd be similar to the first example, except with a different WHERE
condition:
SELECT casenum
FROM charges
WHERE charge = 'assault'
GROUP BY casenum;
But the aggregate destroys the individual charge-per-case detail, e.g. how cases D and E each have more than one charge:
| casenum |
|---------|
| A |
| D |
| E |
Sometimes for research and reporting purposes, we want a result table with as much detail as possible, to make it easier to browse the filtered cases.
So to get all cases that include at least one assault charge, but include all other charges for each case, as well as all the original columns, we'll need to do a subquery and INNER JOIN.
Note how the main SELECT
statement has no GROUP BY
clause -- this allows us to list each charge per case:
SELECT charges.*
FROM charges
INNER JOIN
(SELECT casenum
FROM charges
WHERE charge = 'assault'
GROUP BY casenum)
AS assault_cases
ON assault_cases.casenum = charges.casenum;
| rowid | casenum | charge | disposition |
|-------|---------|----------|-------------|
| 1 | A | assault | dismissed |
| 5 | D | dui | dismissed |
| 6 | D | dui | convicted |
| 7 | D | assault | convicted |
| 8 | E | evilness | dismissed |
| 9 | E | assault | dismissed |
| 10 | E | assault | dismissed |
4. All cases that have multiple charges and at least one assault charge, and include the charge-per-case detail
This example is meant to show why we might be interested in doing the third example in the first place. It uses virtually the same code as in the second and third examples, stacked together.
It requires joining together 3 tables:
charges
- this is the originalcharges
table, becausue we need access to the original rows and columnsta
- this table is the result of a subquery that filters thecharges
include only charges of'assault'
, then does aGROUP BY casenum
to return only rows with uniquecasenum
values.tc
- this subquery returns the uniquecasenum
for each case fromcharges
, along with total count of charges per case, i.e. aGROUP BY casenum
.
SELECT charges.*
FROM charges
INNER JOIN
(SELECT casenum
FROM charges
WHERE charge = 'assault'
GROUP BY casenum)
AS ta
ON ta.casenum = charges.casenum
INNER JOIN
(SELECT casenum,
COUNT(1) AS total_charges
FROM charges
GROUP BY casenum)
AS tc
ON tc.casenum = ta.casenum
WHERE tc.total_charges > 1;
Note how case A
is eliminated from the results, as it has only one charge total:
| rowid | casenum | charge | disposition |
|-------|---------|----------|-------------|
| 5 | D | dui | dismissed |
| 6 | D | dui | convicted |
| 7 | D | assault | convicted |
| 8 | E | evilness | dismissed |
| 9 | E | assault | dismissed |
| 10 | E | assault | dismissed |
It's probably more conceptually clearer to eliminate the WHERE
clause in the main SELECT
and use a HAVING
clause in the subquery that finds total charges per case:
SELECT charges.*
FROM charges
INNER JOIN
(SELECT casenum
FROM charges
WHERE charge = 'assault'
GROUP BY casenum)
AS ta
ON ta.casenum = charges.casenum
INNER JOIN
(SELECT casenum,
COUNT(1) AS total_charges
FROM charges
GROUP BY casenum
HAVING total_charges > 1)
AS tc
ON tc.casenum = ta.casenum
;
This is just a join between two tables created by subqueries. But it's different than when we wanted to find cases in which every charge was dismissed, i.e.
- Cases in which total number of dismissals is equal to total number of charges.
By definition, such cases have at least one dismissal.
However, a case in which not all charges were dismissed includes:
- Cases in which total number dismissals is less than total number of charges.
- Cases in which there were no dismissals
In order to capture the second condition, we need to use LEFT JOIN
instead of a INNER JOIN
, as well as add a NULL
check in the WHERE
clause:
SELECT
tc.casenum,
tc.total_charges,
td.total_dismissals
FROM
(SELECT casenum,
COUNT(1) AS total_charges
FROM charges
GROUP BY casenum)
AS tc
LEFT JOIN
(SELECT casenum,
COUNT(1) AS total_dismissals
FROM charges
WHERE disposition = 'dismissed'
GROUP BY casenum)
AS td
ON tc.casenum = td.casenum
WHERE
tc.total_charges != td.total_dismissals
OR td.total_dismissals IS NULL;
| casenum | total_charges | total_convictions |
|---------|---------------|-------------------|
| A | 1 | (null) |
| C | 2 | (null) |
| D | 3 | 2 |
TODO: (fill this out later)
Finding cases that have at least one conviction is much easier than finding cases without a conviction:
SELECT allcases.casenum
FROM
(SELECT casenum
FROM charges
GROUP BY casenum)
AS allcases
LEFT JOIN
(SELECT casenum
FROM charges
WHERE disposition = 'convicted'
GROUP BY casenum)
AS convictions
ON allcases.casenum = convictions.casenum
WHERE convictions.casenum IS NULL;
| casenum |
|---------|
| A |
| C |
| E |