SQL: GROUP BY and HAVING | Study

GROUP BY reshapes a query result so that rows sharing a key value collapse into a single row per key. Aggregate functions (COUNT, SUM, AVG, MIN, MAX) then run within each group instead of across the whole table.

Basic shape

SELECT customer_id, COUNT(*) AS order_count, SUM(total) AS revenue
FROM orders
GROUP BY customer_id;

One row per customer_id, with the count and revenue computed within that customer's orders.

The selection rule

Every column in the SELECT clause must either:

Appear in the GROUP BY clause, or
Be wrapped in an aggregate function.

Otherwise the database can't decide which value of that column to show — there are many rows per group, and only one slot in the output. PostgreSQL and standard SQL enforce this strictly. MySQL historically allowed it and would pick an arbitrary value, which is a footgun.

WHERE vs HAVING — the order matters

The logical order of operations in a SQL query is:

FROM / JOIN — assemble rows
WHERE — filter rows
GROUP BY — collapse into groups
HAVING — filter groups
SELECT — pick columns
ORDER BY — sort
LIMIT — cap

WHERE filters individual rows before grouping. HAVING filters whole groups after aggregation. They are not interchangeable.

SELECT customer_id, SUM(total) AS revenue
FROM orders
WHERE status = 'completed'      -- drop incomplete orders before grouping
GROUP BY customer_id
HAVING SUM(total) > 1000;       -- keep only customers with >$1000 in completed orders

You can't reference SUM(total) in WHERE — it doesn't exist yet at that stage. You can't reference status in HAVING (without re-aggregating) — individual row values are gone after grouping.

GROUP BY multiple columns

SELECT region, product, SUM(units) AS units_sold
FROM sales
GROUP BY region, product;

One row per (region, product) combination. The order of columns in GROUP BY doesn't change which rows appear, but it can affect query plans and how some databases sort the output.

Common gotcha

COUNT(*) counts rows. COUNT(column) counts non-NULL values in that column. They differ when nulls are present:

SELECT COUNT(*), COUNT(email) FROM users;   -- second number is smaller if any email is NULL