SQL: GROUP BY and HAVING
GROUP BY collapses rows that share a key into one row per group so aggregates (COUNT, SUM, AVG) can be applied. HAVING filters groups after aggregation, where WHERE can't reach.
GROUP BY reshapes a query result so that rows sharing a key value collapse into a single row per key. Aggregate functions (COUNT, SUM, AVG, MIN, MAX) then run within each group instead of across the whole table.
Basic shape
SELECT customer_id, COUNT(*) AS order_count, SUM(total) AS revenue
FROM orders
GROUP BY customer_id;
One row per customer_id, with the count and revenue computed within that customer's orders.
The selection rule
Every column in the SELECT clause must either:
- Appear in the
GROUP BYclause, or - Be wrapped in an aggregate function.
Otherwise the database can't decide which value of that column to show — there are many rows per group, and only one slot in the output. PostgreSQL and standard SQL enforce this strictly. MySQL historically allowed it and would pick an arbitrary value, which is a footgun.
WHERE vs HAVING — the order matters
The logical order of operations in a SQL query is:
FROM/JOIN— assemble rowsWHERE— filter rowsGROUP BY— collapse into groupsHAVING— filter groupsSELECT— pick columnsORDER BY— sortLIMIT— cap
WHERE filters individual rows before grouping. HAVING filters whole groups after aggregation. They are not interchangeable.
SELECT customer_id, SUM(total) AS revenue
FROM orders
WHERE status = 'completed' -- drop incomplete orders before grouping
GROUP BY customer_id
HAVING SUM(total) > 1000; -- keep only customers with >$1000 in completed orders
You can't reference SUM(total) in WHERE — it doesn't exist yet at that stage. You can't reference status in HAVING (without re-aggregating) — individual row values are gone after grouping.
GROUP BY multiple columns
SELECT region, product, SUM(units) AS units_sold
FROM sales
GROUP BY region, product;
One row per (region, product) combination. The order of columns in GROUP BY doesn't change which rows appear, but it can affect query plans and how some databases sort the output.
Common gotcha
COUNT(*) counts rows. COUNT(column) counts non-NULL values in that column. They differ when nulls are present:
SELECT COUNT(*), COUNT(email) FROM users; -- second number is smaller if any email is NULL