admin管理员组

文章数量:1344303

In SQLite I can do

SELECT * FROM tbl GROUP BY col

and all the results are grouped by col, which means that if some rows have the same col value, only the last one will be included.

Now the same does not work in Postgres or BigQuery.

I get errors saying that I need to specify all the columns I am selecting, in the group statement. But if I do that, then result is not what I expect because now grouping is done by all these columns, not just on one column, so I will get records with duplicate col values...

Any way to achieve SQLite-like grouping in these other dbs?

In SQLite I can do

SELECT * FROM tbl GROUP BY col

and all the results are grouped by col, which means that if some rows have the same col value, only the last one will be included.

Now the same does not work in Postgres or BigQuery.

I get errors saying that I need to specify all the columns I am selecting, in the group statement. But if I do that, then result is not what I expect because now grouping is done by all these columns, not just on one column, so I will get records with duplicate col values...

Any way to achieve SQLite-like grouping in these other dbs?

Share Improve this question edited 22 hours ago Zegarek 27.1k5 gold badges24 silver badges30 bronze badges asked yesterday AlexAlex 66.2k185 gold badges459 silver badges651 bronze badges 4
  • How do you know what value is selected for each column when it's not a duplicate? Is the system "Free to choose" which value is returned? Say column A has values 1,2,3,4. But you're grouping by column B. each run the system could return any of the 1,2,3,4 values. Worse yet, if you have Column C of values A,B,C,D respectively, the system could choose 1 for column A and D for Column C (different records). so do you really mean you want "Any' value grouped by column B? (Sample data expected results would help explain what you're after. Generally people are after data from the same record. – xQbert Commented yesterday
  • mySQL would support this; but it's not standard SQL. The RDBMS engines are extending the group by to support a non-standard implementation. Usually in such situations you get 1 record for each distinct value for your "column" and then the system is free to pick any value from the other columns on any record. is this what you're after? MySQL supports this concept; you'd need to look for group by extensions in teh other RDBMS. – xQbert Commented yesterday
  • Note that strictly speaking "only the last one will be included" is incorrect: The value of bare columns is usually undefined, with the exception of special behavior around very simple min/max aggregates. You must not rely on the behavior you describe anyway. – user2722968 Commented 4 hours ago
  • @user2722968 It's already here. – Zegarek Commented 4 hours ago
Add a comment  | 

2 Answers 2

Reset to default 0

This is a more standard way of achieving what I think you're after and should be suportted by most RDBMS which support CTE and window functions.

We use a window function row number partitioning by your column and assigning a "random" order

This will return a distinct record for each row and 1 value from each other column.

With CTE AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY col ORDER BY random()) AS rn
    FROM table_name) 

SELECT * 
FROM CTE
WHERE rn = 1;

The ability to include bare columns in a query is an SQLite-specific extension.

In PostgreSQL and BigQuery, you can use any_value() for that:
demo at db<>fiddle

select (any_value(tbl)).* from tbl group by tbl.col;

The function accepts one argument, but you can give it the entire row, then use ().* to unpack it.

If that doesn't look like it's matching your description, it's because that description isn't entirely accurate. In SQLite, if some rows have the same col value, a random one (not necessarily the last) will be included, unless you also called min() or max(), exactly once. Quoting 2.5. Bare columns in an aggregate query (same place the opening quote comes from):

SELECT a, b, max(c) FROM tab1 GROUP BY a;

If there is exactly one min() or max() aggregate in the query, then all bare columns in the result set take values from an input row which also contains the minimum or maximum. So in the query above, the value of the "b" column in the output will be the value of the "b" column in the input row that has the largest "c" value. There are limitations on this special behavior of min() and max():

  1. If the same minimum or maximum value occurs on two or more rows, then bare values might be selected from any of those rows. The choice is arbitrary. There is no way to predict from which row the bare values will be choosen. The choice might be different for different bare columns within the same query.

  2. If there are two or more min() or max() aggregates in the query, then bare column values will be taken from one of the rows on which one of the aggregates has their minimum or maximum value. The choice of which min() or max() aggregate determines the selection of bare column values is arbitrary. The choice might be different for different bare columns within the same query.

  3. This special processing for min() or max() aggregates only works for the built-in implementation of those aggregates. If an application overrides the built-in min() or max() aggregates with application-defined alternatives, then the values selected for bare columns will be taken from an arbitrary row.

I agree with xQbert's idea that if you're after a more portable construct, window functions are adopted widely enough for row_number() to be a pretty safe bet. My one remark is that there's no need to waste time on order by if you're after random/arbitrary rows:
example you can switch between all databases available on db<>fiddle

select * 
from (select *,row_number()over(partition by col 
                                order by col) as n
      from tbl) as s
where n=1;

Oracle and SQL Server require that order by to be present, PostgreSQL doesn't need it at all, SQLite, BigQuery and MySQL don't either. By using the same column for both partition by and order by you maintain portability without adding needless operations - at least in PostgreSQL, this combination gets the same plan as a version skipping order by entirely.
One caveat is that you need to deal with the additional column:

  1. SELECT * EXCEPT(n), if the syntax allows it (PostgreSQL and SQLite don't have it, BigQuery does).
  2. Select all but that one, explicitly naming every field (IDEs can do that for you anyways, while select * you're sacrificing is often considered an anti-pattern).
  3. Ignore it, especially if this query feeds something else that's using an explicit column name list.

Oracle doesn't have that problem because it doesn't allow select * in a subquery or CTE.

本文标签: