admin管理员组

文章数量:1334887

I have an issue with Athena when trying to query records between dates that are stored as string.

My csv dataset, spread among several files across directories, has a quote_date column with 10/8/2024 format. I created the prices table with quote_date as string (not date) otherwise the column would result empty - in my understanding because the string->date conversion expects an ISO format.

ID,QUOTE_DATE,PRICE
435,10/2/2024,5.80
123,10/12/2024,5.62
489,10/28/2024,5.04

I want to query records between dates on that column.

For some reason this works fine:

select
    customer_id,
    q_date,
    normalized_price
from (
    select *, date_parse(quote_date,'%m/%d/%Y') as q_date
    from prices
    where customer_id is not null
) as tbl
where
    customer_id = 435
order by q_date desc
#   customer_id   q_date                     normalized_price
1   435           2024-11-19 00:00:00.000    5.62
2   435           2024-11-19 00:00:00.000    5.54
3   435           2024-11-19 00:00:00.000    5.2
...

but for some reason this, adding the filter on q_date, breaks, and unexpectedly on the quote_date column (?!):

select
    customer_id,
    q_date,
    normalized_price
from (
    select *, date_parse(quote_date,'%m/%d/%Y') as q_date
    from prices
    where customer_id is not null
) as tbl
where
    customer_id = 435
  AND
    date(q_date) > date('2024-11-15')
order by q_date desc

INVALID_FUNCTION_ARGUMENT: Invalid format: “QUOTE_DATE” This query ran against the “athena_dtn” database, unless qualified by the query.

Checking the data I can't see any malformed/anexpected string format in the quote_date values, so I really can't figure out what would be the issue.

Any suggestions?


UPDATE

It looks like there is something wrong somewhere when converting the date string.

  • The following query returns INVALID_FUNCTION_ARGUMENT: Invalid format: "QUOTE_DATE"

    SELECT * FROM prices
    WHERE customer_id = 435
      AND date_parse(quote_date, '%m/%d/%Y') > DATE('2024-11-15')
    
  • The following one, adding try() on the date parsing, works just fine (returning the expected results)

    SELECT * FROM prices
    WHERE customer_id = 435
       AND try(date_parse(quote_date, '%m/%d/%Y')) > DATE('2024-11-15')
    
  • The following query works as well, but returning no results (!)

    SELECT * FROM prices
    WHERE customer_id = 435
      AND try(date_parse(quote_date, '%m/%d/%Y')) IS NULL
    

I guess this solves my blocking issue, however it would be nice to learn what might be the underlying problem if anybody can provide some insight on the conversion

I have an issue with Athena when trying to query records between dates that are stored as string.

My csv dataset, spread among several files across directories, has a quote_date column with 10/8/2024 format. I created the prices table with quote_date as string (not date) otherwise the column would result empty - in my understanding because the string->date conversion expects an ISO format.

ID,QUOTE_DATE,PRICE
435,10/2/2024,5.80
123,10/12/2024,5.62
489,10/28/2024,5.04

I want to query records between dates on that column.

For some reason this works fine:

select
    customer_id,
    q_date,
    normalized_price
from (
    select *, date_parse(quote_date,'%m/%d/%Y') as q_date
    from prices
    where customer_id is not null
) as tbl
where
    customer_id = 435
order by q_date desc
#   customer_id   q_date                     normalized_price
1   435           2024-11-19 00:00:00.000    5.62
2   435           2024-11-19 00:00:00.000    5.54
3   435           2024-11-19 00:00:00.000    5.2
...

but for some reason this, adding the filter on q_date, breaks, and unexpectedly on the quote_date column (?!):

select
    customer_id,
    q_date,
    normalized_price
from (
    select *, date_parse(quote_date,'%m/%d/%Y') as q_date
    from prices
    where customer_id is not null
) as tbl
where
    customer_id = 435
  AND
    date(q_date) > date('2024-11-15')
order by q_date desc

INVALID_FUNCTION_ARGUMENT: Invalid format: “QUOTE_DATE” This query ran against the “athena_dtn” database, unless qualified by the query.

Checking the data I can't see any malformed/anexpected string format in the quote_date values, so I really can't figure out what would be the issue.

Any suggestions?


UPDATE

It looks like there is something wrong somewhere when converting the date string.

  • The following query returns INVALID_FUNCTION_ARGUMENT: Invalid format: "QUOTE_DATE"

    SELECT * FROM prices
    WHERE customer_id = 435
      AND date_parse(quote_date, '%m/%d/%Y') > DATE('2024-11-15')
    
  • The following one, adding try() on the date parsing, works just fine (returning the expected results)

    SELECT * FROM prices
    WHERE customer_id = 435
       AND try(date_parse(quote_date, '%m/%d/%Y')) > DATE('2024-11-15')
    
  • The following query works as well, but returning no results (!)

    SELECT * FROM prices
    WHERE customer_id = 435
      AND try(date_parse(quote_date, '%m/%d/%Y')) IS NULL
    

I guess this solves my blocking issue, however it would be nice to learn what might be the underlying problem if anybody can provide some insight on the conversion

Share Improve this question edited Nov 21, 2024 at 18:52 Stefano asked Nov 20, 2024 at 20:18 StefanoStefano 7041 gold badge6 silver badges20 bronze badges 5
  • Can you show us some examples of the data in the actual input files, particularly how the date is stored? – John Rotenstein Commented Nov 20, 2024 at 23:13
  • @JohnRotenstein I added a simplified extract of the csv files – Stefano Commented Nov 21, 2024 at 18:26
  • Can you show us the actual data in the CSV file? It might be that the date is not in the format you expect. – John Rotenstein Commented Nov 21, 2024 at 21:02
  • I have hundreds of similar files, with the same format. There might be though some unexpected non utf-8 character somewhere, but that doesn't show up on an HTML page... Maybe the try function helps with that? ¯_(ツ)_/¯ – Stefano Commented Nov 21, 2024 at 22:23
  • It might be that the dates aren't in M/D/Y format, or it might be that some lines have bad formats or bad characters. You should start by creating a file with some random lines from the normal CSV files and see whether it is working as expected. It would be great to identify a bad line to figure out what is happening. – John Rotenstein Commented Nov 21, 2024 at 22:37
Add a comment  | 

1 Answer 1

Reset to default 0

date is:

an alias for CAST(x AS date).

It expects a date, not timestamp, you can parse the timestamp and process it:

-- sample data
WITH dataset(q_date) as (
    values ('2024-11-19 00:00:00.000'),
           ('2024-11-18 01:00:00.000')
)

-- query
select date_parse(q_date, '%Y-%m-%d %T.%f')
from dataset
where date_parse(q_date, '%Y-%m-%d %T.%f') >= date '2024-11-19';

本文标签: sqlError in Athena when trying to query records between dates that are stored as stringStack Overflow