admin管理员组文章数量:1303374
Question:
Is it possible to create an MD5 digest between IICS and Snowflake that both match and can be productionized, as seen in the column outpus for MD5_ZERO and MD5_FIVE in the code snippet below?
Details:
I am working to determine record updates by hashing column inputs on NUMERICS. The datawarehouse is Snowflake and the ETL tool is Informatica's IICS Data Integration (i.e., mapping).
Basically, IICS is creating an MD5 digest that only matches Snowflake when the input to MD5 is the numeric string representation that I type in manually (i.e., hard-coded). But, this does not work for something that I am trying to productionize where the function hashes values from the table rather than what I hard-code.
I have tried several variations and permutations in both Snowflake and IICS, but for the life of me, I cannot get the two to create a MD5 digest that matches. I checked the documentation for both Snowflake and IICs (see links below) without success.
Example:
For example, when hashing 5.000000
defined as NUMERIC(38,6), I get the following digests in a productionized setting:
- IICS:
E4DA3B7FBBCE2345D7772B0674A318D5
- Snowflake:
E4DA3B7FBBCE2345D7772B0674A318D5
- MD5_FIVE <-- I need this - Snowflake:
463DAAEAA59E931EC54B0D6375A05A0F
- MD5_BAR <-- But I get this
What is happening here is that the MD5 in Snowflake is different when hashing a defined NUMERIC(38,6) value (i.e., 5.000000) that is stored on the table itself than when just manually typing it into the MD5() function itself. So, something like the query below yields a different MD5, which is unexpected - they should be the same.
SELECT MD5(TO_CHAR(FOO)) AS MD5_FOO
,MD5(5.000000) AS MD5_FIVE
FROM TMP
I need Snowflake to generate an MD5 digest that is equal to the output of MD5(5.000000), but as read from the table and not hard-coded into the function itself.
Alternatively, is this potentially a bug with Snowflake's MD5 hashing function?
Troubleshooting example here:
CREATE TABLE TMP (
FOO NUMBER(38,6),
BAR NUMBER(38,6)
);
INSERT INTO TMP VALUES(0, 5)
;
-- Note: When copying the field value from the table, the INSERTED values now have trailing zeroes.
SELECT *
,MD5(0.000000) AS MD5_ZERO
,MD5(5.000000) AS MD5_FIVE
,MD5(TO_VARCHAR(FOO::VARCHAR(38))) AS MD5_FOO
,MD5(TO_VARCHAR(BAR::VARCHAR(38))) AS MD5_BAR
,MD5(TO_CHAR(BAR, '">"$99.0"<"')) AS D2_1
,MD5(TO_CHAR(BAR, '">"00000000000000000000000000000000.000000"<"')) AS D2_2
,MD5(TO_CHAR(BAR, '">"00000000000000000000000000000000.000000"<"')) AS D2_3
,MD5(TO_CHAR(BAR, '">"B9,999.0"<"')) AS D4_1
,MD5(TO_CHAR(BAR, '">"TM"<"')) AS TM
,MD5(TO_CHAR(BAR, '">"TME"<"')) AS TME
,MD5(TO_CHAR(BAR, '">"TM9"<"')) AS TM9
,MD5(TO_CHAR(BAR, '">"0XXX"<"')) AS X4
,MD5(TO_CHAR(BAR, '">"S0XXX"<"')) AS SX4
FROM TMP
;
.html
Question:
Is it possible to create an MD5 digest between IICS and Snowflake that both match and can be productionized, as seen in the column outpus for MD5_ZERO and MD5_FIVE in the code snippet below?
Details:
I am working to determine record updates by hashing column inputs on NUMERICS. The datawarehouse is Snowflake and the ETL tool is Informatica's IICS Data Integration (i.e., mapping).
Basically, IICS is creating an MD5 digest that only matches Snowflake when the input to MD5 is the numeric string representation that I type in manually (i.e., hard-coded). But, this does not work for something that I am trying to productionize where the function hashes values from the table rather than what I hard-code.
I have tried several variations and permutations in both Snowflake and IICS, but for the life of me, I cannot get the two to create a MD5 digest that matches. I checked the documentation for both Snowflake and IICs (see links below) without success.
Example:
For example, when hashing 5.000000
defined as NUMERIC(38,6), I get the following digests in a productionized setting:
- IICS:
E4DA3B7FBBCE2345D7772B0674A318D5
- Snowflake:
E4DA3B7FBBCE2345D7772B0674A318D5
- MD5_FIVE <-- I need this - Snowflake:
463DAAEAA59E931EC54B0D6375A05A0F
- MD5_BAR <-- But I get this
What is happening here is that the MD5 in Snowflake is different when hashing a defined NUMERIC(38,6) value (i.e., 5.000000) that is stored on the table itself than when just manually typing it into the MD5() function itself. So, something like the query below yields a different MD5, which is unexpected - they should be the same.
SELECT MD5(TO_CHAR(FOO)) AS MD5_FOO
,MD5(5.000000) AS MD5_FIVE
FROM TMP
I need Snowflake to generate an MD5 digest that is equal to the output of MD5(5.000000), but as read from the table and not hard-coded into the function itself.
Alternatively, is this potentially a bug with Snowflake's MD5 hashing function?
Troubleshooting example here:
CREATE TABLE TMP (
FOO NUMBER(38,6),
BAR NUMBER(38,6)
);
INSERT INTO TMP VALUES(0, 5)
;
-- Note: When copying the field value from the table, the INSERTED values now have trailing zeroes.
SELECT *
,MD5(0.000000) AS MD5_ZERO
,MD5(5.000000) AS MD5_FIVE
,MD5(TO_VARCHAR(FOO::VARCHAR(38))) AS MD5_FOO
,MD5(TO_VARCHAR(BAR::VARCHAR(38))) AS MD5_BAR
,MD5(TO_CHAR(BAR, '">"$99.0"<"')) AS D2_1
,MD5(TO_CHAR(BAR, '">"00000000000000000000000000000000.000000"<"')) AS D2_2
,MD5(TO_CHAR(BAR, '">"00000000000000000000000000000000.000000"<"')) AS D2_3
,MD5(TO_CHAR(BAR, '">"B9,999.0"<"')) AS D4_1
,MD5(TO_CHAR(BAR, '">"TM"<"')) AS TM
,MD5(TO_CHAR(BAR, '">"TME"<"')) AS TME
,MD5(TO_CHAR(BAR, '">"TM9"<"')) AS TM9
,MD5(TO_CHAR(BAR, '">"0XXX"<"')) AS X4
,MD5(TO_CHAR(BAR, '">"S0XXX"<"')) AS SX4
FROM TMP
;
https://docs.snowflake/en/sql-reference/functions/to_char#examples-that-convert-numbers
https://docs.informatica/data-integration/powercenter/10-5/transformation-language-reference/functions/to_char--numbers-.html
Share Improve this question edited Feb 5 at 14:44 J Weezy asked Feb 4 at 21:13 J WeezyJ Weezy 3,9575 gold badges42 silver badges98 bronze badges 2 |2 Answers
Reset to default 1The issue is with inconsistent inputs to the hashing function.
Using your troubleshooting example as a starting point, here's a table with two columns, one of type NUMBER
with default precision and scale ((38,0)
) and the other with type NUMBER
with explicitly defined precision and scale ((38,6)
). Two rows are added, adding 0, 5 to each column.
CREATE OR REPLACE TEMPORARY TABLE tmp (
num_default NUMBER,
num_custom NUMBER(38,6)
);
INSERT INTO tmp VALUES(0, 0);
INSERT INTO tmp VALUES(5, 5);
Here is the result of calculating the MD5 digest on each value, with the input shown alongside each result. Two things to note:
- the values are being silently coerced into a string expression, which is the type of input accepted by the
MD5()
function - anytime the input is different in any way, the result is going to be different
SELECT
num_default,
MD5(num_default) AS md5_defualt,
num_custom,
MD5(num_custom) AS md5_custom,
FROM tmp;
NUM_DEFAULT | MD5_DEFUALT | NUM_CUSTOM | MD5_CUSTOM |
---|---|---|---|
0 | cfcd208495d565ef66e7dff9f98764da | 0.000000 | 2d0d659d374801a71f65622406b26458 |
5 | e4da3b7fbbce2345d7772b0674a318d5 | 5.000000 | 463daaeaa59e931ec54b0d6375a05a0f |
Here's a query that explicitly converts the values using TO_CHAR()
and specifies the output format. FM
is used to toggle the fill mode to "compact" and TM9
is used to output numbers in "text minimal" format. The former causes the latter to omit trailing zeroes.
See https://docs.snowflake/en/sql-reference/sql-format-models#text-minimal-numeric-formats and https://docs.snowflake/en/sql-reference/sql-format-models#format-modifiers-and-generic-space-handling (I don't know why the "Note" at the end says that fill mode has no effect on text minimal format elements, that's simply not true and contradicts other points in the documentation as well as the actual system behavior).
SELECT
TO_CHAR(num_default, 'FMTM9') AS default_fmtm9,
MD5(TO_CHAR(num_default, 'FMTM9')) AS md5_default_fmtm9,
TO_CHAR(num_custom, 'FMTM9') AS custom_fmtm9,
MD5(TO_CHAR(num_custom, 'FMTM9')) AS md5_custom_fmtm9,
FROM tmp;
DEFAULT_FMTM9 | MD5_DEFAULT_FMTM9 | CUSTOM_FMTM9 | MD5_CUSTOM_FMTM9 |
---|---|---|---|
0 | cfcd208495d565ef66e7dff9f98764da | 0 | cfcd208495d565ef66e7dff9f98764da |
5 | e4da3b7fbbce2345d7772b0674a318d5 | 5 | e4da3b7fbbce2345d7772b0674a318d5 |
Here are some additional queries to demonstrate how numeric literals are processed by default (i.e. as NUMBER(38,0)
) and what happens when you explicitly cast a literal to NUMBER(38,6)
).
SELECT
5,
MD5(5) AS md5_five_literal;
5 | MD5_FIVE_LITERAL |
---|---|
5 | e4da3b7fbbce2345d7772b0674a318d5 |
SELECT
5.000000,
MD5(5.000000) AS md5_five_literal;
5.000000 | MD5_FIVE_LITERAL |
---|---|
5 | e4da3b7fbbce2345d7772b0674a318d5 |
SELECT
5.000000::NUMBER(38,6),
MD5(5.000000::NUMBER(38,6)) AS md5_five_literal_casted;
5.000000::NUMBER(38,6) | MD5_FIVE_LITERAL_CASTED |
---|---|
5.000000 | 463daaeaa59e931ec54b0d6375a05a0f |
The main takeaway from all this is that you need to focus on what exactly is being input into the hashing function. You should strive to remove all ambiguity by performing explicit conversions to the string expression that the Snowflake MD5()
function accepts.
For reference, see https://docs.snowflake/sql-reference/data-type-conversion and https://docs.snowflake/en/sql-reference/functions/md5.
Alternatively, is this potentially a bug with Snowflake's MD5 hashing function?
There's nothing wrong with the MD5 implementation. This same behavior can be illustrated in a different environment. For example, using Go:
package main
import (
"crypto/md5"
"fmt"
)
func printMD5(input string) {
inputBytes := []byte(input)
fmt.Printf("input = %q\n", input)
fmt.Printf("inputBytes = %x\n", inputBytes)
fmt.Printf("md5(inputBytes) = %x\n\n", md5.Sum(inputBytes))
}
func main() {
printMD5("5")
printMD5("5.000000")
}
Output:
input = "5"
inputBytes = 35
md5(inputBytes) = e4da3b7fbbce2345d7772b0674a318d5
input = "5.000000"
inputBytes = 352e303030303030
md5(inputBytes) = 463daaeaa59e931ec54b0d6375a05a0f
I am adding some additional follow-up to Chuckx's very detailed answer for the IICS crowd, because this applies to other data warehouse tools (e.g., MSSQL Server). The problem on my end was with IICS because it's TO_CHAR()
function will cast high precision numbers to a DOUBLE before converting it to string. I validated this with Snowflake, MSSQL Server, and an online MD5 tool and the MD5 checksum was identical, which means IICS is the source of the problem and not Snowflake.
The IICS (i.e., Cloud ETL tool) documentation online was* silent on this behavior - it focused more on its ability to cast date formats as string. I had to go to the Powercenter 10.5 (i.e., on-prem ETL tool) and it explained the behavior in better detail - although it still seemed a bit off.
CREATE OR REPLACE TABLE TMP (
FOO NUMBER(38,6),
BAZ NUMBER(38,6),
BAR INT AUTOINCREMENT (1,1) ORDER
)
INSERT INTO TMP (FOO, BAZ) VALUES (1010.99, 12345678901234567890123456799),
(-15.62567, 0999999999999999999999.999999),
(10842764968208837340, NULL),
(236789034569723, 1234578945469649345876.123456),
(0, 2378964536789761),
(33.15, -15.62567),
(5, -15.62567),
(NULL, 5),
(2378964536789761, 1010.99),
(1234567890123456789012345679, 0),
(1234567890123456789012345679, 33.15),
(1234578945469649345876.123456, 12345678901234567890123456799),
(0999999999999999999999.999999, 5),
(12345678901234567890123456799, 2378964536789761)
SELECT FOO
,MD5(FOO)::VARCHAR(32) AS SF_MD5_FOO
,TO_VARCHAR(FOO)::VARCHAR(38) AS FOO_VARCHAR
,MD5(TO_VARCHAR(FOO))::VARCHAR(32) AS SF_MD5_FOO_VARCHAR
,TO_VARCHAR(TO_DOUBLE(FOO))::VARCHAR(38) AS FOO_DOUBLE
,MD5(TO_VARCHAR(TO_DOUBLE(FOO)))::VARCHAR(32) AS SF_MD5_FOO_DOUBLE
,BAZ
,MD5(BAZ)::VARCHAR(32) AS SF_MD5_BAZ
,TO_VARCHAR(BAZ)::VARCHAR(38) AS BAZ_VARCHAR
,MD5(TO_VARCHAR(BAZ))::VARCHAR(32) AS SF_MD5_BAZ_VARCHAR
,TO_VARCHAR(TO_DOUBLE(BAZ))::VARCHAR(38) AS BAZ_DOUBLE
,MD5(TO_VARCHAR(TO_DOUBLE(BAZ)))::VARCHAR(32) AS SF_MD5_BAZ_DOUBLE
,MD5(CONCAT(
IFF(FOO IS NULL, 0, FOO)
,IFF(BAZ IS NULL, 0, BAZ)
)
)::VARCHAR(32) AS SF_MD5_COMPOSITE
,MD5(CONCAT(
IFF(FOO IS NULL, TO_VARCHAR(0), TO_VARCHAR(FOO))
,IFF(BAZ IS NULL, TO_VARCHAR(0), TO_VARCHAR(BAZ))
)
)::VARCHAR(32) AS SF_MD5_COMPOSITE_VARCHAR
,MD5(CONCAT(
IFF(FOO IS NULL, TO_VARCHAR(TO_DOUBLE(0)), TO_VARCHAR(TO_DOUBLE(FOO)))
,IFF(BAZ IS NULL, TO_VARCHAR(TO_DOUBLE(0)), TO_VARCHAR(TO_DOUBLE(BAZ)))
)
)::VARCHAR(32) AS SF_MD5_COMPOSITE_DOUBLE
FROM TMP
;
This is the link that I used to resolve the problem:
https://docs.informatica/data-integration/powercenter/10-5/transformation-language-reference/functions/to_char--numbers-.html
*Note: The IICS Data Integration documentation shows as updated this month, so the previous instance of what I was working with last month is no longer online. The link below just updated:
https://docs.informatica/integration-cloud/data-integration/current-version/function-reference/functions/to_char--numbers-.html
本文标签:
版权声明:本文标题:cryptography - Why do Snowflake and Informatica (IICS) MD5 function yield a different digest for the same input NUMERIC? - Stack 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741745411a2395503.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
MD5(TO_CHAR(BAR, 'FMTM9')) AS FMTM9
? – chuckx Commented Feb 9 at 5:03