admin管理员组

文章数量:1277875

Data model

My database contains a table with unique id and JSON blob. JSON blob contains, among other fields, a list of "identifiers" that I have an index on. Identifiers are of a given type (ISIN, VALOR, etc.).
The identifier is held in a JSON subobject, with possibly a priority class (primary, or linked): {<type>: <identifier value>, <class>: true}.
By having each identifier in its own subobject, the array of identifiers can hold multiple identifiers of the same type, for example:

    {
        "identifiers": [
          {"isin": "XS1", "primary": true},
          {"valor": 1234, "linked": true},
          {"isin": "EXTRA"}
        ],
        "workflows": {},
        "slots": {}
    }

Lookup queries

Initial, single field search

Previously I had a query that provided a single array element, to be searched among all JSON blobs "identifiers" lists and return the Ids of those elements. For example the search criteria: [ {"isin": "XS1"} ] would return true for the aforementioned example.

New search rules

Now I want the priority class to filter more precisely.
Basically the conditions are as follows:

Return a record if it contains an identifier with the given type and value (of the search criteria):

  • and primary = true,
  • or linked = true but the object holds no primary identifier of that type
  • or one with no primary or linked class (either having no class at all, or a different class), and there exists no primary or linked identifier of the searched type in the object

Thus the search criteria [ {"isin": "EXTRA"} ] would return false for below identifiers

'{
        "identifiers": [
          {"isin": "XS1", "primary": true},
          {"valor": 1234},
          {"isin": "EXTRA", "linked":true}
        ],
        "workflows": {},
        "slots": {}
    }'

because isin and EXTRA match but there is already an isin which is primary, thus masking all other isins.

Desired query

Query plan for this single item query was good, because i could easly hardcode certain values like "type" in the query itself. Now I would like to make the query resolve a batch of identifiers the same way, i.e.:

'[
      [{"isin": "XS12"}, {"id": "XS12"}, {"schema" : "isin"}],
      [{"valor": 1234}, {"id": 1234}, {"schema" : "valor"}],
      [{"isin": "EXTRA"}, {"id": "EXTRA"}, {"schema" : "isin"}]
    ]'

The query is: Fiddle

WITH scheme_and_id AS (
    -- Convert the input JSON object into rows of key-value pairs
    SELECT jsonb_array_elements('[
      [{"isin": "XS12"}, {"id": "XS12"}, {"schema" : "isin"}],
      [{"valor": 1234}, {"id": 1234}, {"schema" : "valor"}],
      [{"isin": "EXTRA"}, {"id": "EXTRA"}, {"schema" : "isin"}]
    ]'::jsonb) as batch_json
),
     -- first build the JSON to be used to match the index
     resolve_id as (select -- select the JSON objects
                           "id",
                           (blob -> 'identifiers')    as "ids",
                           (scheme_and_id.batch_json[1] -> 'id')::text as identifier,
                           trim((scheme_and_id.batch_json[2] -> 'schema')::text, '"') as selected_schema
                    from blobstable, scheme_and_id
                    where (blob -> 'identifiers') @>  jsonb_build_array( scheme_and_id.batch_json[0])),
     candidates as (
         SELECT
             "id",
             selected_schema  as "schema",
             identifier as "identifier",
             (single_identifier -> 'linked')::boolean as "linked",
             (single_identifier -> 'primary')::boolean as "primary",
             (single_identifier -> selected_schema)::TEXT as selected_id,
             single_identifier
         FROM resolve_id ,
              LATERAL jsonb_array_elements("ids") single_identifier
         where  (single_identifier -> selected_schema)::TEXT is not null
     )
SELECT *
from candidates
where ("primary" and identifier = selected_id)        -- first rule
   or ("linked" and identifier = selected_id and
       not exists(select "id" from candidates where "primary")) -- second rule
   or (("primary" is Null or False) and ("linked" is Null or False) and identifier = selected_id and
       not exists(select "id" from candidates where "linked")) -- third rule
;

And the explanation: I provide arrays of params like so: [{"isin": "XS12"}, {"id": "XS12"}, {"schema" : "isin"}]

{"isin": "XS12"} is the identifier to be matched in:

where (blob -> 'identifiers') @>  jsonb_build_array( scheme_and_id.batch_json[0]))

{"id": "XS12"} and {"schema" : "isin"} parameters are needed to easily match it in the candidates conditions (extracting this dynamically was challenging because the type name can be isin, valor etc, as seen here: (trim is used because somehow schema was being returned with quotation marks

(scheme_and_id.batch_json[1] -> 'id')::text as identifier,
trim((scheme_and_id.batch_json[2] -> 'schema')::text, '"') 

After those operations "resolve_id" contains the "id" and "identifiers" blob of all arrays that contain identifiers from the "scheme_and_id"

And finally in "candidates" I extract needed fields from the blob for the final "where" logic. Addition of this "candidates" ads another nested loop to the explain plan. I am looking for a way to optimize this query. I was even thinking about making or the "where" logic using @> operator, to basically check if contains "type":"value" primary - true, and not contains others, but I see no way to do it like "not contains "isin": any, primary: true

Working example: Fiddle

Basic create table with test data, expected in the repsonse are all 3 selected ids

CREATE TABLE blobstable 
(
  id integer primary key,
  blob jsonb
);

INSERT INTO blobstable 
VALUES 
(1, '{
    "identifiers": [
      {"isin": "XS1", "primary": true},
      {"valor": 1234},
      {"isin": "EXTRA"}
    ],
    "workflows": {},
    "slots": {}
}'),
(2, '{
     "identifiers": [
      {"isin": "XS3", "primary": true},
      {"valor": 456, "linked": true},
      {"cusip": "TRGYN"}
    ],
    "workflows": {},
    "slots": {}
}'),
  (3, '{
    "identifiers": [
      {"isin": "XS12", "primary": true},
      {"valor": 678},
      {"isin": "XS5"}
    ],
    "workflows": {},
    "slots": {}
}')
;

Data model

My database contains a table with unique id and JSON blob. JSON blob contains, among other fields, a list of "identifiers" that I have an index on. Identifiers are of a given type (ISIN, VALOR, etc.).
The identifier is held in a JSON subobject, with possibly a priority class (primary, or linked): {<type>: <identifier value>, <class>: true}.
By having each identifier in its own subobject, the array of identifiers can hold multiple identifiers of the same type, for example:

    {
        "identifiers": [
          {"isin": "XS1", "primary": true},
          {"valor": 1234, "linked": true},
          {"isin": "EXTRA"}
        ],
        "workflows": {},
        "slots": {}
    }

Lookup queries

Initial, single field search

Previously I had a query that provided a single array element, to be searched among all JSON blobs "identifiers" lists and return the Ids of those elements. For example the search criteria: [ {"isin": "XS1"} ] would return true for the aforementioned example.

New search rules

Now I want the priority class to filter more precisely.
Basically the conditions are as follows:

Return a record if it contains an identifier with the given type and value (of the search criteria):

  • and primary = true,
  • or linked = true but the object holds no primary identifier of that type
  • or one with no primary or linked class (either having no class at all, or a different class), and there exists no primary or linked identifier of the searched type in the object

Thus the search criteria [ {"isin": "EXTRA"} ] would return false for below identifiers

'{
        "identifiers": [
          {"isin": "XS1", "primary": true},
          {"valor": 1234},
          {"isin": "EXTRA", "linked":true}
        ],
        "workflows": {},
        "slots": {}
    }'

because isin and EXTRA match but there is already an isin which is primary, thus masking all other isins.

Desired query

Query plan for this single item query was good, because i could easly hardcode certain values like "type" in the query itself. Now I would like to make the query resolve a batch of identifiers the same way, i.e.:

'[
      [{"isin": "XS12"}, {"id": "XS12"}, {"schema" : "isin"}],
      [{"valor": 1234}, {"id": 1234}, {"schema" : "valor"}],
      [{"isin": "EXTRA"}, {"id": "EXTRA"}, {"schema" : "isin"}]
    ]'

The query is: Fiddle

WITH scheme_and_id AS (
    -- Convert the input JSON object into rows of key-value pairs
    SELECT jsonb_array_elements('[
      [{"isin": "XS12"}, {"id": "XS12"}, {"schema" : "isin"}],
      [{"valor": 1234}, {"id": 1234}, {"schema" : "valor"}],
      [{"isin": "EXTRA"}, {"id": "EXTRA"}, {"schema" : "isin"}]
    ]'::jsonb) as batch_json
),
     -- first build the JSON to be used to match the index
     resolve_id as (select -- select the JSON objects
                           "id",
                           (blob -> 'identifiers')    as "ids",
                           (scheme_and_id.batch_json[1] -> 'id')::text as identifier,
                           trim((scheme_and_id.batch_json[2] -> 'schema')::text, '"') as selected_schema
                    from blobstable, scheme_and_id
                    where (blob -> 'identifiers') @>  jsonb_build_array( scheme_and_id.batch_json[0])),
     candidates as (
         SELECT
             "id",
             selected_schema  as "schema",
             identifier as "identifier",
             (single_identifier -> 'linked')::boolean as "linked",
             (single_identifier -> 'primary')::boolean as "primary",
             (single_identifier -> selected_schema)::TEXT as selected_id,
             single_identifier
         FROM resolve_id ,
              LATERAL jsonb_array_elements("ids") single_identifier
         where  (single_identifier -> selected_schema)::TEXT is not null
     )
SELECT *
from candidates
where ("primary" and identifier = selected_id)        -- first rule
   or ("linked" and identifier = selected_id and
       not exists(select "id" from candidates where "primary")) -- second rule
   or (("primary" is Null or False) and ("linked" is Null or False) and identifier = selected_id and
       not exists(select "id" from candidates where "linked")) -- third rule
;

And the explanation: I provide arrays of params like so: [{"isin": "XS12"}, {"id": "XS12"}, {"schema" : "isin"}]

{"isin": "XS12"} is the identifier to be matched in:

where (blob -> 'identifiers') @>  jsonb_build_array( scheme_and_id.batch_json[0]))

{"id": "XS12"} and {"schema" : "isin"} parameters are needed to easily match it in the candidates conditions (extracting this dynamically was challenging because the type name can be isin, valor etc, as seen here: (trim is used because somehow schema was being returned with quotation marks

(scheme_and_id.batch_json[1] -> 'id')::text as identifier,
trim((scheme_and_id.batch_json[2] -> 'schema')::text, '"') 

After those operations "resolve_id" contains the "id" and "identifiers" blob of all arrays that contain identifiers from the "scheme_and_id"

And finally in "candidates" I extract needed fields from the blob for the final "where" logic. Addition of this "candidates" ads another nested loop to the explain plan. I am looking for a way to optimize this query. I was even thinking about making or the "where" logic using @> operator, to basically check if contains "type":"value" primary - true, and not contains others, but I see no way to do it like "not contains "isin": any, primary: true

Working example: Fiddle

Basic create table with test data, expected in the repsonse are all 3 selected ids

CREATE TABLE blobstable 
(
  id integer primary key,
  blob jsonb
);

INSERT INTO blobstable 
VALUES 
(1, '{
    "identifiers": [
      {"isin": "XS1", "primary": true},
      {"valor": 1234},
      {"isin": "EXTRA"}
    ],
    "workflows": {},
    "slots": {}
}'),
(2, '{
     "identifiers": [
      {"isin": "XS3", "primary": true},
      {"valor": 456, "linked": true},
      {"cusip": "TRGYN"}
    ],
    "workflows": {},
    "slots": {}
}'),
  (3, '{
    "identifiers": [
      {"isin": "XS12", "primary": true},
      {"valor": 678},
      {"isin": "XS5"}
    ],
    "workflows": {},
    "slots": {}
}')
;

Share Improve this question edited Feb 25 at 11:31 guruk asked Feb 25 at 8:29 gurukguruk 8310 bronze badges 6
  • As mentioned in your previous post, please show a minimal reproducible example, containing the CREATE TABLE and some insert statements. Please also show a single Markdown Table with your expected results. – Charlieface Commented Feb 25 at 10:29
  • I'm still very unclear: in those final "rules" are we purely looking at a match across the identifiers in a single JSON row, or do you want to look at all rows? Eg does row 1 affect row 3? Also your input search data seems to repeat itself, for example [{"isin": "XS12"}, {"id": "XS12"}, {"schema" : "isin"}] could be just [{"isin": "XS12"}] and everything can be worked out from there. – Charlieface Commented Feb 25 at 11:34
  • Hi, if multiple rows are returned for a single "isin" for example, actually we shoudl select the row with preference to "primary", then "linked" as in the "where"clause, so in that sense they are not fully independent, all candidates shoudl be checked for a particular ID – guruk Commented Feb 25 at 12:30
  • 1 Your current query seems to need a small correction, because the last where filters out only if the id is found by another criteria. Thus, the only reason why {isin:"EXTRA"} of id 1 is not matched, is that because id 1 is matched by valor. But if you remove valor from the criterias, its {isin:"EXTRA"} will still match (while according to rule 3., it should not match because id 1 holds another isin with primary: true). Another way to test is by adding an {"isin": "EXTRA"}, to id 2: it incorrectly shows up as a result. – Guillaume Outters Commented Feb 25 at 15:23
  • 1 I have put what I think would be the corrected version (according to what I have understood from your description) in a fiddle; along with a second version that simplifies the not exists by flattening all "priority classes" (primary, linked) to an easily comparable int field. – Guillaume Outters Commented Feb 25 at 17:34
 |  Show 1 more comment

2 Answers 2

Reset to default 0

Supposing your dataware are more often searched than modified,
I would compute a field dedicated to identifiers search (indexed, and computed ON INSERT OR UPDATE).

In 3 passes, it would contain:

  1. all primary identifiers
  2. and all linked identifiers of types not in 1.
  3. and all identifiers of types not in 1. + 2.

Your search query would then be super simple.

I went with final query, as seen here: https://dbfiddle.uk/gwzGZX8j

       WITH scheme_and_id as (
                select     (id_json -> 'identifier') AS queried_id,
                           trim((id_json-> 'scheme')::text, '"')AS scheme,
                           ordinal
                from jsonb_array_elements('[
      {"identifier": "XS12", "scheme" : "isin"},
      {"identifier": 1234, "scheme" : "valor"},
      {"identifier": "EXTRA", "scheme" : "isin"}
    ]'::jsonb) WITH ORDINALITY as f(id_json, ordinal)),
                 -- first build the JSON to be used to match the index
                 resolve_id as (select -- select the JSON objects
                                       id,
                                       (blob -> 'identifiers')     as "ids",
                                       ordinal,
                                       queried_id,
                                       scheme
                                 from blobstable, scheme_and_id
                                where (blob -> 'identifiers')  @>  jsonb_build_array( jsonb_build_object(scheme, queried_id))),
                 candidates as (
                     SELECT
                         id,
                         ordinal,
                         scheme,
                         queried_id,
                         (identifier_row -> scheme) as candidate_id,
                         (identifier_row -> 'primary')::boolean as "primary",
                         (identifier_row -> 'linked')::boolean as "linked"
                     FROM resolve_id ,
                          LATERAL jsonb_array_elements("ids") identifier_row
                 )
            SELECT id
            from candidates c
                     right join scheme_and_id on c.ordinal = scheme_and_id.ordinal
                and ((c."primary" and c.queried_id = c.candidate_id)                                            -- first rule
                or  (c."linked"   and c.queried_id = c.candidate_id
                    and not exists(select "id" from candidates where c.ordinal = candidates.ordinal and candidate_id is not null and "primary"))  -- second rule
                or (("primary" is Null or False) and ("linked" is Null or False) and c.queried_id = c.candidate_id
                    and not exists(select "id" from candidates where c.ordinal = candidates.ordinal and candidate_id is not null and "linked")))  -- third rule
            order by scheme_and_id.ordinal;

So I added sorting, since I need to return NULL for not found identifiers. I also improved the way I provide the input into the query to drop the nested arrays and value duplication. I also decided to go with this where clause, because it moves all the conditional logic to the final step, and I believe is easier to read.

       from candidates c
                     right join scheme_and_id on c.ordinal = scheme_and_id.ordinal
                and ((c."primary" and c.queried_id = c.candidate_id)                                            -- first rule
                or  (c."linked"   and c.queried_id = c.candidate_id
                    and not exists(select "id" from candidates where c.ordinal = candidates.ordinal and candidate_id is not null and "primary"))  -- second rule
                or (("primary" is Null or False) and ("linked" is Null or False) and c.queried_id = c.candidate_id
                    and not exists(select "id" from candidates where c.ordinal = candidates.ordinal and candidate_id is not null and "linked")))  -- third rule
            order by scheme_and_id.ordinal;

Thank you for help

本文标签: postgresqlEffective JSON array search in batcheswith data extractionStack Overflow