admin管理员组

文章数量:1289635

I am trying to load on prem transaction log table and update a databricks table.In Source table,the column reqdetails table hold all the information and expect ProductID remaining columns are dynamic(not all columns exists in the request) all time the optional/nullable.Could you please let me the process using python in databricks ? Explored StructType but it needs mandatory specific columns.

Source Table

Req ID  Type    Req Details Status
1   Update  ProductID=234;ProductName=LawnMover;Price=58    True
2   Update  ProductID=874;Price=478 True
3   Update  ProductID=678;ProductParentgroup=Watersuppuly;Price=1.6 True

Target table before Update

ProductID   ProductParentgroup  ProductName    Price
234         Utility             Mover          86
874         HOA                 Sink           450
678         Water               Filters        1.2

Target table after Update

ProductID   ProductParentgroup  ProductName        Price
234         Utility             LawnMover          58
874         HOA                 Sink               478
678         Watersupply         Filters            1.6

I am trying to load on prem transaction log table and update a databricks table.In Source table,the column reqdetails table hold all the information and expect ProductID remaining columns are dynamic(not all columns exists in the request) all time the optional/nullable.Could you please let me the process using python in databricks ? Explored StructType but it needs mandatory specific columns.

Source Table

Req ID  Type    Req Details Status
1   Update  ProductID=234;ProductName=LawnMover;Price=58    True
2   Update  ProductID=874;Price=478 True
3   Update  ProductID=678;ProductParentgroup=Watersuppuly;Price=1.6 True

Target table before Update

ProductID   ProductParentgroup  ProductName    Price
234         Utility             Mover          86
874         HOA                 Sink           450
678         Water               Filters        1.2

Target table after Update

ProductID   ProductParentgroup  ProductName        Price
234         Utility             LawnMover          58
874         HOA                 Sink               478
678         Watersupply         Filters            1.6
Share Improve this question edited Feb 20 at 21:11 Hogan 70.5k10 gold badges82 silver badges119 bronze badges asked Feb 20 at 19:01 RUCRUC 831 gold badge3 silver badges6 bronze badges 1
  • 1 Next time please don't use tabs in your code. I had to replace with spaces to get it to line up so your example is readable. You should never use tabs since other people might not see it formed the same, if you want to use the tab key, configure your code editor to replace tabs with spaces. – Hogan Commented Feb 20 at 21:34
Add a comment  | 

2 Answers 2

Reset to default 1

Split ReqDetails from the source table using regexp_extract, join source and target table and then merge the respective columns using coalesce.

If regexp_extract does not match a value in ReqDetails, it returns an empty string.

from pyspark.sql import functions as F

srcDf = ...
targetDf = ...

src2Df = srcDf.withColumn('ProductID', F.regexp_extract('ReqDetails', 'ProductID=(.*?);', 1)) \
     .withColumn('SrcProductParentgroup', F.regexp_extract('ReqDetails', 'ProductParentgroup=(.*?);', 1)) \
     .withColumn('SrcProductName', F.regexp_extract('ReqDetails', 'ProductName=(.*?);', 1)) \
     .withColumn('SrcPrice', F.regexp_extract('ReqDetails', 'Price=(.*)', 1)) \
     .na.replace('', None)


targetDf.join(src2Df, on='ProductID', how='left') \
    .withColumn('ProductParentgroup', F.coalesce('SrcProductParentgroup', 'ProductParentgroup')) \
    .withColumn('ProductName', F.coalesce('SrcProductName', 'ProductName')) \
    .withColumn('Price', F.coalesce('SrcPrice', 'Price')) \
    .select('ProductID', 'ProductParentgroup', 'ProductName', 'Price') \
    .show()

Output:

+---------+------------------+-----------+-----+
|ProductID|ProductParentgroup|ProductName|Price|
+---------+------------------+-----------+-----+
|      234|           Utility|  LawnMover|   58|
|      678|      Watersuppuly|    Filters|  1.6|
|      874|               HOA|       Sink|  478|
+---------+------------------+-----------+-----+

Create a proc with default values of null, then pass in the value you get or null to the procedure. The update code below shows you how to use COALESCE to replace the null values with the existing value or use the passed value.

CREATE PROCEDURE example
(
    @productid integer
    @parentgroup varchar(200) = NULL,
    @productname varchar(200) = NULL,
    @price integer = NULL
)
AS

BEGIN
    SET NOCOUNT ON

    UPDATE tablenameyoudidnottellusthename as A
      SET A.parentgroup = COALESCE(@pargentgroup, A.parentgroup),
          A.productname = COALESCE(@productname, A.productname),
          A.price = COALESCE(@price,A.price)
    WHERE A.productid = @productid
END

本文标签: pythonUpdate target column with optional source columnsStack Overflow