ETL

Oracle Merge性能调整

Posted by bluesky blog on January 30, 2012

Merge的具体功能,就不细说了,在实际的ETL开发过程中,碰到了一个很有意思的问题,

原加工存储片段:

 merge into temp_data.tmp_indpty_prod_stat_h_1 p
 			 using ( select indpty_agmt_rela.conform_indparty_id,
       'S05009' || trim(f.operation_typ_cd) as product_id,
       sum(thisincomeamt - thispayoutamt) as inv_income,
       sum(lastincomeamt - lastpayoutamt + thisincomeamt -
           thispayoutamt) as sum_inv_income
   
  from tcs_indpty_agmt_rela_h indpty_agmt_rela
  left join raw_data.f_acctquotient f on f.customerid =
                                         indpty_agmt_rela.agmt_num
                                     and indpty_agmt_rela.province_cd = 96
                                     and indpty_agmt_rela.agmt_modifier_num = 5
                                     and indpty_agmt_rela.indpty_agmt_rela_cd = '10'
                                     and p_txdate between
                                         indpty_agmt_rela.start_dt and
                                         indpty_agmt_rela.end_dt - 1
 group by indpty_agmt_rela.conform_indparty_id,
          'S05009' || trim(f.operation_typ_cd)) s
 			on (p.conform_indparty_id = s.conform_indparty_id and p.product_id = s.product_id)
 		 when matched then
update
   set p.inv_income = s.inv_income, p.sum_inv_income = s.sum_inv_income;

这就是一个简单的Merge语句,但这个加工存储执行了7,8个小时,数据量为70多万左右,速度实在是不敢恭维,

经过尝试,将Megrge中的生成源数据的SQL语句,做张临时表,并将加工存储,改成如下:

INSERT INTO temp_data.tmp_indpty_prod_stat_h_3
select indpty_agmt_rela.conform_indparty_id,
       'S05009' || trim(f.operation_typ_cd) as product_id,
       sum(thisincomeamt - thispayoutamt) as inv_income,
       sum(lastincomeamt - lastpayoutamt + thisincomeamt -
           thispayoutamt) as sum_inv_income
   
  from tcs_indpty_agmt_rela_h indpty_agmt_rela
  left join raw_data.f_acctquotient f on f.customerid =
                                         indpty_agmt_rela.agmt_num
                                     and indpty_agmt_rela.province_cd = 96
                                     and indpty_agmt_rela.agmt_modifier_num = 5
                                     and indpty_agmt_rela.indpty_agmt_rela_cd = '10'
                                     and p_txdate between
                                         indpty_agmt_rela.start_dt and
                                         indpty_agmt_rela.end_dt - 1
 group by indpty_agmt_rela.conform_indparty_id,
          'S05009' || trim(f.operation_typ_cd);
 	 merge into temp_data.tmp_indpty_prod_stat_h_1 p
 	 using (SELECT CONFORM_INDPARTY_ID, PRODUCT_ID, INV_INCOME, SUM_INV_INCOME
       from temp_data.tmp_indpty_prod_stat_h_3) s
 	 on (p.conform_indparty_id = s.conform_indparty_id and p.product_id = s.product_id)
 	 when matched then
update
   set p.inv_income = s.inv_income, p.sum_inv_income = s.sum_inv_income;

这样改完之后,发现该存储运行完成,只要几分钟,性能大大提升。

很有意思,后来,又在其他几个加工存储中做了验证,

发现Merge中,避免生成数据源的sql 过于复杂,如果无法避免,提前做张临时表,

然Merge直接去读,这样性能很大程度上有所提升。