datax安装使用及使用过程中遇到的问题

DataX 是一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。

相关概念和设计理念不再赘述,这里记录个人初次使用时的过程。

datax环境要求:

 

1.下载datax

③git clone [email protected]:alibaba/DataX.git
 
 

2、解压

本过程中使用方式②安装:
解压下载的安装包
[[email protected] bin]# unzip datax.tar.gz
 
 

3、进入目录进行任务配置:

通过命令生成datax任务的模板json文件:mysql2mysql.json
[[email protected] bin]# cd  ./datax/bin/
[[email protected] bin]# python datax.py -r mysqlreader -w mysqlwriter > mysql2mysql.json
详细如下图:
datax安装使用及使用过程中遇到的问题

 

 

4、根据各自需求,按照模板修改json文件:

[[email protected] bin]# cat mysql2mysql.json
[[email protected] bin]# cat mysql2mysql.json
{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "column": ["id","user_id"],
                        "connection": [
                            {
                                "jdbcUrl": ["jdbc:mysql://192.168.225.131:3306/iris"],
                                "table": ["seiki1"]
                            }
                        ],
                        "password": "root",
                        "username": "root"
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "column": ["id","user_id"],
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://192.168.225.130:3306/iris",
                                "table": ["seiki1"]
                            }
                        ],
                        "password": "root",
                        "username": "root",
                        "writeMode": "insert"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": "1"
            }
        }
    }
}
文件详细说明:
数据源:IP:131;
目标库:IP:130;
库名:iris;
表名:seiki1;
字段:id,user_id(writer和reader列名一一对应)。
 
 

5、启动datax同步任务:

[[email protected] bin]# python ./datax.py ./mysql2mysql.json
 
启动信息如下:
[email protected] bin]# python ./datax.py mysql2mysql.json
 
 
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
 
 
 
 
2019-12-27 15:55:08.264 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2019-12-27 15:55:08.275 [main] INFO  Engine - the machine info  =>
 
    osInfo:    Oracle Corporation 1.8 25.221-b11
    jvmInfo:    Linux amd64 3.10.0-957.el7.x86_64
    cpu num:    1
 
    totalPhysicalMemory:    -0.00G
    freePhysicalMemory:    -0.00G
    maxFileDescriptorCount:    -1
    currentOpenFileDescriptorCount:    -1
 
    GC Names    [Copy, MarkSweepCompact]
 
    MEMORY_NAME                    | allocation_size                | init_size                      
    Eden Space                     | 273.06MB                       | 273.06MB                       
    Code Cache                     | 240.00MB                       | 2.44MB                         
    Survivor Space                 | 34.13MB                        | 34.13MB                        
    Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
    Metaspace                      | -0.00MB                        | 0.00MB                         
    Tenured Gen                    | 682.69MB                       | 682.69MB                       
 
……
 
问题1:
2019-12-27 15:55:09.312 [job-0] ERROR RetryUtil - Exception when calling callable, 异常Msg:Code:[MYSQLErrCode-02], Description:[数据库服务的IP地址或者Port错误,请检查填写的IP地址和Port或者联系DBA确认IP地址和Port是否正确。如果是同步中心用户请联系DBA确认idb上录入的IP和PORT信息和数据库的当前实际信息是一致的].  -  具体错误信息为:com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
解决办法:
检查json文件中配置的ip、端口、用户名和密码,及验证网络连通性,注意防火墙策略等。
在本次场景中,是由于防火墙策略问题导致。
 
 
 

6、解决问题后,继续执行datax同步任务:

执行结果信息如下:
2019-12-27 16:14:29.083 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2019-12-27 16:14:29.083 [job-0] INFO  JobContainer - DataX Writer.Job [mysqlwriter] do post work.
2019-12-27 16:14:29.084 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do post work.
2019-12-27 16:14:29.084 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2019-12-27 16:14:29.084 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /data/datax/hook
2019-12-27 16:14:29.086 [job-0] INFO  JobContainer -
     [total cpu info] =>
        averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
        -1.00%                         | -1.00%                         | -1.00%
                        
 
 
     [total gc info] =>
         NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
         Copy                 | 1                  | 1                  | 1                  | 0.098s             | 0.098s             | 0.098s             
         MarkSweepCompact     | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
 
 
2019-12-27 16:14:29.086 [job-0] INFO  JobContainer - PerfTrace not enable!
2019-12-27 16:14:29.086 [job-0] INFO  StandAloneJobContainerCommunicator - Total 100858 records, 1295828 bytes | Speed 126.54KB/s, 10085 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 2.048s |  All Task WaitReaderTime 0.437s | Percentage 100.00%
2019-12-27 16:14:29.087 [job-0] INFO  JobContainer -
任务启动时刻                    : 2019-12-27 16:14:17
任务结束时刻                    : 2019-12-27 16:14:29
任务总计耗时                    :                 11s
任务平均流量                    :          126.54KB/s
记录写入速度                    :          10085rec/s
读出记录总数                    :              100858
读写失败总数                    :                   0
成功同步10085条记录,失败0条。
 
问题2:
报错:Could not retrieve transation read-only status server
All Task WaitWriterTime 18.691s |  All Task WaitReaderTime 0.221s | Percentage 0.00%
2019-12-27 15:00:12.241 [0-0-0-writer] WARN  CommonRdbmsWriter$Task - 回滚此次写入, 采用每次写入一行方式提交. 因为:Could not retrieve transation read-only status server
2019-12-27 15:00:13.455 [0-0-0-writer] WARN  CommonRdbmsWriter$Task - 回滚此次写入, 采用每次写入一行方式提交. 因为:Could not retrieve transation read-only status server
2019-12-27 15:00:14.670 [0-0-0-writer] WARN  CommonRdbmsWriter$Task - 回滚此次写入, 采用每次写入一行方式提交. 因为:Could not retrieve transation read-only status server
2019-12-27 15:00:15.896 [0-0-0-writer] WARN  CommonRdbmsWriter$Task - 回滚此次写入, 采用每次写入一行方式提交. 因为:Could not retrieve transation read-only status server
2019-12-27 15:00:17.072 [0-0-0-writer] WARN  CommonRdbmsWriter$Task - 回滚此次写入, 采用每次写入一行方式提交. 因为:Could not retrieve transation read-only status server
2019-12-27 15:00:18.276 [0-0-0-writer] WARN  CommonRdbmsWriter$Task - 回滚此次写入, 采用每次写入一行方式提交. 因为:Could not retrieve transation read-only status server
2019-12-27 15:00:19.499 [0-0-0-writer] WARN  CommonRdbmsWriter$Task - 回滚此次写入, 采用每次写入一行方式提交. 因为:Could not retrieve transation read-only status server
2019-12-27 15:00:20.775 [0-0-0-writer] WARN  CommonRdbmsWriter$Task - 回滚此次写入, 采用每次写入一行方式提交. 因为:Could not retrieve transation read-only status server
2019-12-27 15:00:21.949 [0-0-0-writer] WARN  CommonRdbmsWriter$Task - 回滚此次写入, 采用每次写入一行方式提交. 因为:Could not retrieve transation read-only status server
问题原因:
无法正常对数据库进行数据写入,有可能连接问题,也可能跟隔离级别有关。
 
解决办法:
①安装合适的mysql连接驱动
[[email protected] bin]# python -m pip install mysql-connector
没有pip工具,可以通过以下命令安装:
[[email protected] bin]# yum -y install *pip* 
 
 
②检查数据库隔离级别:
[email protected]|iris>SET GLOBAL transaction_isolation='REPEATABLE-READ';
详情如下:
datax安装使用及使用过程中遇到的问题
如果不是RC,可调修改为RC进行尝试。
修改命令:
[email protected]|(none)>SET GLOBAL transaction_isolation='READ-COMMITTED';
或者
修改my.cnf配置文件,然后重启MySQL服务器,以永久生效。
 
 

 

7、配置多表关联结果插入目标表:

[[email protected] bin]# cat mysql2mysql22.json
{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "column": ["id","user_id"],
                        "connection": [
                            {
                                "jdbcUrl": ["jdbc:mysql://192.168.225.131:3306/iris"],
                                "querySql":["select a.id,a.user_id,concat(a.user_id,b.name) from seiki1 a inner join aaa b on a.id=b.id"]
                            }
                        ],
                        "password": "root",
                        "username": "root"
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "column": ["id","user_id","name"],
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://192.168.225.130:3306/iris",
                                "table": ["seiki1"]
                            }
                        ],
                        "password": "root",
                        "username": "root",
                        "writeMode": "insert"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": "1"
            }
        }
    }
}
注意:
①select中字段个数和writer节点上“column”的个数需保持一致,
②writer节点的jdbcUrl,不可以同reader节点那样用“[]”括起来。
>> 字段不对应报错如下:
2019-12-27 17:44:33.306 [job-0] ERROR JobContainer - 运行scheduler 模式[standalone]出错.
2019-12-27 17:44:33.307 [job-0] ERROR JobContainer - Exception when job run
com.alibaba.datax.common.exception.DataXException: Code:[DBUtilErrorCode-00], Description:[您的配置错误.].  - 列配置信息有错误. 因为您配置的任务中,源头读取字段数:3 与 目的表要写入的字段数:2 不相等. 请检查您的配置并作出修改.
如果能确保 select中的字段与目标表保持一致,可在writer节点的配置上去掉“column”相关配置。
>>>writer节点配置增加[]时报错如下:
2019-12-30 14:10:45.251 [job-0] ERROR RetryUtil - Exception when calling callable, 即将尝试执行第1次重试.本次重试计划等待[1000]ms,实际等待[1001]ms, 异常Msg:[Code:[DBUtilErrorCode-10], Description:[连接数据库失败. 请检查您的 账号、密码、数据库名称、IP、Port或者向 DBA 寻求帮助(注意网络环境).].  -  具体错误信息为:java.sql.SQLException: No suitable driver found for ["jdbc:mysql://192.168.225.130:3306/iris"]?
 
 
 

 

8、多表关联同步任务执行结果:

2019-12-27 17:46:18.457 [job-0] INFO  JobContainer -
     [total cpu info] =>
        averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
        -1.00%                         | -1.00%                         | -1.00%
                        
     [total gc info] =>
         NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
         Copy                 | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
         MarkSweepCompact     | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
2019-12-27 17:46:18.457 [job-0] INFO  JobContainer - PerfTrace not enable!
2019-12-27 17:46:18.458 [job-0] INFO  StandAloneJobContainerCommunicator - Total 22 records, 260 bytes | Speed 26B/s, 2 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2019-12-27 17:46:18.459 [job-0] INFO  JobContainer -
任务启动时刻                    : 2019-12-27 17:46:07
任务结束时刻                    : 2019-12-27 17:46:18
任务总计耗时                    :                 10s
任务平均流量                    :               26B/s
记录写入速度                    :              2rec/s
读出记录总数                    :                  22
读写失败总数                    :                   0