如何找到将数据从S3加载到Redshift的平均时间

问题描述：

我有超过8个模式和200多个表，并且数据由不同架构中的CSV文件加载。如何找到将数据从S3加载到Redshift的平均时间

我想知道如何查找所有200个表的平均时间以将数据从S3加载到Redshift的SQL脚本。

答

您可以检查STL System Tables for Logging以发现查询运行的时间。

您可能需要解析查询文本以发现哪些表已加载，但是您可以使用历史加载时间来计算每个表的典型加载时间。

一些特别有用的表格是：

STL_QUERY_METRICS：包含度量信息，如已完成在用户运行处理的行的数量，CPU的使用率，输入/输出，和磁盘的使用，用于查询定义的查询队列（服务类）。
STL_QUERY：返回有关数据库查询的执行信息。
STL_LOAD_COMMITS：该表记录每个数据文件加载到数据库表时的进度。

答

有一个聪明的方法来做到这一点。您应该有一个将数据从S3迁移到Redshift的ETL脚本。

假设你有一个shell脚本，刚捕获的时间戳的ETL逻辑开始该表之前（我们称之为start），该表的ETL逻辑结束后拍摄另一时间戳（我们称之为end）和采取对剧本的结尾的区别：

#!bin/sh 
    . 
    . 
    . 

start=$(date +%s) #capture start time 

#ETL Logic 
     [find the right csv on S3] 
     [check for duplicates, whether the file has already been loaded etc] 
     [run your ETL logic, logging to make sure that file has been processes on s3] 
     [copy that table to Redshift, log again to make sure that table has been copied] 
     [error logging, trigger emails, SMS, slack alerts etc] 
     [ ... ] 


end=$(date +%s) #Capture end time 


duration=$((end-start)) #Difference (time taken by the script to execute) 

echo "duration is $duration"

PS：持续时间将在几秒钟内就可以保持一个日志文件，进入到一个数据库表等的时间戳将在epoc，你可以使用功能（取决于你在哪里登录）：

sec_to_time($duration) - 对于MySQL

SELECT (TIMESTAMP 'epoch' + 1511680982 * INTERVAL '1 Second ')AS mytimestamp - 适用于Amazon Redshift（然后采用epoch中两个实例的区别）。

答

运行此查询以了解COPY查询的工作速度。

select q.starttime, s.query, substring(q.querytxt,1,120) as querytxt, 
     s.n_files, size_mb, s.time_seconds, 
     s.size_mb/decode(s.time_seconds,0,1,s.time_seconds) as mb_per_s 
from (select query, count(*) as n_files, 
    sum(transfer_size/(1024*1024)) as size_MB, (max(end_Time) - 
     min(start_Time))/(1000000) as time_seconds , max(end_time) as end_time 
     from stl_s3client where http_method = 'GET' and query > 0 
     and transfer_time > 0 group by query) as s 
LEFT JOIN stl_Query as q on q.query = s.query 
where s.end_Time >= dateadd(day, -7, current_Date) 
order by s.time_Seconds desc, size_mb desc, s.end_time desc 
limit 50;

一旦你了解有多少MB/s的你从S3推动通过你可以大致判断它需要多长时间按大小每个文件。

如何找到将数据从S3加载到Redshift的平均时间

相关推荐