加快SQL服务器跨应用获得汇总数据
问题描述:
在SQL Server中,我试图拼凑其抓住一排单查询,包括来自该行前两个小时的窗口中的汇总数据,以及从一个聚合数据小时后窗口。我怎样才能让这个运行更快?加快SQL服务器跨应用获得汇总数据
的行具有时间戳的毫秒精度,而不是均匀地隔开。我在此表中有50万行,并且查询似乎没有完成。许多地方都有索引,但它们似乎没有帮助。我也在考虑使用窗口函数,但我不确定它是否可能具有不均匀分布的行的滑动窗口。另外,对于未来的一个小时窗口,我不确定如何用SQL窗口完成这个工作。
Box是一个字符串,有10个独特的价值观。 进程是一个字符串,有30个唯一值。 平均duration_ms是200毫秒。 错误数据少于0.1%。 5000万行描述了数年的数据。
select
c1.start_time,
c1.end_time,
c1.box,
c1.process,
datediff(ms,c1.start_time,c1.end_time) as duration_ms,
datepart(dw,c1.start_time) as day_of_week,
datepart(hour,c1.start_time) as hour_of_day,
c3.*,
c5.*
from metrics_table c1
cross apply
(select
avg(cast(datediff(ms,c2.start_time,c2.end_time) as numeric)) as avg_ms,
count(1) as num_process_total,
count(distinct process) as num_process_unique,
count(distinct box) as num_box_unique
from metrics_table c2
where datediff(minute,c2.start_time,c1.start_time) <= 120
and c1.start_time> c2.start_time
and c2.error_code = 0
) c3
cross apply
(select
avg(case when datediff(ms,c4.start_time,c4.end_time)>1000 then 1.0 else 0.0 end) as percent_over_thresh
from metrics_table c4
where datediff(hour,c1.start_time,c4.start_time) <= 1
and c4.start_time> c1.start_time
and c4.error_code= 0
) c5
where
c1.error_code= 0
编辑
版:SQL Azure的12.0
答
下应该是在正确的方向迈出的一步... 注:c2.start_time & c4.start_time不再在DATEDIFF函数wrappen使他们优化搜索...
SELECT
c1.start_time,
c1.end_time,
c1.box,
c1.process,
DATEDIFF(ms, c1.start_time, c1.end_time) AS duration_ms,
DATEPART(dw, c1.start_time) AS day_of_week,
DATEPART(HOUR, c1.start_time) AS hour_of_day,
--c3.*,
avg_ms = CASE WHEN
c5.*
FROM
dbo.metrics_table c1
CROSS APPLY (
SELECT
AVG(CAST(DATEDIFF(ms, c2.start_time, c2.end_time) AS NUMERIC)) AS avg_ms,
COUNT(1) AS num_process_total,
COUNT(DISTINCT process) AS num_process_unique,
COUNT(DISTINCT box) AS num_box_unique
FROM
dbo.metrics_table c2
WHERE
--DATEDIFF(minute,c2.start_time,c1.start_time) <= 120
c2.start_time <= DATEADD(MINUTE, -120, c1.start_time)
--and c1.start_time> c2.start_time
AND c2.error_code = 0
) c3
CROSS APPLY (
SELECT
AVG(CASE WHEN DATEDIFF(ms, c4.start_time, c4.end_time) > 1000 THEN 1.0 ELSE 0.0 END
) AS percent_over_thresh
FROM
dbo.metrics_table c4
WHERE
--DATEDIFF(HOUR, c1.start_time, c4.start_time) <= 1
c4.start_time >= DATEADD(HOUR, 1, c1.start_time)
--and c4.start_time> c1.start_time
AND c4.error_code = 0
) c5
WHERE
c1.error_code = 0;
当然,使查询优化搜索没有任何好处,除非有可用的合适指标。下面列出的是适合所有3个metrics_table引用...(看什么指标目前已经上市,有可能是你需要创建一个新的指数机会)
CREATE NONCLUSTERED INDEX ixf_metricstable_errorcode_starttime ON dbo.metrics_table (
error_code,
start_time
)
INCLUDE (
end_time,
box,
process
)
WHERE
error_code = 0;
答
我用Between
并得到了良好的性能我简单的测试装备。我也使用了列存储,因为5000万条记录是DW卷:
CREATE TABLE dbo.metrics_table (
rowId INT IDENTITY,
start_time DATETIME NOT NULL,
end_time DATETIME NOT NULL,
box VARCHAR(10) NOT NULL,
process VARCHAR(10) NOT NULL,
error_code INT NOT NULL
);
-- Add records
;WITH cte AS (
SELECT TOP 3334 ROW_NUMBER() OVER (ORDER BY (SELECT 1)) rn
FROM sys.columns c1
CROSS JOIN sys.columns c2
CROSS JOIN sys.columns c3
)
INSERT INTO dbo.metrics_table (start_time, end_time, box, process, error_code)
SELECT
DATEADD(ms, rn, DATEADD(day, rn % 365, '1 Jan 2017')) AS start_time,
DATEADD(ms, rn % 409, DATEADD(ms, rn, DATEADD(day, rn % 365, '1 Jan 2017'))) AS end_time,
'box' + CAST(boxes.box AS VARCHAR(10)) box,
'process' + CAST(boxes.box AS VARCHAR(10)) process,
ABS(CAST(rn % 3000 AS BIT) -1) error_code
FROM cte c
CROSS JOIN (SELECT TOP 10 rn FROM cte) AS boxes(box)
CROSS JOIN (SELECT TOP 30 rn FROM cte) AS processes(process);
-- Create normal clustered index to order the data
CREATE CLUSTERED INDEX cci_metrics_table ON dbo.metrics_table (start_time, end_time, box, process);
--CREATE CLUSTERED INDEX cci_metrics_table ON dbo.metrics_table (box, process, start_time, end_time);
-- Convert to columnstore
CREATE CLUSTERED COLUMNSTORE INDEX cci_metrics_table ON dbo.metrics_table WITH (MAXDOP = 1, DROP_EXISTING = ON);
IF OBJECT_ID('tempdb..#tmp1') IS NOT NULL DROP TABLE #tmp1
-- two hour window before, 1 hour window after
SELECT
c1.start_time,
c1.end_time,
c1.box,
c1.process,
DATEDIFF(ms, c1.start_time, c1.end_time) AS duration_ms,
DATEPART(dw, c1.start_time) AS day_of_week,
DATEPART(hour, c1.start_time) AS hour_of_day,
c2.xavg,
c2.num_process_total,
c2.num_process_unique,
c2.num_box_unique,
c3.percent_over_thresh
INTO #tmp1
FROM dbo.metrics_table c1
CROSS APPLY
(
SELECT
COUNT(1) AS num_process_total,
AVG(CAST(DATEDIFF(ms, start_time, end_time) AS NUMERIC)) xavg,
COUNT(DISTINCT process) num_process_unique,
COUNT(DISTINCT box) num_box_unique
FROM dbo.metrics_table c2
WHERE c2.error_code = 0
AND c2.start_time Between DATEADD(minute, -120, c1.start_time) And c1.start_time
AND c1.start_time > c2.start_time
) c2
CROSS APPLY
(
SELECT
AVG(CASE WHEN DATEDIFF(ms, c4.start_time, c4.end_time) > 1000 THEN 1.0 ELSE 0.0 END) percent_over_thresh
FROM dbo.metrics_table c4
WHERE c4.error_code = 0
AND c4.start_time Between c1.start_time And DATEADD(minute, 60, c1.start_time)
AND c4.start_time > c1.start_time
) c3
WHERE error_code = 0
如果性能问题不是因为您的where谓词,我会感到惊讶。你的where子句中有函数,这意味着你必须为每一行计算datediff。在这种情况下,你正在做两次。这意味着你正在执行大约1亿次的计算。 –
@Hogan我试图去开窗,但是我没有看到一种方法,如果数据点不是以均匀间隔收集的话,我会从某个时间点开始-2小时。含义从一排的差到下一个可能是几毫秒,可能是几秒钟,可能是几分钟 – user4446237
是的,这是不可能在SQL Server实现(没有'范围介于INTERVAL'),你就必须做一些预聚合以保证每分钟一行等。但是'COUNT(DISTINCT ...)'不容易兼容。 –