python 平滑时间序列_时间序列平滑以实现更好的聚类

python 平滑时间序列

In time series analysis, the presence of dirty and messy data can alter our reasonings and conclusions. This is true, especially in this domain, because the temporal dependency plays a crucial role when dealing with temporal sequences.

在时间序列分析中,脏数据和杂乱数据的存在会改变我们的推理和结论。 这是正确的,尤其是在此领域,因为在处理时间序列时,时间依赖性起着至关重要的作用。

Noise or outliers must be handled with care following ad-hoc solutions. In this situation, the tsmoothie package can help us save a lot of time in preparing time series for our analysis. Tsmoothie is a python library for time series smoothing and outlier detection that can handle multiple series in a vectorized way. It’s useful because it can provide the preprocess steps we needed, like denoising or outlier removal, preserving the temporal pattern present in our raw data.

按照临时解决方案,必须小心处理噪声或异常值。 在这种情况下, tsmoothie软件包可以帮助我们节省大量时间来准备用于分析的时间序列。 Tsmoothie是用于时间序列平滑和离群值检测的python库,可以以矢量化方式处理多个序列。 这很有用,因为它可以提供我们所需的预处理步骤,例如去噪或离群值去除,保留原始数据中存在的时间模式。

In this post, we use these trinks to improve a clustering task. More precisely, we try to identify some changes in financial data carrying out an unsupervised approach. In the end, we will expect to point out clear patterns in the closing prices that can be used to inspect the hidden behavior of the market.

在本文中,我们将使用这些工具来改善聚类任务。 更准确地说,我们尝试在无监督的情况下识别财务数据中的某些变化。 最后,我们期望指出收盘价的清晰模式,可用于检查市场的隐藏行为。

数据 (THE DATA)

As introduced before, we operate with financial time series. There are a lot of tools or premade datasets that provide and store financial data. For our aims, we use a dataset collected from Kaggle. The Stock data 2000–2018 is a cleaned collection of stock prices from 2000 to 2018 of around 39 different stocks. It reports volumes, open, high, low, and close prices daily. We focus on the close prices.

如前所述,我们使用财务时间序列进行操作。 有很多提供或存储财务数据的工具或预制数据集。 为了我们的目标,我们使用从Kaggle收集的数据集。 2000-2018年的股票数据是从2000年到2018年大约39种不同股票的干净价格集合。 它每日报告交易量,开盘价,最高价,最低价和收盘价。 我们关注收盘价。

For a demonstrative purpose, we consider the Amazon stock price but the same findings appear also in other stock signals.

出于说明目的,我们考虑了亚马逊股票的价格,但在其他股票信号中也出现了相同的发现。

python 平滑时间序列_时间序列平滑以实现更好的聚类
Amazon closing price history and distribution
亚马逊收盘价历史和分布

时间序列平滑 (Time Series Smoothing)

The first step in our workflow consists of time series preprocessing. Our strategy is very intuitive and effective. Given a time series of closing prices, we split it into small sliding pieces. Each piece is then smooth in order to remove outliers. The smoothing process is essential to reduce the noise present in our series and point out the true patterns that may present over time.

我们工作流程的第一步包括时间序列预处理。 我们的策略非常直观有效。 给定一个时间序列的收盘价,我们将其分为几个小块。 然后,每片都是光滑的,以去除异常值。 平滑过程对于减少我们系列中存在的噪声并指出随着时间推移可能出现的真实图案至关重要。

Tsmoothie provides different smoothing techniques for our purpose. It also has the built-in utility to operate a sliding smoothing approach. The raw time series is partitioned into equal windowed pieces which are then smoothed independently. We select the Locally Weighted Scatterplot Smooth (LOWESS) as the smoothing procedure.

Tsmoothie为我们的目的提供了不同的平滑技术。 它还具有内置实用程序,可操作滑动平滑方法。 原始时间序列被分成相等的窗口部分,然后分别进行平滑。 我们选择局部加权散点图平滑( LOWESS )作为平滑过程。

LOWESS is a powerful non-parametric technique for fitting a smoothed line for given data either through univariate or multivariate smoothing. It implements a regression on a collection of points in a moving range, and weighted according to distance, around abscissa values in order to calculate ordinal values. The selection of the smoothing parameter (alpha) is often entirely based on a “repeated trial” basis. There is no specific technique for the selection of its exact value. The selection of a particular value may lead to “over-smoothing” or “under-smoothing”.

LOWESS是一种强大的非参数技术,可通过单变量或多变量平滑拟合给定数据的平滑线。 它对移动范围内的点集合进行回归,并根据距离在横坐标值附近加权,以便计算序数值。 平滑参数( alpha )的选择通常完全基于“重复试验”。 没有用于选择其确切值的特定技术。 选择特定值可能会导致“过度平滑”或“欠平滑”。

Below the result of applying the mentioned procedure with sliding windows of length 20 (days) and alpha equal to 0.6. In other words, we are computing a LOWESS for every generated window.

下面是使用长度为20(天)且alpha等于0.6的滑动窗口应用上述过程的结果。 换句话说,我们正在为每个生成的窗口计算一个LOWESS。

python 平滑时间序列_时间序列平滑以实现更好的聚类
The first smoothed windows from the AMZN stock prices
AMZN股票价格的第一个平滑窗口

时间序列聚类 (Time Series Clustering)

The second step involves the usage of a clustering algorithm to identify the behaviors in our time series. The creation of equal length windows is aimed to solve this task easily.

第二步涉及使用聚类算法来识别时间序列中的行为。 等长窗口的创建旨在轻松解决此任务。

Generally speaking, clustering different time series into similar groups is challenging because each data point follows a temporal structure that we must respect in order to obtain satisfactory results. The distance measures used in standard clustering algorithms, such as Euclidean distance, are often not appropriate to time series. A stronger approach is to replace the default distance measure with a metric for comparing time series, such as Dynamic Time Warping.

一般而言,将不同的时间序列聚类为相似的组具有挑战性,因为每个数据点都遵循一个时间结构,为了获得令人满意的结果,我们必须遵循该时间结构。 标准聚类算法中使用的距离度量(例如欧几里得距离)通常不适用于时间序列。 一种更强大的方法是用一种用于比较时间序列的度量标准来代替默认距离度量,例如Dynamic Time Warping

The search of 4 clusters with K-means and Dynamic Time Warping metric produces the following results:

使用K均值和动态时间规整度量标准对4个聚类进行搜索会产生以下结果:

python 平滑时间序列_时间序列平滑以实现更好的聚类
with smoothing 并进行平滑处理

As we can see, it’s evident the creation of 4 different clusters that represent 4 different market movements: an increasing trend (cluster 0), a decreasing trend (cluster 1), a downward turning point (cluster 2), an upward turning point (cluster 3). We can do the same with our raw time windows without computing the smoothing and make a comparison.

如我们所见,很明显,创建了代表4个不同市场运动的4个不同的集群:上升趋势( 集群0 ),下降趋势( 集群1 ),下降拐点( 集群2 ),上升拐点( 集群 ) 组3 )。 我们可以对原始时间窗口执行相同操作,而无需计算平滑度并进行比较。

python 平滑时间序列_时间序列平滑以实现更好的聚类
without smoothing 无需平滑

Now the difference between the 4 groups is not so marked. It’s more difficult to provide an interpretation of the generated clusters. The ability to generate meaningfully groups from a clustering algorithm is the more important prerequisite of any unsupervised approach. If we can’t attribute an explanation, the results can’t be utilized to make a decision. In this sense, the adoption of a smoothing preprocess can help the analysis.

现在,这四个组之间的差异不再那么明显。 提供对生成的集群的解释更加困难。 从聚类算法生成有意义的组的能力是任何无监督方法的重要前提。 如果我们无法解释原因,那么结果将无法用于做出决定。 从这个意义上讲,采用平滑预处理可以帮助分析。

python 平滑时间序列_时间序列平滑以实现更好的聚类
with smoothing 平滑获得聚类

摘要 (SUMMARY)

In the financial domain, the concept of volatility is fundamental to take decisions. It measures the uncertainty, i.e. the risk, present in the market. Here we went deeper extending our idea of market regimes in the short term. We identified four clear market conditions, smoothing our time series blocks to better understand the real dynamic of the data. In this post, we took advantage of the time series smoothing in a financial clustering application but this approach is valid and useful in some other contests involving time series analysis.

在金融领域,波动性概念是做出决策的基础。 它测量市场中存在的不确定性,即风险。 在这里,我们在短期内更深入地扩展了市场*的概念。 我们确定了四个明确的市场条件,从而平滑了时间序列块,以更好地了解数据的真实动态。 在本文中,我们利用了金融聚类应用程序中的时间序列平滑功能,但是这种方法在涉及时间序列分析的其他一些竞赛中是有效且有用的。

翻译自: https://towardsdatascience.com/time-series-smoothing-for-better-clustering-121b98f308e8

python 平滑时间序列