paddlets.models.dl.paddlepaddle.adapter.paddle_dataset_impl

class PaddleDatasetImpl(rawdataset: TSDataset, in_chunk_len: int, out_chunk_len: int, skip_chunk_len: int, sampling_stride: int, time_window: Optional[Tuple] = None)[源代码]

基类:Dataset

paddle.io.Dataset 的实现类。

1> 对于任何不会用到的 (未来协变量 / 观测协变量) 列,应当在传入当前adapter之前将其从TSDataset中移除。

2> time_window默认认为每条样本同时包含特征X时间块(即 in_chunk), 跳过的时间块(即 skip_chunk)以及标签Y(即 out_chunk)。

3> 如果调用者显式地传入time_window参数,并且time_window窗口的上界大于 len(TSDataset._target) - 1, 则意味着构建出的样本将仅包含特征X(即 in_chunk),而不会包含跳过的时间块(即 skip_chunk)或者标签Y(即 out_chunk)。

参数
  • rawdataset (TSDataset) – 原始的 TSDataset 数据集,用于构建 paddle.io.Dataset 样本数据集。

  • in_chunk_len (int) – 模型输入的时间序列长度。

  • out_chunk_len (int) – 模型输出的序列长度。

  • skip_chunk_len (int) – 可选变量,输入序列与输出序列之间跳过的序列长度,既不作为特征也不作为预测目标使用,默认值为0。

  • sampling_stride (int, optional) – 在第i条样本和第i+1条样本之间跨越的时间步数。 具体来说,令 t 为target时序数据的时间索引,t[i] 为第i条样本的起始时间,t[i+1]`为第i+1条样本的起始时间, 则`sampling_stride`代表 `t[i+1] - t[i] 的计算结果,即2条相邻的样本之间相差的时间点的数量。

  • time_window (Tuple, optional) – 一个包含2个元素的元组类型的时间窗口,允许adapter模块在其范围内构建样本。 time_window[0] 值代表窗口范围的下界,time_window[1] 值代表窗口范围的上界。 对于每一个包含在该左闭右闭范围内的元素,都代表一条样本的尾部索引。

_supported_paddle_versions

一组当前支持的 paddle 模块的版本集合。

Type

Set[str]

_rawdataset
Type

TSDataset

_target_in_chunk_len

模型输入的时间序列长度。

Type

int

_target_out_chunk_len

模型输出的序列长度。

Type

int

_target_skip_chunk_len

输入序列与输出序列之间跳过的序列长度,既不作为特征也不作为预测目标使用,默认值为0。

Type

int

_known_cov_chunk_len

对于单条样本,其代表未来已知(known)协变量的时序块长度。

Type

int

_observed_cov_chunk_len

对于单条样本,其代表观测(observed)协变量的时序块长度。

Type

int

_sampling_stride

在第 i 条样本和第 i+1 条样本之间跨越的时间步数。

Type

int

_time_window

一个包含2个元素的元组类型的时间窗口,允许adapter模块在其范围内构建样本。 time_window[0] 值代表窗口范围的下界,time_window[1] 值代表窗口范围的上界。 对于每一个包含在该左闭右闭范围内的元素,都代表一条样本的尾部索引。

Type

Tuple, optional

_samples

一组构建完成的样本。

Type

List[Dict[str, np.ndarray]]

实际案例

# 1) in_chunk_len examples
# Given:
tsdataset.target = [0, 1, 2, 3, 4]
skip_chunk_len = 0
out_chunk_len = 1

# 1.1) If in_chunk_len = 1, sample[0]:
# X -> skip_chunk -> Y
# (0) -> () -> (1)

# 1.2) If in_chunk_len = 2, sample[0]:
# X -> skip_chunk -> Y
# (0, 1) -> () -> (2)

# 1.3) If in_chunk_len = 3, sample[0]:
# X -> skip_chunk -> Y
# (0, 1, 2) -> () -> (3)
# 2) out_chunk_len examples
# Given:
tsdataset.target = [0, 1, 2, 3, 4]
in_chunk_len = 1
skip_chunk_len = 0

# 2.1) If out_chunk_len = 1, sample[0]:
# X -> skip_chunk -> Y
# (0) -> () -> (1)

# 2.2) If out_chunk_len = 2, sample[0]:
# X -> skip_chunk -> Y
# (0) -> () -> (1, 2)

# 2.3) If out_chunk_len = 3, sample[0]:
# X -> skip_chunk -> Y
# (0) -> () -> (1, 2, 3)
# 3) skip_chunk_len examples
# Given:
tsdataset.target = [0, 1, 2, 3, 4]
in_chunk_len = 1
out_chunk_len = 1

# 3.1) If skip_chunk_len = 0, sample[0]:
# X -> skip_chunk -> Y
# (0) -> () -> (1)

# 3.2) If skip_chunk_len = 1, sample[0]:
# X -> skip_chunk -> Y
# (0) -> (1) -> (2)

# 3.3) If skip_chunk_len = 2, sample[0]:
# X -> skip_chunk -> Y
# (0) -> (1, 2) -> (3)

# 3.4) If skip_chunk_len = 3, sample[0]:
# X -> skip_chunk -> Y
# (0) -> (1, 2, 3) -> (4)
# 4) sampling_stride examples
# Given:
tsdataset.target = [0, 1, 2, 3, 4]
in_chunk_len = 1
skip_chunk_len = 0
out_chunk_len = 1

# 4.1) If sampling_stride = 1, samples:
# X -> skip_chunk -> Y
# (0) -> () -> (1)
# (1) -> () -> (2)
# (2) -> () -> (3)
# (3) -> () -> (4)

# 4.2) If sampling_stride = 2, samples:
# X -> skip_chunk -> Y
# (0) -> () -> (1)
# (2) -> () -> (3)

# 4.3) If sampling_stride = 3, samples:
# X -> skip_chunk -> Y
# (0) -> () -> (1)
# (3) -> () -> (4)
# 5) time_window examples:
# 5.1) The default time_window calculation formula is as follows:
# time_window[0] = 0 + in_chunk_len + skip_chunk_len + (out_chunk_len - 1)
# time_window[1] = max_target_idx
#
# Given:
tsdataset.target = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
in_chunk_len = 4
skip_chunk_len = 3
out_chunk_len = 2
sampling_stride = 1

# The following equation holds:
max_target_idx = tsdataset.target[-1] = 10

# The default time_window is calculated as follows:
time_window[0] = 0 + 2 + 3 + (4 - 1) = 5 + 3 = 8
time_window[1] = max_target_idx = 10
time_window = (8, 10)

# 3 samples will be built in total:
X -> Y
(0, 1, 2, 3) -> (7, 8)
(1, 2, 3, 4) -> (8, 9)
(2, 3, 4, 5) -> (9, 10)


# 5.2) Each element in time_window refers to the TAIL index of each sample, but NOT the HEAD index.
# The following two scenarios shows how to pass in the expected time_window parameter to build samples.
# Given:
tsdataset.target = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
in_chunk_len = 4
skip_chunk_len = 3
out_chunk_len = 2

# Scenario 5.2.1 - Suppose the following training samples are expected to be built:
# X -> Y
# (0, 1, 2, 3) -> (7, 8)
# (1, 2, 3, 4) -> (8, 9)
# (2, 3, 4, 5) -> (9, 10)

# The 1st sample's tail index is 8
# The 2nd sample's tail index is 9
# The 3rd sample's tail index is 10

# Thus, the time_window parameter should be as follows:
time_window = (8, 10)

# All other time_window showing up as follows are NOT correct:
time_window = (0, 2)
time_window = (0, 10)

# Scenario 5.2.2 - Suppose the following predict sample is expected to be built:
# X -> Y
# (7, 8, 9, 10) -> (14, 15)

# The first (i.e. the last) sample's tail index is 15;

# Thus, the time_window parameter should be as follows:
time_window = (15, 15)

# 5.3) The calculation formula of the max allowed time_window upper bound is as follows:
# time_window[1] <= len(tsdataset.target) - 1 + skip_chunk_len + out_chunk_len
# The reason is that the built paddle.io.Dataset is used for a single call of :func: `model.predict`, as
# it only allow for a single predict sample, any time_window upper bound larger than a single predict
# sample's TAIL index will not be allowed because there is not enough target time series to build past
# target time series chunk.
#
# Given:
tsdataset.target = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
in_chunk_len = 4
skip_chunk_len = 3
out_chunk_len = 2

# For a single :func:`model.predict` call:
X = in_chunk = (7, 8, 9, 10)

# max allowed time_window[1] is calculated as follows:
time_window[1] <= len(tsdataset) - 1 + skip_chunk_len + out_chunk_len = 11 - 1 + 3 + 2 = 15

# Note that time_window[1] (i.e. 15) is larger than the max_target_idx (i.e. 10), but this time_window
# upper bound is still valid, because predict sample does not need skip_chunk (i.e.  [11, 12, 13]) or
# out_chunk (i.e. [14, 15]).

# Any values larger than 15 (i.e. 16) is invalid, because the existing target time series is NOT long
# enough to build X for the prediction sample, see following example:
# Given:
time_window = (16, 16)

# The calculated out_chunk = (15, 16)
# The calculated skip_chunk = (12, 13, 14)

# Thus, the in_chunk should be [8, 9, 10, 11]
# However, the tail index of the calculated in_chunk 11 is beyond the max target time series
# (i.e. tsdataset.target[-1] = 10), so current target time series cannot provide 11 to build this sample.