This report analyzes why standard vectorization methods — statistical summary (mean/var/AUC), automatic feature extraction (tsfresh, catch22), convolutional representations (MiniRocket), and self-supervised embeddings (TS2Vec) — fail when the time series length T is very short, especially T ≤ 10. As an alternative, it argues that Raw Vectorization (Identity Mapping) is not just an easy fallback but the information-theoretically and statistically justified first choice. In the T ≤ 10 regime, all summary, automatic feature, and embedding methods enter the regime of “estimator variance ≫ signal.” No transform adds information; they only amplify noise. Raw Vector is the minimal sufficient statistic for nonparametric time series.

1. Problem Definition: Why Short Time Series Are Different

1.1 The Spectrum of Length T

Time series length T is one of the most important variables in choosing a model, but the meaning of “short” varies widely across domains.

Length range	Common name	Recommended representation
T ≤ 10	Ultra-short / Tiny series	Raw Vectorization (this report’s recommendation)
10 < T ≤ 30	Short series	Raw + simple statistics (mean, slope)
30 < T ≤ 200	Medium series	catch22, MiniRocket, tsfresh
T > 200	Long series	TS2Vec, Transformer, InceptionTime

1.2 Degrees-of-Freedom View

1.2.1 Standard Error of the Mean

Under i.i.d. assumption, the Standard Error (SE) of the sample mean $\bar{x} = (1/T)\sum x_i$ is:

$$\mathrm{SE}(\bar{x}) = \frac{\sigma}{\sqrt{T}}$$

where σ is the population standard deviation. Requiring relative error ≤ 10% (i.e., SE/σ ≤ 0.1) gives:

$$\frac{\sigma}{\sqrt{T}} \le 0.1\sigma \;\;\Longrightarrow\;\; T \ge 100$$

At T = 5, SE/σ = 1/√5 ≈ 0.447 (44.7% error); at T = 10, 1/√10 ≈ 0.316 (31.6% error). Even the mean carries one-third to one-half the noise of the signal itself.

1.2.2 Standard Error of the Variance

Under normality, the SE of the sample variance $s^2$ is:

$$\mathrm{SE}(s^2) = \sigma^2 \sqrt{\frac{2}{T-1}}$$

The 10% relative-error criterion gives T ≥ 201. At T = 5, the relative error is 70.7%; at T = 10, 47.1%. Variance estimation in T ≤ 10 is essentially meaningless.

1.2.3 95% Confidence Interval for the Mean

The half-width of the 95% Confidence Interval (CI) is $1.96 \cdot \sigma/\sqrt{T}$. Requiring half-width ≤ 0.1σ gives T ≥ 384. At T = 5 the half-width is ±0.876σ (almost as wide as the signal); at T = 10, ±0.620σ.

1.2.4 Summary

Estimator	T for 10% relative error	Error at T = 5	Error at T = 10
SE of mean	T ≥ 100	44.7%	31.6%
SE of variance	T ≥ 201	70.7%	47.1%
95% CI half-width of mean	T ≥ 384	87.6%	62.0%

At T ≤ 10, all basic statistics are outside their “trustworthy” region. Higher moments and the AutoCorrelation Function (ACF) are even worse: ACF at lag k requires T−k pairs, so ACF(3) at T = 5 rests on just two pairs and is essentially random. tsfresh’s “autocorrelation lag k” features become NaN at T = 5 because they are mathematically undefined. The T ≤ 10 regime is not just “few data” but “estimator variance dominates the signal,” and extracting features in this regime amounts to turning noise into dimensions.

2. Limitations of Existing Vectorization Methods

Method	Output dim	Source	Min length	At T = 5	At T = 10
Statistical Summary (mean, std, min, max, median)	5–10	estimate (user-defined)	T ≥ 2	Estimator variance dominates signal; time order completely lost	Variance stabilizes but time order still lost; weak on time-dependent signals
Area Under the Curve (AUC)	1	trivial (definition)	T ≥ 2	5 values → 1 scalar; [1,5,1,5,1] and [3,3,3,3,3] yield the same AUC	10 values → 1 scalar; infinitely many equal-area equivalence classes
tsfresh comprehensive	783	Christ et al. (2018) docs	≥ 30 in practice	NaN ratio 60–80% (estimate); ACF lag k≥3, high-frequency Fast Fourier Transform (FFT) coefficients, entropy all undefined	NaN ratio 30–50% (estimate); ACF lag 3–5 defined but with very high estimator variance
tsfresh efficient	78	Christ et al. (2018) docs	≥ 20 in practice	NaN ratio 40–60% (estimate)	NaN ratio 20–30% (estimate); usable but trees may split on NaN columns
catch22	22	Lubba et al. (2019)	≥ 50 in practice	DN_HistogramMode, SB_TransitionMatrix, SP_Summaries undefined; over half NaN	Some ACF-based features defined; histogram mode and transition matrix still meaningless
MiniRocket	9,996	Dempster et al. (2021); aeon-toolkit	9 (enforced)	Library refuses input (padding required)	Library accepts, but length-9 dilation kernels cover almost the entire signal; Proportion of Positive Values (PPV) saturates at 0 or 1
TS2Vec	320 (default)	Yue et al. (2022)	≥ 50 in practice	Sub-series cropping yields length 2–3; max-pooling collapses hierarchy after 1–2 layers; almost no learning signal	Sub-series of length 5 possible but contrastive signal weak; representational power limited even when pretraining domain matches
Raw Vector	T (= 5 or 10)	trivial (definition)	T ≥ 1	All information preserved; zero NaN; directly usable by tree models	All information preserved; no dimensional burden; optimal representation

At T = 5, MiniRocket is rejected by the library; tsfresh and catch22 produce more than 50% NaN.
At T = 10, MiniRocket’s dilation-9 kernels barely slide at all, so PPV pooling becomes nearly constant — the library accepts the input but loses almost all information.
tsfresh and catch22 produce fewer NaNs at T = 10, but the surviving features still have huge estimator variance, effectively adding noise dimensions.
Only Raw Vector preserves all information at both T = 5 and T = 10 with zero NaN.

3. Why Raw Vectorization Is Justified

3.1 Terminology

Using the raw time series directly as the input vector goes by several names in the literature:

Raw Feature Vectorization: The most general term; placing each time-step value into a dimension of the vector.
Identity Mapping: The functional view, $f(x) = x$. Used as a baseline in representation-learning literature.
Flattening: Reshaping a (time, feature) 2D structure into 1D. Standard for feeding multivariate short series to tree models.
Direct Mapping: An intentional choice to skip embeddings or dimension reduction in regimes where overfitting risk is high.
Pointwise Encoding: Neural-network terminology where each timestamp is represented by its own raw value.

3.2 Minimal Sufficient Statistic

The relevant object whose dimension we discuss is the minimal sufficient statistic — the coarsest among all sufficient statistics (Lehmann & Casella 1998). Its dimension is the intrinsic information dimension of the distribution family; compressing below it loses sufficiency. Time series classification and regression are typically nonparametric (no assumed distribution family) and time-order matters, so they fall in the nonparametric + time-order-significant regime. There, the minimal sufficient statistic is the data itself, $S(x) = x$, with dimension T. Any compression below T loses sufficiency.

For T ≤ 10, the minimal sufficient dimension is 5 or 10 — exactly the dimension of Raw Vector. tsfresh (783D), catch22 (22D), MiniRocket (9,996D) all use non-injective transforms and therefore fail to preserve sufficiency, regardless of how large their output dimension is (Appendix E).

The following table summarizes minimal-sufficient dimensions derived via the Fisher-Neyman factorization theorem (Appendix D).

Setting	Distribution-family assumption	Dimension of minimal sufficient
Parametric	Normal $N(\mu, \sigma^2)$ i.i.d.	2 (Σx_i, Σx_i²) — fixed, independent of T
Parametric	k-parameter exponential family	k — fixed, independent of T
Nonparametric	Family unknown, time order matters	T ($S(x) = x$ itself)
Nonparametric	Family unknown, order irrelevant	T (order statistics)

The same conclusion follows from a sample-efficiency view. With N samples each of length T, total information is NT scalars. If a model has P parameters to estimate, the rule of thumb is NT ≥ 10P. When T < 10, the number of parameters P that can be reliably estimated is itself small — only simple models work, and the input representation must be as low-dimensional as possible. Raw Vector, with dimension exactly T, is the most compatible.

3.3 Curse of Dimensionality

In high-dimensional space, Euclidean distances concentrate (concentration of distances), eroding the discriminative power of distance-based algorithms (Bellman 1961). In the hundreds-to-thousands-dimensional vectors produced by tsfresh and MiniRocket, only T dimensions carry signal; the rest are noise.

3.3.1 Definition of Signal-to-Noise Ratio (SNR)

The Signal-to-Noise Ratio (SNR) is defined as the signal variance divided by the noise variance:

$$\mathrm{SNR}_{\mathrm{raw}} = \frac{\sigma^2_{\mathrm{signal}}}{\sigma^2_{\mathrm{noise}}}$$

$\sigma^2_{\mathrm{signal}}$: variance of the meaningful variation in the original time series that is actually related to the target Y.
$\sigma^2_{\mathrm{noise}}$: variance of random fluctuations unrelated to Y — measurement error, sensor noise, environmental noise. Modeled as $x_{\mathrm{observed}} = x_{\mathrm{signal}} + \varepsilon$ with $\varepsilon \sim (0, \sigma^2_{\mathrm{noise}})$.

3.3.2 SNR Dilution Under Dimensional Expansion (Heuristic Model)

When a T-dimensional signal is expanded to a $d_{\mathrm{total}}$-dimensional representation, signal still resides in only T dimensions; the remaining $(d_{\mathrm{total}} – T)$ dimensions are essentially noise-like residuals. A simplified SNR-dilution model is:

$$\mathrm{SNR}_{\mathrm{after}} \approx \mathrm{SNR}_{\mathrm{raw}} \times \frac{d_{\mathrm{signal}}}{d_{\mathrm{total}}}$$

$\mathrm{SNR}_{\mathrm{after}}$: SNR of the output vector after the transform.
$d_{\mathrm{signal}}$: number of output dimensions actually carrying signal. Since transforms cannot add information, $d_{\mathrm{signal}} \le T$.
$d_{\mathrm{total}}$: total output dimension after the transform.

Source caveat: The above is a heuristic model (estimate) used in this report, not a textbook theorem. Exact SNR change depends on noise distribution, signal representation, and nonlinearity of the transform; no result of this exact form appears in Cover & Thomas (2006). The qualitative conclusion — that SNR dilutes proportionally when $d_{\mathrm{total}} \gg d_{\mathrm{signal}}$ — is, however, a general intuition tied to the curse of dimensionality.

3.3.3 Dilution Comparison at T = 10

Method	$d_{\mathrm{total}}$	$d_{\mathrm{signal}}$	$d_{\mathrm{signal}}/d_{\mathrm{total}}$	Interpretation
Raw Vector	10	10	1.000	1× no dilution, no compression (optimal)
Stats (10D)	10	10	1.000	1× no dilution, no compression (but time order lost)
AUC	1	10	10.000	10× information compression (10D → scalar)
catch22	22	10	0.455	~2× dilution (12 noise dims added)
tsfresh efficient	78	10	0.128	~8× dilution (68 noise dims added)
tsfresh comprehensive	783	10	0.013	~78× dilution (773 noise dims added)
TS2Vec	320	10	0.031	~32× dilution (310 noise dims added)
MiniRocket	9,996	10	0.001	~1,000× dilution (9,986 noise dims added)

Tree models can find spurious information gain on noise dimensions when picking splits, leading to validation degradation. The smaller T is, the smaller $d_{\mathrm{signal}}/d_{\mathrm{total}}$ becomes — so SNR damage from large representations is most severe on short series.

3.4 Affinity with Tree Models

Raw Vectorization fits gradient-boosted tree models — Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost), CatBoost — exceptionally well:

Monotone-transform invariance: Trees are invariant to monotone input transforms, so no normalization or standardization is needed. The scale-mixing problem of tsfresh-style mixed features disappears.
Native missing-value handling: LightGBM and XGBoost route NaN to a learned branch. Variable-length padding like [x₁, x₂, x₃, NaN, NaN] is handled automatically.
Automatic interactions: Trees pick up time-pair relationships (differences, ratios) through split combinations — capturing slope and curvature without manual engineering.
Nonlinear thresholds: Unlike AUC’s single-scalar compression, trees separate signal via threshold combinations like ‘x₃ > 0.5 AND x₅ < 0.3’. When a specific timestamp matters (e.g., the third clinical day), the tree picks it automatically.

4. Practical Guidelines

4.1 Decision Tree

flowchart TD
    A([Time-series length T]) --> B{Which range?}
    B -->|T ≤ 10| C[Raw Vectorization
required]
    B -->|10 < T ≤ 30| D[Raw Vector and
5–10 stat features]
    B -->|30 < T ≤ 200| E[catch22 or
tsfresh efficient]
    B -->|T > 200| F[MiniRocket
or Hydra]

    linkStyle 0 color:black,font-weight:bold
    linkStyle 1 color:green, background-color:#fff, font-weight:bold
    linkStyle 2 color:black, background-color:white, font-weight:bold
    linkStyle 3 color:black, background-color:white, font-weight:bold
    linkStyle 4 color:black, background-color:white, font-weight:bold

    C --> C1[optional add-on:
 simple
derived features]
    D ---> D1[fallback:
 catch22
NaN check required]
    E ---> E1[fallback:
 MiniRocket
T ≥ 9 needed]
    F ---> F1[fallback:
TS2Vec for
large data]

    classDef root fill:#1F4E79,stroke:#1F4E79,color:#fff
    classDef decision fill:#FFF2CC,stroke:#BF9000,color:#333
    classDef recommended fill:#C6E0B4,stroke:#548235,color:#1F3864
    classDef alternative fill:#DEEBF7,stroke:#2E74B5,color:#1F3864
    classDef addon fill:#E2F0D9,stroke:#548235,color:#333,font-style:italic
    classDef fallback fill:#FCE4D6,stroke:#C65911,color:#333,font-style:italic
 
    class A root
    class B decision
    class C recommended
    class D,E,F alternative
    class C1 addon
    class D1,E1,F1 fallback

4.2 Best Practices

4.2.1 Variable-Length Handling

In practice, samples rarely all share the same T. When T varies between 1 and 10, two safe strategies:

Right padding with sentinel: Pad with NaN to fixed length. LightGBM and XGBoost route NaN to a learned branch, so missingness itself becomes informative.
Length indicator: Append the actual length as one extra dimension, e.g., [x₁, x₂, x₃, NaN, …, NaN, len = 3]. The model can learn length-dependent patterns.

4.2.2 NaN-Safe Derived Features

On top of Raw Vector, the following simple derived features are always defined for T ≥ 2 and never produce NaN — in contrast to library auto-extractors (tsfresh, catch22), which often do. We call these NaN-safe derived features:

x_last − x_first (total change, scalar)
max(x) − min(x) (range, scalar)
x_last (final value, scalar) — often a strong predictor on its own
sign(x_last − x_first) (direction, scalar in {−1, 0, +1})
argmax(x) (location of the maximum, integer in 0 to T−1)

Each is a single scalar, so the augmentation adds exactly 5 dimensions:

T = 5: 5 raw + 5 derived = 10 dimensions
T = 10: 10 raw + 5 derived = 15 dimensions

Both fit comfortably even with N ≥ 100, and are vastly smaller than tsfresh (78–783D) or MiniRocket (9,996D).

4.2.3 Multivariate Short Series

For V variables and T time-steps, flatten to a $V \times T$ vector. With V = 10, T = 10, the result is 100 dimensions — easy for LightGBM, and ~1,000× smaller than MiniRocket’s ~100,000D (10 channels × 9,996 kernels).

4.3 Common Pitfalls

Pitfall 1: “More features are always better.” With T ≤ 10, 800 tsfresh features almost always do worse than T raw values. Check NaN ratio and constant-column ratio first.
Pitfall 2: Padding to bypass MiniRocket’s length requirement. PPV on padded regions teaches the model to learn the length, not the signal — generalization fails when test-time lengths differ.
Pitfall 3: Pretrained TS2Vec or other self-supervised embeddings. If the pretraining domain differs from yours, you trade T-dimensional signal for 320-dimensional noise.
Pitfall 4: Applying Principal Component Analysis (PCA) or Uniform Manifold Approximation and Projection (UMAP) on top of Raw Vector. T ≤ 10 is already low-dimensional; further reduction only loses information.
Pitfall 5: Reflexive z-score standardization. Trees don’t need it, and the sample-variance estimate is unstable on short series — standardization injects noise.

5. Empirical Comparison: Expected Performance Pattern

The following synthesizes general literature reports and the analyses above into a qualitative pattern for T ≤ 10. Exact numbers depend on the dataset, but the qualitative ordering is consistently observed.

Method	Output dim (T = 10)	Source	NaN ratio (expected)	Train time	Expected relative accuracy
Raw Vector	10	trivial	0%	lowest (~ms)	baseline (1.00)
Raw + 5 derived	15	estimate	0%	lowest	best (1.02–1.05) (estimate)
AUC + stats	8	estimate	~5% (estimate)	low	0.85–0.92 (estimate)
tsfresh efficient	78	Christ et al. (2018)	20–30% at T=10 / 40–60% at T=5 (estimate)	medium	0.75–0.85 (estimate)
tsfresh comprehensive	783	Christ et al. (2018)	30–50% at T=10 / 60–80% at T=5 (estimate)	high	0.65–0.80 (estimate)
catch22	22	Lubba et al. (2019)	30–40% at T=10 / 50–70% at T=5 (estimate)	low	0.70–0.85 (estimate)
MiniRocket (T = 9 padded)	9,996	Dempster et al. (2021)	0% (but artifacts)	medium–high	0.60–0.80 (estimate)
TS2Vec	320	Yue et al. (2022)	0% (but weak meaning)	highest	0.55–0.75 (estimate)

The relative accuracies above are estimates of typical patterns; exact values vary by domain. When the SNR is high and length is small (e.g., five to ten clinical measurements), Raw’s advantage becomes even sharper.

6. Conclusion

The T ≤ 10 regime lies outside the design assumptions of standard time-series tools. There, statistical summary, automatic feature extraction, and deep embeddings all run into one of two issues: (a) the estimator is undefined and produces large amounts of NaN, or (b) even when defined, estimator variance dominates signal variance, effectively turning noise into dimensions.

The value of feature engineering scales with both length T and sample size N. The smaller T is, the less room there is for transformation, and the more valuable the raw form becomes. In the T ≤ 10 regime, the most sophisticated algorithm is the simplest representation. Raw Vectorization is not just a quick shortcut. Information-theoretically, it is the lossless representation guaranteed by the Data Processing Inequality (DPI); statistically, it is the sufficient statistic under nonparametric assumptions; practically, it is a perfect match for the monotone-transform invariance, native NaN handling, and automatic interaction learning of tree-based models like LightGBM.

Appendix A. Entropy

The entropy of a discrete random variable A measures its uncertainty in bits (Cover & Thomas 2006):

$$H(A) = -\sum_{a} p(a) \log_2 p(a)$$

A: a discrete random variable (e.g., coin flip, die roll, classification label).
a: a particular value A can take (e.g., for a coin, $a \in \{\text{heads}, \text{tails}\}$).
$p(a) = P(A = a)$: probability that A equals a.
The sum is over all possible values of A.

Physical meaning of “bits”: H(A) is the average number of yes/no questions needed to identify A’s value. A fair coin has $H = -0.5\log_2 0.5 – 0.5\log_2 0.5 = 1$ bit — exactly one yes/no question identifies the outcome. A uniform 8-sided die has $H = \log_2 8 = 3$ bits.

Conditional entropy $H(A \mid B)$ is the average remaining uncertainty in A when B is known:

$$H(A \mid B) = -\sum_{a,b} p(a, b) \log_2 p(a \mid b)$$

A basic property is $H(A \mid B) \le H(A)$: knowing B can only reduce (never increase) the uncertainty about A. This is what makes mutual information non-negative (Appendix B).

Appendix B. Mutual Information

B.1 Definition

The mutual information between two random variables A and B measures the information they share, in bits (Cover & Thomas 2006):

$$I(A; B) = \sum_{a,b} p(a, b) \log_2 \frac{p(a, b)}{p(a)\, p(b)} = H(A) – H(A \mid B)$$

The second equality means: total uncertainty about A minus the residual uncertainty when B is known — the amount of information about A gained by knowing B.

B.2 Properties

Property 1: $I(A; B) \ge 0$ (shared information cannot be negative). Follows directly from $H(A \mid B) \le H(A)$ in Appendix A:

$$I(A; B) = H(A) – H(A \mid B) \ge 0$$

Property 2: $I(A; B) = 0$ if and only if A and B are independent. If A and B are independent, $p(a, b) = p(a)p(b)$, so the log term inside the sum is $\log_2 1 = 0$. Conversely, $I(A; B) = 0$ implies $H(A) = H(A \mid B)$, meaning B reveals nothing about A — i.e., independence.

B.3 How This Report Uses It

Mutual information is the core tool for stating the Data Processing Inequality (Appendix C.2). The DPI says a transform f cannot increase $I(X; Y)$ — both sides of its inequality are mutual information.

How the DPI is used in this report:

§3.2 Minimal sufficient statistic: When arguing that Raw Vector is the minimal sufficient statistic for nonparametric time series, we use DPI’s “non-injective transforms lose information” to justify that any compression below T loses sufficiency.
§3.3 Curse of dimensionality: When proposing the heuristic SNR-dilution model for transforms with $d_{\mathrm{total}} \gg d_{\mathrm{signal}}$, we rely on the DPI implication that transforms cannot add information (so $d_{\mathrm{signal}} \le T$).
Appendix E (transform invertibility): Each vectorization method’s transform f is verified to be non-injective; from there, DPI’s strict inequality implies all non-Raw methods incur information loss in principle.
§6 Conclusion: The claim “Raw Vectorization is the lossless representation guaranteed by the DPI” rests on DPI as its information-theoretic basis.

Appendix C. Markov Chain and Data Processing Inequality

C.1 Markov Chain

Three random variables X, Z, Y form a Markov chain $X \to Z \to Y$ when:

$$P(Y \mid X, Z) = P(Y \mid Z)$$

The left side is the probability of Y given that both X and Z occurred; the right side is the probability of Y given only Z. Their equality means: once Z is known, X gives no additional information about Y — Y receives all of X’s information through Z.

Simple example: “Rain (X) → wet clothes (Z) → catching a cold (Y).” If we already know the clothes are wet (Z), the probability of catching a cold (Y) does not change whether it rained (X) or not — the original cause does not matter once wetness is observed. So $P(Y \mid X, Z) = P(Y \mid Z)$ and X → Z → Y is a Markov chain.

In this report’s setting, X is the original time series, Z is its transformed representation (Raw Vector, tsfresh output, etc.), and Y is the target label. Once Z = f(X) is fixed, Y receives X’s information only through Z.

C.2 The DPI Itself

For a Markov chain $X \to Z = f(X) \to Y$ (Cover & Thomas 2006):

$$I(X; Y) \ge I(Z; Y) = I(f(X); Y)$$

Mutual information $I(\cdot; \cdot)$ is defined in Appendix B. Intuitively: any transform cannot increase the information between X and Y. Processing creates no new information.

C.3 The Transform f(X)

f is a deterministic or stochastic function applied to X, with $Z = f(X)$. It falls into two classes:

Injective (invertible) f: $f(x_1) = f(x_2)$ implies $x_1 = x_2$ — a one-to-one map. The inverse $f^{-1}$ exists, so X can be fully recovered from Z.
Non-injective f: there exist $x_1 \ne x_2$ with $f(x_1) = f(x_2)$. X cannot be recovered from Z, and information is lost.

C.4 Injective f — Equality Holds

The DPI’s equality holds if and only if f is injective: $I(f(X); Y) = I(X; Y)$. Because the inverse $f^{-1}$ exists, Z carries the same information about Y as X does. Raw Vector (f = identity) is trivially injective, so equality holds for every $T \ge 1$ — zero information loss is guaranteed.

C.5 Non-Injective f — Information Loss Guaranteed

Definition of “downstream”: a machine-learning pipeline typically has two stages:

Upstream: raw data X → transform f → representation Z. Raw Vectorization, tsfresh, catch22, MiniRocket, TS2Vec all live here.
Downstream: representation Z → model → prediction Y. LightGBM, XGBoost, Ridge, etc. live here.

So downstream accuracy means the predictive performance (accuracy, AUC, F1) of the model that consumes Z. It is upper-bounded by how much information Z carries about Y.

Information loss with non-injective transforms: for any non-injective f (statistical summaries, AUC, tsfresh, catch22, MiniRocket’s PPV pooling, TS2Vec’s max-pooling, …):

$$I(f(X); Y) < I(X; Y)$$

i.e., the upper bound on downstream accuracy is lower. A model may fail to reach this bound even on Raw Vector — that is a model issue, not a representation issue. But non-injectivity guarantees information loss at the representation stage itself.

Appendix D. Fisher-Neyman Factorization Theorem and Sufficient Statistics

D.1 The Fisher-Neyman Factorization Theorem

Fisher-Neyman factorization theorem (Lehmann & Casella 1998):

$$p_\theta(x) = g_\theta(S(x)) \cdot h(x)$$

Reading the equation: the equation asks whether the joint density $p_\theta(x)$ can be factored into two parts.

LHS $p_\theta(x)$: probability (density) of observing data x given parameter θ. Different θ produce different observation probabilities.
RHS first term $g_\theta(S(x))$: the part affected by θ. Crucially, θ enters not through the full data x but only through the statistic S(x) — all the information needed to estimate θ is compressed into S(x).
RHS second term $h(x)$: a part determined by the shape of the data x alone, with no dependence on θ. It contributes nothing to inferring θ.

If such a factorization exists, the statistic S(x) is a sufficient statistic for θ. Intuition: all information in x about θ flows through S(x); the residual h(x) is irrelevant to inference.

Formal definitions of the symbols:

$p_\theta(x)$: joint probability density (or mass) function indexed by θ.
θ: parameter of the distribution family; can be a scalar, vector, or even infinite-dimensional (nonparametric).
x: observed data. In this report, $x = (x_1, \ldots, x_T) \in \mathbb{R}^T$.
S(x): a statistic — any function of x used for inference.
$g_\theta$: depends only on S(x) and θ.
h(x): depends only on x, not on θ.

D.2 What the Theorem Says

S(x) is a sufficient statistic for θ if and only if $p_\theta(x)$ admits a factorization of the form above.

D.3 Determining the Dimension of the Minimal Sufficient Statistic

Sufficient statistics are not unique. The data x itself is always sufficient (via the trivial factorization $p_\theta(x) = p_\theta(x) \cdot 1$); for a normal distribution, $(\sum x_i, \sum x_i^2, x_1)$ is also sufficient by the definition. To talk about an intrinsic information dimension, we need the smallest sufficient statistic — the minimal sufficient statistic.

Procedure:

Write down the joint density $p_\theta(x)$ for the family $\mathcal{P}_\Theta = \{p_\theta : \theta \in \Theta\}$.
Consider all factorizations of the form $g_\theta(S(x)) \cdot h(x)$.
Find the one with the smallest output dimension of S.
That S(x) is the minimal sufficient statistic, and its output dimension is the minimal sufficient dimension.

D.4 Parametric Example — Normal Distribution

For an i.i.d. sample $x_1, \ldots, x_T$ from $N(\mu, \sigma^2)$:

$$p_{\mu,\sigma^2}(x) = (2\pi\sigma^2)^{-T/2} \exp\left(-\frac{1}{2\sigma^2}\sum (x_i – \mu)^2\right)$$

Expanding the exponent:

$$= \underbrace{(2\pi\sigma^2)^{-T/2} \exp\left(-\frac{1}{2\sigma^2}\left(\sum x_i^2 – 2\mu\sum x_i + T\mu^2\right)\right)}_{g_{\mu,\sigma^2}(\sum x_i,\, \sum x_i^2)} \cdot \underbrace{1}_{h(x)}$$

Only $\sum x_i$ and $\sum x_i^2$ appear in the exponent, so the smallest factorization gives $S(x) = (\sum x_i, \sum x_i^2) \in \mathbb{R}^2$. The minimal sufficient dimension is 2, fixed regardless of T. This is the parametric advantage — fixed compression independent of length.

D.5 Nonparametric Case — X Itself Is Minimal Sufficient

What “nonparametric” means: the family $\mathcal{P}_\Theta = \{p_\theta : \theta \in \Theta\}$ is not parameterized by a finite-dimensional θ. In this report, we mean Θ is the set of all joint distributions on $\mathbb{R}^T$ — no functional assumption on the distribution at all. Then:

θ: the distribution itself — an infinite-dimensional object.
Θ: the set of all joint distributions on $\mathbb{R}^T$.
$p_\theta$: the joint density corresponding to θ. No functional constraint (no normality, no exponential family, etc.).

Claim: Under the nonparametric family, $x = (x_1, \ldots, x_T)$ is itself the minimal sufficient statistic.

Proof.

(a) x is sufficient. Apply the trivial factorization:

$$p_\theta(x) = p_\theta(x) \cdot 1 = g_\theta(S(x)) \cdot h(x)$$

with $S(x) = x$, $g_\theta(s) = p_\theta(s)$, $h(x) = 1$. Fisher-Neyman factorization holds, so x is sufficient.

(b) x is minimal. Suppose, for contradiction, that some $S'(x)$ with strictly smaller range than x is sufficient — i.e., $|\text{Range}(S’)| < |\text{Range}(x)|$. By the pigeonhole principle, there exist $x_1 \ne x_2$ with $S'(x_1) = S'(x_2)$. Sufficiency of $S’$ requires that $P(X = x \mid S'(X) = s)$ does not depend on θ. In particular, on the level set $S'(x) = s^*$:

$$\frac{P_\theta(X = x_1)}{P_\theta(X = x_2)} = \frac{p_\theta(x_1)}{p_\theta(x_2)} = c \quad (\text{constant in } \theta)$$

But under a fully nonparametric family, for any $x_1 \ne x_2$ there is some θ making $p_\theta(x_1)/p_\theta(x_2)$ equal to any positive number we like — so the ratio cannot be constant in θ. Contradiction. Therefore no $S’$ with smaller range can be sufficient, and x is minimal sufficient. ∎

D.6 Order Statistics Are Not Sufficient When Time Order Matters

The above argument shows no nontrivial compression of x is sufficient. In particular, the order statistics $x_{(1)} \le \ldots \le x_{(T)}$ — which discard time order — are also a compression and therefore not sufficient when time order matters. Two time series with the same multiset $\{x_1, \ldots, x_T\}$ but different orderings correspond to different distributions $p_\theta$ in the nonparametric family. Hence the minimal sufficient dimension in the nonparametric, time-order-significant setting is exactly T — which is the dimension of Raw Vector.

Appendix E. Invertibility of Each Transform

This table summarizes whether each vectorization method’s transform f is injective. By the DPI (Appendix C), non-injective transforms always lose information.

Method	Nature of the transform	Injective?	Reason for non-injectivity
Raw Vector	$f(x) = x$ (identity)	Injective	trivial
Stats (mean, var, …)	$(\sum x_i / T, \sum (x_i – \bar{x})^2 / (T-1), \min, \max, \ldots)$	non-injective	[1,2,3,4,5] and [5,4,3,2,1] yield identical statistics
AUC	$\sum_i (x_i + x_{i+1}) / 2$	non-injective	infinitely many equal-area equivalence classes
tsfresh	~800 statistical / spectral / entropy functions	non-injective	many features invariant to time shift, partial reversal, etc.
catch22	22 dynamics features	non-injective	built on distribution moments and transition matrices — partially loses time order
MiniRocket	9,996 convolutions + PPV pooling	non-injective	PPV preserves only the sign ratio; many signals share the same PPV
TS2Vec	neural encoder + max-pooling	non-injective	max-pooling itself is non-injective; also stochastic via dropout

Every method except Raw Vector is non-injective for every T. Raw Vector is the only representation that achieves equality in DPI, and therefore the only one guaranteeing zero information loss for every $T \ge 1$.

References

Bellman, R. (1961). Adaptive Control Processes. Princeton University Press.
Christ, M., Braun, N., Neuffer, J., & Kempa-Liehr, A. W. (2018). Time series feature extraction on basis of scalable hypothesis tests (tsfresh – A Python package). Neurocomputing, 307, 72–77.
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.
Dempster, A., Schmidt, D. F., & Webb, G. I. (2021). MiniRocket: A very fast (almost) deterministic transform for time series classification. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD ’21) (pp. 248–257).
Lehmann, E. L., & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer.
Lubba, C. H., Sethi, S. S., Knaute, P., Schultz, S. R., Fulcher, B. D., & Jones, N. S. (2019). catch22: CAnonical Time-series CHaracteristics. Data Mining and Knowledge Discovery, 33, 1821–1852.
Yue, Z., Wang, Y., Duan, J., Yang, T., Huang, C., Tong, Y., & Xu, B. (2022). TS2Vec: Towards universal representation of time series. In Proceedings of the AAAI Conference on Artificial Intelligence, 36(8), 8980–8987.
aeon-toolkit. MiniRocket documentation. https://www.aeon-toolkit.org/

Our Score

Click to rate this post!

[Total: 1 Average: 4]

Visited 3 times, 1 visit(s) today