{"id":6495,"date":"2026-05-02T03:05:01","date_gmt":"2026-05-02T08:05:01","guid":{"rendered":"https:\/\/ykim.synology.me\/wordpress\/?p=6495"},"modified":"2026-05-02T19:12:00","modified_gmt":"2026-05-03T00:12:00","slug":"why-raw-vectorization-is-the-right-choice-for-ultra-short-time-series-t-%e2%89%a4-10","status":"publish","type":"post","link":"https:\/\/ykim.synology.me\/wordpress\/why-raw-vectorization-is-the-right-choice-for-ultra-short-time-series-t-%e2%89%a4-10-6495\/","title":{"rendered":"Why Raw Vectorization Is the Right Choice for Ultra-Short Time Series (T \u2264 10)"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"600\" src=\"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/05\/20260502-Round-Rock-Tx-Spherical-Water-Tower-800x600px.jpg\" alt=\"\" class=\"wp-image-6543\" style=\"width:600px\" srcset=\"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/05\/20260502-Round-Rock-Tx-Spherical-Water-Tower-800x600px.jpg 800w, https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/05\/20260502-Round-Rock-Tx-Spherical-Water-Tower-800x600px-300x225.jpg 300w, https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/05\/20260502-Round-Rock-Tx-Spherical-Water-Tower-800x600px-768x576.jpg 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/figure>\n\n\n<style>.kadence-column6495_1aab8c-97 > .kt-inside-inner-col,.kadence-column6495_1aab8c-97 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6495_1aab8c-97 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6495_1aab8c-97 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6495_1aab8c-97 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6495_1aab8c-97 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6495_1aab8c-97{position:relative;}.kadence-column6495_1aab8c-97, .kt-inside-inner-col > .kadence-column6495_1aab8c-97:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6495_1aab8c-97 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6495_1aab8c-97 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6495_1aab8c-97\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">This report analyzes why standard vectorization methods \u2014 statistical summary (mean\/var\/AUC), automatic feature extraction (tsfresh, catch22), convolutional representations (MiniRocket), and self-supervised embeddings (TS2Vec) \u2014 fail when the time series length T is very short, especially T \u2264 10. As an alternative, it argues that Raw Vectorization (Identity Mapping) is not just an easy fallback but the information-theoretically and statistically justified first choice. In the T \u2264 10 regime, all summary, automatic feature, and embedding methods enter the regime of &#8220;estimator variance \u226b signal.&#8221; No transform adds information; they only amplify noise. Raw Vector is the minimal sufficient statistic for nonparametric time series.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">1. Problem Definition: Why Short Time Series Are Different<\/h2>\n\n\n<style>.kadence-column6495_5372dc-5a > .kt-inside-inner-col,.kadence-column6495_5372dc-5a > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6495_5372dc-5a > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6495_5372dc-5a > .kt-inside-inner-col{flex-direction:column;}.kadence-column6495_5372dc-5a > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6495_5372dc-5a > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6495_5372dc-5a{position:relative;}.kadence-column6495_5372dc-5a, .kt-inside-inner-col > .kadence-column6495_5372dc-5a:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6495_5372dc-5a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6495_5372dc-5a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6495_5372dc-5a\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">1.1 The Spectrum of Length T<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Time series length T is one of the most important variables in choosing a model, but the meaning of &#8220;short&#8221; varies widely across domains.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Length range<\/th><th>Common name<\/th><th>Recommended representation<\/th><\/tr><\/thead><tbody><tr><td><strong>T \u2264 10<\/strong><\/td><td>Ultra-short \/ Tiny series<\/td><td><strong>Raw Vectorization (this report&#8217;s recommendation)<\/strong><\/td><\/tr><tr><td>10 &lt; T \u2264 30<\/td><td>Short series<\/td><td>Raw + simple statistics (mean, slope)<\/td><\/tr><tr><td>30 &lt; T \u2264 200<\/td><td>Medium series<\/td><td>catch22, MiniRocket, tsfresh<\/td><\/tr><tr><td>T &gt; 200<\/td><td>Long series<\/td><td>TS2Vec, Transformer, InceptionTime<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">1.2 Degrees-of-Freedom View<\/h3>\n\n\n\n<h3 class=\"wp-block-heading\">1.2.1 Standard Error of the Mean<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Under i.i.d. assumption, the Standard Error (SE) of the sample mean $\\bar{x} = (1\/T)\\sum x_i$ is:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$\\mathrm{SE}(\\bar{x}) = \\frac{\\sigma}{\\sqrt{T}}$$\r\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">where \u03c3 is the population standard deviation. Requiring relative error \u2264 10% (i.e., SE\/\u03c3 \u2264 0.1) gives:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$\\frac{\\sigma}{\\sqrt{T}} \\le 0.1\\sigma \\;\\;\\Longrightarrow\\;\\; T \\ge 100$$\r\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">At T = 5, SE\/\u03c3 = 1\/\u221a5 \u2248 <strong>0.447<\/strong> (44.7% error); at T = 10, 1\/\u221a10 \u2248 <strong>0.316<\/strong> (31.6% error). Even the mean carries one-third to one-half the noise of the signal itself.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1.2.2 Standard Error of the Variance<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Under normality, the SE of the sample variance $s^2$ is:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$\\mathrm{SE}(s^2) = \\sigma^2 \\sqrt{\\frac{2}{T-1}}$$\r\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">The 10% relative-error criterion gives T \u2265 201. At T = 5, the relative error is 70.7%; at T = 10, 47.1%. Variance estimation in T \u2264 10 is essentially meaningless.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1.2.3 95% Confidence Interval for the Mean<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The half-width of the 95% Confidence Interval (CI) is $1.96 \\cdot \\sigma\/\\sqrt{T}$. Requiring half-width \u2264 0.1\u03c3 gives T \u2265 384. At T = 5 the half-width is \u00b10.876\u03c3 (almost as wide as the signal); at T = 10, \u00b10.620\u03c3.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1.2.4 Summary<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Estimator<\/th><th>T for 10% relative error<\/th><th>Error at T = 5<\/th><th>Error at T = 10<\/th><\/tr><\/thead><tbody><tr><td>SE of mean<\/td><td>T \u2265 100<\/td><td>44.7%<\/td><td>31.6%<\/td><\/tr><tr><td>SE of variance<\/td><td>T \u2265 201<\/td><td>70.7%<\/td><td>47.1%<\/td><\/tr><tr><td>95% CI half-width of mean<\/td><td>T \u2265 384<\/td><td>87.6%<\/td><td>62.0%<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">At T \u2264 10, all basic statistics are outside their &#8220;trustworthy&#8221; region. Higher moments and the AutoCorrelation Function (ACF) are even worse: ACF at lag k requires T\u2212k pairs, so ACF(3) at T = 5 rests on just two pairs and is essentially random. tsfresh&#8217;s &#8220;autocorrelation lag k&#8221; features become NaN at T = 5 because they are mathematically undefined. The T \u2264 10 regime is not just &#8220;few data&#8221; but &#8220;estimator variance dominates the signal,&#8221; and extracting features in this regime amounts to <em>turning noise into dimensions<\/em>.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">2. Limitations of Existing Vectorization Methods<\/h2>\n\n\n<style>.kadence-column6495_6e778d-6a > .kt-inside-inner-col,.kadence-column6495_6e778d-6a > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6495_6e778d-6a > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6495_6e778d-6a > .kt-inside-inner-col{flex-direction:column;}.kadence-column6495_6e778d-6a > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6495_6e778d-6a > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6495_6e778d-6a{position:relative;}.kadence-column6495_6e778d-6a, .kt-inside-inner-col > .kadence-column6495_6e778d-6a:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6495_6e778d-6a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6495_6e778d-6a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6495_6e778d-6a\"><div class=\"kt-inside-inner-col\">\n<figure class=\"wp-block-table\"><table><thead><tr><th>Method<\/th><th>Output dim<\/th><th>Source<\/th><th>Min length<\/th><th>At T = 5<\/th><th>At T = 10<\/th><\/tr><\/thead><tbody><tr><td><strong>Statistical Summary<\/strong> (mean, std, min, max, median)<\/td><td>5\u201310<\/td><td>estimate (user-defined)<\/td><td>T \u2265 2<\/td><td>Estimator variance dominates signal; time order completely lost<\/td><td>Variance stabilizes but time order still lost; weak on time-dependent signals<\/td><\/tr><tr><td><strong>Area Under the Curve (AUC)<\/strong><\/td><td>1<\/td><td>trivial (definition)<\/td><td>T \u2265 2<\/td><td>5 values \u2192 1 scalar; [1,5,1,5,1] and [3,3,3,3,3] yield the same AUC<\/td><td>10 values \u2192 1 scalar; infinitely many equal-area equivalence classes<\/td><\/tr><tr><td><strong>tsfresh comprehensive<\/strong><\/td><td>783<\/td><td>Christ et al. (2018) docs<\/td><td>\u2265 30 in practice<\/td><td>NaN ratio 60\u201380% (estimate); ACF lag k\u22653, high-frequency Fast Fourier Transform (FFT) coefficients, entropy all undefined<\/td><td>NaN ratio 30\u201350% (estimate); ACF lag 3\u20135 defined but with very high estimator variance<\/td><\/tr><tr><td><strong>tsfresh efficient<\/strong><\/td><td>78<\/td><td>Christ et al. (2018) docs<\/td><td>\u2265 20 in practice<\/td><td>NaN ratio 40\u201360% (estimate)<\/td><td>NaN ratio 20\u201330% (estimate); usable but trees may split on NaN columns<\/td><\/tr><tr><td><strong>catch22<\/strong><\/td><td>22<\/td><td>Lubba et al. (2019)<\/td><td>\u2265 50 in practice<\/td><td>DN_HistogramMode, SB_TransitionMatrix, SP_Summaries undefined; over half NaN<\/td><td>Some ACF-based features defined; histogram mode and transition matrix still meaningless<\/td><\/tr><tr><td><strong>MiniRocket<\/strong><\/td><td>9,996<\/td><td>Dempster et al. (2021); aeon-toolkit<\/td><td>9 (enforced)<\/td><td><strong>Library refuses input<\/strong> (padding required)<\/td><td>Library accepts, but length-9 dilation kernels cover almost the entire signal; Proportion of Positive Values (PPV) saturates at 0 or 1<\/td><\/tr><tr><td><strong>TS2Vec<\/strong><\/td><td>320 (default)<\/td><td>Yue et al. (2022)<\/td><td>\u2265 50 in practice<\/td><td>Sub-series cropping yields length 2\u20133; max-pooling collapses hierarchy after 1\u20132 layers; almost no learning signal<\/td><td>Sub-series of length 5 possible but contrastive signal weak; representational power limited even when pretraining domain matches<\/td><\/tr><tr><td><strong>Raw Vector<\/strong><\/td><td><strong>T (= 5 or 10)<\/strong><\/td><td>trivial (definition)<\/td><td><strong>T \u2265 1<\/strong><\/td><td><strong>All information preserved; zero NaN; directly usable by tree models<\/strong><\/td><td><strong>All information preserved; no dimensional burden; optimal representation<\/strong><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li>At T = 5, MiniRocket is rejected by the library; tsfresh and catch22 produce more than 50% NaN.<\/li>\n\n\n\n<li>At T = 10, MiniRocket&#8217;s dilation-9 kernels barely slide at all, so PPV pooling becomes nearly constant \u2014 the library accepts the input but loses almost all information.<\/li>\n\n\n\n<li>tsfresh and catch22 produce fewer NaNs at T = 10, but the surviving features still have huge estimator variance, effectively adding noise dimensions.<\/li>\n\n\n\n<li>Only Raw Vector preserves all information at both T = 5 and T = 10 with zero NaN.<\/li>\n<\/ul>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why Raw Vectorization Is Justified<\/h2>\n\n\n<style>.kadence-column6495_b0d43e-83 > .kt-inside-inner-col,.kadence-column6495_b0d43e-83 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6495_b0d43e-83 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6495_b0d43e-83 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6495_b0d43e-83 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6495_b0d43e-83 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6495_b0d43e-83{position:relative;}.kadence-column6495_b0d43e-83, .kt-inside-inner-col > .kadence-column6495_b0d43e-83:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6495_b0d43e-83 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6495_b0d43e-83 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6495_b0d43e-83\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">3.1 Terminology<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Using the raw time series directly as the input vector goes by several names in the literature:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Raw Feature Vectorization:<\/strong> The most general term; placing each time-step value into a dimension of the vector.<\/li>\n\n\n\n<li><strong>Identity Mapping:<\/strong> The functional view, $f(x) = x$. Used as a baseline in representation-learning literature.<\/li>\n\n\n\n<li><strong>Flattening:<\/strong> Reshaping a (time, feature) 2D structure into 1D. Standard for feeding multivariate short series to tree models.<\/li>\n\n\n\n<li><strong>Direct Mapping:<\/strong> An intentional choice to skip embeddings or dimension reduction in regimes where overfitting risk is high.<\/li>\n\n\n\n<li><strong>Pointwise Encoding:<\/strong> Neural-network terminology where each timestamp is represented by its own raw value.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3.2 Minimal Sufficient Statistic<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The relevant object whose dimension we discuss is the <strong>minimal sufficient statistic<\/strong> \u2014 the coarsest among all sufficient statistics (Lehmann &amp; Casella 1998). Its dimension is the intrinsic information dimension of the distribution family; compressing below it loses sufficiency. Time series classification and regression are typically nonparametric (no assumed distribution family) and time-order matters, so they fall in the <strong>nonparametric + time-order-significant<\/strong> regime. There, the minimal sufficient statistic is the data itself, $S(x) = x$, with dimension T. Any compression below T loses sufficiency.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For T \u2264 10, the minimal sufficient dimension is 5 or 10 \u2014 exactly the dimension of Raw Vector. tsfresh (783D), catch22 (22D), MiniRocket (9,996D) all use non-injective transforms and therefore fail to preserve sufficiency, regardless of how large their output dimension is (Appendix E).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The following table summarizes minimal-sufficient dimensions derived via the Fisher-Neyman factorization theorem (Appendix D).<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Setting<\/th><th>Distribution-family assumption<\/th><th>Dimension of minimal sufficient<\/th><\/tr><\/thead><tbody><tr><td>Parametric<\/td><td>Normal $N(\\mu, \\sigma^2)$ i.i.d.<\/td><td><strong>2<\/strong> (\u03a3x_i, \u03a3x_i\u00b2) \u2014 fixed, independent of T<\/td><\/tr><tr><td>Parametric<\/td><td>k-parameter exponential family<\/td><td><strong>k<\/strong> \u2014 fixed, independent of T<\/td><\/tr><tr><td><strong>Nonparametric<\/strong><\/td><td>Family unknown, time order matters<\/td><td><strong>T<\/strong> ($S(x) = x$ itself)<\/td><\/tr><tr><td>Nonparametric<\/td><td>Family unknown, order irrelevant<\/td><td><strong>T<\/strong> (order statistics)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The same conclusion follows from a sample-efficiency view. With N samples each of length T, total information is NT scalars. If a model has P parameters to estimate, the rule of thumb is NT \u2265 10P. When T &lt; 10, the number of parameters P that can be reliably estimated is itself small \u2014 only simple models work, and the input representation must be as low-dimensional as possible. Raw Vector, with dimension exactly T, is the most compatible.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.3 Curse of Dimensionality<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In high-dimensional space, Euclidean distances concentrate (concentration of distances), eroding the discriminative power of distance-based algorithms (Bellman 1961). In the hundreds-to-thousands-dimensional vectors produced by tsfresh and MiniRocket, only T dimensions carry signal; the rest are noise.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.3.1 Definition of Signal-to-Noise Ratio (SNR)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The Signal-to-Noise Ratio (SNR) is defined as the signal variance divided by the noise variance:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$\\mathrm{SNR}_{\\mathrm{raw}} = \\frac{\\sigma^2_{\\mathrm{signal}}}{\\sigma^2_{\\mathrm{noise}}}$$\r\n<\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>$\\sigma^2_{\\mathrm{signal}}$<\/strong>: variance of the meaningful variation in the original time series that is actually related to the target Y.<\/li>\n\n\n\n<li><strong>$\\sigma^2_{\\mathrm{noise}}$<\/strong>: variance of random fluctuations unrelated to Y \u2014 measurement error, sensor noise, environmental noise. Modeled as $x_{\\mathrm{observed}} = x_{\\mathrm{signal}} + \\varepsilon$ with $\\varepsilon \\sim (0, \\sigma^2_{\\mathrm{noise}})$.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3.3.2 SNR Dilution Under Dimensional Expansion (Heuristic Model)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">When a T-dimensional signal is expanded to a $d_{\\mathrm{total}}$-dimensional representation, signal still resides in only T dimensions; the remaining $(d_{\\mathrm{total}} &#8211; T)$ dimensions are essentially noise-like residuals. A simplified SNR-dilution model is:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$\\mathrm{SNR}_{\\mathrm{after}} \\approx \\mathrm{SNR}_{\\mathrm{raw}} \\times \\frac{d_{\\mathrm{signal}}}{d_{\\mathrm{total}}}$$\r\n<\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>$\\mathrm{SNR}_{\\mathrm{after}}$<\/strong>: SNR of the output vector after the transform.<\/li>\n\n\n\n<li><strong>$d_{\\mathrm{signal}}$<\/strong>: number of output dimensions actually carrying signal. Since transforms cannot add information, $d_{\\mathrm{signal}} \\le T$.<\/li>\n\n\n\n<li><strong>$d_{\\mathrm{total}}$<\/strong>: total output dimension after the transform.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Source caveat:<\/strong> The above is a heuristic model (estimate) used in this report, not a textbook theorem. Exact SNR change depends on noise distribution, signal representation, and nonlinearity of the transform; no result of this exact form appears in Cover &amp; Thomas (2006). The qualitative conclusion \u2014 that SNR dilutes proportionally when $d_{\\mathrm{total}} \\gg d_{\\mathrm{signal}}$ \u2014 is, however, a general intuition tied to the curse of dimensionality.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.3.3 Dilution Comparison at T = 10<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Method<\/th><th>$d_{\\mathrm{total}}$<\/th><th>$d_{\\mathrm{signal}}$<\/th><th>$d_{\\mathrm{signal}}\/d_{\\mathrm{total}}$<\/th><th>Interpretation<\/th><\/tr><\/thead><tbody><tr><td><strong>Raw Vector<\/strong><\/td><td>10<\/td><td>10<\/td><td>1.000<\/td><td><strong>1\u00d7 no dilution, no compression (optimal)<\/strong><\/td><\/tr><tr><td>Stats (10D)<\/td><td>10<\/td><td>10<\/td><td>1.000<\/td><td>1\u00d7 no dilution, no compression (but time order lost)<\/td><\/tr><tr><td>AUC<\/td><td>1<\/td><td>10<\/td><td>10.000<\/td><td><strong>10\u00d7 information compression<\/strong> (10D \u2192 scalar)<\/td><\/tr><tr><td>catch22<\/td><td>22<\/td><td>10<\/td><td>0.455<\/td><td>~2\u00d7 dilution (12 noise dims added)<\/td><\/tr><tr><td>tsfresh efficient<\/td><td>78<\/td><td>10<\/td><td>0.128<\/td><td>~8\u00d7 dilution (68 noise dims added)<\/td><\/tr><tr><td>tsfresh comprehensive<\/td><td>783<\/td><td>10<\/td><td>0.013<\/td><td>~78\u00d7 dilution (773 noise dims added)<\/td><\/tr><tr><td>TS2Vec<\/td><td>320<\/td><td>10<\/td><td>0.031<\/td><td>~32\u00d7 dilution (310 noise dims added)<\/td><\/tr><tr><td>MiniRocket<\/td><td>9,996<\/td><td>10<\/td><td>0.001<\/td><td>~1,000\u00d7 dilution (9,986 noise dims added)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Tree models can find spurious information gain on noise dimensions when picking splits, leading to validation degradation. The smaller T is, the smaller $d_{\\mathrm{signal}}\/d_{\\mathrm{total}}$ becomes \u2014 so SNR damage from large representations is most severe on short series.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.4 Affinity with Tree Models<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Raw Vectorization fits gradient-boosted tree models \u2014 Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost), CatBoost \u2014 exceptionally well:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Monotone-transform invariance:<\/strong> Trees are invariant to monotone input transforms, so no normalization or standardization is needed. The scale-mixing problem of tsfresh-style mixed features disappears.<\/li>\n\n\n\n<li><strong>Native missing-value handling:<\/strong> LightGBM and XGBoost route NaN to a learned branch. Variable-length padding like [x\u2081, x\u2082, x\u2083, NaN, NaN] is handled automatically.<\/li>\n\n\n\n<li><strong>Automatic interactions:<\/strong> Trees pick up time-pair relationships (differences, ratios) through split combinations \u2014 capturing slope and curvature without manual engineering.<\/li>\n\n\n\n<li><strong>Nonlinear thresholds:<\/strong> Unlike AUC&#8217;s single-scalar compression, trees separate signal via threshold combinations like &#8216;x\u2083 > 0.5 AND x\u2085 &lt; 0.3&#8217;. When a specific timestamp matters (e.g., the third clinical day), the tree picks it automatically.<\/li>\n<\/ul>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">4. Practical Guidelines<\/h2>\n\n\n<style>.kadence-column6495_ed61e7-a8 > .kt-inside-inner-col,.kadence-column6495_ed61e7-a8 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6495_ed61e7-a8 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6495_ed61e7-a8 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6495_ed61e7-a8 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6495_ed61e7-a8 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6495_ed61e7-a8{position:relative;}.kadence-column6495_ed61e7-a8, .kt-inside-inner-col > .kadence-column6495_ed61e7-a8:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6495_ed61e7-a8 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6495_ed61e7-a8 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6495_ed61e7-a8\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">4.1 Decision Tree<\/h3>\n\n\n\n<style>\n.edgeLabel p {\n    background-color: white;\n}\n<\/style>\n<pre class=\"mermaid\">\nflowchart TD\n    A([Time-series length T]) --> B{Which range?}\n    B -->|T \u2264 10| C[Raw Vectorization<br\/>required]\n    B -->|10 &lt; T \u2264 30| D[Raw Vector and<br\/>5\u201310 stat features]\n    B -->|30 &lt; T \u2264 200| E[catch22 or<br\/>tsfresh efficient]\n    B -->|T &gt; 200| F[MiniRocket<br\/>or Hydra]\n\n    linkStyle 0 color:black,font-weight:bold\n    linkStyle 1 color:green, background-color:#fff, font-weight:bold\n    linkStyle 2 color:black, background-color:white, font-weight:bold\n    linkStyle 3 color:black, background-color:white, font-weight:bold\n    linkStyle 4 color:black, background-color:white, font-weight:bold\n\n    C --> C1[optional add-on:<br\/> simple<br\/>derived features]\n    D ---> D1[fallback:<br\/> catch22<br\/>NaN check required]\n    E ---> E1[fallback:<br\/> MiniRocket<br\/>T \u2265 9 needed]\n    F ---> F1[fallback:<br\/>TS2Vec for<br\/>large data]\n\n    classDef root fill:#1F4E79,stroke:#1F4E79,color:#fff\n    classDef decision fill:#FFF2CC,stroke:#BF9000,color:#333\n    classDef recommended fill:#C6E0B4,stroke:#548235,color:#1F3864\n    classDef alternative fill:#DEEBF7,stroke:#2E74B5,color:#1F3864\n    classDef addon fill:#E2F0D9,stroke:#548235,color:#333,font-style:italic\n    classDef fallback fill:#FCE4D6,stroke:#C65911,color:#333,font-style:italic\n \n    class A root\n    class B decision\n    class C recommended\n    class D,E,F alternative\n    class C1 addon\n    class D1,E1,F1 fallback\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">4.2 Best Practices<\/h3>\n\n\n\n<h3 class=\"wp-block-heading\">4.2.1 Variable-Length Handling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In practice, samples rarely all share the same T. When T varies between 1 and 10, two safe strategies:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Right padding with sentinel:<\/strong> Pad with NaN to fixed length. LightGBM and XGBoost route NaN to a learned branch, so missingness itself becomes informative.<\/li>\n\n\n\n<li><strong>Length indicator:<\/strong> Append the actual length as one extra dimension, e.g., [x\u2081, x\u2082, x\u2083, NaN, &#8230;, NaN, len = 3]. The model can learn length-dependent patterns.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4.2.2 NaN-Safe Derived Features<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">On top of Raw Vector, the following simple derived features are always defined for T \u2265 2 and never produce NaN \u2014 in contrast to library auto-extractors (tsfresh, catch22), which often do. We call these <strong>NaN-safe derived features<\/strong>:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>x_last \u2212 x_first<\/strong> (total change, scalar)<\/li>\n\n\n\n<li><strong>max(x) \u2212 min(x)<\/strong> (range, scalar)<\/li>\n\n\n\n<li><strong>x_last<\/strong> (final value, scalar) \u2014 often a strong predictor on its own<\/li>\n\n\n\n<li><strong>sign(x_last \u2212 x_first)<\/strong> (direction, scalar in {\u22121, 0, +1})<\/li>\n\n\n\n<li><strong>argmax(x)<\/strong> (location of the maximum, integer in 0 to T\u22121)<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Each is a single scalar, so the augmentation adds exactly 5 dimensions:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>T = 5<\/strong>: 5 raw + 5 derived = <strong>10 dimensions<\/strong><\/li>\n\n\n\n<li><strong>T = 10<\/strong>: 10 raw + 5 derived = <strong>15 dimensions<\/strong><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Both fit comfortably even with N \u2265 100, and are vastly smaller than tsfresh (78\u2013783D) or MiniRocket (9,996D).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.2.3 Multivariate Short Series<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For V variables and T time-steps, flatten to a $V \\times T$ vector. With V = 10, T = 10, the result is 100 dimensions \u2014 easy for LightGBM, and ~1,000\u00d7 smaller than MiniRocket&#8217;s ~100,000D (10 channels \u00d7 9,996 kernels).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.3 Common Pitfalls<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Pitfall 1:<\/strong> &#8220;More features are always better.&#8221; With T \u2264 10, 800 tsfresh features almost always do worse than T raw values. Check NaN ratio and constant-column ratio first.<\/li>\n\n\n\n<li><strong>Pitfall 2:<\/strong> Padding to bypass MiniRocket&#8217;s length requirement. PPV on padded regions teaches the model to learn the <em>length<\/em>, not the signal \u2014 generalization fails when test-time lengths differ.<\/li>\n\n\n\n<li><strong>Pitfall 3:<\/strong> Pretrained TS2Vec or other self-supervised embeddings. If the pretraining domain differs from yours, you trade T-dimensional signal for 320-dimensional noise.<\/li>\n\n\n\n<li><strong>Pitfall 4:<\/strong> Applying Principal Component Analysis (PCA) or Uniform Manifold Approximation and Projection (UMAP) on top of Raw Vector. T \u2264 10 is already low-dimensional; further reduction only loses information.<\/li>\n\n\n\n<li><strong>Pitfall 5:<\/strong> Reflexive z-score standardization. Trees don&#8217;t need it, and the sample-variance estimate is unstable on short series \u2014 standardization injects noise.<\/li>\n<\/ul>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">5. Empirical Comparison: Expected Performance Pattern<\/h2>\n\n\n<style>.kadence-column6495_ebaba0-6b > .kt-inside-inner-col,.kadence-column6495_ebaba0-6b > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6495_ebaba0-6b > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6495_ebaba0-6b > .kt-inside-inner-col{flex-direction:column;}.kadence-column6495_ebaba0-6b > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6495_ebaba0-6b > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6495_ebaba0-6b{position:relative;}.kadence-column6495_ebaba0-6b, .kt-inside-inner-col > .kadence-column6495_ebaba0-6b:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6495_ebaba0-6b > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6495_ebaba0-6b > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6495_ebaba0-6b\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">The following synthesizes general literature reports and the analyses above into a qualitative pattern for T \u2264 10. Exact numbers depend on the dataset, but the qualitative ordering is consistently observed.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Method<\/th><th>Output dim (T = 10)<\/th><th>Source<\/th><th>NaN ratio (expected)<\/th><th>Train time<\/th><th>Expected relative accuracy<\/th><\/tr><\/thead><tbody><tr><td><strong>Raw Vector<\/strong><\/td><td>10<\/td><td>trivial<\/td><td>0%<\/td><td>lowest (~ms)<\/td><td><strong>baseline (1.00)<\/strong><\/td><\/tr><tr><td><strong>Raw + 5 derived<\/strong><\/td><td>15<\/td><td>estimate<\/td><td>0%<\/td><td>lowest<\/td><td><strong>best (1.02\u20131.05) (estimate)<\/strong><\/td><\/tr><tr><td>AUC + stats<\/td><td>8<\/td><td>estimate<\/td><td>~5% (estimate)<\/td><td>low<\/td><td>0.85\u20130.92 (estimate)<\/td><\/tr><tr><td>tsfresh efficient<\/td><td>78<\/td><td>Christ et al. (2018)<\/td><td>20\u201330% at T=10 \/ 40\u201360% at T=5 (estimate)<\/td><td>medium<\/td><td>0.75\u20130.85 (estimate)<\/td><\/tr><tr><td>tsfresh comprehensive<\/td><td>783<\/td><td>Christ et al. (2018)<\/td><td>30\u201350% at T=10 \/ 60\u201380% at T=5 (estimate)<\/td><td>high<\/td><td>0.65\u20130.80 (estimate)<\/td><\/tr><tr><td>catch22<\/td><td>22<\/td><td>Lubba et al. (2019)<\/td><td>30\u201340% at T=10 \/ 50\u201370% at T=5 (estimate)<\/td><td>low<\/td><td>0.70\u20130.85 (estimate)<\/td><\/tr><tr><td>MiniRocket (T = 9 padded)<\/td><td>9,996<\/td><td>Dempster et al. (2021)<\/td><td>0% (but artifacts)<\/td><td>medium\u2013high<\/td><td>0.60\u20130.80 (estimate)<\/td><\/tr><tr><td>TS2Vec<\/td><td>320<\/td><td>Yue et al. (2022)<\/td><td>0% (but weak meaning)<\/td><td>highest<\/td><td>0.55\u20130.75 (estimate)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The relative accuracies above are estimates of typical patterns; exact values vary by domain. When the SNR is high and length is small (e.g., five to ten clinical measurements), Raw&#8217;s advantage becomes even sharper.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">6. Conclusion<\/h2>\n\n\n<style>.kadence-column6495_544e55-ec > .kt-inside-inner-col,.kadence-column6495_544e55-ec > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6495_544e55-ec > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6495_544e55-ec > .kt-inside-inner-col{flex-direction:column;}.kadence-column6495_544e55-ec > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6495_544e55-ec > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6495_544e55-ec{position:relative;}.kadence-column6495_544e55-ec, .kt-inside-inner-col > .kadence-column6495_544e55-ec:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6495_544e55-ec > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6495_544e55-ec > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6495_544e55-ec\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">The T \u2264 10 regime lies outside the design assumptions of standard time-series tools. There, statistical summary, automatic feature extraction, and deep embeddings all run into one of two issues: (a) the estimator is undefined and produces large amounts of NaN, or (b) even when defined, estimator variance dominates signal variance, effectively turning noise into dimensions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The value of feature engineering scales with both length T and sample size N. The smaller T is, the less room there is for transformation, and the more valuable the raw form becomes. In the T \u2264 10 regime, the most sophisticated algorithm is the simplest representation. Raw Vectorization is not just a quick shortcut. Information-theoretically, it is the lossless representation guaranteed by the Data Processing Inequality (DPI); statistically, it is the sufficient statistic under nonparametric assumptions; practically, it is a perfect match for the monotone-transform invariance, native NaN handling, and automatic interaction learning of tree-based models like LightGBM.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Appendix A. Entropy<\/h2>\n\n\n<style>.kadence-column6495_252359-38 > .kt-inside-inner-col,.kadence-column6495_252359-38 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6495_252359-38 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6495_252359-38 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6495_252359-38 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6495_252359-38 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6495_252359-38{position:relative;}.kadence-column6495_252359-38, .kt-inside-inner-col > .kadence-column6495_252359-38:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6495_252359-38 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6495_252359-38 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6495_252359-38\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">The entropy of a discrete random variable A measures its uncertainty in bits (Cover &amp; Thomas 2006):<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$H(A) = -\\sum_{a} p(a) \\log_2 p(a)$$\r\n<\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>A<\/strong>: a discrete random variable (e.g., coin flip, die roll, classification label).<\/li>\n\n\n\n<li><strong>a<\/strong>: a particular value A can take (e.g., for a coin, $a \\in \\{\\text{heads}, \\text{tails}\\}$).<\/li>\n\n\n\n<li><strong>$p(a) = P(A = a)$<\/strong>: probability that A equals a.<\/li>\n\n\n\n<li>The sum is over all possible values of A.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Physical meaning of &#8220;bits&#8221;:<\/strong> H(A) is the average number of yes\/no questions needed to identify A&#8217;s value. A fair coin has $H = -0.5\\log_2 0.5 &#8211; 0.5\\log_2 0.5 = 1$ bit \u2014 exactly one yes\/no question identifies the outcome. A uniform 8-sided die has $H = \\log_2 8 = 3$ bits.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Conditional entropy<\/strong> $H(A \\mid B)$ is the average remaining uncertainty in A when B is known:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$H(A \\mid B) = -\\sum_{a,b} p(a, b) \\log_2 p(a \\mid b)$$\r\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">A basic property is $H(A \\mid B) \\le H(A)$: knowing B can only reduce (never increase) the uncertainty about A. This is what makes mutual information non-negative (Appendix B).<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Appendix B. Mutual Information<\/h2>\n\n\n<style>.kadence-column6495_9f3ebe-77 > .kt-inside-inner-col,.kadence-column6495_9f3ebe-77 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6495_9f3ebe-77 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6495_9f3ebe-77 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6495_9f3ebe-77 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6495_9f3ebe-77 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6495_9f3ebe-77{position:relative;}.kadence-column6495_9f3ebe-77, .kt-inside-inner-col > .kadence-column6495_9f3ebe-77:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6495_9f3ebe-77 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6495_9f3ebe-77 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6495_9f3ebe-77\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">B.1 Definition<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The mutual information between two random variables A and B measures the information they share, in bits (Cover &amp; Thomas 2006):<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$I(A; B) = \\sum_{a,b} p(a, b) \\log_2 \\frac{p(a, b)}{p(a)\\, p(b)} = H(A) &#8211; H(A \\mid B)$$\r\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">The second equality means: total uncertainty about A minus the residual uncertainty when B is known \u2014 the amount of information about A gained by knowing B.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">B.2 Properties<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Property 1: $I(A; B) \\ge 0$<\/strong> (shared information cannot be negative). Follows directly from $H(A \\mid B) \\le H(A)$ in Appendix A:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$I(A; B) = H(A) &#8211; H(A \\mid B) \\ge 0$$\r\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Property 2: $I(A; B) = 0$ if and only if A and B are independent.<\/strong> If A and B are independent, $p(a, b) = p(a)p(b)$, so the log term inside the sum is $\\log_2 1 = 0$. Conversely, $I(A; B) = 0$ implies $H(A) = H(A \\mid B)$, meaning B reveals nothing about A \u2014 i.e., independence.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">B.3 How This Report Uses It<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Mutual information is the core tool for stating the Data Processing Inequality (Appendix C.2). The DPI says a transform f cannot increase $I(X; Y)$ \u2014 both sides of its inequality are mutual information.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">How the DPI is used in this report:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>\u00a73.2 Minimal sufficient statistic:<\/strong> When arguing that Raw Vector is the minimal sufficient statistic for nonparametric time series, we use DPI&#8217;s &#8220;non-injective transforms lose information&#8221; to justify that any compression below T loses sufficiency.<\/li>\n\n\n\n<li><strong>\u00a73.3 Curse of dimensionality:<\/strong> When proposing the heuristic SNR-dilution model for transforms with $d_{\\mathrm{total}} \\gg d_{\\mathrm{signal}}$, we rely on the DPI implication that transforms cannot add information (so $d_{\\mathrm{signal}} \\le T$).<\/li>\n\n\n\n<li><strong>Appendix E (transform invertibility):<\/strong> Each vectorization method&#8217;s transform f is verified to be non-injective; from there, DPI&#8217;s strict inequality implies all non-Raw methods incur information loss in principle.<\/li>\n\n\n\n<li><strong>\u00a76 Conclusion:<\/strong> The claim &#8220;Raw Vectorization is the lossless representation guaranteed by the DPI&#8221; rests on DPI as its information-theoretic basis.<\/li>\n<\/ol>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Appendix C. Markov Chain and Data Processing Inequality<\/h2>\n\n\n<style>.kadence-column6495_dbaac7-38 > .kt-inside-inner-col,.kadence-column6495_dbaac7-38 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6495_dbaac7-38 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6495_dbaac7-38 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6495_dbaac7-38 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6495_dbaac7-38 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6495_dbaac7-38{position:relative;}.kadence-column6495_dbaac7-38, .kt-inside-inner-col > .kadence-column6495_dbaac7-38:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6495_dbaac7-38 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6495_dbaac7-38 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6495_dbaac7-38\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">C.1 Markov Chain<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Three random variables X, Z, Y form a Markov chain $X \\to Z \\to Y$ when:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$P(Y \\mid X, Z) = P(Y \\mid Z)$$\r\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">The left side is the probability of Y given that both X and Z occurred; the right side is the probability of Y given only Z. Their equality means: <strong>once Z is known, X gives no additional information about Y<\/strong> \u2014 Y receives all of X&#8217;s information through Z.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Simple example:<\/strong> &#8220;Rain (X) \u2192 wet clothes (Z) \u2192 catching a cold (Y).&#8221; If we already know the clothes are wet (Z), the probability of catching a cold (Y) does not change whether it rained (X) or not \u2014 the original cause does not matter once wetness is observed. So $P(Y \\mid X, Z) = P(Y \\mid Z)$ and X \u2192 Z \u2192 Y is a Markov chain.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this report&#8217;s setting, X is the original time series, Z is its transformed representation (Raw Vector, tsfresh output, etc.), and Y is the target label. Once Z = f(X) is fixed, Y receives X&#8217;s information only through Z.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">C.2 The DPI Itself<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For a Markov chain $X \\to Z = f(X) \\to Y$ (Cover &amp; Thomas 2006):<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$I(X; Y) \\ge I(Z; Y) = I(f(X); Y)$$\r\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Mutual information $I(\\cdot; \\cdot)$ is defined in Appendix B. Intuitively: any transform cannot increase the information between X and Y. Processing creates no new information.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">C.3 The Transform f(X)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">f is a deterministic or stochastic function applied to X, with $Z = f(X)$. It falls into two classes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Injective (invertible) f:<\/strong> $f(x_1) = f(x_2)$ implies $x_1 = x_2$ \u2014 a one-to-one map. The inverse $f^{-1}$ exists, so X can be fully recovered from Z.<\/li>\n\n\n\n<li><strong>Non-injective f:<\/strong> there exist $x_1 \\ne x_2$ with $f(x_1) = f(x_2)$. X cannot be recovered from Z, and information is lost.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">C.4 Injective f \u2014 Equality Holds<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The DPI&#8217;s equality holds if and only if f is injective: $I(f(X); Y) = I(X; Y)$. Because the inverse $f^{-1}$ exists, Z carries the same information about Y as X does. <strong>Raw Vector (f = identity) is trivially injective<\/strong>, so equality holds for every $T \\ge 1$ \u2014 zero information loss is guaranteed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">C.5 Non-Injective f \u2014 Information Loss Guaranteed<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Definition of &#8220;downstream&#8221;:<\/strong> a machine-learning pipeline typically has two stages:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Upstream:<\/strong> raw data X \u2192 transform f \u2192 representation Z. Raw Vectorization, tsfresh, catch22, MiniRocket, TS2Vec all live here.<\/li>\n\n\n\n<li><strong>Downstream:<\/strong> representation Z \u2192 model \u2192 prediction Y. LightGBM, XGBoost, Ridge, etc. live here.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">So <strong>downstream accuracy<\/strong> means the predictive performance (accuracy, AUC, F1) of the model that consumes Z. It is upper-bounded by how much information Z carries about Y.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Information loss with non-injective transforms:<\/strong> for any non-injective f (statistical summaries, AUC, tsfresh, catch22, MiniRocket&#8217;s PPV pooling, TS2Vec&#8217;s max-pooling, &#8230;):<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$I(f(X); Y) < I(X; Y)$$\r\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">i.e., the upper bound on downstream accuracy is lower. A model may fail to reach this bound even on Raw Vector \u2014 that is a model issue, not a representation issue. But non-injectivity guarantees information loss at the representation stage itself.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Appendix D. Fisher-Neyman Factorization Theorem and Sufficient Statistics<\/h2>\n\n\n<style>.kadence-column6495_77d36c-4c > .kt-inside-inner-col,.kadence-column6495_77d36c-4c > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6495_77d36c-4c > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6495_77d36c-4c > .kt-inside-inner-col{flex-direction:column;}.kadence-column6495_77d36c-4c > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6495_77d36c-4c > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6495_77d36c-4c{position:relative;}.kadence-column6495_77d36c-4c, .kt-inside-inner-col > .kadence-column6495_77d36c-4c:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6495_77d36c-4c > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6495_77d36c-4c > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6495_77d36c-4c\"><div class=\"kt-inside-inner-col\">\n<h3 class=\"wp-block-heading\">D.1 The Fisher-Neyman Factorization Theorem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Fisher-Neyman factorization theorem (Lehmann &amp; Casella 1998):<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$p_\\theta(x) = g_\\theta(S(x)) \\cdot h(x)$$\r\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Reading the equation:<\/strong> the equation asks whether the joint density $p_\\theta(x)$ can be factored into two parts.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>LHS $p_\\theta(x)$<\/strong>: probability (density) of observing data x given parameter \u03b8. Different \u03b8 produce different observation probabilities.<\/li>\n\n\n\n<li><strong>RHS first term $g_\\theta(S(x))$<\/strong>: the part affected by \u03b8. Crucially, \u03b8 enters not through the full data x but only through the statistic S(x) \u2014 all the information needed to estimate \u03b8 is compressed into S(x).<\/li>\n\n\n\n<li><strong>RHS second term $h(x)$<\/strong>: a part determined by the shape of the data x alone, with no dependence on \u03b8. It contributes nothing to inferring \u03b8.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">If such a factorization exists, the statistic S(x) is a <strong>sufficient statistic<\/strong> for \u03b8. Intuition: all information in x about \u03b8 flows through S(x); the residual h(x) is irrelevant to inference.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Formal definitions of the symbols:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>$p_\\theta(x)$<\/strong>: joint probability density (or mass) function indexed by \u03b8.<\/li>\n\n\n\n<li><strong>\u03b8<\/strong>: parameter of the distribution family; can be a scalar, vector, or even infinite-dimensional (nonparametric).<\/li>\n\n\n\n<li><strong>x<\/strong>: observed data. In this report, $x = (x_1, \\ldots, x_T) \\in \\mathbb{R}^T$.<\/li>\n\n\n\n<li><strong>S(x)<\/strong>: a statistic \u2014 any function of x used for inference.<\/li>\n\n\n\n<li><strong>$g_\\theta$<\/strong>: depends only on S(x) and \u03b8.<\/li>\n\n\n\n<li><strong>h(x)<\/strong>: depends only on x, not on \u03b8.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">D.2 What the Theorem Says<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">S(x) is a sufficient statistic for \u03b8 if and only if $p_\\theta(x)$ admits a factorization of the form above.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">D.3 Determining the Dimension of the Minimal Sufficient Statistic<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Sufficient statistics are not unique. The data x itself is always sufficient (via the trivial factorization $p_\\theta(x) = p_\\theta(x) \\cdot 1$); for a normal distribution, $(\\sum x_i, \\sum x_i^2, x_1)$ is also sufficient by the definition. To talk about an intrinsic information dimension, we need <strong>the smallest sufficient statistic<\/strong> \u2014 the minimal sufficient statistic.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Procedure:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Write down the joint density $p_\\theta(x)$ for the family $\\mathcal{P}_\\Theta = \\{p_\\theta : \\theta \\in \\Theta\\}$.<\/li>\n\n\n\n<li>Consider all factorizations of the form $g_\\theta(S(x)) \\cdot h(x)$.<\/li>\n\n\n\n<li>Find the one with the <strong>smallest<\/strong> output dimension of S.<\/li>\n\n\n\n<li>That S(x) is the minimal sufficient statistic, and its output dimension is the minimal sufficient dimension.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">D.4 Parametric Example \u2014 Normal Distribution<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For an i.i.d. sample $x_1, \\ldots, x_T$ from $N(\\mu, \\sigma^2)$:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$p_{\\mu,\\sigma^2}(x) = (2\\pi\\sigma^2)^{-T\/2} \\exp\\left(-\\frac{1}{2\\sigma^2}\\sum (x_i &#8211; \\mu)^2\\right)$$\r\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Expanding the exponent:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$= \\underbrace{(2\\pi\\sigma^2)^{-T\/2} \\exp\\left(-\\frac{1}{2\\sigma^2}\\left(\\sum x_i^2 &#8211; 2\\mu\\sum x_i + T\\mu^2\\right)\\right)}_{g_{\\mu,\\sigma^2}(\\sum x_i,\\, \\sum x_i^2)} \\cdot \\underbrace{1}_{h(x)}$$\r\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Only $\\sum x_i$ and $\\sum x_i^2$ appear in the exponent, so the smallest factorization gives $S(x) = (\\sum x_i, \\sum x_i^2) \\in \\mathbb{R}^2$. The minimal sufficient dimension is <strong>2<\/strong>, fixed regardless of T. This is the parametric advantage \u2014 fixed compression independent of length.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">D.5 Nonparametric Case \u2014 X Itself Is Minimal Sufficient<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>What &#8220;nonparametric&#8221; means:<\/strong> the family $\\mathcal{P}_\\Theta = \\{p_\\theta : \\theta \\in \\Theta\\}$ is not parameterized by a finite-dimensional \u03b8. In this report, we mean \u0398 is the set of <em>all<\/em> joint distributions on $\\mathbb{R}^T$ \u2014 no functional assumption on the distribution at all. Then:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>\u03b8<\/strong>: the distribution itself \u2014 an infinite-dimensional object.<\/li>\n\n\n\n<li><strong>\u0398<\/strong>: the set of all joint distributions on $\\mathbb{R}^T$.<\/li>\n\n\n\n<li><strong>$p_\\theta$<\/strong>: the joint density corresponding to \u03b8. No functional constraint (no normality, no exponential family, etc.).<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Claim:<\/strong> Under the nonparametric family, $x = (x_1, \\ldots, x_T)$ is itself the minimal sufficient statistic.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Proof.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">(a) <strong>x is sufficient.<\/strong> Apply the trivial factorization:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$p_\\theta(x) = p_\\theta(x) \\cdot 1 = g_\\theta(S(x)) \\cdot h(x)$$\r\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">with $S(x) = x$, $g_\\theta(s) = p_\\theta(s)$, $h(x) = 1$. Fisher-Neyman factorization holds, so x is sufficient.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">(b) <strong>x is minimal.<\/strong> Suppose, for contradiction, that some $S'(x)$ with strictly smaller range than x is sufficient \u2014 i.e., $|\\text{Range}(S&#8217;)| &lt; |\\text{Range}(x)|$. By the pigeonhole principle, there exist $x_1 \\ne x_2$ with $S'(x_1) = S'(x_2)$. Sufficiency of $S&#8217;$ requires that $P(X = x \\mid S'(X) = s)$ does not depend on \u03b8. In particular, on the level set $S'(x) = s^*$:<\/p>\n\n\n\n<div style=\"background-color: #fff; border: none\">\r\n$$\\frac{P_\\theta(X = x_1)}{P_\\theta(X = x_2)} = \\frac{p_\\theta(x_1)}{p_\\theta(x_2)} = c \\quad (\\text{constant in } \\theta)$$\r\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">But under a fully nonparametric family, for any $x_1 \\ne x_2$ there is some \u03b8 making $p_\\theta(x_1)\/p_\\theta(x_2)$ equal to any positive number we like \u2014 so the ratio cannot be constant in \u03b8. Contradiction. Therefore no $S&#8217;$ with smaller range can be sufficient, and x is minimal sufficient. \u220e<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">D.6 Order Statistics Are Not Sufficient When Time Order Matters<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The above argument shows no nontrivial compression of x is sufficient. In particular, the order statistics $x_{(1)} \\le \\ldots \\le x_{(T)}$ \u2014 which discard time order \u2014 are also a compression and therefore <strong>not sufficient when time order matters<\/strong>. Two time series with the same multiset $\\{x_1, \\ldots, x_T\\}$ but different orderings correspond to different distributions $p_\\theta$ in the nonparametric family. Hence the minimal sufficient dimension in the nonparametric, time-order-significant setting is exactly T \u2014 which is the dimension of Raw Vector.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Appendix E. Invertibility of Each Transform<\/h2>\n\n\n<style>.kadence-column6495_77e150-0a > .kt-inside-inner-col,.kadence-column6495_77e150-0a > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6495_77e150-0a > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6495_77e150-0a > .kt-inside-inner-col{flex-direction:column;}.kadence-column6495_77e150-0a > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6495_77e150-0a > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6495_77e150-0a{position:relative;}.kadence-column6495_77e150-0a, .kt-inside-inner-col > .kadence-column6495_77e150-0a:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6495_77e150-0a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6495_77e150-0a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6495_77e150-0a\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\">This table summarizes whether each vectorization method&#8217;s transform f is injective. By the DPI (Appendix C), non-injective transforms always lose information.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Method<\/th><th>Nature of the transform<\/th><th>Injective?<\/th><th>Reason for non-injectivity<\/th><\/tr><\/thead><tbody><tr><td><strong>Raw Vector<\/strong><\/td><td>$f(x) = x$ (identity)<\/td><td><strong>Injective<\/strong><\/td><td>trivial<\/td><\/tr><tr><td>Stats (mean, var, &#8230;)<\/td><td>$(\\sum x_i \/ T, \\sum (x_i &#8211; \\bar{x})^2 \/ (T-1), \\min, \\max, \\ldots)$<\/td><td>non-injective<\/td><td>[1,2,3,4,5] and [5,4,3,2,1] yield identical statistics<\/td><\/tr><tr><td>AUC<\/td><td>$\\sum_i (x_i + x_{i+1}) \/ 2$<\/td><td>non-injective<\/td><td>infinitely many equal-area equivalence classes<\/td><\/tr><tr><td>tsfresh<\/td><td>~800 statistical \/ spectral \/ entropy functions<\/td><td>non-injective<\/td><td>many features invariant to time shift, partial reversal, etc.<\/td><\/tr><tr><td>catch22<\/td><td>22 dynamics features<\/td><td>non-injective<\/td><td>built on distribution moments and transition matrices \u2014 partially loses time order<\/td><\/tr><tr><td>MiniRocket<\/td><td>9,996 convolutions + PPV pooling<\/td><td>non-injective<\/td><td>PPV preserves only the sign ratio; many signals share the same PPV<\/td><\/tr><tr><td>TS2Vec<\/td><td>neural encoder + max-pooling<\/td><td>non-injective<\/td><td>max-pooling itself is non-injective; also stochastic via dropout<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Every method except Raw Vector is non-injective for every T. Raw Vector is the only representation that achieves equality in DPI, and therefore the only one guaranteeing zero information loss for every $T \\ge 1$.<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n<style>.kadence-column6495_fe0da1-73 > .kt-inside-inner-col,.kadence-column6495_fe0da1-73 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column6495_fe0da1-73 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column6495_fe0da1-73 > .kt-inside-inner-col{flex-direction:column;}.kadence-column6495_fe0da1-73 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column6495_fe0da1-73 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column6495_fe0da1-73{position:relative;}.kadence-column6495_fe0da1-73, .kt-inside-inner-col > .kadence-column6495_fe0da1-73:not(.specificity){margin-left:var(--global-kb-spacing-sm, 1.5rem);}@media all and (max-width: 1024px){.kadence-column6495_fe0da1-73 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column6495_fe0da1-73 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column6495_fe0da1-73\"><div class=\"kt-inside-inner-col\">\n<ul class=\"wp-block-list\">\n<li>Bellman, R. (1961). <em>Adaptive Control Processes<\/em>. Princeton University Press.<\/li>\n\n\n\n<li>Christ, M., Braun, N., Neuffer, J., &amp; Kempa-Liehr, A. W. (2018). Time series feature extraction on basis of scalable hypothesis tests (tsfresh \u2013 A Python package). <em>Neurocomputing<\/em>, 307, 72\u201377.<\/li>\n\n\n\n<li>Cover, T. M., &amp; Thomas, J. A. (2006). <em>Elements of Information Theory<\/em> (2nd ed.). Wiley.<\/li>\n\n\n\n<li>Dempster, A., Schmidt, D. F., &amp; Webb, G. I. (2021). MiniRocket: A very fast (almost) deterministic transform for time series classification. In <em>Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &amp; Data Mining (KDD &#8217;21)<\/em> (pp. 248\u2013257).<\/li>\n\n\n\n<li>Lehmann, E. L., &amp; Casella, G. (1998). <em>Theory of Point Estimation<\/em> (2nd ed.). Springer.<\/li>\n\n\n\n<li>Lubba, C. H., Sethi, S. S., Knaute, P., Schultz, S. R., Fulcher, B. D., &amp; Jones, N. S. (2019). catch22: CAnonical Time-series CHaracteristics. <em>Data Mining and Knowledge Discovery<\/em>, 33, 1821\u20131852.<\/li>\n\n\n\n<li>Yue, Z., Wang, Y., Duan, J., Yang, T., Huang, C., Tong, Y., &amp; Xu, B. (2022). TS2Vec: Towards universal representation of time series. In <em>Proceedings of the AAAI Conference on Artificial Intelligence<\/em>, 36(8), 8980\u20138987.<\/li>\n\n\n\n<li>aeon-toolkit. MiniRocket documentation. https:\/\/www.aeon-toolkit.org\/<\/li>\n<\/ul>\n<\/div><\/div>\n<div style='text-align:center' class='yasr-auto-insert-overall'><\/div><div style='text-align:center' class='yasr-auto-insert-visitor'><\/div>","protected":false},"excerpt":{"rendered":"<p>This report analyzes why standard vectorization methods \u2014 statistical summary (mean\/var\/AUC), automatic feature extraction (tsfresh, catch22), convolutional representations (MiniRocket), and self-supervised embeddings (TS2Vec) \u2014 fail when the time series length T is very short, especially T \u2264 10. As an alternative, it argues that Raw Vectorization (Identity Mapping) is not just an easy fallback but&#8230;<\/p>\n","protected":false},"author":4,"featured_media":6543,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","yasr_overall_rating":0,"yasr_post_is_review":"","yasr_auto_insert_disabled":"","yasr_review_type":"","fifu_image_url":"","fifu_image_alt":"","iawp_total_views":0,"footnotes":""},"categories":[56,373,370],"tags":[],"class_list":["post-6495","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science-slug","category-feature-engineering-slug","category-time-series-slug"],"yasr_visitor_votes":{"stars_attributes":{"read_only":false,"span_bottom":false},"number_of_votes":1,"sum_votes":4},"jetpack_featured_media_url":"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/05\/20260502-Round-Rock-Tx-Spherical-Water-Tower-800x600px.jpg","_links":{"self":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6495","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/comments?post=6495"}],"version-history":[{"count":31,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6495\/revisions"}],"predecessor-version":[{"id":6545,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/6495\/revisions\/6545"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/media\/6543"}],"wp:attachment":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/media?parent=6495"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/categories?post=6495"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/tags?post=6495"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}