A Taxonomy of Manufacturing Big Data: Integrating Machine and Human Data

1. Introduction: The Missing Link in Smart Manufacturing
Investment in smart manufacturing and big data analytics has expanded rapidly, yet the focus has remained almost exclusively on Machine Data—the data automatically generated by equipment and systems. Despite dramatic improvements in data volume, granularity, and infrastructure, gains in defect analysis and yield improvement have fallen short of expectations. The reason is clear.
The root cause of defects often lies not in machines but in human judgment and design intent.
- What hypothesis drove this experiment?
- Why was the recipe changed at this point in time?
- On what basis was this lot held or released?
- What correction was applied to this chamber after Preventive Maintenance (PM)?
Such information remains scattered across engineers’ minds, emails, slide decks, and personal notes—never reaching the big data pipeline. No matter how rich the machine data, the absence of context for why a value turned out the way it did imposes a hard ceiling on causal inference and model learning.
This report defines manufacturing big data as two equal branches—Machine Data and Human Data—and proposes a taxonomy that classifies each branch by the purpose and role of the data. The taxonomy is industry-agnostic and uses semiconductor manufacturing as the representative example.
2. Machine Data Taxonomy: The Physical Reality
Machine Data is the output that systems and equipment generate automatically according to predefined logic. Machines are not decision-makers; they are executors following control algorithms, and they record the physical reality that results. Machine Data answers “What happened.”
2.1 Classification
MACHINE DATA ├─ Static : Asset & Metadata ├─ Dynamic : Operational & Trace Data └─ Quality : Metrology, Inspection
| Category | Definition | Characteristics |
|---|---|---|
| Static | Equipment and asset identity, configuration, and specifications—information that does not change or changes rarely | Quasi-static, reference |
| Dynamic | Time-dependent data generated during operation (sensors, events, state logs, Fault Detection and Classification (FDC) trace) | High-frequency time-series |
| Quality | Inspection and metrology results—post-hoc verification of process output | Discrete measurement, outcome-oriented |
The three categories operate on different time axes and serve different roles. Static answers “what exists,” Dynamic answers “how it ran,” and Quality answers “what the result was.” Together they form a complete description of the physical state of the process.
3. Human Data Taxonomy: The Engineering Intelligence
Human Data is the data produced by the judgment, design intent, and experience of engineers. It captures “Why and how” and provides the context that gives machine data its meaning.
In a Human-in-the-Loop (HITL) view, engineers infuse data with meaning by:
- Interpreting sensor readings and judging normal versus anomalous
- Forming hypotheses from process results and designing experiments
- Making judgments and deciding actions when standards are violated
- Learning equipment behavior through experience and applying corrections
The artifacts of these activities form Human Data, classified by purpose into Baseline / Knob / Excursion / Maintenance.
3.1 Classification
HUMAN DATA ├─ Baseline ├─ Knob │ ├─ Narrow DOE │ └─ Wide DOE ├─ Excursion └─ Maintenance
| Category | Definition | Representative Assets |
|---|---|---|
| Baseline | Data that defines the standard state | Recipe, Spec/Limit, Standard Operating Procedure (SOP), Best Known Method (BKM) |
| Knob | Intentional manipulation and exploration data — Design of Experiments (DOE) | DOE plan, split table |
| ├ Narrow DOE | Fine tuning and optimization within a narrow range | Single step or parameter adjustment |
| └ Wide DOE | Exploration and screening across a broad range | Multiple steps or parameters varied simultaneously |
| Excursion | Response data for abnormal situations | Disposition, troubleshooting, Non-Conformance Report / Corrective and Preventive Action (NCR/CAPA), Engineering Change Notice (ECN) / Engineering Change Request (ECR) / Engineering Information Notice (EIN) |
| Maintenance | Maintenance and management activity data | PM records, parts replacement history, heuristic offsets |
4. Integration Strategy: Synergy between Machine and Human
Machine Data and Human Data are limited on their own; value emerges when they are combined. The practical significance of this taxonomy lies in how the two branches are paired.
4.1 Machine ↔ Human Correspondence
| Role | Machine Data | Human Data |
|---|---|---|
| Standard / Fixed | Static | Baseline |
| Intentional Variation | — | Knob |
| Routine Operation | Dynamic | — |
| Abnormal Response | — | Excursion |
| Maintenance | — | Maintenance |
| Outcome Verification | Quality | — |
4.2 Key Integration Scenarios
(1) Maintenance × Asset → Predictive Maintenance (PdM)
- Combine asset/parts information with maintenance history (replacement cycles, correction history)
- Match against degradation patterns in Dynamic trace to predict remaining useful life
- Result: shift from scheduled maintenance to Condition-Based Maintenance (CBM) (Lee 2014)
(2) Knob (DOE) × Trace → Process Optimization and Virtual Metrology (VM)
- Combine DOE intent (which variables were perturbed and how) with the corresponding Trace data
- Enables input-output relationship modeling — the basis for VM, Advanced Process Control (APC), and soft sensors (Moyne 2012)
- Result: improved experimental efficiency, reduced metrology load, automated process-window discovery
(3) Baseline × Dynamic/Quality → Drift Detection
- Compare Baseline standards and control limits against real-time Dynamic/Quality data
- Goes beyond classical Statistical Process Control (SPC) to detect changes in the distribution the model itself learned (Gama 2014)
- Result: early detection of silent degradation and quiet distribution shifts; trigger for model retraining
(4) Excursion × Quality → Root Cause Analysis (RCA)
- Link engineer judgments and corrective actions during excursions to Quality outcomes via lineage
- Forms a learning corpus of “which action led to which result”
- Result: automated troubleshooting recommendation, foundation for domain Large Language Model (LLM) training (Shintani 2021)
These integrations only work when both branches are managed as equal-tier assets. If Human Data remains scattered across one-off documents, none of the combinations will function.
5. Conclusion: Toward Autonomous Process Control
A truly autonomous factory becomes possible only when data is fully classified and integrated.
- Machine Data alone reveals phenomena but not causes.
- Human Data alone carries intent but cannot be verified.
- When both branches are managed as equal-tier assets, causal inference, learning, and autonomous control begin to operate.
The proposed taxonomy satisfies three criteria:
- Consistent classification axis — both branches are organized by the purpose of the data.
- Completeness — covers standards, intentional variation, abnormal response, maintenance, and verification.
- Industry generality — applies to all manufacturing domains, including semiconductor.
Defining and integrating Machine Data and Human Data as equal assets is the starting point for data-driven autonomous manufacturing.
References
- Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4), 1–37.
- Lee, J., Wu, F., Zhao, W., Ghaffari, M., Liao, L., & Siegel, D. (2014). Prognostics and health management design for rotary machinery systems—Reviews, methodology and applications. Mechanical Systems and Signal Processing, 42(1–2), 314–334.
- Moyne, J., & Iskandar, J. (2012). Big data analytics for smart manufacturing: Case studies in semiconductor manufacturing. Processes, 5(3), 39.
- Shintani, K., et al. (2021). Knowledge management and AI-driven assistants for semiconductor process engineering. IEEE Transactions on Semiconductor Manufacturing, 34(3), 312–321.
