Simulation Knowledge: Intermediate

What is Materials Informatics (MI)?
Next-generation materials development accelerated by AI × simulation

2026.03.05
Share this article

Challenges in Materials Informatics (MI) and the aim of this article

This article discusses how to overcome two key challenges in materials informatics (MI)—data scarcity and black-box behavior—by integrating simulation. It also touches on the potential of agentic generative AI and organizational aspects, and introduces ideas and example workflows to accelerate materials development.

For many years, materials development has relied heavily on trial and error based on the experience and intuition of excellent researchers. However, this traditional approach has a structural limitation: it tends to be inefficient and costly when searching for optimal candidates within a vast chemical space whose combinations are nearly infinite. As a result, the hurdle for discovering truly innovative materials remains high.

One promising approach attracting attention is Materials Informatics (MI). By using a data-driven approach, MI transforms materials exploration into a predictive and efficient process. It not only discovers new insights through large-scale data analysis, but also has the potential to accelerate the exploration process itself by working with robotic synthesis and automated experimentation.

That said, in materials domains with complex behavior, a simple data-driven approach—one that merely collects available experimental data—often struggles to deliver sufficient results. There is a growing need to partially address this challenge through integration with simulation, and to realize more effective materials design workflows grounded in physical mechanisms.

1. Challenges specific to materials design

Why are final material properties not determined by chemical composition alone? The answer lies in the hierarchical internal structure of materials, and how that structure is formed by manufacturing processes. Understanding this fundamental point is an important basis for grasping why simulation integration is necessary.

Multiscale structure

Material properties emerge across multiple length scales.

  • Molecular scale (Å–nm): individual molecular structures and chemical bonds
  • Mesoscale (nm–μm): morphology such as polymer chain entanglement, phase-separated structures, and interfacial structures
  • Macroscale (mm–m): mechanical and thermal behavior as a bulk material
Identifying at which scale a given property emerges, and accurately capturing the corresponding internal structure, is the key to understanding and controlling property-determining mechanisms.

Figure 1. Multiscale structure (hierarchical structure in materials)
Figure 1. Multiscale structure (hierarchical structure in materials)

Process–Structure–Property (PSP) relationships

The internal structure of a material is dynamically determined by the manufacturing process. Steps such as mixing, curing, and deformation determine molecular orientation and aggregation states, and the resulting internal structure governs the final material properties. This chain of causality is sometimes referred to as the Process–Structure–Property (PSP) relationships. However, when one merely collects available data, it is not uncommon that only the final property values are recorded, while information on process conditions and intermediate internal structures is not sufficiently captured.

Figure 2. Processing–Structure–Property: PSP
Figure 2. Processing–Structure–Property: PSP

2. Challenges of data-driven Materials Informatics (MI)

The materials-specific complexity described above causes several issues, especially when applying data-driven MI approaches to the datasets available on hand.

Limited and noisy data

In many materials development cases, the absolute amount of usable data is insufficient. In addition, data are often siloed by department, and formats are not unified. If hidden variables that strongly affect properties—such as measurement conditions, lot-to-lot variation, pretreatment conditions, and process history—are not recorded, the dataset becomes incomplete and the reliability of models built on it decreases.

Insufficient features (descriptors)

In MI, it is essential to convert chemical structures and related information into numerical vectors (features or descriptors) that machines can learn from. However, simple numerical descriptors cannot adequately represent the physical entities that determine properties, such as polymer orientation, crystalline structure, entanglement, or phase-separated morphology. As a result, models may overlook key mechanisms behind property emergence.

Risk of becoming a black box

Data-driven models can fit known datasets with high accuracy. In many cases, however, they are not necessarily learning physical causality; they are capturing statistical correlations. Therefore, when trying to predict for new chemical species or different process conditions not included in the training data, reliability may drop. If the model’s rationale is unclear and it becomes a black box, MI may struggle to fulfill its original purpose: exploring unknown regions of the materials space.

3. Positioning MI and simulation integration

One solution to the limitations faced by data-driven MI is autonomous high-throughput experimentation using robots. By controlling conditions and processes, high-quality experimental data can be accumulated, and such efforts have already been reported. Another approach is integration with simulation.

Overcoming data scarcity

A straightforward approach is: when experimental data are lacking, use simulation to run virtual experiments and add data while checking accuracy and validity. This can compensate for limited experimental data and expand the dataset required to train MI models.

Obtaining physically meaningful features

A key advantage of simulation is that it can extract microscopic internal information that is difficult to measure directly in experiments, for example:

  • Local 3D molecular structures and dynamics
  • Interfacial structures and interaction energies
  • Intermolecular forces and free volume
By incorporating these physically meaningful intermediate factors as features in MI models, the models can learn causal relationships to final properties more directly.

Reducing black-box behavior and improving explainability

Models that incorporate physical features can improve not only prediction accuracy but also physical interpretability of results. By analyzing which physical features contribute strongly to properties, one can answer the question “why does the model predict this,” and the model is expected to evolve from a mere black box into a predictive tool with explainability and credibility.

This simulation integration can be incorporated into conventional MI workflows. Compared to a linear flow of “data collection → featurization → model building,” simulation creates three intervention points.

  • First, at the stage of “Data Generation/Curation,” where data are insufficient, simulation supplies various forms of data. It provides physics-based data—such as energy, force, and dynamics in microscopic regions—that cannot be obtained experimentally.
  • Second, at the stage of “Property Prediction & Inverse Design,” detailed simulation analyses are performed to validate the predicted physical quantities/properties and molecular structures.
  • Third (a more advanced effort), at the stage of “Model Training,” the training process is constrained by physical rules such as conservation of mass and energy, using approaches like Physics-Informed Neural Networks (PINNs), making the model “physics-informed.” Through active integration with simulation, MI is expected to move beyond statistical correlation analysis toward predictive and design tools grounded in physical mechanisms.

Figure 3. Integrating MI and simulation (data augmentation, physics-based features, physics-constrained learning)
Figure 3. Integrating MI and simulation (data augmentation, physics-based features, physics-constrained learning)

4.AI evolution and next-generation R&D workflows

In recent years, the role of AI has evolved significantly beyond the framework of property prediction (MI) toward supporting and automating the entire R&D process. In materials design software, the role of AI can be roughly categorized into three types.

 AI role  Description  Examples
 Data science (MI)  Core MI technology that learns correlations between descriptors and properties using statistical models, enabling property prediction and inverse design for unknown materials.  QSPR models, clustering analysis
 Accelerator (speed-up)  Techniques to reduce the computational cost of physical simulation and/or improve accuracy by replacing part of a physical model with a machine-learning model.  ML potentials, surrogate models
 Agent (autonomy)  Tools that support and orchestrate research workflows. They assist researchers in dialogue form and automate parts of the process.  Script generation, modeling support, similar-structure generation, result analysis, semi-autonomous design

A particularly important recent development is the evolution of generative AI as an “agent.” By leveraging technologies such as LLMs (large language models), efforts are being made to build “semi-autonomous design” workflows where a researcher can give a natural-language instruction like “I want a material with these properties,” and the AI proposes molecular structures, runs simulations, and even summarizes results into a report. Technical feasibility is also beginning to be discussed in various places.

In this new environment, the role of human researchers also changes. Rather than being mere operators, they will shift toward higher-level decision makers such as:

  • Translating business requirements into concrete objectives for AI to optimize (properties, cost, etc.).
  • Judging whether AI-generated results are physically and chemically valid, based on expert knowledge.
  • Connecting acquired insights to actual product development and business strategy.
Future software will increasingly require not only computational functionality, but also interfaces that enable effective collaboration between AI agents and humans, as well as functions that visualize the grounds for AI decisions. This technical evolution is simultaneously creating momentum for changes in organizational structures needed to make effective use of these tools.

5. Embedding MI in an organization

To implement advanced technologies such as simulation-integrated MI and maximize their value, it is not enough to simply deploy powerful tools. Continuous investment and effort in people, organizational structures, and data infrastructure will be critical factors determining success.

  • To use siloed, person-dependent data formats as organizational assets, it is essential to standardize rules for data collection and descriptor formats. This reduces the cost of reusing data.
  • While individual property data are part of a company’s competitive domain, collaboration across companies and between industry and academia—beyond single-company optimization—in non-competitive domains such as sharing benchmark problems and unifying data formats is expected to raise the overall technical level of the industry.
  • Bridge personnel who can understand across three domains—materials science (experiments and phenomenon understanding), simulation (physical modeling and numerical computation), and data science (statistics and machine learning)—and “translate” dialogue between specialists become important.
  • It is necessary to shift from a passive stance of “what can we do with the data we already have” to proactive workflow design: “how do we accumulate high-quality data as part of daily operations.”

To succeed with Materials Informatics (MI)

This article organized the basic ideas of materials informatics (MI) and how simulation integration can help overcome challenges such as data scarcity and black-box behavior.
To correctly understand and control material properties, it is essential to consider hierarchical internal structures and the fact that manufacturing processes determine structures and thus properties emerge. Because of this complexity, data-driven MI faces challenges such as limited data and black-box behavior. By positioning simulation as a source of physical information and extracting physically meaningful features, it may be possible to overcome these issues and build explainable, reliable predictive models.
Integrated platforms such as J-OCTA support the adoption of these technologies by providing a foundation for individual researchers to practice them, combining state-of-the-art multiscale simulation with MI. Data accumulation, development of bridge talent, and the establishment of supporting software environments are inseparable challenges. Continued efforts on these fronts are expected to support future materials development and contribute to improved corporate competitiveness.
If you are interested in the technologies introduced here, please feel free to contact us. We also accept consultations on running MI-linked proof-of-concepts (PoCs) using J-OCTA and on building data infrastructure.

Back to Top