More stories

  • in

    Mathematical innovations enable advances in seismic activity detection

    Amidst the unique landscape of geothermal development in the Tohoku region, subtle seismic activities beneath the Earth’s surface present a fascinating challenge for researchers. While earthquake warnings may intermittently alert us to seismic events, there exist numerous smaller quakes that have long intrigued resource engineers striving to detect and understand them.
    Mathematical innovations from Tohoku University researchers are advancing detection of more types — and fainter forms — of seismic waves, paving the way for more effective earthquake monitoring and risk assessment.
    The results of their study were published in IEEE Transactions on Geoscience and Remote Sensing on January 15, 2024.
    Collection of seismic data relies on the number and positioning of sensors called seismometers. Especially where only limited deployment of seismic sensors is possible, such as in challenging environments like the planet Mars or when conducting long-term monitoring of captured and stored carbon, optimizing data extraction from each and every sensor becomes crucial. One promising method for doing so is polarization analysis, which involves studying 3-D particle motion and has garnered attention for its ability to leverage three-component data, offering more information than one-component data. This approach enables the detection and identification of various polarized seismic waveforms, including S-waves, P-waves and others.
    Polarization analysis using a spectral matrix (SPM) in particular is a technique used to analyze the way particles move in three dimensions over time and at different frequencies, in other words, in the time-frequency domain. However, in scenarios where the desired signal is weak compared to background noise — known as low signal-to-noise ratio (SNR) events, which are typical in underground reservoirs — SPM analysis faces limitations. Due to mathematical constraints, it can only characterize linear particle motion (meaning the fast-moving, easy-to-detect P-waves), making the analysis of other waveforms (such as the secondary arriving S-waves) challenging.
    “We overcame the technical challenges of conventional SPM analysis and expanded it for broader polarization realization by introducing time-delay components,” said Yusuke Mukuhira, an assistant professor at the Institute of Fluid Science of Tohoku University and lead author of the study.
    Compared to existing techniques, his team’s incorporation of time-delay components enhanced the accuracy of SPM analysis, enabling the characterization of various polarized waves, including S-waves, and the detection of low-SNR events with smaller amplitudes.

    A key innovation in the study is the introduction of a new weighting function based on the phase information of the first eigenvector — a special vector that, when multiplied by the matrix, results in a scaled version of the original vector. The purpose of the weighting function is to assign different levels of importance to different parts of signals according to their significance, thereby reducing false alarms. Synthetic waveform tests showed that this addition significantly improved the evaluation of seismic wave polarization, a crucial factor in distinguishing signal from noise.
    “Technically, we have developed a signal processing technique that improves particle motion analysis in the time and frequency domain,” Mukuhira said.
    The research team validated their methodology using real-world data recorded at the Groningen gas field in the Netherlands. The results showcased superior seismic motion detection performance, bringing to light two low-SNR events that had previously gone unnoticed by conventional methods.
    These findings hold the potential for applications across various fields, including seismology and geophysics, particularly in monitoring underground conditions with limited observation points. The implications extend to earthquake monitoring, planetary exploration and resource development. More

  • in

    Engineering household robots to have a little common sense

    From wiping up spills to serving up food, robots are being taught to carry out increasingly complicated household tasks. Many such home-bot trainees are learning through imitation; they are programmed to copy the motions that a human physically guides them through.
    It turns out that robots are excellent mimics. But unless engineers also program them to adjust to every possible bump and nudge, robots don’t necessarily know how to handle these situations, short of starting their task from the top.
    Now MIT engineers are aiming to give robots a bit of common sense when faced with situations that push them off their trained path. They’ve developed a method that connects robot motion data with the “common sense knowledge” of large language models, or LLMs.
    Their approach enables a robot to logically parse many given household task into subtasks, and to physically adjust to disruptions within a subtask so that the robot can move on without having to go back and start a task from scratch — and without engineers having to explicitly program fixes for every possible failure along the way.
    “Imitation learning is a mainstream approach enabling household robots. But if a robot is blindly mimicking a human’s motion trajectories, tiny errors can accumulate and eventually derail the rest of the execution,” says Yanwei Wang, a graduate student in MIT’s Department of Electrical Engineering and Computer Science (EECS). “With our method, a robot can self-correct execution errors and improve overall task success.”
    Wang and his colleagues detail their new approach in a study they will present at the International Conference on Learning Representations (ICLR) in May. The study’s co-authors include EECS graduate students Tsun-Hsuan Wang and Jiayuan Mao, Michael Hagenow, a postdoc in MIT’s Department of Aeronautics and Astronautics (AeroAstro), and Julie Shah, the H.N. Slater Professor in Aeronautics and Astronautics at MIT.
    Language task
    The researchers illustrate their new approach with a simple chore: scooping marbles from one bowl and pouring them into another. To accomplish this task, engineers would typically move a robot through the motions of scooping and pouring — all in one fluid trajectory. They might do this multiple times, to give the robot a number of human demonstrations to mimic.

    “But the human demonstration is one long, continuous trajectory,” Wang says.
    The team realized that, while a human might demonstrate a single task in one go, that task depends on a sequence of subtasks, or trajectories. For instance, the robot has to first reach into a bowl before it can scoop, and it must scoop up marbles before moving to the empty bowl, and so forth. If a robot is pushed or nudged to make a mistake during any of these subtasks, its only recourse is to stop and start from the beginning, unless engineers were to explicitly label each subtask and program or collect new demonstrations for the robot to recover from the said failure, to enable a robot to self-correct in the moment.
    “That level of planning is very tedious,” Wang says.
    Instead, he and his colleagues found some of this work could be done automatically by LLMs. These deep learning models process immense libraries of text, which they use to establish connections between words, sentences, and paragraphs. Through these connections, an LLM can then generate new sentences based on what it has learned about the kind of word that is likely to follow the last.
    For their part, the researchers found that in addition to sentences and paragraphs, an LLM can be prompted to produce a logical list of subtasks that would be involved in a given task. For instance, if queried to list the actions involved in scooping marbles from one bowl into another, an LLM might produce a sequence of verbs such as “reach,” “scoop,” “transport,” and “pour.”
    “LLMs have a way to tell you how to do each step of a task, in natural language. A human’s continuous demonstration is the embodiment of those steps, in physical space,” Wang says. “And we wanted to connect the two, so that a robot would automatically know what stage it is in a task, and be able to replan and recover on its own.”
    Mapping marbles

    For their new approach, the team developed an algorithm to automatically connect an LLM’s natural language label for a particular subtask with a robot’s position in physical space or an image that encodes the robot state. Mapping a robot’s physical coordinates, or an image of the robot state, to a natural language label is known as “grounding.” The team’s new algorithm is designed to learn a grounding “classifier,” meaning that it learns to automatically identify what semantic subtask a robot is in — for example, “reach” versus “scoop” — given its physical coordinates or an image view.
    “The grounding classifier facilitates this dialogue between what the robot is doing in the physical space and what the LLM knows about the subtasks, and the constraints you have to pay attention to within each subtask,” Wang explains.
    The team demonstrated the approach in experiments with a robotic arm that they trained on a marble-scooping task. Experimenters trained the robot by physically guiding it through the task of first reaching into a bowl, scooping up marbles, transporting them over an empty bowl, and pouring them in. After a few demonstrations, the team then used a pretrained LLM and asked the model to list the steps involved in scooping marbles from one bowl to another. The researchers then used their new algorithm to connect the LLM’s defined subtasks with the robot’s motion trajectory data. The algorithm automatically learned to map the robot’s physical coordinates in the trajectories and the corresponding image view to a given subtask.
    The team then let the robot carry out the scooping task on its own, using the newly learned grounding classifiers. As the robot moved through the steps of the task, the experimenters pushed and nudged the bot off its path, and knocked marbles off its spoon at various points. Rather than stop and start from the beginning again, or continue blindly with no marbles on its spoon, the bot was able to self-correct, and completed each subtask before moving on to the next. (For instance, it would make sure that it successfully scooped marbles before transporting them to the empty bowl.)
    “With our method, when the robot is making mistakes, we don’t need to ask humans to program or give extra demonstrations of how to recover from failures,” Wang says. “That’s super exciting because there’s a huge effort now toward training household robots with data collected on teleoperation systems. Our algorithm can now convert that training data into robust robot behavior that can do complex tasks, despite external perturbations.” More

  • in

    GPT-4 for identifying cell types in single cells matches and sometimes outperforms expert methods

    GPT-4 can accurately interpret types of cells important for the analysis of single-cell RNA sequencing — a sequencing process fundamental to interpreting cell types — with high consistency to that of time-consuming manual annotation by human experts of gene information, according to a study at Columbia University Mailman School of Public Health. The findings are published in the journal Nature Methods.
    GPT-4 is a large language model designed for speech understanding and generation. Upon assessment across numerous tissue and cell types, GPT-4 has demonstrated the ability to produce cell type annotations that closely align with manual annotations of human experts and surpass existing automatic algorithms. This feature has the potential to significantly lessen the amount of effort and expertise needed for annotating cell types, a process that can take months. Moreover, the researchers have developed GPTCelltype, an R software package, to facilitate the automated annotation of cell types using GPT-4.
    “The process of annotating cell types for single cells is often time-consuming, requiring human experts to compare genes across cell clusters,” said Wenpin Hou, PhD, assistant professor of Biostatistics at Columbia Mailman School. “Although automated cell type annotation methods have been developed, manual methods to interpret scientific data remain widely used, and such a process can take weeks to months. We hypothesized that GPT-4 can accurately annotate cell types, transitioning the process from manual to a semi- or even fully automated procedure and be cost-efficient and seamless.”
    The researchers assessed GPT-4’s performance across ten datasets covering five species, hundreds of tissue and cell types, and including both normal and cancer samples. GPT-4 was queried using GPTCelltype, the software tool developed by the researchers. For competing purposes, they also evaluated other GPT versions and manual methods as a reference tool.
    As a first step, the researchers first explored the various factors that may affect the annotation accuracy of GPT-4. They found that GPT-4 performs best when using the top 10 different genes and exhibits similar accuracy across various prompt strategies, including a basic prompt strategy, a chain-of-thought-inspired prompt strategy that includes reasoning steps, and a repeated prompt strategy. GPT-4 matched manual analyses in over 75 percent of cell types in most studies and tissues demonstrating its competency in generating expert-comparable cell type annotations. In addition, the low agreement between GPT-4 and manual annotations in some cell types does not necessarily imply that GPT-4’s annotation is incorrect. In an example of stromal or connective tissue cells, GPT-4 provides more accurate cell type annotations. GPT-4 was also notably faster.
    Hou and her colleague also assessed GPT-4’s robustness in complex real data scenarios and found that GPT-4 can distinguish between pure and mixed cell types with 93 percent accuracy, and differentiated between known and unknown cell types with 99 percent accuracy. They also evaluated the performance of reproducing GPT-4’s methods using prior simulation studies. GPT-4 generated identical notations for the same marker genes in 85 percent of cases. “All of these results demonstrate GPT-4’s robustness in various scenarios,” observed Hou.
    While GPT-4 surpasses existing methods, there are limitations to consider, according to Hou, including the challenges for verifying GPT-4’s quality and reliability because it discloses little about its training proceedings.
    “Since our study focuses on the standard version of GPT-4, fine-tuning GPT-4 could further improve cell type annotation performance,” said Hou.
    Zhicheng Ji of Duke University School of Medicine is a co-author.
    The study was supported by the National Institutes of Health, grant U54AG075936 and R35GM150887. More

  • in

    Pairing crypto mining with green hydrogen offers clean energy boost

    Pairing cryptocurrency mining — notable for its outsize consumption of carbon-based fuel — with green hydrogen could provide the foundation for wider deployment of renewable energy, such as solar and wind power, according to a new Cornell University study.
    “Since current cryptocurrency operations now contribute heavily to worldwide carbon emissions, it becomes vital to explore opportunities for harnessing the widespread enthusiasm for cryptocurrency as we move toward a sustainable and a climate-friendly future,” said Fengqi You, professor of energy systems engineering at Cornell.
    You and doctoral student Apoorv Lal are authors of “Climate Sustainability Through a Dynamic Duo: Green Hydrogen and Crypto Driving Energy Transition and Decarbonization,” which published March 25 in the Proceedings of the National Academy of Sciences.
    Their research shows how linking the use of energy-intensive cryptocurrency mining with green hydrogen technology — the “dynamic duo,” they call it — can boost renewable energy sectors.
    “Building a green hydrogen infrastructure to help produce cryptocurrency can accelerate renewable energy and create a more sustainable energy landscape,” Lal said.
    Using clean energy sources to power blockchain mining operations and fuel the production of green hydrogen can lead to growing wind and solar capacity — and expand sustainable energy production across the country, the researchers said.
    In its current structure, mining blockchain-based cryptocurrency in the U.S. can use as much carbon-based energy as the entire country of Argentina, according to a 2022 White House Office of Science and Technology report. Nearly all domestic crypto-mining electricity is driven by computer power-hungry consensus mechanisms, known as “proof of work,” which is used to verify crypto-assets.

    Preliminary estimates by the U.S. Energy Information Administration suggest that 2023 annual electricity consumption for cryptocurrency mining likely represents from 0.6% to 2.3% of all U.S. electricity consumption.
    “Acknowledging the substantial energy demands of cryptocurrency mining, our research proposes an innovative technology solution,” You said. “By leveraging cryptocurrencies as virtual energy carriers in tandem with using green hydrogen, we can transform what was once an environmental challenge into a dynamic force for climate mitigation and sustainability.”
    In their research, You and Lal examined individual U.S. states to assess potential energy strengths in each region.
    Supporting cryptocurrency can hasten the building of extra energy infrastructure and potentially create 78.4 megawatt hours of solar power for each Bitcoin mined in New Mexico, for example, and potentially 265.8 megawatt hours of wind power for each Bitcoin mined in Wyoming, according to the paper.
    “While cryptocurrency currently has a high dollar value (Bitcoin traded for more than $73,000 on March 13,) you cannot hold it in your hand,” You said. “It’s virtual. Think of cryptocurrency and energy in the same way — much like a gift-card concept. Cryptocurrency also can hold an energy value and that becomes an additional function.”
    To advance a sustainable future for blockchain-based cryptocurrency, the researchers said, stronger federal policies for climate goals and renewable energy need to advance.
    “Coupled with green hydrogen, this approach to cryptocurrency not only mitigates its own environmental impact, but pioneers a sustainable path for renewable energy transition,” You said. “It’s a novel strategy.”
    You is a senior faculty fellow at the Cornell Atkinson Center for Sustainability. Funding for this work was provided by the National Science Foundation. More

  • in

    Pushing back the limits of optical imaging by processing trillions of frames per second

    Pushing for a higher speed isn’t just for athletes. Researchers, too, can achieve such feats with their discoveries. This is the case for Jinyang Liang, Professor at the Institut national de la recherche scientifique (INRS), and his team, whose research results have recently been published in Nature Communications.
    The group based at INRS’ Énergie Matériaux Télécommunications Research Centre has developed a new ultrafast camera system that can capture up to 156.3 trillion frames per second with astonishing precision. For the first time, 2D optical imaging of ultrafast demagnetization in a single shot is possible. This new device called SCARF (for swept-coded aperture real-time femtophotography) can capture transient absorption in a semiconductor and ultrafast demagnetization of a metal alloy. This new method will help push forward the frontiers of knowledge in a wide range of fields, including modern physics, biology, chemistry, materials science, and engineering.
    Improving on past advances
    Professor Liang is known around the world as a pioneer of ultrafast imaging. Already, in 2018, he was the principal developer of a major breakthrough in the field, which laid the groundwork for the development of SCARF.
    Until now, ultrafast camera systems have mainly used an approach involving sequentially capturing frames one by one. They would acquire data through brief, repeated measurements, then put everything together to create a movie that reconstructed the observed movement.
    “However, this approach can only be applied to inert samples or to phenomena that happen the exact same way each time. Fragile samples, not to mention non-repeatable phenomena or phenomena with ultrafast speeds, cannot be observed with this method.”
    Professor Jinyang Liang, expert in ultra-fast and biophotonic imaging said, “For example, phenomena such as femtosecond laser ablation, shock-wave interaction with living cells, and optical chaos cannot be studied this way.”
    The first tool developed by Professor Liang helped fill this gap. The T-CUP (Trillion-frame-per-second compressed ultrafast photography) system was based on passive femtosecond imaging capable of acquiring ten trillion (1013) frames per second. This was a major first step towards ultrafast, single-shot real-time imaging.

    Yet challenges still remained.
    “Many systems based on compressed ultrafast photography have to cope with degraded data quality and have to trade the sequence depth of the field of view. These limitations are attributable to the operating principle, which requires simultaneously shearing the scene and the coded aperture.”
    Miguel Marquez, postdoctoral fellow and co-first author of the study said, “SCARF overcomes these challenges. Its imaging modality enables ultrafast sweeping of a static coded aperture while not shearing the ultrafast phenomenon. This provides full-sequence encoding rates of up to 156.3 THz to individual pixels on a camera with a charge-coupled device (CCD). These results can be obtained in a single shot at tunable frame rates and spatial scales in both reflection and transmission modes.”
    A range of applications
    SCARF makes it possible to observe unique phenomena that are ultrafast, non-repeatable, or difficult to reproduce, such as shock wave mechanics in living cells or matter. These advances could potentially be used to develop better pharmaceutics and medical treatments.
    What’s more, SCARF promises very appealing economic spin-offs. Two companies, Axis Photonique and Few-Cycle, are already working with Professor Liang’s team to produce a marketable version of their patent-pending discovery. This represents a great opportunity for Quebec to strengthen its already enviable position as a leader in photonics.
    The work was carried out in the Advanced Laser Light Source (ALLS) Laboratory in collaboration with Professor François Légaré, Director of the Énergie Matériaux Télécommunications Research Centre, and international colleagues Michel Hehn, Stéphane Mangin and Grégory Malinowski of the Institut Jean Lamour at the Université de Lorraine (France) and Zhengyan Li of Huazhong University of Science and Technology (China).
    This research was funded by the Natural Sciences and Engineering Research Council of Canada, the Canada Research Chairs Program, the Canada Foundation for Innovation, the Ministère de l’Économie et de l’Innovation du Québec, the Canadian Cancer Society, the Government of Canada’s New Frontiers in Research Fund, as well as the Fonds de recherche du Québec-Nature et Technologies and the Fonds de recherche du Québec -Santé. More

  • in

    Scientists deliver quantum algorithm to develop new materials and chemistry

    U.S. Naval Research Laboratory (NRL) scientists published the Cascaded Variational Quantum Eigensolver (CVQE) algorithm in a recent Physical Review Research article, expected to become a powerful tool to investigate the physical properties in electronic systems.
    The CVQE algorithm is a variant of the Variational Quantum Eigensolver (VQE) algorithm that only requires the execution of a set of quantum circuits once rather than at every iteration during the parameter optimization process, thereby increasing the computational throughput.
    “Both algorithms produce a quantum state close to the ground state of a system, which is used to determine many of the system’s physical properties,” said John Stenger, Ph.D., a Theoretical Chemistry Section research physicist. “Calculations that previously took months can now be performed in hours.”
    The CVQE algorithm uses a quantum computer to probe the needed probability mass functions and a classical computer to perform the remaining calculations, including the energy minimization.
    “Finding the minimum energy is computationally hard as the size of the state space grows exponentially with the system size,” said Steve Hellberg, Ph.D., a Theory of Advanced Functional Materials Section research physicist. “Except for very small systems, even the world’s most powerful supercomputers are unable to find the exact ground state.”
    To address this challenge, scientists use a quantum computer with a qubit register, whose state space also increases exponentially, in this case with qubits. By representing the states of a physical system on the state space of the register, a quantum computer can be used to simulate the states in the exponentially large representation space of the system.
    Data can subsequently be extracted by quantum measurements. As quantum measurements are not deterministic, the quantum circuit executions must be repeated multiple times to estimate probability distributions describing the states, a process known as sampling. Variational quantum algorithms, including the CVQE algorithm, identify trial states by a set of parameters that are optimized to minimize the energy.
    “The key difference between the original VQE method and the new CVQE method is that the sampling and optimization processes have been decoupled in the latter such that the sampling can be performed exclusively on the quantum computer and the parameters processed exclusively on a classical computer,” said Dan Gunlycke, D.Phil., Theoretical Chemistry Section Head, who also leads the NRL quantum computing effort. “The new approach also has other benefits. The form of the solution space does not have to comport with the symmetry requirements of the qubit register, and therefore, it is much easier to shape the solution space and implement symmetries of the system and other physically motivated constraints, which will ultimately lead to more accurate predictions of electronic system properties.”
    Quantum computing is a component of quantum science, which has been designated as a Critical Technology Area within the USD(R&E) Technology Vision for an Era of Competition by the Under Secretary of Defense for Research and Engineering Heidi Shyu.
    “Understanding the properties of quantum-mechanical systems is essential in the development of new materials and chemistry for the Navy and Marine Corps,” Gunlycke said. “Corrosion, for instance, is an omnipresent challenge costing the Department of Defense billions every year. The CVQE algorithm can be used to study the chemical reactions causing corrosion and provide critical information to our existing anticorrosion teams in their quest to develop better coatings and additives.” More

  • in

    The world is one step closer to secure quantum communication on a global scale

    Researchers at the University of Waterloo’s Institute for Quantum Computing (IQC) have brought together two Nobel prize-winning research concepts to advance the field of quantum communication.
    Scientists can now efficiently produce nearly perfect entangled photon pairs from quantum dot sources.
    Entangled photons are particles of light that remain connected, even across large distances, and the 2022 Nobel Prize in Physics recognized experiments on this topic. Combining entanglement with quantum dots, a technology recognized with the Nobel Prize in Chemistry in 2023, the IQC research team aimed to optimize the process for creating entangled photons, which have a wide variety of applications, including secure communications.
    “The combination of a high degree of entanglement and high efficiency is needed for exciting applications such as quantum key distribution or quantum repeaters, which are envisioned to extend the distance of secure quantum communication to a global scale or link remote quantum computers,” said Dr. Michael Reimer, professor at IQC and Waterloo’s Department of Electrical and Computer Engineering. “Previous experiments only measured either near-perfect entanglement or high efficiency, but we’re the first to achieve both requirements with a quantum dot.”
    By embedding semiconductor quantum dots into a nanowire, the researchers created a source that creates near-perfect entangled photons 65 times more efficiently than previous work. This new source, developed in collaboration with the National Research Council of Canada in Ottawa, can be excited with lasers to generate entangled pairs on command. The researchers then used high-resolution single photon detectors provided by Single Quantum in The Netherlands to boost the degree of entanglement.
    “Historically, quantum dot systems were plagued with a problem called fine structure splitting, which causes an entangled state to oscillate over time. This meant that measurements taken with a slow detection system would prevent the entanglement from being measured,” said Matteo Pennacchietti, a PhD student at IQC and Waterloo’s Department of Electrical and Computer Engineering. “We overcame this by combining our quantum dots with a very fast and precise detection system. We can basically take a timestamp of what the entangled state looks like at each point during the oscillations, and that’s where we have the perfect entanglement.”
    To showcase future communications applications, Reimer and Pennacchietti worked with Dr. Norbert Lütkenhaus and Dr. Thomas Jennewein, both IQC faculty members and professors in Waterloo’s Department of Physics and Astronomy, and their teams. Using their new quantum dot entanglement source, the researchers simulated a secure communications method known as quantum key distribution, proving that the quantum dot source holds significant promise in the future of secure quantum communications. More

  • in

    Rectifying AI’s usage in the quest for thermoelectric materials

    Using AI, a team of researchers has identified a thermoelectric material that potentially possesses favorable values. The group was able to navigate AI’s conventional pitfalls, giving a prime example of how AI can revolutionize materials science.
    Details of their finding were published in the journal Science China Materials on March 8, 2024.
    “Traditional methods of finding suitable materials involve trial-and-error, which is time-consuming and often expensive,” proclaims Hao Li, associate professor at Tohoku University’s Advanced Institute for Materials Research (WPI-AIMR) and corresponding author of the paper. “AI transforms this by combing through databases to identify potential materials that can then be experimentally verified.”
    Still, challenges remain. Large-scale material datasets sometimes contain errors and overfitting the predicted temperature-dependent properties is also a common error. Overfitting occurs when a model learns to capture noise or random fluctuations in the training data rather than the underlying pattern or relationship. As a result, the model performs well on the training data but fails to generalize new, unseen data. When predicting temperature-dependent properties, overfitting could lead to inaccurate predictions when the model encounters new conditions outside the range of the training data.
    Li and his colleagues sought to overcome this to develop a thermoelectric material. These materials convert heat energy into electrical energy, or vice versa. Thus, getting a highly accurate temperature-dependence is critical.
    “First, we performed a series of rational actions to identify and discard questionable data, obtaining 92,291 data points comprising 7,295 compositions and different temperatures from the Starrydata2 database — an online database that collects digital data from published papers,” states Li.
    Following this, the researchers implemented a composition-based cross-validation method. Crucially, they emphasized that data points with the same compositions but different temperatures should not be split into different sets to avoid overfitting.
    Then the researchers built machine building models using the Gradient Boosting Decision Tree method. The model achieved remarkable R2 values 0.89, ~0.90, and ~0.89 on the training dataset, test dataset, and new out-of-sample experimental data released in 2023, demonstrating the models accuracy in predicting newly available materials.
    “We could use this model to carry out a large-scale evaluation of the stable materials from the Materials Project database, predicting the potential thermoelectric performance of new materials and providing guidance for experiments,” states Xue Jia, Assistant Professor at WPI-AIMR, and co-author of the paper.
    Ultimately, the study illustrates the importance of following rigorous guidelines when it comes to data preprocessing and data splitting in machine learning so that it addresses the pressing issues in materials science. The researchers are optimistic that their strategy can also be applied to other materials, such as electrocatalysts and batteries. More