Science can be messy. And so can the data collected in the process. When the data has repeats, overlaps, or isn’t directly tied to an on-the-ground measurement, standard computer programs can’t tackle the ambiguity. With the colossal amounts of data collected in every field from healthcare to astronomy, new tools to process this information are more important than ever. Andrea Bertozzi, Distinguished Professor of Mathematics and Mechanical and Aerospace Engineering at UCLA, is the lead university PI for a new research initiative to develop public domain tools to sort through complicated scientific data and reduce the amount of data needed.
This group won a Data Reduction for Science award from the Department of Energy in September, providing $4 million over three years. Bertozzi hosted the inaugural meeting of the group, which includes scientists from the University of Utah and Los Alamos National lab, at the California NanoSystems Institute (CNSI) on UCLA’s campus.
When scientists collect large amounts of data, such as collecting reflections off a glacier from space, there can be overlaps in the data. For example, each image stitched together to form the overall map will have some overlap on the edges. While there are solutions for connecting different chunks of the same map, other problems are more complex and involve a large amount of mostly unlabeled data. When huge data sets don’t have labels, scientists cannot analyze it all by hand.
Bertozzi and her colleagues are developing a program that uses artificial intelligence to identify a small subset of the available training data that can be used to accurately classify new data. The main goal of the meeting was to organize the project. To broaden their vision for the project, the team connected with scientists from other disciplines at UCLA who have large amounts of data that could benefit from their methods. “The more versatile our algorithms are, the better,” says Bertozzi.
Another problem scientists face is when they have a lot of indirect measurements but very few direct measurements to relate them to. For instance, they might have the remote sensing data from a large area of a glacier, but have few, if any, direct in-person measurements for even a small region of the glacier. This is especially important when the area is remote, such as in the arctic. In the first paper produced from this project, published earlier this month, Bertozzi and her colleagues developed a method to use images hand-annotated by experts to analyze large amounts of satellite data without such laborious annotations. Even from space, they can determine the boundaries between bodies of water like rivers and the surrounding soil and rocks.
Bertozzi says the new methods her team works on requires “orders of magnitude less training data than the deep learning method.” This allowed them to use the small amount of hand-labeled data to figure out how to analyze the large amount of satellite data.
Tags: AI, Grants, mathematics