Global Leader in Big-Data Science
By Eric Swedlund –
UA Earns Second $50 Million Grant
What began as iPlant at the University of Arizona in 2008 as a computing tool for plant science has expanded into an invaluable platform for research of all sorts – earning a second $50 million grant and positioning the university as an international leader in big-data science.
Renamed this year as CyVerse, UA cyber infrastructure is used by researchers around the world to store, manage and analyze data sets of staggering size and complexity. The data is key to helping scientists gain new insights in areas like genomics, climate modeling and astronomy, fields where the collection of mass data has outpaced researchers’ ability to make sense of it all.
“This kind of infrastructure is absolutely crucial to the future of science,” said Parker Antin, CyVerse’s principal investigator. “More and more, science involves very large data sets and the analysis of those data sets to create understanding. The data sets are so large and complex and the handling of them is so difficult, that you need this sort of large infrastructure to do it for you, otherwise you grind to a halt very quickly.”
CyVerse marshals the technological resources necessary to crunch big data in a way that frees individual researchers from having to create their own platforms for every unique set of data. The simplicity and flexibility that characterizes CyVerse allows any researcher to use whatever tools are necessary, without having to program or even understand how those tools operate.
“It is our job to connect data across all levels to see the big picture – to sift through the data and figure out ‘why’ instead of simply ‘what,’ ” said Kimberly Andrews Espy, the UA’s senior VP for research. “Without the proper platforms, researchers spend too much time staring at frozen computer screens and not enough time making discoveries.”
Five-year $50 million grant from NSF
CyVerse began as the iPlant Collaborative, earning the university a five-year, $50 million National Science Foundation grant in 2008 to provide computational infrastructure for plant sciences. The platform was so successful that NSF expanded its focus in 2013 when the grant was renewed for another five years and $50 million.
“Nobody really knew then what was necessary or what the applications would be,” Antin said. “The first two years of the project were meetings with the scientific community around the world, to ask what is happening in the next five years and what do we need in terms of cyber infrastructure to make it happen?”
CyVerse is a highly sophisticated and thoroughly interconnected set of computing resources, largely built on open-source resources already created by the NSF, including software and hardware solutions for data storage and management, analysis, communications and more.
“The genius of it was to get those individual resources, which were created as stand-alone entities, to talk seamlessly together. That’s very difficult and complicated, but people need it in order to facilitate their science,” Antin said. “We now have a mission that is much broader than it initially was and one of the great things about our infrastructure is it’s the Lego blocks of science. It’s scalable. You can keep adding onto it. People can even bring their own resources to us.”
CyVerse is now approaching 30,000 user accounts across the world, representing all areas of biology, ecology, environmental sciences, geography, climate and space sciences. Those users rely on CyVerse for an ever-growing data storage of more than 1.3 petabytes housed on servers at the UA and its partner institutions – the Texas Advanced Computing Center, Cold Spring Harbor Laboratory and the University of North Carolina Wilmington.
“Some people think that what they need is a lot of supercomputers, but it’s not about having access to the fastest computer in the world, because there are very few projects that actually need that,” Antin said. “What they really need is data management. It’s about having a place to house these huge data sets. With CyVerse, suddenly you can upload unlimited amounts of data, you can process it and share it and point it at data analysis applications.”
Next-generation
computing resources
In one recent breakthrough, CyVerse enabled UA geneticist Taylor Edwards in his discovery of a new tortoise species in northern Mexico. Edwards was able to easily share his data with collaborators and process the genetic information using the CyVerse platform.
Likewise for Fiona McCarthy, a UA associate professor in the School of Animal and Comparative Biomedical Sciences, CyVerse enables work that couldn’t otherwise be done. “This is a real turning point for biologists. This is the first time that we’re able to get more data so rapidly that we actually are lagging behind in understanding what that means,” McCarthy said.
McCarthy, who has worked on CyVerse since its inception, studies bird genomes, focusing on bioinformatics and genomics – research that requires next-generation computing resources. The chicken was the first bird genome to be sequenced, more than 10 years ago. Now there are more than 50 species to be sequenced and part of McCarthy’s research uses comparative modeling to study how birds evolved and are related to each other.
“One of the things that’s fundamental is dropping boundaries and developing collaborative links for the researchers,” McCarthy said. “I can put my data on CyVerse and decide who I share it with. I don’t have to mail this hard drive to a collaborator. It’s so easy to share data now with just a couple of clicks.”
Science itself is transforming in the era of big data, Antin said. The traditional model of hypothesis-based inquiry is losing ground to what’s called discovery-based science – where the collection and analysis of data can occur without any specific question in mind.
“We have a mantra that we enable science. We don’t do science. We enable other people to do great, great science – by creating this environment and infrastructure for them to succeed,” Antin said. “That’s the power here. It doesn’t matter if it’s in space, or in biology, if you have a major challenge and you need computing resources, we’re your people.
‘Customers for life’
As CyVerse has grown and expanded, it’s built a reputation in the scientific community as an indispensable resource for investigators across disciplines.
“In the beginning we went out and sought users. Now, we have so much going on that we don’t have to do that anymore. People are coming to us. As soon as we solve their data needs, they’re customers for life,” Antin said.
CyVerse has changed and adapted over time by staying in close contact with its users, creating and adding new capabilities along the way as users’ needs change.
“We have a group of scientific analysts who understand both the cyber infrastructure and the needs of the users. It’s through this interaction we realize and create new capabilities within the infrastructure,” Antin said. “The capabilities are driven by the users. CyVerse can address users who just point and click – but we can also take users who know how to code and they can build their own applications.”
Antin and his co-principal investigators at the UA – Nirav Merchant, director of information technology at Arizona Research Laboratories, and Eric Lyons, assistant professor of plant sciences – lead what’s becoming one of the UA’s most notable scientific attributes.
“One of the reasons it started here is we have world-class plant biologists and world-class computer capability in our faculty,” Antin said. “Having this here is a unique attribute, similar to having the OSIRIS-Rex project or the Biosphere. You can think of this as a unique resource on the scale of those because we can enable so many major projects.”