In 2007, NASA set out to fill a major gap in landslide research. Now, a team of master’s students is changing the way we detect landslides.
Master’s students at the University of British Columbia (UBC) are using social media to identify landslides. Despite their global prevalence, landslides frequently go undetected by researchers and responders. While NASA’s multi-partner landslide team has been collecting global landslide information since 2007, it has proven to be a very time-consuming, manual process.
Recently, NASA launched the Cooperative Open Online Repository (COOLR) to expand its archive of landslide reports. COOLR is the largest database of its kind, containing landslide information from NASA scientists, citizen scientists, and other organizations. However, the landslide team still needs ways to streamline their landslide detection processes and grow their inventory. Fortunately, an initiative with UBC students and BGC Engineering Inc. (BGC) could provide a solution. BGC supports clients across multiple sectors, such as transportation and energy, to understand and manage risk from geological hazards. NASA’s landslide catalog is a powerful tool for private institutions such as BGC as it provides a rich picture of where and when landslides have occurred.
Last November, Corey Froese, Principal Geological Engineer of BGC, reached out to the landslide team to see if they would be interested in collaborating with UBC. The pursuit would involve mentoring a group of four UBC graduate students through their Master of Data Science capstone project to create an automatic landslide detection tool. The team saw the opportunity to engage future scientists in extending COOLR’s capabilities and readily agreed.
Collaborating with new organizations helps NASA learn more about resources available for improving its internal procedures. The tools developed in this capstone project will inform NASA’s methods for locating and categorizing landslides. In turn, this will build NASA’s capacity to manage landslide risk by expanding its publicly available landslide inventory and reducing the time spent identifying and adding new landslides to COOLR. Helping NASA expand this dataset will increase its utility for BGC, as well, by enhancing their ability to calibrate statistical models for future landslide occurrence. Furthermore, a more extensive and comprehensive landslide repository will create less bias when used to calibrate and validate NASA’s global landslide models.
Reducing Bias in Models
The smaller the landslide inventory, the harder it is to produce models without bias. Bias describes how well a model’s prediction matches reality. For landslide detection, a biased model will predict where landslides will occur with less accuracy than an unbiased model. But before researchers can detect bias in their models, they need to thoroughly understand what ‘reality’ looks like. For NASA’s landslide scientists, this means they need to create an inventory of as many recent and historical landslide events from around the world as possible. COOLR is working to fill this gap in landslide research. After NASA runs the landslide prediction models, its scientists can use COOLR to compare the predicted outcome to real-life events. This process is called validating model results, and it’s a crucial step in ensuring that decision-makers can use NASA’s landslide models to predict disaster risk accurately and reduce it.
So, before NASA can reduce bias in its landslide models, all it needs is a continually updating, comprehensive archive of global landslides. Easy, right? Well, there’s only one problem: landslides are underreported. COOLR is a valuable tool, but NASA needs to be aware of landslide events before incorporating them into the database. As a result, the landslide team has been tenacious in their search for innovative ways to locate landslides and expand COOLR’s database.
Finding Landslides with Reddit
The UBC student team, composed of Shengjie Zhang, Mariia Shubina, Badr Jaidi, and Yiting Zhou, partnered with BGC’s Data Science Engineer Autumn Umanetz, Geoscientist Dr. Corey Scheip, and NASA’s landslide scientists, to address inadequate landslide detection across the globe. The students are part of UBC’s Master of Data Science in Computational Linguistics program. At the end of the ten-month program, the students had eight to ten weeks to create a group capstone project based on real data and questions from a university partner. “We apply the models that we learned in this program to try to solve a problem in the real world, and I think this is very exciting,” says Shengjie Zhang.
This group’s project, called ‘Social Landslides,’ uses natural language processing (NLP) to web scrape Reddit, a social news forum, for landslide information. NLP involves giving computers the ability to understand written and spoken language. Once the computer can process language, it can then be used to extract relevant data from websites. This technique helped the students determine when a landslide happened by teaching computers to evaluate phrases like “last Thursday morning.” The team used this process to scrape Reddit for landslide locations and supporting information, such as the conditions that triggered the landslide, casualties, the time of the event, and more.
While they looked to NASA and BGC for guidance, the students were largely on their own to create and execute a project plan. “When we spoke to BGC and NASA, they told us, ‘Here’s the problem, good luck!’ They helped us, of course, but to move forward, we had to figure it out somehow,” says Badr Jaidi. BGC provided the team with more technical advice, helping the students with NLP, geocoding, and improving the model, while NASA focused more on the end-product’s utility. The landslides team answered questions about what landslide attributes are important to capture and how NASA uses landslide data to ensure that the final product will be useful for building NASA’s landslide capacity. The landslide team also provided the students with a list of positive and negative words to scan Reddit for based on their experience looking for landslide events online. Positive words are more likely to accompany real landslides, such as: earthquake, downpour, flood, or mudslide, while negative words are more likely to be misleading and describe metaphorical landslides, such as: victory, song, election, or sports. A Reddit post about Fleetwood Mac’s iconic song “Landslide” may give us insight about the changes and challenges of life, but it doesn’t do much for global disaster detection.
Bringing COOLR and Social Landslides together
The Social Landslides group’s final product is a script that researchers can run to automatically search Reddit for landslide events within a specific timeframe. Currently, NASA’s landslide team relies on media reports, citizen scientists, and other organizations’ datasets to populate COOLR. With the Social Landslides script, they could automate this process and find more landslides than ever before. According to NASA’s landslides website, if the landslides team were to calculate all the hours spent over the past 10 years compiling this inventory, it would total over a year and a half of straight landslide cataloging! This innovative approach to finding landslides could reduce landslide cataloging to a fraction of the time, giving researchers more time to study other aspects of global landslide risk.
The Social Landslides team won two awards for their innovative capstone project: “Overall Best Project” and the “Faculty Choice Award.” “I was really happy working on the project. I learned so much,” says Jaidi. In the future, the NLP and web scraping processes developed by these students could be applied to other social media platforms, such as Twitter, to locate even more landslide events. With an expanded disaster repository, NASA’s Disasters program will be able to reduce bias in its landslide models and provide communities with ever more accurate and comprehensive resources for landslide risk reduction.