Ushahidi's first steps towards integrating Machine Learning

Ushahidi
Mar 26, 2018

Hello, my name is Will, I’m one of the developers at Ushahidi. I hope to tell you briefly about some of the exciting things we’re working on at Ushahidi, the implications for that technology, and my own hopes for the future of the tools we build.

My background is as a computer scientist. Previously I was a researcher in machine learning, I worked as a software developer for financial services companies, and on a whim I moved from Ireland to Montreal to work with tech NGOs building tools to help activists.

Six years ago, I worked on a project that was focused on documenting human rights abuses and crimes committed within Syria. There have been a number of significant efforts by individuals and organisations to collect data about the Syrian conflict and potential crimes that have occurred. The group that I worked with were interested in collecting information about individuals affected by the conflict, all of the documents that were generated about the conflict, and additional documents that were retrieved from Syria. That group was interested in corroboration. Or more specifically, they hoped to find a means to link different documents, to highlight those that were meaningfully related, and ultimately to find those that corroborated what had happened for an event being examined.

At the time, there was a great deal of focus on scraping and collecting data. There were, and still are, many websites that stored, collected, and displayed information about those who had been killed or detained, and those who disappeared. The project I worked on had managed to collect significant amounts of data and had produced some meaningful tools to help in analysis. Though this techniques were still quite manual in their implementation. It appeared evident, based on the types of documents and information the project dealt with, that there was great potential for the application of Machine Learning to better facilitate the work of corroborating events related to the Syrian conflict.

When analysing great quantities of data, you are often dealing with large complex pieces of information that may be comprised of many millions of pages - far more than even the largest most well funded team of researchers could ever hope to decipher. Machine learning can help begin to solve these challenges.

I don’t want to assume that everyone knows precisely what machine learning is. To give a very brief overview, there are various algorithms that describe ways that a program can store understanding. At a simple level, this understanding is expressed as an ability to categorize or classify discrete pieces of information. How you teach a program to understand is by giving many examples of different categories, then testing it on examples that are related but which it has not seen before, and correcting it depending on whether it correctly identifies the example or not.

The algorithms are good at discerning whether something is of a particular category that it already understands. However, they find it hard to comprehend novel pieces of data that have no relation to categories they have seen before so there are distinct limitations to the scope of understanding that a model can represent. In order to overcome these restriction, we can build different models that describe overlapping forms of data or we can apply sequences of algorithms that progressively refine the categorisation.

Over the intervening years since I originally worked on the Syrian project Machine Learning technology has become vastly more accessible and implementable. When I found Ushahidi, I was thrilled because it presented an existing piece of software that was designed to help people document what was happening to them, and allowed others to use that information to coordinate efforts or capture experiences. In the last year, I have worked on the COMRADES project we are developing with the Knowledge Media Institute at Open University specialising in Machine Learning. Currently the project has has two parts. First, the algorithm used attempts to categorize inbound data according to its understanding of the domain that a particular group is working in. For example, responses to natural disasters - is this inbound tweet, text, or web submission of data related to a fire, an earthquake, a flood, or is it a request for resources or an offer of help? The goal of this first element is to quickly triage the large amounts of inbound data that inundate an organization. This will reduce the physical human effort required to manually read through and figure out what a particular piece of text is trying to convey and how it might be relevant. The project has made this tool available as a Google Spreadsheet to allow anyone to experiment with it. You can find it here.

In conjunction with the Natural Language Processing Research Group at the University of Sheffield the first part of the project is about grouping and filtering inbound data, the second aspect of the project is about enhancing the richness of the data that our application can provide to organizations work methodically to analyse large amounts of data as well as individuals working on the ground to quickly assess situations and devise responses. Using a tool for automatic annotation, YODIE, the program is able to identify important nouns that exist within a particular piece of text and link the noun directly to the associated dbpedia entry. (For anyone who doesn’t know what dbpedia is, it is a manifestation of the semantic web. Information is collected about given objects, formatted in a standard way and then cross linked. For example, the golden gate bridge is in San Francisco. The dbpedia page for the golden gate bridge will tell us the below information.)

Property

Value

dbo:Infrastructure/length

2.7374088

dbo:abstract

The Golden Gate Bridge is a suspension bridge spanning the Golden Gate strait, the one-mile-wide (1.6 km), three-mile-long (4.8 km) channel between San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California – the northern tip of the San Francisco Peninsula – to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers. The Frommer's travel guide describes the Golden Gate Bridge as "possibly the most beautiful, certainly the most photographed, bridge in the world." It opened in 1937 and was, until 1964, the longest suspension bridge main span in the world, at 4,200 feet (1,300 m). (en)

dbo:architect

dbr:Joseph_Strauss_(engineer)

dbr:Charles_Alton_Ellis

dbr:Irving_Morrow

By annotating previously flat 1 dimensional text, we are able to start to provide more connected meaning to the information that our tool is receiving in crises or natural disasters. If Ushahidi Platform receives a text describing a crash on the golden gate bridge caused by aftershocks, the software will soon be able to correctly categorize the text as related to earthquakes or accidents, and link the user to the dbpedia entry for the Golden Gate bridge automatically. Ultimately, because the dbpedia entries are semantically linked and contain all the properties for a given object, in this case, the golden gate bridge, Ushahidi will begin to be able to do things like automatic suggested geolocation. This will help organizations using our software not only to categorize new information but also make decisions about where they might need to direct their resources more efficiently and effectively.

Ushahidi is still at a very early phase of development and learning for the programs. At Ushahidi we are working on models that will improve categorization of election monitoring data and crisis response. My hope is that by this time next year, I will be able to say that these tools are dramatically improving the work of our users.