The following post was written by a volunteer developer, Vladimir G. Ermakov a Master’s student at Carnegie-Mellon University in Pennsylvania. Over the past few months he took on an ambitious project: to contribute code that would allow us to parse news articles and attempt to auto-detect the primary location that is the subject of any given text.
Cross-posted from blog.swiftly.org
The amount of information available in electronic format is rapidly increasing. It is becoming possible to find out real-time about the current events in a particular part of the world based on electronic data such as news articles, blog entries, twitter feeds and SMS messages. Even though the data is available, there is an overwhelming amount of it and it is hard to stay on top of events that are of relevance. Getting informed about recent developments is particularly important in the times of crisis, when lives could depend on timely response. In this project I am exploring ways to pinpoint the location discussed in text documents. I am able to achieve good results by combining location keywords extracted by Yahoo! Placemaker service with state of the art machine learning and natural language processing techniques.
The basic approach that I’ve embarked upon is to extract location keywords from a document using Yahoo Placemaker service, and then apply classification techniques to disambiguate, which of these locations is most relevant to the document at hand. I’ve conducted experiments with Naïve Bayes and Fisher classifiers using bag of words model for feature extraction, but these did not give good results. I explored an alternative approach: use count and position of location keywords extracted by Placemaker and feed them into a SMV. This proved to be a very effective way of determining the country that is the focus of the document. Applying lemmatization to location adjectives such as Russian and converting them to nouns such as Russia helped improve the results even further.
While the Reuters-21578 is was a great dataset to use for training classifiers and experimenting with the data, the articles there were collected 20 years ago. What made this project interesting for me, is the possibility of visualizing the news around the world on a map, and seeing whether sudden rise in the number of articles published can be an indicator of some important events.
To make this possible I had to obtain a recent dataset. Reuters has archived articles from the last several years on their website. I developed a simple crawler that visited news articles from this archive, downloaded them to my server, and extracted the news article text content. I then passed this content off to the Yahoo Placemaker service, and output the data with the location labels into XML files. I then could use my scripts to run the experiments on this new dataset, just like I did with the original data.
I limited my data collection to the most recent articles. The archive contained over 400,000 news articles for 2010, which too many to download. I restricted the crawler to randomly pick 10% of the articles from each day of the year. This was still a significant amount of data, 80,000 articles, and fairly representative of the whole archive.
After all the experiments I was able to narrow down on a working solution for mapping news articles – extract location information from the article using Yahoo Placemaker service, making sure to lemmatize location adjectives, extract normalized count and position of location keywords within the article, and apply SVM classifier to decide which of these locations are more important to the article. The results were encouraging, and I believe this solution is ready to deploy into a real world application. I am hoping to implement an extension to Swiftriver platform in the near future that uses this method to classify news articles by country.
Valdimir’s paper is a much longer, and much more fascinating read than I could share here but if you’d like to read it. He can be reached by emailing vermakov [at] emu [dot] edu.
We’re working on folding this and other contributions into the next release of Sweeper. Thanks for the awesome work Vladimir! Other developers interested in contributing to the Swift platform can find out more here.