Feb
29
2012

Mapping Twitter Abuse and Recursive Abstraction: Can We Accurately Map Who Abuses Who?

Introduction

In week 14 of our CAST London Sandbox sessions, we learnt about visualising networks through mapping software such as Openstreetmap, TileMill and Mapbox – as a result I was inspired to use a combination of these APIs to see if I could really map a sample of abuse tweets over a day on Twitter.

One of the key questions I have had in relation to my research question: “Who Abuses Who on Twitter, and How Do They Respond?” – is how does one visualise exactly where abusive tweets are coming from, especially when a large percentage of people don’t bother to switch on their geo-location tagging on the Twitter platform?

One of my main interests is ICT for Development, so I monitor both Twitter and news feeds for interesting articles on the use of mobile technology and digital social media in the Global South. I was really intrigued with a recent article in the Economist (2012) entitled “#AfricaTweets”, which showed a striking info-graphic that described in brief, how many Africans are Tweeting across its continent:

Economist: Number of African Tweets

 

Looking at the info-graphic above, I was quite impressed that they were actually able to get such a good of a sample of data to give a comprehensive representation of the number of tweets made in the Top 20 African countries in Q4 2011.  However, once I had tracked down the full press release and actual report by Portland Communications (2012) I found that in reality, this sample was only based on geo-located tweets, and therefore, due to the security and privacy issues discussed in the full report and blog article, this was in fact only a very small representation of actual tweets in Africa in Q4 2011. As Graham (2012) rightly points out in his post “A critique of the Economist’s “#AfricaTweets” story” – the Economist’s (2012) brief and witty report of Portland Communication data made the data itself a victim of recursive abstraction, where the final conclusions of this brief info-graphic of African Tweets were several times removed from the underlying data in Portland’s full report.

As we learnt in a later CAST London session discussing Digital Ethnography and recursive analysis, one must always be aware of the filter provided by the tools one is conducting research with, and at each point qualify one’s data with the limitations of the software being used – the Economist’s (2012) article clearly neglected to do this, whereas, one can see from Portland Communication’s original info-graphic, they clearly stated that their data was limited to geo-located tweets:

 

How Africa Tweets (Portland Communications)

Bearing these issues of recursive analysis in mind, I was still determined to see what kind mapped-based (or digitally cartographic) visualisation I could use to represent the data I had gained from my previous research into the John Terry vs Anton Ferdinand controversy.  I decided to use a combination of TileMill and MapBox to see what restrictions I would come across in my quest to map a small sample of abusive tweets over one day in early February 2012.

Methodology and Discussion

Because I wanted to map what had happened at the height of the John Terry controversy, I decided to use a small sample of data of 2000 tweets that I had previously scraped over the period of 30 minutes on 3rd February 2012.  The full volume of tweets around the removal of John Terry’s England captaincy while he awaits trial for the alleged racial abuse of Anton Ferdinand contained quite a large dataset of over 4000 tweets, and from initial analysis, I knew that a very small percentage, about 0.03%, of tweets contained any geo-location data at all. Previous analysis of the Vertices data sheet from my Google Docs scrape, showed that there was a field containing ‘Location’ data that could provide some indication of where the Twitter user was based. A paper by Takhteyev et al (2011), discussed an attempt to do an automated coding of profile locations with an unnamed tool (Graham et al. 2011).  In Takhteyev et al’s research, they found that most of the messages in their original sample (75 percent) had some location value associated with them. They were sent by users who either specified a location in their Twitter profile or, in the minority of cases, used a Twitter client that automatically updated the location field in their profile. I decided to see if I could find a tool that would allow me to convert the user-inputted locations into longitude and latitude co-ordinates.

An analysis of my sample dataset revealed similar occurrences to the study by Takhteyev et al (2011). The format of location descriptions from my dataset varied, where some users identified specific cities, countries, addresses, and general geographical areas like ‘Old Trafford, Manchester’, and some even had map coordinates where the data had been taken from a digital device. However, as Takhteyev et al (2011) found, a careful analysis of my dataset also showed that there were a few erroneous/made-up descriptions, like “Planet of the Japes” or “Not Bolton”, which required me to manually sift through the data and filter out such references.

Once I had analysed my sample and filtered out users who had not inputted location data, and kept in those who had supplied identifiable locations, my sample size decreased quite considerably from 2000 to just under 170 tweets.  This was quite interesting, as again, it showed just how little accurate location data people really tend to input in their Twitter profiles. I did a search for automated location conversion tools, and found the Findlatitudeandlongitude.com website, which has a useful batch conversion tool, so I pasted my filtered location data one column at a time into this tool:

Batch Geo-coding: http://www.findlatitudeandlongitude.com/batch-geocode/

and then created a single spreadsheet that contained both the longitude and latitude, and location data, as well as the name of the user and the text that they had tweeted:

I then exported the spreadsheet as a .csv file, and imported it into TileMill as a layer. TileMap automatically reads the ‘long’ and ‘lat’ column headers in the .csv file and pulls them in as markers on the map:

TileMill: View Layer Content (CSV)

I then imported country bounadary layers from an open source cartography data bank:

TileMill: Layers: Country Boundaries

and changed the colour of the style.mss file to darken the colours and boundary lines on the map:

TileMill: Map Styles

 

I wanted to pull through the names, tweets and correlating self-inputted data of users’ locations, so I then opened the interactive tab, and pulled through what is called the ‘mustache’ tag data that is read from each column header, and will appear in a white box as the user hovers over/clicks on each marker point on the map:

TileMill: Interactive Mustaches

I then installed some multi-zoom plugins, so that I could figure out what the best zoom levels and boundary points would be:

TileMill: Map Plugins

The next step was to set the actual boundary areas and zoom levels for the exported map:
TileMill: Map Settings

My Interactive Map

After creating a free MapBox.com account, I exported the data from TileMill to the MapBox.com website, and embedded the resulting map below.  You can use your mouse to zoom in and out of the map, and mousing over each of the green nodes will reveal an interactive box, which displays, the name of the users, their tweet, and the location they inputted in the ‘location’ field of their Twitter profile:

The full map can be found on my MapBox account.

Results and Conclusion

What I found really interesting about the whole process of converting location data from my Twitter dataset into a series of useable longitude and latitude coordinates, was how little real geo-located data, or accurate self-inputted location data could be found in a fairly good sized data sample of 2000 tweets. My research into and comparison of the Economist’s (2012) representation of Portland Communication’s (2012) dataset, and reading of Graham’s (2012) subsequent critique and caution on making sure one recursively analyses one’s data and takes account of the limitations of one’s tools, meant that I was well-prepared for doing my research with this in mind, and the limitations of my results.

In a recent paper, Graham et al (2011) discuss how problematic it is to find an alternative way of retrieving geo-located data from tweets, and discusses various other methods used to try and get geo-located data from both the location and description, and in some cases, the status updates of users:

While geographic metadata in device locations (i.e. precise coordinates) is unlikely to be subject to much debate about its validity, the self-reported profile location field on a user’s profile is problematic because of its unstructured nature; yet, the profile location is often contemplated in papers discussing the validity of findings about a geographically bound situation such as the Arab spring of 2011 or the Iran election protests of 2009 .(Hecht et al. 2011; Gaffney 2010; Lotan et al. 2011)

Vieweg et al. (2010) also handcoded profile locations, but used tweet content in addition to profile content in determining the user’s true physical location. Java et al. (2007) used the Yahoo geocoding API, which attempts to assign a precise location to self-reported profile locations. However, Hecht et al. (2011) found only 66% of the profiles they examined by hand had valid geographic information while 18% were blank and 16% had only non-geographic information, mostly made of popular culture references. As a result geocoding APIs struggle with this input and Hecht et al. recommend preprocessing the information to remove such non-geographic information; however such a preprocessor has yet to be developed and tested.

Graham et al (2011) go on to discuss how even after doing some extensive boundary-based analysis with over 111 million tweets, both the process and the results are still open to question, and conclude that there are significant challenges to accurately determining the language of tweets in an automated manner. This again shows how inaccurate the Twitter API can be as a singular tool for solely relying on data without a deeper analysis and a healthy cognisance and perspective of its limitations for research.

Recent developments in the way that Twitter shares it’s more retrospective and detailed data via the UK company Datasift (Lee, David 2012), may give us better access to data and tools of analysis that will more accurately reveal the location of Twitter users. However, other developments (BBC 2012) pertaining to the un-authorised access of smart-phone users’ contact and address book details when social networking apps such as Twitter request access to users’ locations and friends connections, may see even more of a movement away from users intentional sharing of personal data, with users reluctant to reveal their real locations over concern that this data will be shared for monetary gain to third party companies.

Finally, while it is true there are pitfalls of blindly relying on poor initial summaries of Big Data from digital social networks like Twitter, especially when these poor summaries are visualised using maps, Social Scientists can answer the problems of recursive abstraction by documenting the reasoning behind each step in their research, and by citing examples from the data where information has been included and excluded from the final visualisation and subsequent report.

Bibliography

Tweet

About the Author:

2 Comments + Add Comment