Internet Research is Hard

5 minute read

At the end of last year, a group of researchers thought that they had a great idea for solving a common problem in social media research.

I usually try to avoid calling out specific researchers when I discuss problems in research. I’m making an exception here, because it’s hard to discuss this topic without going into the details. This is not an invitation to harass these individuals. They made a mistake, and have hopefully learnt from it. Mistaken research gets published all the time. That’s why we make more of it.

There aren’t a lot of good quality data sets available for social media research. A group of researchers thought that scraping the growing social media, Mastodon, would solve that problem.1 So they collected all the posts available in English on all Mastodon servers they could, and published the data set as part of a conference paper. That conference paper also applied their data, exploring how content warnings are used on Mastodon and arguing that their analysis shows what is “appropriate” on Mastodon. Unfortunately, they got both the data collection and the analysis wrong.

Why would good quality, publicly available data sets about social media be rare? Because collecting and sharing them is often either illegal, unethical or both. For example, Twitter’s Terms of Service forbid publicly publishing data sets collected from their service. And privacy laws, like the recent European General Data Protection Regulation, place very strict limits on what can be done with data that includes peoples’ personal information. The researchers in question thought that Mastodon doesn’t have limitations on data use, but they failed to notice that each node they collected data from has their own Terms of Service. Some of them might allow scraping their posts, but some explicitly forbid it. Scraping data from a federated network isn’t easier, it’s more difficult, since each server has its own rules.

There are some requirements that need to be fulfilled before someone can publish a data set. If it includes people, it’s important that those people can’t be recognised, meaning the data has to be anonymised. This is true even if the data was publicly available, unless those people explicitly gave their consent for having their data published out of the original context. Even if I tweet something, I might later want to delete that tweet – but I can’t if someone has already scraped it and placed it in a data set where it might be available forever.2 Of course, nothing stops people with bad intentions from collecting my tweets, but usually one can trust that researchers at least try to act ethically.

It’s really difficult to perfectly anonymise online data. Even if you remove usernames and other identifying information, just having a piece of text is usually enough for finding the original version of that text online. In this case the problem was even worse, since the researchers failed to understand their data and failed to remove identifying information. To their credit, the researchers have since taken down the data set.

The research also failed in its analysis, because the researchers didn’t understand their data or the context of research. They used their data set to analyse what is “appropriate” to post about on Mastodon by seeing what type of posts have content warnings in them. The Mastodon user interface allows users to hide their posts and show a content warning instead. They thought that analysing what type of posts are behind content warnings would tell them what is “appropriate” on Mastodon.

Screenshot of Mastodon posting interface, showing a line for writing the content warning

Unfortunately, this simply isn’t how content warnings are used on Mastodon. If you spend any time there, you quickly notice that there are many uses for the content warning feature. For example

A user can click the “CW” button to label a toot as containing discussion of politics, illness, injury, or bigotry (sexism, racism, homophobia, transphobia, and so forth). All of these topics are “appropriate”, but a user may at their own discretion decide to provide advance warning for the benefit of those readers who wish to mentally prepare themselves for reading about emotionally damaging subject matter. Such CWs are acts of courtesy, not signals of “inappropriate” content. Users often apply CWs to toots about food and cooking, topics that are safe for children to read but may cause distress among readers with eating disorders. CWs can also hide spoilers about movies, books and television shows, and they can be part of the presentation of a joke: the “Content Warning” text contains the setup, and clicking to open the toot then reveals the punchline. By no stretch of the imagination is hiding the punchline of a joke an example of content that strays outside of community norms or that “may hurt people’s feelings”. (Open Letter from the Mastodon Community)

There is nothing inappropriate about knock knock jokes, but the content warning system is great for them. This reveals broader problems about researching contexts you are not familiar with. It might seem commonsensical that content warnings are used for hiding inappropriate things, but that’s only if you’ve never encountered them in the wild. It also seems odd to assume that people will use a software feature in just one way, especially the intended way. The history of media is also the history of subverting media.

Another problem with the analysis has to do with the specifics of researching Mastodon, a federated social media. The researchers didn’t collect data from one place, but 363 different places. Examples include instances like, an instance for sex workers and their allies, and, an invite-only instance for sharing art. As you can probably guess, what is considered appropriate in these two places (and the other 361 instances) is wildly different. One of the central points of federated social media is to allow communities to define for themselves what kind of behaviour they like to see, leading to different policies on what is acceptable on different instances. Examining what is appropriate on Mastodon by collecting data on all the available servers makes no sense, since there is no one, shared culture of “appropriate”.

I hope I’m not being unfair by focusing on this one research paper. It’s not the only published piece of research in the world that is wrong. But, I found it to be wrong in two, interesting ways: by showing how difficult collecting online data can be, and by showing how challenging it’s to analyse a context you’re not intimately familiar with. I hope this research paper can work as a example for future researchers.

  1. I’m writing about Mastodon here because that’s what the researchers did. However, Mastodon is only one software among many that can be used to access the Fediverse, a collection of compatible social media services. 

  2. I actually use an automated service to delete all my old tweets. There are very few uses for them, and the most common one on Twitter is to mine them for harassment material.