Using science to reform toxic player behavior in League of Legends
Riot Games decided that just banning players wasn’t good enough.
Riot Games founders and League of Legends creators Brandon Beck and Marc Merrill have encountered bad behavior in massively multiplayer online games since the days of Ultima Online and EverQuest. In all that time, the typical moderator response to the all-too-common racial epithets, homophobic remarks, and bullying that borders on psychological abuse in MMOs has been to simply ban the players and move on. League of Legends definitely could have afforded to go the same route, bleeding off a few bad apples from its 12 million daily players and 32 million active monthly players (as of late 2012) without really affecting the bottom line.
But Beck and Merrill decided that simply banning toxic players wasn’t an acceptable solution for their game. Riot Games began experimenting with more constructive modes of player management through a formal player behavior initiative that actually conducts controlled experiments on its player base to see what helps reduce bad behavior. The results of that initiative have been shared at a lecture at the Massachusetts Institute of Technology and on panels at the Penny Arcade Expo East and the Game Developers Conference.
Prior to the launch of the formal initiative, Riot introduced "the Tribunal" to League of Legends in May of 2011. The Tribunal is basically a community-based court system where the defendants are players who have a large number of reports filed against them by other players. League players can log in to the Tribunal and see the cases that have been created against those players, viewing evidence in the form of sample chat logs and commentary from the players who filed the reports.
Cases in the Tribunal were evaluated independently by both player juries and staff from Riot Player Support. In over a year’s worth of cases, Riot found that the community verdict agreed with the decision of the staff moderators 80 percent of the time. The other 20 percent of the time, the players were more lenient than Riot’s staff would have been (players were never harsher than the staffers).
Riot’s takeaway from the Tribunal experiment was that League players were not only unwilling to put up with toxic behavior in their community, but they were willing to be active participants in addressing the problem. This success inspired Riot to assemble a team of staffers that would make up its formal player behavior initiative, launched just over a year ago.
Jeffrey Lin holds a PhD in Cognitive Neuroscience from the University of Washington. Before joining Riot’s player behavior team and eventually becoming lead designer of Social Systems, Lin worked at Valve Software with experimental psychologist Mike Ambinder, conducting research on games like Left 4 Dead 2 and DOTA 2. The other founding members of the player behavior team were Renjie Li, who holds a PhD in Brain and Cognitive Sciences from the University of Rochester, and Davin Pavlas, who holds a PhD in Human Factors Psychology from the University of Central Florida.
All three doctors are hardcore gamers, a necessary prerequisite for the core team members. “A big part of Riot Games in general is we want to be the most player-focused game company in the world,” Riot producer T. Carl Kwoh told Ars Technica. “Part of that player focus is really understanding that experience and living and breathing that experience.”
Before their experiments could go forward, the team had to create some sort of baseline for what constituted bad behavior in player chat rooms. This meant hand-coding thousands of chat logs and designating each line as positive, neutral, or negative. “Going through that exercise once has provided us with good data that we can rely on as far as intuition goes,” Kwoh said. The player behavior team can now categorize the nature of chat logs quickly.
With this hand-coding system in place, the Riot team conducted a little experiment with its player base by messing with the default setting for cross-team chat, which lets players broadcast a message to everyone on other teams. Players could still turn the feature on, but they had to actively go into the settings to do so.
Riot then compared the quality of the chat log in the week before and the week after the switch and noted a more than 30 percent swing from negative coded messages to positive ones. And it wasn’t just because fewer people were chatting across teams, either—overall usage of the feature stayed about the same, even after the switch.
Riot also conducted a player-centric analysis of its wealth of chat log data to try to generate an automatic profile that could separate good players from bad players. “One of the cooler things we did is we took our whole player base and categorized the players who are known for toxic behaviors, all the players who are known for positive behaviors, and we can cross-correlate all the words that both populations use,” Lin said. “Any words in common we filter out of the dictionaries. What you’re left with is a dictionary for all the words the bad players use that good players don’t use."
The dictionary of common words for bad players falls along depressingly predictable lines, with heavy weighting towards racial and homophobic slurs. The dictionary for good players, however, turned up an inspiring data point. “The first 500 or so words were real life names,” Lin said. “The best experiences are when you trust the other person because they’re a real life friend.”
This kind of chat analysis also makes it possible to predict which players are likely to run into problems in the game even before they show truly bad behavior or generate complaints. “It turns out that if you use the dictionaries, you can predict if a player will show bad behavior with up to 80 percent accuracy from just one game’s chat log,” Lin said. Of course, Riot doesn’t intend to use this kind of predictive modeling to take pre-emptive actions on players who may be heading down the path to undesirable behavior, saying that would contradict the real spirit of the company’s efforts.
“The core philosophy of the player behavior team is to work with the players and collaborate with the players. We want to build systems and tools to allow the players to hold each other accountable or to provide more social norms to online society,” Lin said.
The team’s next experiment was to add "Reform Cards" to the Tribunal system. Previously, the Tribunal could deliver bans of various lengths to players, but the banned player wouldn’t know why they were punished. Reform Cards show the punished players the same chat logs and other information that was presented during their Tribunal case and add statistics on the level of agreement among the player judges on the recommended punishment.
Simply giving this information to punished players seems to have led to a distinct reduction in post-punishment reports of bad behavior, as seen in the following graph.
It might seem a bit odd that players with a seven-day ban saw less improvement than those with a three-day ban, but the player reform team has their own theories on that result. “When you have a three-day ban you can probably be like, ‘Oh, it’s in the middle of the week, whatever, three days is not a big deal,’” said Kwoh. “Whereas seven days you’re almost…”
“You’re almost guaranteed to have your actual play interrupted,” Lin interjected.
“Over a weekend, over your main playing period, I think [a seven-day ban] hits the sweet spot where people [become] a little bit more frustrated, a little bit less receptive to reform, unfortunately,” said Kwoh.
This would suggest that players with 14-day bans should be even less receptive and more likely to still be angry at their punishment once the ban is lifted, but the data shows otherwise. The player behavior team doesn’t have any hard data about why this is, but they do have some educated guesses. “I think that one 14-day [ban] is widely known as your last chance,” said Kwoh. “Being away from the game for that long a time—I think you sort of get the point.”
When the Tribunal was first launched, League players could earn in-game currency for judging cases as long as their judgments were sound. Some players were concerned that this provided the wrong kind of motivation for those who might judge cases, though. So the player behavior team decided to remove the currency rewards for 30 days, resulting in a 10 percent drop in active Tribunal judges. Then the team introduced a public "Justice Reviews" profile page showing personal judging metrics like the number of players someone has perma-banned and the number of “toxic days” they’ve helped prevent.
The result? “We saw a 100 percent increase in active Tribunal judges that was sustained after [the Justice Reviews profiles] launched,” Lin said. “We also saw that Tribunal judges completed 10 percent more cases daily after the launch of Justice Reviews.”
This experiment went further to prove that League players have a vested interest in working as partners with Riot to improve the quality of their community. Riot has heard the message and now has enough faith in the Tribunal system to let players enact some punishments directly without staffer involvement. “We’re at a point in the Tribunal’s lifespan where we are confident with the accuracy and rate of false positives and trust our players to make the right decisions in the vast majority of cases,” said Lin.
Punishing and reforming negative behavior is important, but the player behavior team also wanted to enable League players to reward one another for positive behaviors. This desire led to the Honor Initiative, which allows players to recognize positive behaviors like helpfulness, friendliness, good teamwork, and sportsmanlike play. The Honor status of every League player is displayed on their profile, encouraging players to work for these kudos from their competitors and teammates.
The most radical experiment Riot has performed on its players so far, though, is based on the psychological principle of priming. Basically, priming involves exposing people to a specific stimulus in order to influence how they’ll react to another stimulus later. In experiments, for example, groups of subjects that discussed the topic of rudeness might be more likely to interrupt an experimenter faster and more often than a control group that discussed the concept of politeness.
Riot Games decided to experiment with priming players by introducing game tips that would be shown to players on loading screens, during gameplay, or both. The tips were divided into five categories, including commentary on positive behavior (“If you cooperate with your teammates you’ll actually win more games.”), negative behavior (“If you engage in toxic behaviors you’ll be punished by our Tribunal system.”), and self-reflection (“Who will be the most sportsmanlike player in this particular game?”). These priming messages could also be displayed in red, blue, or white, a decision spurred by other priming experiments that showed font color could affect test performance in different ways in different cultures.
Despite the large number of variables (tip location, tip type, and color), League of Legends’ large player base meant the player behavior team could examine lots of permutations incredibly quickly “If you look at the social sciences or neurosciences most studies are 2×2 designs or 2×3 [variable] designs,” Lin said. “That’s because in the lab you’re limited by the number of people you can get through your studies in time—it takes three months to do 20 subjects. But in League of Legends we can do these crazy designs with 217 unique conditions and get the data in a couple of days.”
Riot’s results showed clearly that in-game tips were much less effective at changing behavior than those shown on the loading screen. “As soon as players get into the game they open up the store and start immediately buying items, so in-game [tips] have lesser effects than the loading screen [tips],” Lin said. “During the loading screen you don’t really have much to do other than read the tips, but during in-game you’re already busy setting up your equipment and trying to get to your lanes.”
Other than that, the priming results were a little more difficult to pin down. A red loading-screen message about player abuse was more effective at curbing bad behavior than the same message in white, but a red message about sportsmanship actually produced results in the wrong direction, for instance.
While the priming experiment may open up more questions than it answers, it does show that Riot is trying to influence the behavior that feeds into the Tribunal and Honor systems. The player behavior team also wants to begin experimenting with match chemistry as another way to head off toxic behavior at the pass.
Riot Games hopes that by sharing these initiatives and their results, they can inspire similar efforts across the video game industry. “We’re starting to see sprinkles and pockets of other studios doing similar things now. That really excites us,” Lin said. “We’ve long realized that this isn’t necessarily a problem with online games and League of Legends only. It’s gamers in general and online societies in general. It’s more than just Riot jumping in and solving this problem. We need the players to be involved. They need to be a part of the solution [and] other studios have to get in and be involved as well.”
Thanks to Riot Games for allowing Ars Technica to use slides from its GDC presentation in this feature.
Dennis Scimeca is a freelance writer from Boston, MA. He is eager to test the effects of his tweets on your behavior: @DennisScimeca.