Reddit is suing AI company Perplexity and others for scraping user comments on an industrial scale

Social media platform Reddit sued artificial intelligence company Perplexity AI and three other entities on Wednesday, alleging their involvement in an “industrial-scale, illegal” economy to “scrape” the comments of millions of Reddit users for commercial gain. Reddit’s lawsuit in federal court in New York targets San Francisco-based Perplexity, maker of an AI chatbot and “answer engine” that competes in online search with Google, ChatGPT and others. Also named in the lawsuit are Lithuanian data-scraping company Oxylabs UAB, a web domain called AWMProxy that Reddit describes as a “former Russian botnet,” and Texas-based startup SerpApi, which lists Perplexity as a client on its website. This is Reddit’s second such lawsuit since it sued another major AI company, Anthropic, in June. But the lawsuit filed Wednesday is different in the way it confronts not just an AI company, but the lesser-known services the AI industry relies on to obtain online scripts needed to train AI chatbots. “Scrapers bypass technological protections to steal data, then sell it to customers hungry for training material. Reddit is a prime target because it is one of the largest and most dynamic collections of human conversation ever created,” Ben Lee, Reddit’s chief legal officer, said in a statement Wednesday. Perplexity said it has not yet received the lawsuit, but “will always vigorously fight for users’ rights to freely and fairly access public knowledge. Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest.” SerpApi’s director of customer success, Ryan Schafer, said in an email: “We disagree with Reddit’s allegations and intend to vigorously defend ourselves in court.” Oxylabs did not immediately respond to a request for comment. AWMProxy could not immediately be reached for comment. Reddit compares the companies it’s suing to “would-be bank robbers” who can’t get into the bank vault, so they break into the armored truck instead. The lawsuit claims they circumvent Reddit’s own anti-scraping measures, while also “bypassing Google’s controls and scraping Reddit content directly from Google’s search engine results.” Lee said that because they are unable to scrape Reddit directly, they “mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search. Perplexity is a willing client of at least one of these scrapers, which chooses to buy stolen data rather than enter into a legal agreement with Reddit itself.” Reddit made a similar argument in its lawsuit against Anthropic, alleging that the company ignored Reddit’s appeals to stop using its content. That case was initially filed in California Superior Court, but was later moved to federal court and has a trial scheduled for January. Along with digitized books and news articles, websites like Wikipedia and Reddit are deeply collected written material that can help teach an AI assistant the patterns of human language. Reddit has previously entered into licensing agreements with Google, OpenAI and other companies that pay to train their AI systems on the public comments of Reddit’s more than 100 million daily users. The licensing deals helped the 20-year-old online platform raise money ahead of its Wall Street debut as a publicly traded company last year.