X, formerly known as Twitter, just got its terms of service (again) to explicitly prohibit the collection and crawling of data on the platform without prior written consent.
The updated terms, which go into effect on September 29, 2023, introduce strict controls against unauthorized data collection methods and come just eight days after it changed its privacy policy, stating that the platform will begin collecting biometric data from users and professionals. and employment. history.
The previous version of the terms allowed crawling as long as it met the guidelines outlined in the robots.txt file – an instruction file given to “crawlers” (or programs) about which parts of a website they are allowed to visit. However, the revised terms have eliminated this provision, requiring explicit written permission from X for any kind of scraping or crawling.
Web crawling vs web scraping
Although both are very similar, they work for two different purposes.
Web crawling uses other web pages to create indexes or collections of data while web crawling downloads web pages to extract a specific set of data for analysis e.g. product details, pricing information, SEO data, etc.
Essentially, “web scraping” simply extracts publicly available data from a website and imports it into any local file/folder on your computer using a “crawler” program that searches for the specific set of data the user is looking for and additional information . targets to crawl, while “web crawling” discovers target URL(s) or other links for the purpose of creating an index or multiple data indexes.
Data scraping is one of the most effective ways to extract data from the internet and it does not require an internet connection.
In conjunction with the updated terms of service, X recently made changes to the robots.txt file. This file gives web crawlers, including Google’s, instructions about which parts of the site they are allowed to access. These changes have effectively restricted access to specific data types, including likes, retweets associated with certain posts, and account-related information such as likes, media, and photos.
The decision to tighten restrictions on scraping and data access follows X’s recent platform changes. These changes include temporarily preventing logged-out users from viewing posts and then eliminating the login requirement to access tweets.
X’s CEO Elon Musk cited the need for these measures in response to excessive data scraping, which was negatively impacting the performance of the platform for mainstream users.
Musk has in the past spoken out against companies collecting Twitter/X data for training AI models. He has previously issued a legal threat against Microsoft, alleging that they were illegally using the platform’s data for AI training.
In July, Musk initiated legal action against “John Doe” defendants involved in unauthorized data collection.
The impact of these stringent measures on data accessibility and X’s relationship with web crawlers, including those of tech giants like Google, remains to be seen.
Editor’s note: This article was written by a staff member of nft now in collaboration with OpenAI’s GPT-3.