Twitter scraper selenium

Python's package to scrape Twitter's front-end easily with selenium.

Table of Contents

Getting Started
- Prerequisites
- Installation
  - Installing from source
  - Installing with PyPI
Usage
- Available Functions in this package- Summary
- Scraping profile's details
Privacy
License

Prerequisites

Internet Connection

Python 3.6+

Chrome or Firefox browser installed on your machine

Installation

Installing from the source

Download the source code or clone it with:

git clone https://github.com/shaikhsajid1111/twitter-scraper-selenium

Open terminal inside the downloaded folder:

 python3 setup.py install

Installing with PyPI

pip3 install twitter-scraper-selenium

Usage

Available Function In this Package - Summary

Function Name	Function Description	Scraping Method	Scraping Speed
`scrape_profile()`	Scrape's Twitter user's profile tweets	Browser Automation	Slow
`get_profile_details()`	Scrape's Twitter user details.	HTTP Request	Fast
`scrape_profile_with_api()`	Scrape's Twitter tweets by twitter profile username. It expects the username of the profile	Browser Automation & HTTP Request	Fast

Note: HTTP Request Method sends the request to Twitter's API directly for scraping data, and Browser Automation visits that page, scroll while collecting the data.

To scrape twitter profile details:

from twitter_scraper_selenium import get_profile_details

twitter_username = "TwitterAPI"
filename = "twitter_api_data"
browser = "firefox"
headless = True
get_profile_details(twitter_username=twitter_username, filename=filename, browser=browser, headless=headless)

Output:

{
	"id": 6253282,
	"id_str": "6253282",
	"name": "Twitter API",
	"screen_name": "TwitterAPI",
	"location": "San Francisco, CA",
	"profile_location": null,
	"description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
	"url": "https:\/\/t.co\/8IkCzCDr19",
	"entities": {
		"url": {
			"urls": [{
				"url": "https:\/\/t.co\/8IkCzCDr19",
				"expanded_url": "https:\/\/developer.twitter.com",
				"display_url": "developer.twitter.com",
				"indices": [
					0,
					23
				]
			}]
		},
		"description": {
			"urls": []
		}
	},
	"protected": false,
	"followers_count": 6133636,
	"friends_count": 12,
	"listed_count": 12936,
	"created_at": "Wed May 23 06:01:13 +0000 2007",
	"favourites_count": 31,
	"utc_offset": null,
	"time_zone": null,
	"geo_enabled": null,
	"verified": true,
	"statuses_count": 3656,
	"lang": null,
	"contributors_enabled": null,
	"is_translator": null,
	"is_translation_enabled": null,
	"profile_background_color": null,
	"profile_background_image_url": null,
	"profile_background_image_url_https": null,
	"profile_background_tile": null,
	"profile_image_url": null,
	"profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
	"profile_banner_url": null,
	"profile_link_color": null,
	"profile_sidebar_border_color": null,
	"profile_sidebar_fill_color": null,
	"profile_text_color": null,
	"profile_use_background_image": null,
	"has_extended_profile": null,
	"default_profile": false,
	"default_profile_image": false,
	"following": null,
	"follow_request_sent": null,
	"notifications": null,
	"translator_type": null
}

get_profile_details() arguments:

Argument	Argument Type	Description
twitter_username	String	Twitter Username
output_filename	String	What should be the filename where output is stored?.
output_dir	String	What directory output file should be saved?
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.

Keys of the output:
Detail of each key can be found here.

To scrape profile's tweets:

In JSON format:

from twitter_scraper_selenium import scrape_profile

microsoft = scrape_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
print(microsoft)

Output:

{
  "1430938749840629773": {
    "tweet_id": "1430938749840629773",
    "username": "Microsoft",
    "name": "Microsoft",
    "profile_picture": "https://twitter.com/Microsoft/photo",
    "replies": 29,
    "retweets": 58,
    "likes": 453,
    "is_retweet": false,
    "retweet_link": "",
    "posted_time": "2021-08-26T17:02:38+00:00",
    "content": "Easy to use and efficient for all \u2013 Windows 11 is committed to an accessible future.\n\nHere's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW ",
    "hashtags": [],
    "mentions": [],
    "images": [],
    "videos": [],
    "tweet_url": "https://twitter.com/Microsoft/status/1430938749840629773",
    "link": "https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC"
  },...
}

In CSV format:

from twitter_scraper_selenium import scrape_profile


scrape_profile(twitter_username="microsoft",output_format="csv",browser="firefox",tweets_count=10,filename="microsoft",directory="/home/user/Downloads")

Output:

tweet_id	username	name	profile_picture	replies	retweets	likes	is_retweet	retweet_link	posted_time	content	hashtags	mentions	images	videos	post_url	link
1430938749840629773	Microsoft	Microsoft	https://twitter.com/Microsoft/photo	64	75	521	False		2021-08-26T17:02:38+00:00	Easy to use and efficient for all – Windows 11 is committed to an accessible future. Here's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW	[]	[]	[]	[]	https://twitter.com/Microsoft/status/1430938749840629773	https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC

...

scrape_profile() arguments:

Argument	Argument Type	Description
twitter_username	String	Twitter username of the account
browser	String	Which browser to use for scraping?, Only 2 are supported Chrome and Firefox. Default is set to Firefox
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
tweets_count	Integer	Number of posts to scrape. Default is 10.
output_format	String	The output format, whether JSON or CSV. Default is JSON.
filename	String	If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as username passed.
directory	String	If output_format parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory.
headless	Boolean	Whether to run crawler headlessly?. Default is `True`

Keys of the output

Key	Type	Description
tweet_id	String	Post Identifier(integer casted inside string)
username	String	Username of the profile
name	String	Name of the profile
profile_picture	String	Profile Picture link
replies	Integer	Number of replies of tweet
retweets	Integer	Number of retweets of tweet
likes	Integer	Number of likes of tweet
is_retweet	boolean	Is the tweet a retweet?
retweet_link	String	If it is retweet, then the retweet link else it'll be empty string
posted_time	String	Time when tweet was posted in ISO 8601 format
content	String	content of tweet as text
hashtags	Array	Hashtags presents in tweet, if they're present in tweet
mentions	Array	Mentions presents in tweet, if they're present in tweet
images	Array	Images links, if they're present in tweet
videos	Array	Videos links, if they're present in tweet
tweet_url	String	URL of the tweet
link	String	If any link is present inside tweet for some external website.

To Scrap profile's tweets with API:

from twitter_scraper_selenium import scrape_profile_with_api

scrape_profile_with_api('elonmusk', output_filename='musk', tweets_count= 100)

scrape_profile_with_api() Arguments:

Argument	Argument Type	Description
username	String	Twitter's Profile username
tweets_count	Integer	Number of tweets to scrape.
output_filename	String	What should be the filename where output is stored?.
output_dir	String	What directory output file should be saved?
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
browser	String	Which browser to use for extracting out graphql key. Default is firefox.
headless	String	Whether to run browser in headless mode?

Output:

{
  "1608939190548598784": {
    "tweet_url" : "https://twitter.com/elonmusk/status/1608939190548598784",
    "tweet_details":{
      ...
    },
    "user_details":{
      ...
    }
  }, ...
}

Using scraper with proxy (http proxy)

Just pass proxy argument to function.

from twitter_scraper_selenium import scrape_profile

scrape_profile("elonmusk", headless=False, proxy="66.115.38.247:5678", output_format="csv",filename="musk") #In IP:PORT format

Proxy that requires authentication:

from twitter_scraper_selenium import scrape_profile

microsoft_data = scrape_profile(twitter_username="microsoft", browser="chrome", tweets_count=10, output="json",
                      proxy="sajid:[email protected]:5678")  #  username:password@IP:PORT
print(microsoft_data)