Collecting data from Twitter REST Search API using Python

Twitter is a huge resource of raw data where you can explore and find interesting insights, millions of companies, politicians, journalists, and people are using this social media tool to interact with their audience and mining the data to learn more about trends and users.

This post is the first in the series dedicated to mining Twitter data using python. I focus in this first part on how to collect data from Twitter REST Search API  using python to build our dataset, in the following parts I’ll show how to process the data an analyze it.

Table of Contents:
  1. Collecting data from Twitter REST Search API using Python
  2. Pre-processing Twitter Data using Python
  3. Making sense of Twitter data using NLP

Register your application

In order to have access to Twitter data programmatically, we need first to create an app that interacts with Twitter REST Search API. To do so, follow this link http://apps.twitter.com and register an App (you need to be connected to your Twitter account).

Twitter API

Twitter provides couple API endpoints to serve the data, in this post, we’ll discuss the REST Search API and Streaming filter API. However, I’ve used TwitterREST Search API in this tutorial. The first API enable searching historical tweets from Twitter’s search index (up to 7 days for standard free accounts), the second API is a real-time and starts giving you results from the point of query. The choice of the suitable API endpoint to use depends actually on your application and your needs.

Twitter REST Search API has few limitations that you need to be aware of before you start using it, in this post, I’ll cover some techniques you can use to maximize the resources the API gives you.

First, TwitterREST Search API has these limits:

            “Before getting involved, it’s important to know that the Search API is focused on relevance and not completeness. This means that some tweets and users may be missing from search results. If you want to match for completeness, you should consider using a Streaming API instead.”

            Reference the Search API

           “Please note that Twitter’s search service and, by extension, the Search API is not meant to be an exhaustive source of tweets. Not all tweets will be indexed or made available via the search interface.”   

           Reference  GET /search/tweets API 

Also, Twitter has set the rate usage limit of the API’s, you can find it detailed in the following page  API Rate Limit, we can see that the Search API is limited to 180 Requests per 15 min window for per-user authentication. Doing some math here, if we use the Search API with the user authentication method it will give us a max of 18,000 tweets/15 min window since every request is limited to 100 tweets for the Standard Search API.

Twitter REST Search API provides another authentication method dedicated to applications. This method has higher usage limits, precisely up to 450 Requests per 15 min window

 The code sample below shows how to use the App Auth using Twython library, there is also a multi-language libraries clients  available that you can use without re-inventing the wheel:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from twython import Twython
import configparser
import sys

config = configparser.ConfigParser()

# reading  APP_KEY and APP_SECRET from the config.ini file
# update with your credentials before running the app
config.read('config.ini')

APP_KEY = config['twitter_connections_details']['APP_KEY']
APP_SECRET = config['twitter_connections_details']['APP_SECRET']

api = Twython(APP_KEY, APP_SECRET)

if not api:
    print('Authentication failed!.')
    sys.exit(-1)

The api variable is now our entry point for the operations we can perform with Twitter.

I’ve taken as an example for this tutorial a specific geographical zone for my search, here I took my town Montreal.  The map below shows the selected search zone:  

To filter results of the API we need to send a query, in this example, I simply gave coordinates (Latitude, Longitude) of a central point with a radius of 12 miles, because I'm interested to receive all kind of tweets within the specified region.  Below the search query I used:

search_query = "geocode:45.533471,-73.552805,12mi"

You can perform a search using other filters, all the available parameters of the API are documented in this page  Search API.

Unlike the Stream API who gives you the data as soon as it is available and matches your search criteria, the Search API is stateless, which means it doesn’t carry out the search context over repeated calls over time.  What we need here is to provide the server with information about the last result batch received, so that the server can send the next batch of results (this is the pagination concept). The Search API accepts two parameters max_id and since_id which serve as upper and lower bounds of the unique IDs of the tweets.

Below, you find the code I’ve used:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from twython import Twython
import configparser
import sys
import json

config = configparser.ConfigParser()

# reading  APP_KEY and APP_SECRET from the config.ini file
# update with your credentials before running the app
config.read('config.ini')

APP_KEY = config['twitter_connections_details']['APP_KEY']
APP_SECRET = config['twitter_connections_details']['APP_SECRET']

api = Twython(APP_KEY, APP_SECRET)

if not api:
    print('Authentication failed!.')
    sys.exit(-1)

tweets = []   # list where to store data
tweets_per_request = 100
max_tweets_to_be_fetched = 10000   # number of tweets you want fetch (optional)

search_query = "geocode:45.533471,-73.552805,12mi"

max_id = None
since_id = None
count_tweets = 0

while count_tweets < max_tweets_to_be_fetched:    # we continue our calls till we get the number of tweets desired
    try:
        if max_id:   # for the first page of result received (max_id = 0)
            if not since_id:
                tweets_fetched = api.search(q=search_query, count=tweets_per_request)
            else:
                tweets_fetched = api.search(q=search_query, count=tweets_per_request, since_id=since_id)
        else:           # for the following pages max_id = some number
            if not since_id:
                tweets_fetched = api.search(q=search_query, count=tweets_per_request, max_id=max_id)
            else:
                tweets_fetched = api.search(q=search_query, count=tweets_per_request, max_id=max_id, since_id=since_id)

        if not tweets_fetched:
            print("No more tweets found")
            break

        for tweet in tweets_fetched["statuses"]:
            tweets.append(tweet)

        count_tweets += len(tweets_fetched['statuses'])

        sys.stdout.write('\r Number of downloaded Tweets: {0} '.format(count_tweets))

        # get the max_id
        max_id = tweets_fetched['search_metadata']["max_id_str"]

    except Exception as e:
        # Just exit if any error
        print("some error : " + str(e))
        break

# writing tweets to json file
with open('twitter_data_set.json', 'w') as file:
    json.dump(tweets, file)

The above code will write the data in JSON file twitter_data_set.json, it uses the AppAuth method and it can fetch up to 45K of tweets/15 min. As you can see, I’ve used the max_id to fetch the next page of a result set. Also, the above code does not stop until it has exhausted all the results.

If you want to use the variables max_id or/and since_id to fetch for example a newer result from your last run, you can look up of the ID of the first tweet and pass it in the since_id variable in your next run. The same logic is valid for the max_id, let’s say you aborted the execution of your code in the middle, and you want to resume from where it stops, all that you have to is to pass the last tweet ID in the max_id in your next run.

You can find the code in my Git repository, you need to put your own consumer key and a consumer secret credentials in the config.ini file:

Twitter Tutorial

Summary

In this blog post, I’ve explained how you can interact with Twitter REST Search API using Twython, I also highlighted some of its limitations and how you can use it optimally. Following these tips, you surely can extract more tweets at a faster rate.

We have now extracted our data set, in the next blog posts I’ll cover the data processing and data analysis. Stay tuned!

Thanks for reading,

@Merouane_Benth