Pre-processing Twitter Data using Python

The main goal of collecting Twitter data is to make sense of it in the first place. Whether to get insights to make a decision, or validating some theories or just exploring like we are doing in this tutorial.

The data we receive from Twitter comes in JSON format with a bunch of raw text inside it, and as you might know, unstructured data types are hard to analyze. Before we start analyzing the data, we need to do some work on the dataset, this step is called data pre-processing.  In this article, we’re going to see the basic steps to clean the text using Python.

Table of Contents:
  1. Collecting data from Twitter REST Search API using Python
  2. Pre-processing Twitter Data using Python
  3. Making sense of Twitter data using NLP

Tweets structure

Let’s take a look first at what a tweet post looks like inside a JSON format:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import json

# Exploring tweet content
file = os.path.dirname(__file__) + '/../Archive/twitter_data_set.json'

with open(file, 'r', encoding='utf-8') as tweets_file:
    data = tweets_file.read()
    tweets = json.loads(data)
    print(json.dumps(tweets[1], indent=4, ensure_ascii=False))

The above code print one Tweet on the console, I've set ensure_ascii to False just to keep the special characters.

{
    "created_at": "Mon Dec 11 13:23:51 +0000 2017",
    "id": 940210522468880386,
    "id_str": "940210522468880386",
    "text": "C'est Lundi et je fait mon rituel de calendrier de l'avent...sans marijuana comme les criminels de la Colombie-Britannique. ",
    "truncated": false,
    "entities": {
        "hashtags": [],
        "symbols": [],
        "user_mentions": [],
        "urls": []
    },
    "metadata": {
        "iso_language_code": "fr",
        "result_type": "recent"
    },
    "source": "<a href=\"http://www.twitter.com\" rel=\"nofollow\">Twitter for BlackBerry</a>",
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 333131633,
        "id_str": "333131633",
        "name": "Guylaine Lau-----",
        "screen_name": "GuylaineLau",
        "location": "Montreal,Quebec",
        "description": "RULE OF LAW ACTIVIST: JAIL CORRUPT LANDLORDS. ECONOMIC & ECOLOGIC PLANET FOR FUTUR. MISS MENSA OR NIKITA FOR ALL. https://t.co/lVjkZTnZ1T",
        "url": null,
        "entities": {
            "description": {
                "urls": [
                    {
                        "url": "https://t.co/lVjkZTnZ1T",
                        "expanded_url": "https://www.facebook.com/GuylaineLau",
                        "display_url": "facebook.com/GuylaineLau",
                        "indices": [
                            114,
                            137
                        ]
                    }
                ]
            }
        },
        "protected": false,
        "followers_count": 589,
        "friends_count": 1646,
        "listed_count": 45,
        "created_at": "Mon Jul 11 01:31:38 +0000 2011",
        "favourites_count": 1158,
        "utc_offset": -18000,
        "time_zone": "Eastern Time (US & Canada)",
        "geo_enabled": false,
        "verified": false,
        "statuses_count": 73483,
        "lang": "en",
        "contributors_enabled": false,
        "is_translator": false,
        "is_translation_enabled": false,
        "profile_background_color": "01090D",
        "profile_background_image_url": "http://abs.twimg.com/images/themes/theme14/bg.gif",
        "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme14/bg.gif",
        "profile_background_tile": true,
        "profile_image_url": "http://pbs.twimg.com/profile_images/537702543703306240/IiyF2NAD_normal.jpeg",
        "profile_image_url_https": "https://pbs.twimg.com/profile_images/537702543703306240/IiyF2NAD_normal.jpeg",
        "profile_link_color": "9266CC",
        "profile_sidebar_border_color": "EEEEEE",
        "profile_sidebar_fill_color": "EFEFEF",
        "profile_text_color": "333333",
        "profile_use_background_image": false,
        "has_extended_profile": false,
        "default_profile": false,
        "default_profile_image": false,
        "following": null,
        "follow_request_sent": null,
        "notifications": null,
        "translator_type": "regular"
    },
    "geo": null,
    "coordinates": null,
    "place": null,
    "contributors": null,
    "is_quote_status": false,
    "retweet_count": 0,
    "favorite_count": 0,
    "favorited": false,
    "retweeted": false,
    "lang": "fr"
}

As you can see from the tweet above, there is a lot of data and metadata related to a single tweet, which is good form an analytical standpoint because it gives us information that allows some interesting analysis.   

Some of the interesting key attributes:

  • text: the content tweet itself
  • created_at: creation date of tweet
  • id: tweet unique ID
  • user: the  full profile of the author
  • followers_count: count of followers of the user
  • friends_count: count of friends of the user
  • lang: language of the tweet
  • retweet_count: count of retweets
  • favorite_count: count of favorite
  • entities: list of entities mentioned in the Tweet like URLs, @mentions, #hashtags, and symbols
  • place, coordinates, geo: geo-location information if available
  • in_reply_to_user_id: user ID if the tweet is a reply to a specific user
  • in_reply_to_status_id: status ID the tweet is a reply to a specific status

As I said before, almost all the available attributes give us an additional context to analyze the data. However, to conduct an analysis you probably don’t need to use all of them, actually, it depends on the question you’re trying to answer. i.e. if you want to know who is the most followed user, or who’s discussing with whom, it’s clear that you need to focus your analysis on specific keys such as : (user, followers_count, in_reply_to_user_id, in_reply_to_status_id…).

Also, when it comes to the text key (the content of the tweet itself), we can conduct a lot of interesting analysis as well, i.e. what it the most popular hashtag, word frequencies, sentiment analysis ...etc. But as I mentioned before, the preparation of the text is needed as a prior step.

Tweet text preparation

One of the most important techniques is Tokenization, which means breaking the text down into words. The main idea here is to split a big stream of text into small pieces called tokens (words) to make it easy to analyze.

 In Python world, there is a popular library to process human language text called NLTK (Natural Language ToolKit).

Let’s take an example of tokenizing a tweet:

from nltk.tokenize import word_tokenize
tweet = '@Merouane_Benth: This is just a tweet example! #NLTK 🙂 http://www.twitter.com'
print(word_tokenize(tweet))
# results:
# ['@', 'Merouane_Benth', ':', 'This', 'is', 'just', 'a', 'tweet', 'example', '!', '#', 'NLTK', ':', ')', 'http', ':', '//www.twitter.com']

The word_tokenize method from NLTK makes the job pretty easy, it splits the text into words like we want. However, if you take a closer look at the output, you will notice that there are some odd results. I’m particularly referring to emoticons, @mentions, #hashtag, and URLs, they are not tokenized into single tokens. But why?

Well, Twitter data pose some challenges because of the nature of the language, and the word_tokenize method doesn’t capture these aspects out of the box.  The good news is NLTK has a special method for Twitter data, let’s see the same example with the TweetTokenizer method:

from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()
tokens = tokenizer.tokenize(tweet)
print(tokens)
# result: 
# ['@Merouane_Benth', ':', 'This', 'is', 'just', 'a', 'tweet', 'example', '!', '#NLTK', ':)', 'http://www.twitter.com']

As you can see now, @mentions, #hash-tags, emoticons, and URLs are now grouped as individual tokens.

Now, if we take a closer look at this special method to understand what happens behind the scenes, we find this code source inside the class:

######################################################################
# The following strings are components in the regular expression
# that is used for tokenizing. It's important that phone_number
# appears first in the final regex (since it can contain whitespace).
# It also could matter that tags comes after emoticons, due to the
# possibility of having text like
#
#     <:| and some text >:)
#
# Most importantly, the final element should always be last, since it
# does a last ditch whitespace-based tokenization of whatever is left.

# ToDo: Update with http://en.wikipedia.org/wiki/List_of_emoticons ?

# This particular element is used in a couple ways, so we define it
# with a name:
EMOTICONS = r"""
    (?:
      [<>]?
      [:;=8]                     # eyes
      [\-o\*\']?                 # optional nose
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      |
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      [\-o\*\']?                 # optional nose
      [:;=8]                     # eyes
      [<>]?
      |
      <3                         # heart
    )"""

# URL pattern due to John Gruber, modified by Tom Winzig. See
# https://gist.github.com/winzig/8894715

URLS = r"""			# Capture 1: entire matched URL
  (?:
  https?:				# URL protocol and colon
    (?:
      /{1,3}				# 1-3 slashes
      |					#   or
      [a-z0-9%]				# Single letter or digit or '%'
                                       # (Trying not to match e.g. "URI::Escape")
    )
    |					#   or
                                       # looks like domain name followed by a slash:
    [a-z0-9.\-]+[.]
    (?:[a-z]{2,13})
    /
  )
  (?:					# One or more:
    [^\s()<>{}\[\]]+			# Run of non-space, non-()<>{}[]
    |					#   or
    \([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (...(...)...)
    |
    \([^\s]+?\)				# balanced parens, non-recursive: (...)
  )+
  (?:					# End with:
    \([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (...(...)...)
    |
    \([^\s]+?\)				# balanced parens, non-recursive: (...)
    |					#   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]	# not a space or one of these punct chars
  )
  |					# OR, the following to match naked domains:
  (?:
  	(?<!@)			        # not preceded by a @, avoid matching foo@_gmail.com_
    [a-z0-9]+
    (?:[.\-][a-z0-9]+)*
    [.]
    (?:[a-z]{2,13})
    \b
    /?
    (?!@)			        # not succeeded by a @,
                            # avoid matching "foo.na" in "foo.na@example.com"
  )
"""

# The components of the tokenizer:
REGEXPS = (
    URLS,
    # Phone numbers:
    r"""
    (?:
      (?:            # (international)
        \+?[01]
        [\-\s.]*
      )?
      (?:            # (area code)
        [\(]?
        \d{3}
        [\-\s.\)]*
      )?
      \d{3}          # exchange
      [\-\s.]*
      \d{4}          # base
    )"""
    ,
    # ASCII Emoticons
    EMOTICONS
    ,
    # HTML tags:
    r"""<[^>\s]+>"""
    ,
    # ASCII Arrows
    r"""[\-]+>|<[\-]+"""
    ,
    # Twitter username:
    r"""(?:@[\w_]+)"""
    ,
    # Twitter hashtags:
    r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
    ,
    # email addresses
    r"""[\w.+-]+@[\w-]+\.(?:[\w-]\.?)+[\w-]"""
    ,
    # Remaining word types:
    r"""
    (?:[^\W\d_](?:[^\W\d_]|['\-_])+[^\W\d_]) # Words with apostrophes or dashes.
    |
    (?:[+\-]?\d+[,/.:-]\d+[+\-]?)  # Numbers, including fractions, decimals.
    |
    (?:[\w_]+)                     # Words without apostrophes or dashes.
    |
    (?:\.(?:\s*\.){1,})            # Ellipsis dots.
    |
    (?:\S)                         # Everything else that isn't whitespace.
    """
    )

######################################################################
# This is the core tokenizing regex:

WORD_RE = re.compile(r"""(%s)""" % "|".join(REGEXPS), re.VERBOSE | re.I
                     | re.UNICODE)

# WORD_RE performs poorly on these patterns:
HANG_RE = re.compile(r'([^a-zA-Z0-9])\1{3,}')

# The emoticon string gets its own regex so that we can preserve case for
# them as needed:
EMOTICON_RE = re.compile(EMOTICONS, re.VERBOSE | re.I | re.UNICODE)

# These are for regularizing HTML entities to Unicode:
ENT_RE = re.compile(r'&(#?(x?))([^&;\s]+);')


# ... some skipped code source here

######################################################################

class TweetTokenizer:
    r"""
    Tokenizer for tweets.

        >>> from nltk.tokenize import TweetTokenizer
        >>> tknzr = TweetTokenizer()
        >>> s0 = "This is a cooool #dummysmiley: 🙂 😛 <3 and some arrows < > -> <--"
        >>> tknzr.tokenize(s0)
        ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']

    Examples using `strip_handles` and `reduce_len parameters`:

        >>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
        >>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!'
        >>> tknzr.tokenize(s1)
        [':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']
    """

    def __init__(self, preserve_case=True, reduce_len=False, strip_handles=False):
        self.preserve_case = preserve_case
        self.reduce_len = reduce_len
        self.strip_handles = strip_handles

    def tokenize(self, text):
        """
        :param text: str
        :rtype: list(str)
        :return: a tokenized list of strings; concatenating this list returns\
        the original string if `preserve_case=False`
        """
        # Fix HTML character entities:
        text = _replace_html_entities(text)
        # Remove username handles
        if self.strip_handles:
            text = remove_handles(text)
        # Normalize word lengthening
        if self.reduce_len:
            text = reduce_lengthening(text)
        # Shorten problematic sequences of characters
        safe_text = HANG_RE.sub(r'\1\1\1', text)
        # Tokenize:
        words = WORD_RE.findall(safe_text)
        # Possibly alter the case, but avoid changing emoticons like 😀 into :d:
        if not self.preserve_case:
            words = list(map((lambda x : x if EMOTICON_RE.search(x) else
                              x.lower()), words))
        return words


# ... the rest of the  code source

Now it’s clear.  The TweetTokenizer class is actually based on regular expressions filters to handle these unique aspects of Twitter data.

Punctuation

We’ve seen how to tokenize tweets, but we still have some useless standalone punctuation to remove. Just a reminder, we need to take the @mentions, #hash-tags, emoticons, and URLs into consideration. Let’s see how to do it:

emoticons_str = r"""
    (?:
      [<>]?
      [:;=8]                     # eyes
      [\-o\*\']?                 # optional nose
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      |
      [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth
      [\-o\*\']?                 # optional nose
      [:;=8]                     # eyes
      [<>]?
      |
      <3                         # heart
    )"""
regex_str = [
    emoticons_str,
    r'(?:@[\w_]+)', # @mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hashtags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+'] # URLs

tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)

tweet_tokens_list = ['@Merouane_Benth', ':', 'This', 'is', 'just', 'a', 'tweet', 'example', '!', '#NLTK', ':)', 'http://www.twitter.com']
clean_token_list = []

for element in tweet_tokens_list:
    if tokens_re.findall(element):
        clean_token_list.append(element)
    else:
        if not re.match(r'[^\w\s]', element):
            clean_token_list.append(element)
            
print("Original token list:", tweet_tokens_list)
print("New token list:", clean_token_list)

# results:
# Original token list: ['@Merouane_Benth', ':', 'This', 'is', 'just', 'a', 'tweet', 'example', '!', '#NLTK', ':)', 'http://www.twitter.com']
# New token list: ['@Merouane_Benth', 'This', 'is', 'just', 'a', 'tweet', 'example', '#NLTK', ':)', 'http://www.twitter.com']

I’ve used in my code the regex filters the same way as the code source of the TweetTokenizer class. Well, the reason behind this tweaking is to keep the symbols of the @mentions, #hash-tags, and emoticons, otherwise, the filter will remove them, and we don’t want to do that.

Stop Words

Stop words are those words that do not contribute to the deeper meaning of the phrase. i.e. “the“,  “a“, and “is “. For some analysis applications like text classification, it may make sense to remove them.

NLTK library provides a list of commonly agreed upon stop words for a variety of languages.  

You can load the list this way:

from nltk.corpus import stopwords
stop_words_english = stopwords.words('english')
stop_words_french = stopwords.words('french')
print(' English stop words:', stop_words_english, '\n', 'French stop words:', stop_words_french)

# results:
# English stop words: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn'] 
# French stop words: ['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle', 'en', 'et', 'eux', 'il', 'je', 'la', 'le', 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aurez', 'auront', 'aurais', 'aurait', 'aurions', 'auriez', 'auraient', 'avais', 'avait', 'avions', 'aviez', 'avaient', 'eut', 'eûmes', 'eûtes', 'eurent', 'aie', 'aies', 'ait', 'ayons', 'ayez', 'aient', 'eusse', 'eusses', 'eût', 'eussions', 'eussiez', 'eussent']

I’ve loaded the list in both languages French and English, you will also notice that they are all lower case and have punctuation removed. Now to filter out our tokens against these lists, we need to ensure that they are prepared the same way.

words = [w.lower() for w in clean_token_list if not w.lower() in stop_words_english]
print(words)

# result: 
# ['@merouane_benth', 'tweet', 'example', '#nltk', ':)', 'http://www.twitter.com']

As you can see from the code above,  with just a simple list comprehension I turned our tokens in lower cases and filtered out the stop words.

At this level, we have covered some of the basics of cleaning text, and I’ll stop the process at this point.  Now, you can apply what we’ve seen so far on your tweets you previously saved.

You can find the code in my Git repository: Twitter Tutorial

Additional Text cleaning considerations

What we’ve seen in this blog post is just the basics in terms of cleaning raw text. Actually, this process can get a lot more complex.

Here it is a short list of additional considerations when cleaning text:

  • Steam words: refers to the process of reducing the token (word) to its root. i.e. ‘cleaning’ become ‘clean.’
  • Locate and correct common typos and misspellings
  • Handling numbers and dates inside the text
  • Extracting text from markup like HTML, XML, PDF, or other documents format
  • …etc.

Conclusion

In this blog post, we’ve seen the structure of a tweet, and we’ve demonstrated how to pre-process the text. We have especially seen some basic techniques and how to deal with  Twitter data nature.

After these pre-processing steps, our dataset is ready now for some interesting analysis, I’ll cover this in the next article. Stay tuned!