Making sense of Twitter data with NLP

In this third part of the series of posts about text mining of Twitter data, we'll focus on analyzing the datasets we’ve collected through the REST API from the previous lectures. In this article, we’ll see some descriptive statistics such as term frequencies, popular hashtags, and mentions ..etc, as well as some advanced NLP like topic modelling and sentiment analysis.

Table of Contents:
  1. Collecting data from Twitter REST Search API using Python
  2. Pre-processing Twitter Data using Python
  3. Making sense of Twitter data using NLP 

Word frequencies

Term frequencies simply refer to counting how many times a word occurs in our dataset. The results could help to see quickly what people are talking about.

Assuming you have followed the steps from the previous posts, we have a list of clean tokens of tweets posts.

Let’s group all the tokens in memory and count the frequencies of each token:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from collections import Counter


def print_top_tokens_frequencies(list_tokens, name, n=10):
    """
    this function prints the top (n) tokens
    :param list_tokens: list of tokens
    :param name: name of list to print
    :param n: number to tokens to print, default 10
    :return: None
    """

    tokens_count = Counter(list_tokens)

    # The top n tokens
    print("The Top {n} ".format(n=n) + name)
    for token, count in tokens_count.most_common(n):
        print("{0}: {1}".format(token, count))

# from the previous article we've seen how to tokenize tweets
# list_tokens contains all the tokens of our data set
# print the count of all tokens

counter = Counter(list_tokens)
print('Count all Tokens:',counter, '\n')

# we can th above function to print the top (n)
print_top_tokens_frequencies(list_tokens, name='Tokens') 
# results

# all tokens
Count all Tokens: Counter({'rt': 1727, 'like': 493, '@bl00zer': 364, '*': 359, 'work': 303, 'stink': 288, '@thetazor': 278, 'follow': 278, 'see': 278, 'one': 276, '': 263, 'ready': 255, ':': 240, 'first': 238, "don't": 232, 'new': 221, 'tell': 202, 'reports': 202, '1': 202, '2': 202, 'give': 202, 'love': 202, 'something': 202, 'make': 202, 'decided': 202, 'stay': 202, 'fire': 202, 'way': 202, 'raw': 202, 'really': 202, 'delicate': 202, 'hot': 202, 'people': 196, 'put': 196, 'exactly': 192, 'want': 183, 'player': 183, 'boy': 182, 'believe': 182, 'go': 181, '@ssssssierra': 181, 'montreal': 179, 'hit': 175, 'thank': 175, 'workshop': 175, 'artificial': 170, 'intelligence': 170, 'designs': 162, 'check': 155, '20': 154, "it's": 152, 'w': 151, 'daily': 150, 'coming': 149, '#dosanddontsforteachers': 145, 'h1z1': 139, 'giveaway': 139, 'enter': 139, 'https://t.co/zibws2oad0': 139, 'ends': 139, '"': 135, '-': 132, 'heart': 128, '$': 128, 'good': 127, 'expected': 124, 'days': 123, "we'll": 122, "you're": 121, 'developed': 118, 'winter': 115, 'think': 114, 'would': 110, '@mechapoetic': 109, 'makes': 108, '7': 108, 'game': 107, 'turn': 105, '@helbigmanuel': 101, 'interested': 101, 'boreal': 101, 'peatlands': 101, 'permafrost': 101, 'methane': 101, '@scottyresearch': 101, '#agu17': 101, '@nasa_above': 101, 'session': 101, 'shouldnt': 101, 'round': 101, 'pot': 101, '3': 101, 'injured': 101, 'least': 101, 'custody': 101, 'unconfirmed': 101, 'explosions': 101, '#newyorkcity': 101, 'metro': 101, 'https://t.co/gg4phrffxx': 101, '@eventcloak': 101, 'alternatively': 101, 'credit': 101, 'hard': 101, 'https://t.co/0d7q5ge5ty': 101, '@veganyogadude': 101, '#wednesdaymotivation': 101, '@imqft': 101, '#startup': 101, '#entrepreneur': 101, '#makeyourownlane': 101, '#defstar5': 101, '#mpgvip': 101, '#quotes': 101, 'https://t.co/yjyoonmyhu': 101, 'stream': 101, 'overlay': 101, 'banner': 101, 'everyone': 101, 'going': 101, 'copy': 101, 'existenz': 101, '@milanweeklypod': 101, 'bonaventura': 101, 'win': 101, 'boost': 101, 'self-confidence': 101, 'discuss': 101, "yesterday's": 101, "today's": 101, 'europa': 101, 'league': 101, 'draw': 101, '@gibbological': 101, '@mariaelemartino': 101, '@thomasgurry': 101, 'synbiotics': 101, 'include': 101, 'post-biotics': 101, 'bacterial': 101, 'https://t.co/2icow4smih': 101, '@twip98': 101, '@premydaremy': 101, 'time': 101, 'said': 101, 'correct': 101, 'beautiful': 101, '@andyfidel_': 101, 'greater': 101, 'driving': 101, 'hub': 101, 'https://t.co/fkevwyjawr': 101, 'ft': 101, '@fiaresponsable': 101, '@mtlintl': 101, '#ai': 101, '#aimtl': 101, '@foxandfriends': 101, 'friend': 101, '@realdonaldtrump': 101, 'bullies': 101, 'everyday': 101, 'yet': 101, 'report': 101, 'fake': 101, 'https://t.co/bq006ykypz': 101, '5': 101, 'proven': 101, 'tips': 101, 'desire': 101, 'https://t.co/wsqrljdeld': 101, 'https://t.co/ghgtl58lrh': 101, 'whelp': 101, 'looks': 101, 'https://t.co/xvgoqub1wr': 101, 'wtf': 101, 'throwing': 101, 'shoes': 101, '!': 101, 'https://t.co/uiper0c7aj': 101, 'nobody': 101, 'likes': 101, 'speak': 101, 'thing': 101, 'truths': 101, 'set': 101, 'team': 101, 'crm': 101, 'finance': 101, 'conference': 101, '#weareready': 101, '@gregquinn_td': 101, '@meganderson_td': 101, '@donoconnortd': 101, '@mclennandan': 101, 'https://t.co/6hham0yonw': 101, '@mckennaconor': 101, '@arponbasu': 101, 'media': 101, 'hire': 101, 'butt': 101, 'kisser': 101, 'pal': 101, 'mourad': 101, 'gohabsgo': 101, '99': 101, 'kn': 101, 'https://t.co/mcsuwkmi23': 101, 'domino': 101, 'advice': 101, 'connected': 101, 'man': 101, 'business': 101, 'https://t.co/0wuw9flbev': 101, '#business': 101, '#stuffidoinbedrun': 101, 'mouth': 101, 'mate': 101, 'eat': 101, 'toe': 101, 'p': 101, 'suck': 101, 'nipple': 101, 'passion': 101, 'rock': 101, 'hi': 101, 'https://t.co/3kekkfrivm': 101, 'tgim': 101, 'fitcamp': 101, '6-7': 101, 'pm': 101, '#fitnessmotivation': 101, '#fitnesslifestyle': 101, 'https://t.co/qlyxxm2vv1': 101, 'need': 101, 'next': 101, 'step': 101, 'journey': 101, 'tuned': 101, 'https://t.co/jbdjl7tim2': 101, 'hour': 101, 'get': 101, 'today': 101, 'wife': 101, '1/2': 101, 'hours': 101, 'glad': 101, 'city': 101, 'shut': 101, '#keatonjones': 101, 'heartbreaking': 101, 'video': 101, '@this': 101, 'little': 101, 'please': 101, 'sharing': 101, 'cuz': 101, 'needs': 101, 'ser': 101, 'https://t.co/nzyr1590pi': 101, '@scarboo': 101, '@isaziggy0908': 101, '@tommybowie': 101, '@handmaiden413': 101, '@carinamartz7': 101, '@carla_dv7': 101, '@julik317': 101, '@kataltitude': 101, '@loobydoobs': 101, '@lizzie_graupe': 101, '@silenzios': 101, 'leading': 101, 'amazing': 101, '@kpatelneha23': 101, '': 101, 'https://t.co/dk4shcvzpe': 101, 'happened': 101, 'tattoo': 101, 'artist': 101, 'worst': 101, 'feel': 101, 'cheating': 101, 'shit': 101, '@elizabeth1carte': 101, 'saw': 101, 'made': 101, 'https://t.co/uisyywapti': 101, 'https://t.co/frgzi9i5fw': 101, '@maskurbate': 101, 'male': 101, 'stripper': 101, 'gets': 101, 'wild': 101, 'apartment': 101, 'masquerade': 101, 'https://t.co/utti4ma8sx': 101, '#maskurbate': 101, '#malestripper': 101, '#gayporn': 101, 'look': 97, 'students': 97, 'potential': 97, 'boyfriends': 97, '@gen_rai': 95, '@general_katz': 95, '@gamingandpandas': 95, 'mad': 95, "can't": 95, 'stand': 95, '@andrea_dawn3': 91, 'easy': 91, 'recommend': 91, 'https://t.co/rj3znnqoyr': 87, '#isupportmyself': 87, 'period': 86, 'bump': 86, '^': 86, 'also': 85, 'wow': 82, 'spider-man': 82, 'spider-verse': 82, 'official': 82, 'trailer': 82, 'https://t.co/jgvsqgudne': 82, 'membralin': 82, "#alzheimer's": 82, 'disease': 82, 'pathogenesis': 82, 'identified': 82, 'https://t.co/qpgtzyjesa': 82, 'https://t.co/hvul5k0anu': 82, '@noa89': 81, 'simple': 81, 'https://t.co/d0zxr6vuyc': 81, 'best': 80, 'regret': 80, '@piper_thibodeau': 75, 'full': 75, 'res': 75, '@actramtl': 74, 'informative': 74, 'instagram': 74, 'photo': 74, '@waterlilin': 74, 'https://t.co/hwt2a3l81h': 74, 'loved': 73, 'playing': 71, 'madonna': 71, 'virgin': 71, 'hear': 71, 'https://t.co/cvzilq85yu': 71, '#80s': 71, '#80smusic': 71, 'transform': 69, 'workplace': 69, 'sooner': 69, 'https://t.co/eqfx9ypxda': 69, '@acapellascience': 67, '@iamscicomm': 67, 'mean': 67, 'dislike': 67, '#steam': 67, 'encountered': 67, 'https://t.co/qi7erd4m6h': 67, 'found': 67, 'completely': 67, 'vapid': 67, 'b': 67, 'paint': 63, '1846': 63, 'snowball': 63, 'pun': 63, 'wips': 63, 'art': 63, 'videos': 63, 'https://t.co/msrmm67usj': 63, 'another': 63, 'god': 60, 'misogynist': 55, 'thinks': 55, 'women': 55, 'periods': 55, 'every': 55, '28': 55, 'https://t.co/rvdyz8lu3a': 55, '@montrealcp': 55, "winter's": 55, 'real': 55, 'storm': 55, 'quebec': 55, 'tomorrow': 55, 'https://t.co/ifo9k9teqs': 55, 'pillow': 54, '16': 54, '@drfunkin': 54, '@smarkinfested': 54, '@jayjackets': 54, 'yeah': 54, 'mostly': 54, 'webcomics': 54, 'even': 54, '\u200d': 54, 'last': 53, 'nights': 53, 'pre-blended': 53, 'detox': 53, 'soup': 53, 'thanks': 53, '@hbfit': 53, 'tv': 53, 'recipe': 53, 'much': 53, '': 53, 'feeling': 53, 'week': 53, '…': 53, 'https://t.co/rhlpzl20tw': 53, '@circuitdusoleil': 53, '#tbt': 53, 'genie': 53, 'n': 53, 'ranked': 53, '#letsgogenie2018': 53, '': 53, 'https://t.co/lt41qtr430': 53, 'charlie': 48, 'brown': 48, 'christmas': 48, 'shoe': 48, 'near': 48, '@vans': 48, '#vanspeanuts': 48, 'https://t.co/gnu3xervvf': 48, 'remember': 48, 'treat': 48, 'well': 48, 'cold': 44, '@bbuttsmontreal': 36, 'morning': 36, 'https://t.co/9vx5tzlxaz': 36, '@presssec': 34, 'someone': 34, 'explain': 34, 'big': 34, 'words': 34, 'ok': 34, 'mind': 34, 'https://t.co/ty8efg9mkp': 34, 'night': 32, 'philadephia': 32, '#philadelphia': 32, '#photography': 32, '#urban': 32, '#night': 32, '@thephotohour': 32, 'https://t.co/hicbye9608': 32, 'book': 32, '@silvet': 30, 'never': 30, 'knew': 30, 'abt': 30, 'mum': 30, 'hometown': 30, '#funfacts': 30, '#themoreyouknow': 30, 'https://t.co/xepmbchizx': 30, '#90dayfiance': 30, 'luis': 30, 'stone': 30, 'face': 30, 'molly': 30, 'crying': 30, 'brings': 30, 'thr': 30, 'https://t.co/z7mp45qztd': 30, 'whole': 27, 'throw': 27, 'cover': 27, 'x': 27, 'insert': 27, '29.99': 27, ')': 27, 'https://t.co/hidun9apjk': 27, '30': 27, 'redu': 27, 'https://t.co/jwjosyff96': 27, '@cnn': 21, 'plays': 21, '10:30': 21, ':)': 21, 'welcome': 20, 'jamila': 20, '#hesm': 20, '#hesm17': 20, 'https://t.co/sfatwyd6fu': 20, '@mtlblog': 19, 'brutal': 19, '20cm': 19, 'snowstorm': 19, 'https://t.co/chyj8ubnok': 19, '#montreal': 19, '#quebec': 19, '#canada': 19, '@grescoe': 17, 'country': 17, 'place': 17, 'poor': 17, 'cars': 17, 'rich': 17, 'use': 17, 'public': 17, 'transportation': 17, '@enriquepenalos': 17, 'snorlax': 15, '(': 15, 'iv': 15, '17': 15, 'cp': 15, '2205': 15, '08:51': 15, '44am': 15, '691': 15, 'rue': 15, 'guy': 15, 'https://t.co/kknnybvcru': 15, 'https://t.co/32yjxzmcyb': 15, '@ottawamike05': 14, 'flipping': 14, 'right': 14, 'brrrr': 14, 'hate': 14, '@o_guest': 13, 'learned': 13, 'couple': 13, 'canada': 13, 'practice': 13, 'perfect': 13, ':d': 13, 'sketchbook': 12, '71': 12, 'penguin': 12, 'chicks': 12, 'paintings': 12, 'available': 12, 'https://t.co/kgyqdximly': 12, 'maybe': 8, 'reason': 8, 'western': 8, 'adaptations': 8, 'reputed': 8, 'anime': 8, 'manga': 8, 'bad': 8, 'less': 8, 'fault': 8, 'source': 8, 'material': 8, 'vladimir': 5, 'putin': 5, 'orders': 5, 'russian': 5, 'forces': 5, 'start': 5, 'pulling': 5, 'syria': 5, 'https://t.co/k8ojmfsj5c': 5, 'https://t.co/l2efswxvmh': 5, 'buds': 4, 'hosting': 4, 'vegan': 4, 'pop': 4, 'sunday': 4, 'save': 4, 'duck': 4, 'sample': 4, 'sale': 4, 'bring': 4, 'https://t.co/8dtdouv18c': 4, '@benefry': 4, 'raven': 4, 'h': 4, 'price': 4, '“': 4, 'convicted': 4, 'life': 4, '#fantasy': 4, '#romance': 4, '#iartg': 4, '@roaringpurr': 4, 'https://t.co/jrdtqrkjq4': 4, '@tfcoachcarlson': 2, '@femalecn': 2, '@ichampionwomen': 2, 'club': 2, 'td': 2, 'call': 2, 'recap': 2, 'decisions': 2, 'many': 2, 'https://t.co/uvzk2gw3vg': 2}) 

# top 10 tokens
The Top 10 Tokens
rt: 1727
like: 493
@bl00zer: 364
*: 359
work: 303
stink: 288
@thetazor: 278
follow: 278
see: 278
one: 276

From the above results, we can quickly see what are the top recurrent words. However, we still have some noise and unwanted words such as the first acronym RT, which refer to a retweet. These kind of tokens are not meaningful to keep in this analysis. Let’s remove all the unwanted tokens and perform the analysis again.

# filter out all unwanted tokens from list_tokens

list_clean_tokens = []
for tweet in list_tokens:
    tweet['clean_tokens'] = [tok_en for tok_en in tweet['final_tokens'] if not tok_en == 'rt' and tok_en.isalpha() == True]

print_top_tokens_frequencies(list_clean_tokens, name='Clean Tokens')

# Results:

The Top 10 Clean Tokens
like: 493
work: 303
stink: 288
follow: 278
see: 278
one: 276
ready: 255
first: 238
new: 221
tell: 202

Now it’s a lot cleaner, we can see the words <like> and <work> are the most popular.

From my first article, I’ve shown you the format of a single tweet in its raw format (JSON). We have noticed some interesting tags that can be use in the analysis:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from collections import Counter


def print_top_user_followers(list_tweets, name, n=10):
    """
    this function prints the user follower 
    :param list_tweets: list 0f tweet
    :param name: name of list
    :param n: top followers, default 10
    :return: None
    """

    dict_users = {}

    for tweet in list_tweets:
        dict_users['@'+tweet['user']['screen_name']] = tweet['user']['followers_count']

    followers_count = Counter(dict_users)
    # The top n tokens
    print("The Top {n} ".format(n=n) + " users with bigest bumber of " + name)
    for token, count in followers_count.most_common(n):
        print("{0}: {1}".format(token, count))


def print_top_retweeted_posts(list_tweets, name, n=10):
    """
    this function prints the top (n) retweeted posts
    :param list_tweets: list 0f tweet
    :param name: name of list
    :param n: top posts, default 10
    :return: None
    """
    
    dict_retweeted = {}

    for tweet in list_tweets:
        dict_retweeted['@' + tweet['user']['screen_name'] + '  tweet_id: ' + tweet['id_str']] = tweet['retweet_count']

    retweet_count = Counter(dict_retweeted)
    # The top n tokens
    print("The Top {n} ".format(n=n) + " posts " + name)
    for token, count in retweet_count.most_common(n):
        print("{0}: {1}".format(token, count))

# loop through list of tweets and extract the data
list_hashtags = []
list_mentions = []

for tweet in list_tweets:
   for elem in tweet['final_tokens']:
            
       if elem[0] == '#':
            list_hashtags.append(elem)
        elif elem[0] == '@':
            list_mentions.append(elem)


print_top_tokens_frequencies(list_hashtags, name='Hashtags')

# Results 
The Top 10 Hashtags
#dosanddontsforteachers: 145
#agu17: 101
#newyorkcity: 101
#wednesdaymotivation: 101
#startup: 101
#entrepreneur: 101
#makeyourownlane: 101
#defstar5: 101
#mpgvip: 101
#quotes: 101

print_top_tokens_frequencies(list_mentions, name='Mentions')

# Results
The Top 10 Mentions
@bl00zer: 364
@thetazor: 278
@ssssssierra: 181
@mechapoetic: 109
@helbigmanuel: 101
@scottyresearch: 101
@nasa_above: 101
@eventcloak: 101
@veganyogadude: 101
@imqft: 101

print_top_user_followers(list_tweets, 'followers')

# Results
The Top 10 users with the bigest number of followers
@sciencebeta: 124001
@isaacrthorne: 21489
@MatthieuDugal: 16765
@ggheorghiu: 8892
@Wave80radio: 6348
@karinejoly: 4956
@DatingTipsRocks: 3958
@LittleBurgundy: 2602
@justputmeon: 2380
@jonwas1: 2292

print_top_retweeted_posts(list_tweets_en, 'retweeted')

# Results 
The Top 10 posts retweeted
@Coriana_Hunt  tweet_id: 940210067848269825: 7979
@vaxildian  tweet_id: 940210567410798593: 172
@Armonah_  tweet_id: 940210162878570496: 137
@pichusophie3698  tweet_id: 940210686717845504: 83
@jeanje_donofrio  tweet_id: 940210228368261121: 29
@TheTazor  tweet_id: 940210503133159424: 23
@Awalyneuh  tweet_id: 940210641947766784: 23
@bconlon101  tweet_id: 940210223968550912: 19
@honda5252  tweet_id: 940210131916038146: 13
@justputmeon  tweet_id: 940210500662525952: 10

We've seen how we can quickly get insights from twitter data, but this is not all we can do with it. We can go a step further by applying more advanced machine algorithm to predict sentiment of tweets and topics.

Topic modelling (LDA)

Latent Dirichlet allocation (LDA) is an algorithm used to predict topics inside a document based on word frequency.  Gensim is python library for topic modelling which is based on LDA algorithm, we're going to use it along with NLTK to produce an LDA model for our dataset.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from nltk.stem.porter import PorterStemmer
from gensim import corpora, models

# Create p_stemmer of class PorterStemmer
stemmer = PorterStemmer()

texts = []
# stem token
# list_tokenized_tweets a list of lists, each list contains tokens of tweet
for tokenized_tweet in list_tokenized_tweets:

    temed_tokens = [stemmer.stem(i) for i in tokenized_tweet]
    texts.append(temed_tokens)

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

ldamodel = models.ldamodel.LdaModel(corpus, num_topics=8, id2word=dictionary, passes=20)

print(ldamodel.print_topics(num_topics=7, num_words=4))

# Results

[(6, '0.051*"delic" + 0.051*"raw" + 0.041*"come" + 0.025*"make"'), 
(2, '0.050*"montreal" + 0.050*"like" + 0.045*"daili" + 0.042*"winter"'), 
(5, '0.044*"see" + 0.041*"want" + 0.030*"readi" + 0.030*"love"'), 
(1, '0.081*"follow" + 0.064*"like" + 0.047*"w" + 0.046*"design"'), 
(7, '0.060*"stink" + 0.041*"work" + 0.041*"get" + 0.041*"hour"'), 
(3, '0.033*"make" + 0.031*"feel" + 0.031*"p" + 0.031*"passion"'), 
(0, '0.023*"first" + 0.023*"man" + 0.023*"busi" + 0.023*"domino"')]

Each line is a topic with individual terms and weights related to it. Since I've made this extract in mid-December, topic 2 clearly seems talking about winter in Montreal. We've chosen to print only the 7 top topics in our document with 4 words in each. However, you can change these parameters for your model.

The results of topic modelling (LDA) are strongly dependent on the features present in the corpus matrix, which is generally very sparse in nature especially with Twitter data. One of the techniques used to improve the model is to reduce the dimensionality of the matrix, by implementing filters to keep terms with higher frequencies. Other methods like filtering on Part of Speech Tags to keep only meaningful tokens can also help.

Sentiment Analysis

Sentiment analysis at the document level aims to analyze the emotional tone of the words. This technique relays on machine learning algorithms to build a sentiment classification model to predict whether the sentiment is positive or negative.

There are quite a few toolkits available for supervised text classification. Scikit-learn is one of them, it has multiple machine learning algorithms already built-in. In this tutorial, we’ll use the Logistic Regression model, which is a linear model commonly used for classifying binary data.  Let’s see how we can build a model to predict the sentiment of our dataset:

The first thing we need to build a model is a training dataset. Since we are working with Twitter data is better to have a similar training dataset well labelled and pre-processed. I’ve used this training corpus which contains 1,578,627 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Read in the data from our training dataset
df = pd.read_csv('SentimentAnalysisDataset.csv',  error_bad_lines=False)

# Exploring the positive Vs the negative tweets
print('Mean of training dataset sentiment', df['Sentiment'].mean())

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['SentimentText'],
                                                    df['Sentiment'],
                                                    random_state=0)

print(' A sample from X_train set:\n\n', X_train.iloc[1], '\n\n X_train size: ', X_train.shape, "\n\n X_test size", X_test.shape)
# Results

Mean of training dataset sentiment 0.500551750525

A sample fro X_train set:

 @rachelstarlive Yeah not bad for a little thing!   I wish we could bring in a Camcorder though!  

X_train size:  (1183959,) 

X_test size: (394653,)

From the results above we can see that the training dataset is equally distributed between Positive and Negative tweets,  our training set represents about 80% of the data and the test set 20%.

# Fit the CountVectorizer to the training data
vectorizer = CountVectorizer().fit(X_train)
print('Number of features inside the document :', len(vectorizer.get_feature_names()))

# transform the documents in the training data to a document-term matrix
X_trainvectorized = vectorizer.transform(X_train)

# Train the model
classification_model = LogisticRegression()
classification_model.fit(X_trainvectorized, y_train)

# Predict the transformed test documents
predictions = classification_model.predict(vectorizer.transform(X_test))

print('ROC AUC: ', roc_auc_score(y_test, predictions))

# Results

Number of features inside the document : 564816
ROC AUC:  0.798728017955

As we can see the prediction score (ROC AUC) for this model is 79% which is not bad at all from the first look. However, I'm curious to see how our model is classifying the training data:

# get the feature names as numpy array
features = np.array(vectorizer.get_feature_names())

# Sort the coefficients from the model
coefficient_index = classification_model.coef_[0].argsort()

# print 10 smallest and 10 largest coefficients ( negative and positive Tokens )
print('Tokens with large coefficients (Positive_tokens): {}\n'.format(features[coefficient_index[:-11:-1]]))
print('Tokens with small coefficients (negative_tokens):{}'.format(features[coefficient_index[:10]]))

# Results

Tokens with large coefficients (Positive_tokens): ['iamsoannoyed' 'tojosan' 'myfax' 'goldymom' 'smiling' 'worries'
 'jonthanjay' 'gracias' 'iammaxathotspot' 'suesshirtshop']

Tokens with small coefficients (negative_tokens): ['inaperfectworld' 'dontyouhate' 'sad' 'saddens' 'saddest' 'heartbroken'
 'pakcricket' 'anqju' 'xbllygbsn' 'disapointed']

We can see from the output above words like 'sad', 'inaperfectworld' and 'heartbroken' are correctly classified as negative. But we also have words like 'anqju', 'myfax' and 'tojosan' that are not classified correctly, and I'm not even sure if they mean something positive or negative😐.

Well, this model of classification is just a basic application of a simple template of Logistic Regression. To build a bulletproof model, it can get away more complex.

In the following, I will guide through some ways of optimizing the classification model and see how it will perform :

1. Reducing sparsity

In our previous model, we've used all the features (tokens) present inside the document (564 816 tokens).  However, all these words don't contribute equally to the overall contextual polarity of a document. The hint here is we need to remove all the meaningless words, such as stop words like we've seen in the previous article Pre-processing Twitter data.

Another improvement we can also implement is by setting a filter on the frequency of a word in the corpus. The main idea here is if a given word appears multiple times in the document, we can assume that it is important. TF-IDF is a metric that calculates the frequency of tokens inside a document. We'll use the TF_IDF metric to set up a minimum frequency filter in the model, we also change the default tokenizer and replace it with the NLTK Twitter tokenizer.

Let's  how these improvements would perform :

new_vectorizer = TfidfVectorizer(min_df=5, stop_words='english', tokenizer=nltk.TweetTokenizer().tokenize).fit(X_train)
print('Number  of features inside the document :', len(new_vectorizer.get_feature_names()))

X_trainvectorized = new_vectorizer.transform(X_train)

classification_model = LogisticRegression()
classification_model.fit(X_trainvectorized, y_train)

predictions = classification_model.predict(new_vectorizer.transform(X_test))

print('ROC AUC: ', roc_auc_score(y_test, predictions))

features = np.array(new_vectorizer.get_feature_names())

coefficient_index = classification_model.coef_[0].argsort()

print('Tokens with small coefficients (negative_tokens):{}\n'.format(features[coefficient_index[:10]]))
print('Tokens with large coefficients (Positive_tokens): {}'.format(features[coefficient_index[:-11:-1]]))
# Results

Number of features inside the document: 66473
ROC AUC:  0.783482247527

Tokens with small coefficients (negative_tokens): ['sad' 'miss' 'sadly' 'poor' 'bummed' 'unfortunately' 'missing' 'sucks'
 'gutted' 'sick']

Tokens with large coefficients (Positive_tokens): ['www.tweeteradder.com' 'thank' 'thanks' 'www.tweeterfollow.com'
 'www.iamsoannoyed.com' 'welcome' 'smile' 'smiling' 'congratulations'
 'proud']

From the results above,  we notice that prediction score has dropped slightly, and the number of features has dropped drastically. We can say that we have achieved basically the same score with very small tokens. Also, if we take a look at the top 10 classified positive and negative tokens, we find the classification almost looks better, except we have some links😑.

2. N-grams

Another well-known solution that can improve the classification model is N-grams. It is a model that capture patterns in consecutive words in a document.  Basically, ngrams group's  n-tokens together in tuples (so 2 grams or bigrams means that tokens are formed by a set of 2 tokens ), and the model will consider the grouped token as a single one. For example, bigrams count pairs of adjacent words and it could give us compound features such 'white house' and 'not bad'.

Let's see how we can implement this:

# we extracting both 1-grams and 2-grams.

new_vectorizer = TfidfVectorizer(min_df=5, stop_words='english', ngram_range=(1,2),tokenizer=nltk.TweetTokenizer().tokenize).fit(X_train)
print('Number of features inside the document :', len(new_vectorizer.get_feature_names()))

X_trainvectorized = new_vectorizer.transform(X_train)

classification_model = LogisticRegression()
classification_model.fit(X_trainvectorized, y_train)

predictions = classification_model.predict(new_vectorizer.transform(X_test))

print('ROC AUC: ', roc_auc_score(y_test, predictions))

features = np.array(new_vectorizer.get_feature_names())

coefficient_index = classification_model.coef_[0].argsort()

print('Tokens with small coefficients (negative_tokens):{}\n'.format(features[coefficient_index[:10]]))
print('Tokens with large coefficients (Positive_tokens): {}'.format(features[coefficient_index[:-11:-1]]))
# Results 

Number of features inside the document : 269882

ROC AUC:  0.797029494271

Tokens with small coefficients (negative_tokens):['sad' 'miss' 'sucks' "can't" 'sick' 'poor' 'missing' 'wish' 'sadly'
 'hurts']

Tokens with large coefficients (Positive_tokens): ["can't wait" 'wish luck' 'thank' 'thanks' 'welcome' 'smile' 'yay' '=('
 'awesome' 'glad']

We can notice the number of features has increased, which was expected. Bigrams predict all possible combinations with tokens. We can also notice the prediction score has jumped to 79,70%.  Taking a look at the top 10 positive tokens, we can see that the classification model now takes into consideration a compound token such as 'wish luck' and "can't wait" , which makes sense because they contribute in the meaning of the sentence.

I think that our classification model can do a pretty good job now. Surely, it is not the most sophisticated on the market, but it can handle general English tweets. Let's test it against a sample from our dataset and see how the prediction goes :

print(classification_model.predict(new_vectorizer.transform(['1 hour to get to work today, my wife is only 1/2 way to work so about 2 hours for her.  Glad the city decided to shut off the 20 again',
                                                             "IT'S COMING... exactly what you need to make it to the next step of your journey.  Stay tuned.",
                                                             "If you are yourself you are BEAUTIFUL",
                                                             "help looks like winter has decided to stay",
                                                             "Thank you for leading an amazing workshop @kpatelneha23 !",
                                                             "Artificial Intelligence to Transform Workplace Sooner Than Expected"])))

# Results 1 for positive and 0 for negative

[0 1 1 0 1 1]

As you can see, the scores are accurate. In fact, our model did a good job classifying all these tweets correctly😀.

We've seen in this example how to use Logistic Regression to classify the text. There are other algorithms that can do the same thing and probably perform better than this one.

You can find the code in my Git repository: Twitter Tutorial

Conclusion

We have discussed in this article how to extract some interesting terms and information from Twitter dataset. With word frequencies, sorting and filtering, we can quickly get a holistic view on our data as they are quite simple techniques to implement. We also had a good discussion on some advanced NLP and machine learning approaches to classify the text. We've also demonstrated some ways to improve the models to achieve a good performance.

In the next article, I'll cover how to use geographic data of tweets and how to visualize it. Stay tuned😉!