Um blog sobre nada

Um conjunto de inutilidades que podem vir a ser úteis

Twitter Data Analysis

Posted by Diego em Julho 29, 2015


With the recent Windows 10 release I decided to bring back to life an old python code that I wrote over an year ago that analyses twitter data. I was interested in seeing how people are reacting to the release.

The code is divided in two parts, the “GetTwitterData.py” that, as the name implies, fetches data from twitter and the “Sentiment.py” that analyses text for positive\negative sentiments.

 

1.Get the data

To access the Twitter API its necessary to create an application at (https://dev.twitter.com/apps). I won’t go through details on how to do that because there are plenty of tutorials online.

The code to get the data has a lot of comments in it so shouldn’t be hard to understand it. One thing to note is that there is some level of “hard coding” in it. For example, the “search term” (in this case, Microsoft) and the “mode” are set on the file itself.

Regarding the “mode”, there are 2 ways the function can work, “topN” or “live track”. The first one will output the latest N tweets on a particular subject while the second will output tweets as they are being generated in real-time (you will need to interrupt the program to make it stop).

Code:

 

#Used to fetch live stream data from twitter.
#To get credentials: "https://dev.twitter.com/apps"

import oauth2 as oauth
import urllib2 as urllib
import json
from pprint import pprint

api_key = "XXXXX"
api_secret = "XXXXX"
access_token_key = "XXXXX-XXXXX"
access_token_secret = "XXXXX"



_debug = 0

oauth_token    = oauth.Token(key=access_token_key, secret=access_token_secret)
oauth_consumer = oauth.Consumer(key=api_key, secret=api_secret)

signature_method_hmac_sha1 = oauth.SignatureMethod_HMAC_SHA1()

http_method = "GET"


http_handler  = urllib.HTTPHandler(debuglevel=_debug)
https_handler = urllib.HTTPSHandler(debuglevel=_debug)

def twitter_track(url, method, parameters):
    req = oauth.Request.from_consumer_and_token(oauth_consumer,
                                             token=oauth_token,
                                             http_method=http_method,
                                             http_url=url,
                                             parameters=parameters)

    req.sign_request(signature_method_hmac_sha1, oauth_consumer, oauth_token)
    headers = req.to_header()

    if http_method == "POST":
        encoded_post_data = req.to_postdata()
    else:
        encoded_post_data = None
    url = req.to_url()

    opener = urllib.OpenerDirector()
    opener.add_handler(http_handler)
    opener.add_handler(https_handler)
    response = opener.open(url, encoded_post_data)

    return response


def getData(mode, topn):
    parameters = []

    #returns an infinite stream of tweets, hence the need to ^C to break out of the for loop
    #use the first URL to get all sort of tweets
    if mode=='live track':
        #url = "https://stream.twitter.com/1/statuses/sample.json" 
        url = "https://stream.twitter.com/1.1/statuses/filter.json?track=Microsoft" #track one subject
        response = twitter_track(url, "GET", parameters)
        for line in response:
            text = line.strip()
            #line is a string so Im doing some very basic (and error prone) string manipulation - room for improvement here
            s= str.find(text,"text")
            e =str.find(text,"source")
            print text[s+7:e-3]
            print ""


    elif mode=="topN":#will return TOP N tweets on the subject
        tweet_count = '&count='+str(topn)    # tweets/page
        queryparams = '?q=Microsoft&lang=en'+tweet_count
        url = "https://api.twitter.com/1.1/search/tweets.json" + queryparams

        #Ignoring the "parameters" variable - quite easy to use the URL
        response = twitter_track(url, "GET", parameters)
        data = json.load(response)#contains all N tweets
        #pprint(data) # data is a dictionary
        for tweet in data["statuses"]:
            print tweet["text"]

#TO DO:
#search term is hardcoded - parametrize it
#clean unnecessary characters, ex: URLS are coming like: http:\/\/t.co\/RC1Z7IaMu5
if __name__ == '__main__':
    #Options:
    #live track: Track Function where all tweets or a single search criteria can be tracked in real-time
    #            Tweets do not repeat, second parameter ignored
    #topN: Displays last N tweets on the subject
  getData("live track",10)

 

Here is an example of the output:

image

And here is how I called the code to pipe the output to a text file – that will be used on the sentiment analysis:

image

 

 

2.Analyse it

The analyse function takes 2 arguments, the first is a link to a “sentiment” file and the second the file we want to analyse. The sentiment file I’m using is called AFINN-111. AFINN is a list of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labelled by Finn Årup Nielsen in 2009-2011. The file is tab-separated and contains2477 words.  The code read both files, create a dictionary with word – value pairs from the AFINN file and loops trough the lines on the sentiment file looking up each word on the dictionary just created and summing their “value” on the tweet. (a very simplistic word – by –word approach).

 

import sys
import json
import operator

def main():

    if len(sys.argv) == 2:
        sent_file = open(sys.argv[1])
        tweet_file = open(sys.argv[2])
    else:
        sent_file = open("AFINN-111.txt")
        tweet_file = open("microsoft.txt")


    #load dictionary
    scores = {}
    for line in sent_file:
        term, score  = line.split("\t")  # The file is tab-delimited.
        scores[term] = int(score)

    #missing_scores = {} #not using this at the moment
    ranked_tweets = {}

    for line in tweet_file:
        line= line.strip()
        if line=="":#ignore blank lines
            next

        #print line
        tweet_score = 0
        try:
            words = line.split()

            for word in words:
                tweet_score += scores.get(word, 0)

            if tweet_score !=0:
                ranked_tweets[line] = tweet_score

        except KeyError:
            continue

    print "Number of tweets scored: "+str(len(ranked_tweets))
    d = dict((k, v) for k, v in ranked_tweets.items() if v > 0)
    print "    Positive Tweets:: "+str(len(d))
    d = dict((k, v) for k, v in ranked_tweets.items() if v < 0)
    print "    Negative Tweets:: "+str(len(d))

    print ""
    print ""

    print "Top 10 Best tweets: "
    for key, value in sorted(ranked_tweets.iteritems(), key=lambda (k,v): (v,k), reverse=True)[0:9]:
        print "   %s: %s" % (key, value)

    print " "
    print " "

    print "Top 10 Worst tweets: "
    for key, value in sorted(ranked_tweets.iteritems(), key=lambda (k,v): (v,k))[0:9]:
        print "   %s: %s" % (key, value)


    #Print all disctionary
    #for key, value in sorted_x.iteritems() :
        #print key, value


if __name__ == '__main__':
    main()

 

I left the program running for over an hour and collected around 20.000 tweets, of those 3.492 where scored.  One important thing to note (and I just realized that when I saw how low that number was) is that I forgot to fetch only “English” tweets on the “live track” option of the “GetTwitterData” therefore a lot of tweets were just ignored on this step rather than on the first step because the sentiment file is in only English., which explains the low classification rate.

The result is quite interesting, from the 3.492 tweets scored 3667 are positives and 825 negatives, which possibly indicates a good acceptance from the community.

Bellow I print the 10 best and worst tweets (please mind the language on the negative ones)

 

toptweets

 

 Things to consider (improve):

·         I see a lot of re-tweets – maybe I shouldn’t consider them? Or if you are re-tweeting something that someone else liked\hated does it mean you like\hate it too?

·         Only consider tweets in English (this is more for the first part of the program)

·         Should I consider a different way to decide whether a tweet was scored or not? Tweet_score “zero” can either mean that no words were scored or that bad words cancelled good words. Is there a difference?

 

 

 

note: thanks to this post on how to post python code

Deixe uma Resposta

Preencha os seus detalhes abaixo ou clique num ícone para iniciar sessão:

Logótipo da WordPress.com

Está a comentar usando a sua conta WordPress.com Terminar Sessão / Alterar )

Imagem do Twitter

Está a comentar usando a sua conta Twitter Terminar Sessão / Alterar )

Facebook photo

Está a comentar usando a sua conta Facebook Terminar Sessão / Alterar )

Google+ photo

Está a comentar usando a sua conta Google+ Terminar Sessão / Alterar )

Connecting to %s

 
%d bloggers like this: