The purpose of this project is to detect and compare trends in sentiment data extracted from Tweets that contain certain keywords. Using the access tokens from a Twitter developer account, users can run historical searches. Those searches will return Tweets that match certain specifications (such as containing keywords, and being written in a specific language), then plot the sentiment score data of those tweets against the scores of random Tweets in a matching time period. The sentiment score is determined using the NLTK Vader compound score value. The Tweet text is then stored in local files, and the data can be used again or appended by rerunning the program over a different time frame with matching user-defined parameters.
As Y values(sentiment score) approach 1, the average sentiment parsed from Tweets containing the specified keyword(s) becomes more positive. There can be multiple causes of this, and it should not be assumed that increased positive sentiment related to a keyword implies positive sentiment about that keyword. For instance, if the keyword list is (covid, coronavirus) and the sentiment score increases over a period of a week, people may be expressing that they feel more positively about Covid 19 recently- or perhaps that a lockdown was just lifted, or a vaccine was made available. In other words, sentiment related to a specific set of keywords may be incidental to those keywords rather than directly related to them. More specific keyword lists can reduce this effect, but it cannot be entirely removed.
A sentiment score value in the graph is calculated as the average of scores collected in a specified interval. Each of the initial scores are the compound sentiment score calculated by NLTK'S Vader software for a specific tweet's cleaned(1) text. By comparing a baseline set of averaged scores over a certain time frame to the scores collected for the desired keywords in the same span, it is possible to determine if tweets containing those keywords differ from the baseline sentiment, and if there are any trends in the sentiment scores over time. The baseline scores are currently calculated by searching Twitter's API v2 for the 25 most common tweeted english words(2)(3). Here we assume that in general, the sentiment score of Tweets containing the most common words correlates with the most common sentiments.
The Twitter Search API v2 allows users with elevated access to search 2 million Tweets each month, and users with academic access to search 10 million tweets each month. Users with elevated access are limited to 450 requests per 15 minute interval using Recent Search(Up to one week in the past). Users with Academic access can use full archive search, which is limited to 300 requests per 15 minutes and 1 request per second. Each response to a request can contain up to 100 tweets. This translates to a maximum of roughly 2500 Tweets per hour when Tweet collection is consistent over a month, and up to 45,000 Tweets collected per hour for an elevated access user. For academic access collection, recent search and full archive search can sustain a rate of 10,000 Tweets per hour for the entire month interval. Recent search's hourly maximum is the same as with elevated access; however full archive search can only retrieve 30,000 Tweets per hour maximum. Keep in mind that this application uses double the number of requests that would be expected, as an equal number of baseline tweets are collected when keyword tweets are requested. Thus elevated access allows the user to collect 1250 keyword Tweets per hour, and academic access extends that limit to 5000 Tweets. When choosing an interval length for Tweet collection, ensure that it divides the duration of collection evenly (both are represented in minutes). Otherwise, there will be an unevenly sized time interval in the dataset that will likely not have the desired quantity of Tweets. This helps prevent outliers in the sentiment graph.