Demo: Track Use of Sentiment Analysis Code
Contents
9.4.3. Demo: Track Use of Sentiment Analysis Code#
In this code demo, we will take the sentiment analysis code we used in the last chapter (Data Mining), and we will turn it into a function which will make it easier to use.
After turning it into a function though, we will add code to that function to track how it is used. We could theoretically take this information we are tracking and send to results to some other account.
This sort of tracking can be part of tracking program telemetry, which can be useful in figure out where software is broken or where it is most or least useful. But it can also be violating the privacy of anyone using our funtion who doesn’t know we are tracking its use, or used maliciously to steal user information.
Reddit PRAW Setup#
import praw
(optional) use the fake version of Reddit praw, so you don’t have to use real Reddit developer access passwords
%run ../../fake_apis/fake_praw.ipynb
# Load all your developer access passwords into Python
# TODO: Put your reddit username, password, and special developer access passwords below:
username="fake_reddit_username"
password="sa@#4*fdf_fake_password_$%DSG#%DG"
client_id="45adf$TW_fake_client_id_JESdsg1O"
client_secret="56sd_fake_client_secret_%Yh%"
# Give the praw code your reddit account info so
# it can perform reddit actions
reddit = praw.Reddit(
username=username, password=password,
client_id=client_id, client_secret=client_secret,
user_agent="a custom python script"
)
load sentiment analysis library and make analyzer#
import nltk
nltk.download(["vader_lexicon"])
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
[nltk_data] Downloading package vader_lexicon to
[nltk_data] C:\Users\kmthayer\AppData\Roaming\nltk_data...
[nltk_data] Package vader_lexicon is already up-to-date!
original code to loop through submissions, finding average sentiment#
This is the code from chapter 8 that loops through submissions in the “cuteanimals” subreddit and calculates the average sentiment
num_submissions = 0
total_sentiment = 0
# Look up the subreddit "cuteanimals", then find the "hot" list, getting up to 10 submission
submissions = reddit.subreddit("cuteanimals").hot(limit=10)
# Turn the submission results into a Python List
submissions_list = list(submissions)
for submission in submissions_list:
#calculate sentiment
submission_sentiment = sia.polarity_scores(submission.title)["compound"]
num_submissions += 1
total_sentiment += submission_sentiment
print("Sentiment: " + str(submission_sentiment))
print(" Submission Title: " + submission.title)
print()
average_sentiment = total_sentiment / num_submissions
print("Average sentiment was " + str(average_sentiment))
Sentiment: 0.5093
Submission Title: Look at my cute dog!
Sentiment: 0.0
Submission Title: A baby lizard!
Sentiment: 0.6239
Submission Title: The cutest bird ever!
Average sentiment was 0.3777333333333333
Make a function using the code above for finding the average sentiment#
We now make a function of that code above by doing the following:
Add a
def
line at the start to make a function calledfind_average_sentiment
Indent all the old code so that it becomes the contents of the function
find_average_sentiment
Make the function take two arguments:
subreddit_name
, which takes place of “cuteanimals”, so the person calling the function can choose which subreddit to searchdisplay_progress
which defaults to False. This decides whether or not the print statements are run when the function is run, so we can see the progress if we want, or just get the answer by default
At the end of the function, return the average_sentiment as the result
def find_average_sentiment(subreddit_name, display_progress = False):
num_submissions = 0
total_sentiment = 0
# Look up the subreddit given as a parameter, then find the "hot" list, getting up to 10 submission
submissions = reddit.subreddit(subreddit_name).hot(limit=10)
# Turn the submission results into a Python List
submissions_list = list(submissions)
for submission in submissions_list:
#calculate sentiment
submission_sentiment = sia.polarity_scores(submission.title)["compound"]
num_submissions += 1
total_sentiment += submission_sentiment
if(display_progress):
print("Sentiment: " + str(submission_sentiment))
print(" Submission Title: " + submission.title)
print()
average_sentiment = total_sentiment / num_submissions
if(display_progress):
print("Average sentiment was " + str(average_sentiment))
return average_sentiment
Now let’s try using the function
find_average_sentiment("cuteanimals")
0.3777333333333333
find_average_sentiment("science", display_progress=True)
Sentiment: -0.5255
Submission Title: Scientists have cloned dangerous dinosaurs!
Sentiment: 0.7574
Submission Title: Scientists have created the best tasting food ever!
Sentiment: 0.0
Submission Title: F*** magnets, how do they work? And I don't wanna talk to a scientist
Average sentiment was 0.0773
0.0773
Modify the function so it tracks use#
Now we make another version of the same function, but with a small difference:
We make a list variable called
sentiment_searches
which exists outside the function.At the start of the function we add the subreddit being searched to that list. This way, as the function gets used, we’ll keep a history of its use in the
sentiment_searches
list
# Make a list to save what subreddit was used for each time `find_average_sentiment` is run
sentiment_searches = []
def find_average_sentiment(subreddit_name, display_progress = False):
# Add the current subreddit being searched to the sentiment_searches list
sentiment_searches.append(subreddit_name)
num_submissions = 0
total_sentiment = 0
# Look up the subreddit name given as a parameter, then find the "hot" list, getting up to 10 submission
submissions = reddit.subreddit(subreddit_name).hot(limit=10)
# Turn the submission results into a Python List
submissions_list = list(submissions)
for submission in submissions_list:
#calculate sentiment
submission_sentiment = sia.polarity_scores(submission.title)["compound"]
num_submissions += 1
total_sentiment += submission_sentiment
if(display_progress):
print("Sentiment: " + str(submission_sentiment))
print(" Submission Title: " + submission.title)
print()
average_sentiment = total_sentiment / num_submissions
if(display_progress):
print("Average sentiment was " + str(average_sentiment))
return average_sentiment
Now let’s run this version of the function
find_average_sentiment("cuteanimals")
0.3777333333333333
find_average_sentiment("science")
0.0773
It looks like it works like normal, but our calls to the function have been tracked!
display(sentiment_searches)
['cuteanimals', 'science']
Now, if we were being malicious, we would hide this code in some other code library we would try to convince you to use, that way you wouldn’t notice the code. And instead of just saving those tweets to a variable, we would send it to ourselves, perhaps by putting code into our new_create_tweet to log into a different twitter account and private messaged that info to ourselves.
How can we trust code libraries?#
If people can make code libraries track us and violate our privacy, how can we trust them? We could try looking at the source code for the PRAW library to try and make sure the library we are using isn’t doing anything bad, but no programmer can be expected to read through all the libraries they use. There is unfortunately no simple answer to this.
In fact, there are cases where people have messed with code libraries:
The United States National Security Agency “paid massive computer security firm RSA $10 million to promote a flawed encryption system so that the surveillance organization could wiggle its way around security.”
Does US national security outweigh global computer security?
Shortly after the Russian invasion of Ukraine in 2022, someone modified a popular NodeJS code library so that it would automatically destroy files if it was run on a computer in Russia or Belarus.
Does opposing a military invasion justify sabatoging a code library?
And those are just the intentional problems with code libraries. All sorts of code libraries and computer programs are full of security flaws, which are regularly discovered and fixed (though who knows how much the flaws were exploited first).