9.4.3. Demo: Track Use of Sentiment Analysis Code#
Choose Social Media Platform: Reddit | Discord | Bluesky | No Coding
In this code demo, we will take the sentiment analysis code we used in the last chapter (Data Mining), and we will turn it into a function which will make it easier to use.
After turning it into a function though, we will add code to that function to track how it is used. We could theoretically take this information we are tracking and send to results to some other account.
This sort of tracking can be part of tracking program telemetry, which can be useful in figure out where software is broken or where it is most or least useful. But it can also be violating the privacy of anyone using our funtion who doesn’t know we are tracking its use, or used maliciously to steal user information.
Log into Bluesky (atproto)#
These are our normal steps get atproto loaded and logged into Bluesky
helper function for atproto links
NOTE: You don’t need to worry about the details of how this works, it just is here to make the code later easier to use.
import re #load a "regular expression" library for helping to parse text
from atproto import IdResolver # Load the atproto IdResolver library to get offical ATProto user IDs
# function to convert a feed from a weblink url to the special atproto "at" URI
def getATFeedLinkFromURL(url):
# Get the user did and feed id from the weblink url
match = re.search(r'https://bsky.app/profile/([^/]+)/feed/([^/]+)', url)
if not match:
raise ValueError("Invalid Bluesky feed URL format.")
user_handle, feed_id = match.groups()
# Get the official atproto user ID (did) from the handle
resolver = IdResolver()
did = resolver.handle.resolve(user_handle)
if not did:
raise ValueError(f'Could not resolve DID for handle "{user_handle}".')
# Construct the at:// URI
post_uri = f"at://{did}/app.bsky.feed.generator/{feed_id}"
return post_uri
# function to convert a post's special atproto "at" URI to a weblink url
def getWebLinkFromPost(post):
# Get the user id and post id from the weblink url
match = re.search(r'at://([^/]+)/app.bsky.feed.post/([^/]+)', post.uri)
if not match:
raise ValueError("Invalid Bluesky atproto post URL format.")
user_id, post_id = match.groups()
post_uri = f"https://bsky.app/profile/{user_id}/post/{post_id}"
return post_uri
# function to take an author profile and generate a weblink url
def getWebLinkFromProfile(authorInfo):
author_uri = f"https://bsky.app/profile/{authorInfo.did}"
return author_uri
from atproto import Client
(optional) make a fake Bluesky connection with the fake_atproto library For testing purposes, we”ve added this line of code, which loads a fake version of atproto, so it wont actually connect to Bluesky. If you want to try to actually connect to Bluesky, don’t run this line of code.
%run ../../fake_apis/fake_atproto.ipynb
# Login to Bluesky
# TODO: put your account name and password below
client = Client(base_url="https://bsky.social")
client.login("your_account_name.bsky.social", "m#5@_fake_bsky_password_$%Ds")
load sentiment analysis library and make analyzer#
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
[nltk_data] Downloading package vader_lexicon to
[nltk_data] C:\Users\kmthayer\AppData\Roaming\nltk_data...
[nltk_data] Package vader_lexicon is already up-to-date!
original code to loop through submissions, finding average sentiment#
This is the code from chapter 8 that loops through submissions in the “cuteanimals” subreddit and calculates the average sentiment
num_submissions = 0
total_sentiment = 0
# Look up the subreddit "cuteanimals", then find the "hot" list, getting up to 10 submission
submissions = reddit.subreddit("cuteanimals").hot(limit=10)
# Turn the submission results into a Python List
submissions_list = list(submissions)
for submission in submissions_list:
#calculate sentiment
submission_sentiment = sia.polarity_scores(submission.title)["compound"]
num_submissions += 1
total_sentiment += submission_sentiment
print("Sentiment: " + str(submission_sentiment))
print(" Submission Title: " + submission.title)
average_sentiment = total_sentiment / num_submissions
print("Average sentiment was " + str(average_sentiment))
NameError Traceback (most recent call last)
Cell In[6], line 5
2 total_sentiment = 0
4 # Look up the subreddit "cuteanimals", then find the "hot" list, getting up to 10 submission
----> 5 submissions = reddit.subreddit("cuteanimals").hot(limit=10)
7 # Turn the submission results into a Python List
8 submissions_list = list(submissions)
NameError: name 'reddit' is not defined
Make a function using the code above for finding the average sentiment#
We now make a function of that code above by doing the following:
Add a
line at the start to make a function calledfind_average_sentiment
Indent all the old code so that it becomes the contents of the function
Make the function take two arguments:
, which takes place of “cuteanimals”, so the person calling the function can choose which subreddit to searchdisplay_progress
which defaults to False. This decides whether or not the print statements are run when the function is run, so we can see the progress if we want, or just get the answer by default
At the end of the function, return the average_sentiment as the result
def find_average_sentiment(subreddit_name, display_progress = False):
num_submissions = 0
total_sentiment = 0
# Look up the subreddit given as a parameter, then find the "hot" list, getting up to 10 submission
submissions = reddit.subreddit(subreddit_name).hot(limit=10)
# Turn the submission results into a Python List
submissions_list = list(submissions)
for submission in submissions_list:
#calculate sentiment
submission_sentiment = sia.polarity_scores(submission.title)["compound"]
num_submissions += 1
total_sentiment += submission_sentiment
print("Sentiment: " + str(submission_sentiment))
print(" Submission Title: " + submission.title)
average_sentiment = total_sentiment / num_submissions
print("Average sentiment was " + str(average_sentiment))
return average_sentiment
Now let’s try using the function
find_average_sentiment("science", display_progress=True)
Sentiment: -0.5255
Submission Title: Scientists have cloned dangerous dinosaurs!
Sentiment: 0.7574
Submission Title: Scientists have created the best tasting food ever!
Sentiment: 0.0
Submission Title: F*** magnets, how do they work? And I don't wanna talk to a scientist
Average sentiment was 0.0773
Modify the function so it tracks use#
Now we make another version of the same function, but with a small difference:
We make a list variable called
which exists outside the function.At the start of the function we add the subreddit being searched to that list. This way, as the function gets used, we’ll keep a history of its use in the
# Make a list to save what subreddit was used for each time `find_average_sentiment` is run
sentiment_searches = []
def find_average_sentiment(subreddit_name, display_progress = False):
# Add the current subreddit being searched to the sentiment_searches list
num_submissions = 0
total_sentiment = 0
# Look up the subreddit name given as a parameter, then find the "hot" list, getting up to 10 submission
submissions = reddit.subreddit(subreddit_name).hot(limit=10)
# Turn the submission results into a Python List
submissions_list = list(submissions)
for submission in submissions_list:
#calculate sentiment
submission_sentiment = sia.polarity_scores(submission.title)["compound"]
num_submissions += 1
total_sentiment += submission_sentiment
print("Sentiment: " + str(submission_sentiment))
print(" Submission Title: " + submission.title)
average_sentiment = total_sentiment / num_submissions
print("Average sentiment was " + str(average_sentiment))
return average_sentiment
Now let’s run this version of the function
It looks like it works like normal, but our calls to the function have been tracked!
['cuteanimals', 'science']
Now, if we were being malicious, we would hide this code in some other code library we would try to convince you to use, that way you wouldn’t notice the code. And instead of just saving those searches or posts to a variable, we would send it to ourselves, perhaps by putting code into our social media code library to log into a different account and private messaged that info to ourselves.
How can we trust code libraries?#
If people can make code libraries track us and violate our privacy, how can we trust them? We could try looking at the source code for the PRAW library to try and make sure the library we are using isn’t doing anything bad, but no programmer can be expected to read through all the libraries they use. There is unfortunately no simple answer to this.
In fact, there are cases where people have messed with code libraries:
The United States National Security Agency “paid massive computer security firm RSA $10 million to promote a flawed encryption system so that the surveillance organization could wiggle its way around security.”
Does US national security outweigh global computer security?
Shortly after the Russian invasion of Ukraine in 2022, someone modified a popular NodeJS code library so that it would automatically destroy files if it was run on a computer in Russia or Belarus.
Does opposing a military invasion justify sabatoging a code library?
And those are just the intentional problems with code libraries. All sorts of code libraries and computer programs are full of security flaws, which are regularly discovered and fixed (though who knows how much the flaws were exploited first).