Demo: Track Use of Sentiment Analysis Code

9.4.3. Demo: Track Use of Sentiment Analysis Code#

Choose Social Media Platform: Bluesky | Reddit | Discord | No Coding

In this code demo, we will take the sentiment analysis code we used in the last chapter (Data Mining), and we will turn it into a function which will make it easier to use.

After turning it into a function though, we will add code to that function to track how it is used. We could theoretically take this information we are tracking and send to results to some other account.

This sort of tracking can be part of tracking program telemetry, which can be useful in figure out where software is broken or where it is most or least useful. But it can also be violating the privacy of anyone using our funtion who doesn’t know we are tracking its use, or used maliciously to steal user information.

Log into Bluesky (atproto)#

These are our normal steps get atproto loaded and logged into Bluesky

from atproto import Client

(optional) make a fake Bluesky connection with the fake_atproto library For testing purposes, we”ve added this line of code, which loads a fake version of atproto, so it wont actually connect to Bluesky. If you want to try to actually connect to Bluesky, don’t run this line of code.

%run ../../fake_apis/fake_atproto.ipynb

Fake atproto (bsky.app) is replacing the atproto.blue library. Fake atproto doesn't need real passwords, and prevents you from accessing real Bluesky

# Login to Bluesky
# TODO: put your account name and password below

client = Client(base_url="https://bsky.social")
client.login("your_account_name.bsky.social", "m#5@_fake_bsky_password_$%Ds")

Fake atproto is pretending to set up a client connection to: https://bsky.social

Fake atproto is pretending log into your account: your_account_name.bsky.social

load sentiment analysis library and make analyzer#

import nltk
nltk.download(["vader_lexicon"])
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\kmthayer\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!

original code to run a search and loop through posts, finding average sentiment#

This is the code from chapter 8 that loops through posts from a search for “news” and calculates the average sentiment

num_submissions = 0
total_sentiment = 0

# Run a search for the word "news" on bluesky
search_query = "news"
search_results = client.app.bsky.feed.search_posts({'q': search_query}).posts

num_posts = 0
total_sentiment = 0

for post in search_results:
    
    #calculate sentiment
    post_sentiment = sia.polarity_scores(post.record.text)["compound"]
    num_posts += 1
    total_sentiment += post_sentiment

    print("Sentiment: " + str(post_sentiment))
    print("   post text: " + post.record.text)
    print()


average_sentiment = total_sentiment / num_posts
print("Average sentiment was " + str(average_sentiment))

Sentiment: 0.784
   post text: Breaking news: A lovely cat took a nice long nap today!

Sentiment: 0.0
   post text: Breaking news: Someone said a really mean thing on the internet today!

Sentiment: 0.7088
   post text: Breaking news: Some grandparents made some yummy cookies for all the kids to share!

Sentiment: -0.6114
   post text: Breaking news: All the horrors of the universe revealed at last!

Average sentiment was 0.22034999999999996

Make a function using the code above for finding the average sentiment#

We now make a function of that code above by doing the following:

Add a def line at the start to make a function called find_average_sentiment
Indent all the old code so that it becomes the contents of the function find_average_sentiment
Make the function take two arguments:
- search_query, which takes place of “news”, so the person calling the function can choose what search to run
- display_progress which defaults to False. This decides whether or not the print statements are run when the function is run, so we can see the progress if we want, or just get the answer by default
At the end of the function, return the average_sentiment as the result

def find_average_sentiment(search_query, display_progress = False):
    num_submissions = 0
    total_sentiment = 0
    
    # Run a search for on bluesky for the passed in "search_query"
    search_results = client.app.bsky.feed.search_posts({'q': search_query}).posts
    
    num_posts = 0
    total_sentiment = 0
    
    for post in search_results:
        
        #calculate sentiment
        post_sentiment = sia.polarity_scores(post.record.text)["compound"]
        num_posts += 1
        total_sentiment += post_sentiment

        if(display_progress):
            print("Sentiment: " + str(post_sentiment))
            print("   post text: " + post.record.text)
            print()
    
    
    average_sentiment = total_sentiment / num_posts
    if(display_progress):
        print("Average sentiment was " + str(average_sentiment))
    
    return average_sentiment

Now let’s try using the function

find_average_sentiment("news")

0.22034999999999996

find_average_sentiment("scientists", display_progress=True)

Sentiment: -0.5255
   post text: Scientists have cloned dangerous dinosaurs!

Sentiment: 0.7574
   post text: Scientists have created the best tasting food ever!

Sentiment: 0.0
   post text: F*** magnets, how do they work? And I don't wanna talk to any scientists

Average sentiment was 0.0773

0.0773

Modify the function so it tracks use#

Now we make another version of the same function, but with a small difference:

We make a list variable called sentiment_searches which exists outside the function.
At the start of the function we add the search_query being searched to that list. This way, as the function gets used, we’ll keep a history of its use in the sentiment_searches list

# Make a list to save what subreddit was used for each time `find_average_sentiment` is run
sentiment_searches = []

def find_average_sentiment(search_query, display_progress = False):

    # Add the current search_query being searched to the sentiment_searches list
    sentiment_searches.append(search_query)
    
    num_submissions = 0
    total_sentiment = 0
    
    # Run a search for on bluesky for the passed in "search_query"
    search_results = client.app.bsky.feed.search_posts({'q': search_query}).posts
    
    num_posts = 0
    total_sentiment = 0
    
    for post in search_results:
        
        #calculate sentiment
        post_sentiment = sia.polarity_scores(post.record.text)["compound"]
        num_posts += 1
        total_sentiment += post_sentiment

        if(display_progress):
            print("Sentiment: " + str(post_sentiment))
            print("   post text: " + post.record.text)
            print()
    
    
    average_sentiment = total_sentiment / num_posts
    if(display_progress):
        print("Average sentiment was " + str(average_sentiment))
    
    return average_sentiment

Now let’s run this version of the function

find_average_sentiment("news")

0.22034999999999996

find_average_sentiment("scientists")

0.0773

It looks like it works like normal, but our calls to the function have been tracked!

display(sentiment_searches)

['news', 'scientists']

Now, if we were being malicious, we would hide this code in some other code library we would try to convince you to use, that way you wouldn’t notice the code. And instead of just saving those searches or posts to a variable, we would send it to ourselves, perhaps by putting code into our social media code library to log into a different account and private messaged that info to ourselves.

How can we trust code libraries?#

If people can make code libraries track us and violate our privacy, how can we trust them? We could try looking at the source code for the atproto library to try and make sure the library we are using isn’t doing anything bad, but no programmer can be expected to read through all the libraries they use. There is unfortunately no simple answer to this.

In fact, there are cases where people have messed with code libraries:

The United States National Security Agency “paid massive computer security firm RSA $10 million to promote a flawed encryption system so that the surveillance organization could wiggle its way around security.”
- Does US national security outweigh global computer security?
Shortly after the Russian invasion of Ukraine in 2022, someone modified a popular NodeJS code library so that it would automatically destroy files if it was run on a computer in Russia or Belarus.
- Does opposing a military invasion justify sabatoging a code library?

And those are just the intentional problems with code libraries. All sorts of code libraries and computer programs are full of security flaws, which are regularly discovered and fixed (though who knows how much the flaws were exploited first).