{
"cells": [
{
"cell_type": "markdown",
"id": "a779ca3a-c174-4e88-93ac-2124c7ea049a",
"metadata": {},
"source": [
"# Demo: Track Use of Sentiment Analysis Code"
]
},
{
"cell_type": "markdown",
"id": "123456789-930485093240532940945-0324095320945904325",
"metadata": {
"tags": []
},
"source": [" _Choose Social Media Platform: __Reddit__ | Discord | Bluesky | No Coding_ "]
},
{
"cell_type": "markdown",
"id": "fab16b2e-a406-45d7-aa81-ae758bf73103",
"metadata": {},
"source": [
"In this code demo, we will take the sentiment analysis code we used in the last chapter (Data Mining), and we will turn it into a function which will make it easier to use.\n",
"\n",
"After turning it into a function though, we will add code to that function to track how it is used. We could theoretically take this information we are tracking and send to results to some other account.\n",
"\n",
"This sort of tracking can be part of tracking program [telemetry](https://en.wikipedia.org/wiki/Telemetry#Software), which can be useful in figure out where software is broken or where it is most or least useful. But it can also be violating the privacy of anyone using our funtion who doesn't know we are tracking its use, or used maliciously to steal user information."
]
},
{
"cell_type": "markdown",
"id": "7d500d2c-21ca-4a38-96de-703857a7d7e6",
"metadata": {},
"source": [
"## Reddit PRAW Setup"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "3d3b3c2d-3500-448a-a9f4-9d500977b792",
"metadata": {},
"outputs": [],
"source": [
"import praw"
]
},
{
"cell_type": "markdown",
"id": "fcae7037-b587-4649-afec-4271d0fbca28",
"metadata": {},
"source": [
"(optional) use the fake version of Reddit praw, so you don't have to use real Reddit developer access passwords"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "5360c90a-3a63-426d-ba79-d34bac9be03b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Fake praw is replacing the praw library. Fake praw doesn't need real passwords, and prevents you from accessing real reddit"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%run ../../fake_apis/fake_praw.ipynb"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "4652addb-023c-45e5-bb0c-2d27cf3c8564",
"metadata": {},
"outputs": [],
"source": [
"# Load all your developer access passwords into Python\n",
"# TODO: Put your reddit username, password, and special developer access passwords below:\n",
"username=\"fake_reddit_username\"\n",
"password=\"sa@#4*fdf_fake_password_$%DSG#%DG\"\n",
"client_id=\"45adf$TW_fake_client_id_JESdsg1O\"\n",
"client_secret=\"56sd_fake_client_secret_%Yh%\""
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "722134dc-4578-438e-8e90-2f6135d7e440",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Fake praw is pretending to collect account info to use on reddit"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Give the praw code your reddit account info so\n",
"# it can perform reddit actions\n",
"reddit = praw.Reddit(\n",
" username=username, password=password,\n",
" client_id=client_id, client_secret=client_secret,\n",
" user_agent=\"a custom python script\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "5e1b794a-b178-4b06-b992-7e88a466b55d",
"metadata": {},
"source": [
"### load sentiment analysis library and make analyzer"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "9822d030-1db1-47c1-8276-763f66d07be5",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package vader_lexicon to\n",
"[nltk_data] C:\\Users\\kmthayer\\AppData\\Roaming\\nltk_data...\n",
"[nltk_data] Package vader_lexicon is already up-to-date!\n"
]
}
],
"source": [
"import nltk\n",
"nltk.download([\"vader_lexicon\"])\n",
"from nltk.sentiment import SentimentIntensityAnalyzer\n",
"sia = SentimentIntensityAnalyzer()"
]
},
{
"cell_type": "markdown",
"id": "0cb3be89-1f51-4ca6-804f-4ccd43ffa513",
"metadata": {},
"source": [
"### original code to loop through submissions, finding average sentiment\n",
"This is the code from chapter 8 that loops through submissions in the \"cuteanimals\" subreddit and calculates the average sentiment"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "7877be09-65a1-404e-9eb8-a332744d291b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Fake praw is pretending to select the subreddit: cuteanimals"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sentiment: 0.5093\n",
" Submission Title: Look at my cute dog!\n",
"\n",
"Sentiment: 0.0\n",
" Submission Title: A baby lizard!\n",
"\n",
"Sentiment: 0.6239\n",
" Submission Title: The cutest bird ever!\n",
"\n",
"Average sentiment was 0.3777333333333333\n"
]
}
],
"source": [
"num_submissions = 0\n",
"total_sentiment = 0\n",
"\n",
"# Look up the subreddit \"cuteanimals\", then find the \"hot\" list, getting up to 10 submission\n",
"submissions = reddit.subreddit(\"cuteanimals\").hot(limit=10)\n",
"\n",
"# Turn the submission results into a Python List\n",
"submissions_list = list(submissions)\n",
"\n",
"for submission in submissions_list:\n",
" #calculate sentiment\n",
" submission_sentiment = sia.polarity_scores(submission.title)[\"compound\"]\n",
" num_submissions += 1\n",
" total_sentiment += submission_sentiment\n",
"\n",
" print(\"Sentiment: \" + str(submission_sentiment))\n",
" print(\" Submission Title: \" + submission.title)\n",
" print()\n",
"\n",
"\n",
"\n",
"average_sentiment = total_sentiment / num_submissions\n",
"print(\"Average sentiment was \" + str(average_sentiment))"
]
},
{
"cell_type": "markdown",
"id": "0dc6c4d9-a522-4a06-ad47-669895660113",
"metadata": {},
"source": [
"## Make a function using the code above for finding the average sentiment\n",
"We now make a function of that code above by doing the following:\n",
"- Add a `def` line at the start to make a function called `find_average_sentiment`\n",
"- Indent all the old code so that it becomes the contents of the function `find_average_sentiment`\n",
"- Make the function take two arguments:\n",
" - `subreddit_name`, which takes place of \"cuteanimals\", so the person calling the function can choose which subreddit to search\n",
" - `display_progress` which defaults to False. This decides whether or not the print statements are run when the function is run, so we can see the progress if we want, or just get the answer by default\n",
"- At the end of the function, return the average_sentiment as the result"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "0caeb3fd-738c-4284-824f-a19468c9fce9",
"metadata": {},
"outputs": [],
"source": [
" def find_average_sentiment(subreddit_name, display_progress = False):\n",
" num_submissions = 0\n",
" total_sentiment = 0\n",
"\n",
" # Look up the subreddit given as a parameter, then find the \"hot\" list, getting up to 10 submission\n",
" submissions = reddit.subreddit(subreddit_name).hot(limit=10)\n",
"\n",
" # Turn the submission results into a Python List\n",
" submissions_list = list(submissions)\n",
"\n",
" for submission in submissions_list:\n",
" #calculate sentiment\n",
" submission_sentiment = sia.polarity_scores(submission.title)[\"compound\"]\n",
" num_submissions += 1\n",
" total_sentiment += submission_sentiment\n",
"\n",
" if(display_progress):\n",
" print(\"Sentiment: \" + str(submission_sentiment))\n",
" print(\" Submission Title: \" + submission.title)\n",
" print()\n",
"\n",
"\n",
"\n",
" average_sentiment = total_sentiment / num_submissions\n",
" if(display_progress):\n",
" print(\"Average sentiment was \" + str(average_sentiment))\n",
" \n",
" return average_sentiment"
]
},
{
"cell_type": "markdown",
"id": "a51294ef-7be6-4b8c-952f-259bf458f3b4",
"metadata": {},
"source": [
"Now let's try using the function"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "6509d197-1668-4698-8f66-75f06d9051a2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Fake praw is pretending to select the subreddit: cuteanimals"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.3777333333333333"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"find_average_sentiment(\"cuteanimals\")"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "55e9313a-2ec4-4e19-b1ad-8a40f73e1294",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Fake praw is pretending to select the subreddit: science"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sentiment: -0.5255\n",
" Submission Title: Scientists have cloned dangerous dinosaurs!\n",
"\n",
"Sentiment: 0.7574\n",
" Submission Title: Scientists have created the best tasting food ever!\n",
"\n",
"Sentiment: 0.0\n",
" Submission Title: F*** magnets, how do they work? And I don't wanna talk to a scientist\n",
"\n",
"Average sentiment was 0.0773\n"
]
},
{
"data": {
"text/plain": [
"0.0773"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"find_average_sentiment(\"science\", display_progress=True)"
]
},
{
"cell_type": "markdown",
"id": "62f91c06-8d66-4272-8f13-56cba15ce7b4",
"metadata": {},
"source": [
"## Modify the function so it tracks use\n",
"Now we make another version of the same function, but with a small difference:\n",
"- We make a list variable called `sentiment_searches` which exists outside the function.\n",
"- At the start of the function we add the subreddit being searched to that list.\n",
"This way, as the function gets used, we'll keep a history of its use in the `sentiment_searches` list"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "4ee4b0d0-f696-4f24-b341-c21169c10b4b",
"metadata": {},
"outputs": [],
"source": [
"# Make a list to save what subreddit was used for each time `find_average_sentiment` is run\n",
"sentiment_searches = []\n",
"\n",
"def find_average_sentiment(subreddit_name, display_progress = False):\n",
" \n",
" # Add the current subreddit being searched to the sentiment_searches list\n",
" sentiment_searches.append(subreddit_name)\n",
" \n",
" num_submissions = 0\n",
" total_sentiment = 0\n",
"\n",
" # Look up the subreddit name given as a parameter, then find the \"hot\" list, getting up to 10 submission\n",
" submissions = reddit.subreddit(subreddit_name).hot(limit=10)\n",
"\n",
" # Turn the submission results into a Python List\n",
" submissions_list = list(submissions)\n",
"\n",
" for submission in submissions_list:\n",
" #calculate sentiment\n",
" submission_sentiment = sia.polarity_scores(submission.title)[\"compound\"]\n",
" num_submissions += 1\n",
" total_sentiment += submission_sentiment\n",
"\n",
" if(display_progress):\n",
" print(\"Sentiment: \" + str(submission_sentiment))\n",
" print(\" Submission Title: \" + submission.title)\n",
" print()\n",
"\n",
"\n",
"\n",
" average_sentiment = total_sentiment / num_submissions\n",
" if(display_progress):\n",
" print(\"Average sentiment was \" + str(average_sentiment))\n",
" \n",
" return average_sentiment"
]
},
{
"cell_type": "markdown",
"id": "75ac59a1-ea54-49a4-b8bd-55f040b50ac5",
"metadata": {},
"source": [
"Now let's run this version of the function"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "3b482148-69a4-4245-8817-9772909edcc3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Fake praw is pretending to select the subreddit: cuteanimals"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.3777333333333333"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"find_average_sentiment(\"cuteanimals\")"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "a794010c-e8f0-42d8-bc38-55039f09aba6",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Fake praw is pretending to select the subreddit: science"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"0.0773"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"find_average_sentiment(\"science\")"
]
},
{
"cell_type": "markdown",
"id": "6e38696f-c2d1-4260-882c-02870422e341",
"metadata": {},
"source": [
"It looks like it works like normal, but our calls to the function have been tracked!"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "5b2b234f-fef9-41e6-9c82-f3ac1b9bf6c1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['cuteanimals', 'science']"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(sentiment_searches)"
]
},
{
"cell_type": "markdown",
"id": "39ec5a8f-ae53-4e6d-9347-2b6e2c91d58b",
"metadata": {},
"source": [
"Now, if we were being malicious, we would hide this code in some other code library we would try to convince you to use, that way you wouldn't notice the code. And instead of just saving those searches or posts to a variable, we would send it to ourselves, perhaps by putting code into our social media code library to log into a different account and private messaged that info to ourselves."
]
},
{
"cell_type": "markdown",
"id": "d7fb9326-86bd-43f1-9367-4944c941e9de",
"metadata": {},
"source": [
"## How can we trust code libraries?\n",
"If people can make code libraries track us and violate our privacy, how can we trust them? We could try looking at the [source code for the PRAW library](https://github.com/praw-dev/praw/) to try and make sure the library we are using isn't doing anything bad, but no programmer can be expected to read through all the libraries they use. There is unfortunately no simple answer to this.\n",
"\n",
"In fact, there are cases where people have messed with code libraries:\n",
"- The United States National Security Agency \"[paid massive computer security firm RSA $10 million to promote a flawed encryption system so that the surveillance organization could wiggle its way around security.](https://gizmodo.com/nsa-paid-security-firm-10-million-bribe-to-keep-encryp-1487442397)\"\n",
" - Does US national security outweigh global computer security? \n",
"- Shortly after the Russian invasion of Ukraine in 2022, someone modified a popular NodeJS code library so that it would [automatically destroy files if it was run on a computer in Russia or Belarus](https://arstechnica.com/information-technology/2022/03/sabotage-code-added-to-popular-npm-package-wiped-files-in-russia-and-belarus/).\n",
" - Does opposing a military invasion justify sabatoging a code library? \n",
" \n",
"And those are just the intentional problems with code libraries. All sorts of code libraries and computer programs are full of security flaws, which are regularly discovered and fixed (though who knows how much the flaws were exploited first).\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}