Ch10.5.2: Demo: Extra Data From Twitter
Contents
Ch10.5.2: Demo: Extra Data From Twitter#
In order to get alt-text data from images in Tweets, we’re going to have to look at how to get extra data from Twitter.
Note: You don’t really need to undestand this whole process, you can just take the final code pieces and copy/paste them to use them yourself. We are including this explanation in case you want to know how it is working.
The examples here are based on examples from this website
But first let’s do our normal tweepy set-up
Normal Tweepy Set-Up#
import tweepy
(optional) use the fake version of tweepy, so you don’t have to use real twitter developer access passwords
%run ../../fake_tweepy/fake_tweepy.ipynb
# Load all your developer access passwords into Python
# TODO: Put your twitter account's special developer access passwords below:
bearer_token = "n4tossfgsafs_fake_bearer_token_isa53#$%$"
consumer_key = "sa@#4@fdfdsa_fake_consumer_key_$%DSG#%DG"
consumer_secret = "45adf$T$A_fake_consumer_secret_JESdsg"
access_token = "56sd5Ss4tsea_fake_access_token_%YE%hDsdr"
access_token_secret = "j^$dr_fake_consumer_key_^A5s#DR5s"
# Give the tweepy code your developer access passwords so
# it can perform twitter actions
client = tweepy.Client(
bearer_token=bearer_token,
consumer_key=consumer_key, consumer_secret=consumer_secret,
access_token=access_token, access_token_secret=access_token_secret
)
Get media (including image) data#
If we want to get media (including image) data from tweets, when we are using search_recent_tweets, then we have to include:
expansions='attachments.media_keys'
which tells Tweepy to get the media information for the tweetmedia_fields=['preview_image_url', 'height', 'width']
which tells Tweepy which information to get for each piece of media.
Let’s do a search for tweets that include the word dog, and have an image, and are not retweets (so we don’t just get the same tweet for all the times it was retweeted):
query = "dog -is:retweet has:images"
tweet_search_results = client.search_recent_tweets(
query=query,
expansions='attachments.media_keys', #tell twitter to download the media related to this tweet
media_fields=['preview_image_url', 'height', 'width'] # when getting the media, make sure to include this info
)
Now, when our search comes back, it has both the Tweet information and the information about media (including images) in those Tweets.
Unfortunately the Tweet info and the media info come back in two separate parts of the tweet_search_results:
tweet_search_results.data
has the list of tweetstweet_search_results.includes['media']
has a list of the pieces of media in the tweets
display(tweet_search_results.data)
[namespace(text='Look at my cute dog!',
id=2342352355,
author_id=213412413,
data={'attachments': {'media_keys': ['7_4353463']}}),
namespace(text='check out these dog photos',
id=93298432,
author_id=309453565,
data={'attachments': {'media_keys': ['4_354354', '4_324654']}}),
namespace(text='lol silly dog!',
id=43954354,
author_id=309453565,
data={'attachments': {'media_keys': ['5_45353']}})]
display(tweet_search_results.includes['media'])
[namespace(media_key='7_4353463',
type='photo',
height=600,
width=800,
alt_text='Photo of a small dog lying flat on floor, looking exhausted',
url='fake_website_photo1.jpg'),
namespace(media_key='4_354354',
type='photo',
height=300,
width=400,
alt_text=None,
url='fake_website_photo2.jpg'),
namespace(media_key='4_324654',
type='photo',
height=300,
width=400,
alt_text=None,
url='fake_website_photo3.jpg'),
namespace(media_key='5_45353',
type='photo',
height=1200,
width=1024,
alt_text='photo taken by fake user 2',
url='fake_website_photo4.jpg')]
The way this comes back doesn’t directly tell us which piece of media is part of which tweet. Instead, for each piece of media, there is a special id number called the media_key
, and for each tweet there is a list of media_key
s that are part of the tweet.
for a
tweet
intweets.data
, the media_keys are intweet.data['attachments']['media_keys']
for a piece of
media
in thetweets.includes['media']
, the media_id is inmedia['media_key']
So, if we are looking at a tweet, and look at the media keys, we will want to look up the media information that goes with that key. Looking up something based on a key is easiest to do with a dictionary in Python. So, what we will do is make a dictionary where the keys are media_keys, and the values are the media information. It will look something like this:
Below is the code to do this (using several Python short hand tricks at once):
media_lookup = {m.media_key: m for m in tweet_search_results.includes['media']}
display(media_lookup)
{'7_4353463': namespace(media_key='7_4353463',
type='photo',
height=600,
width=800,
alt_text='Photo of a small dog lying flat on floor, looking exhausted',
url='fake_website_photo1.jpg'),
'4_354354': namespace(media_key='4_354354',
type='photo',
height=300,
width=400,
alt_text=None,
url='fake_website_photo2.jpg'),
'4_324654': namespace(media_key='4_324654',
type='photo',
height=300,
width=400,
alt_text=None,
url='fake_website_photo3.jpg'),
'5_45353': namespace(media_key='5_45353',
type='photo',
height=1200,
width=1024,
alt_text='photo taken by fake user 2',
url='fake_website_photo4.jpg')}
Now we can choose a tweet, find the media_keys for that tweet, and then look up the media information on each of those tweets
# get the first tweet
first_tweet = tweet_search_results.data[0]
print("displaying info for tweet: " + first_tweet.text)
# get the media keys for the first tweet
first_tweet_media_keys = first_tweet.data['attachments']['media_keys']
# loop through the media keys
for media_key in first_tweet_media_keys:
# lookup the info about this particular media_key
media_info = media_lookup[media_key]
# print out some info about this piece of media
print(" type: " + media_info.type)
print(" height: " + str(media_info.height))
print(" width: " + str(media_info.width))
print()
displaying info for tweet: Look at my cute dog!
type: photo
height: 600
width: 800
Get user information#
User information works the same way that media information did, though there will only be one author per tweet. We have to set an expansion and tell what user fields to download:
query = "dog -is:retweet has:images"
tweet_search_results = client.search_recent_tweets(
query=query,
expansions='author_id', #tell twitter to download the author related to this tweet
user_fields=['profile_image_url'] # when getting the author, make sure to include this info
)
Then we make a lookup dictionary for the user information
user_lookup = {u.id: u for u in tweet_search_results.includes['users']}
display(user_lookup)
{213412413: namespace(id=213412413,
name='Fake User 1',
username='fakeuser1',
profile_image_url='fake_profile_image1.jpg'),
309453565: namespace(id=309453565,
name='Fake User 2',
username='fakeuser2',
profile_image_url='fake_profile_image2.jpg')}
Then we can find the author_id
of a tweet in tweet.author_id, and look it up in the user_lookup
dictionary
first_tweet = tweet_search_results.data[0]
print("displaying info for tweet: " + first_tweet.text)
# get the author id for the first tweet
first_tweet_author_id = first_tweet.author_id
author = user_lookup[first_tweet_author_id]
# look up info about the author:
print(" author name: " + author.name)
print(" author username: " + author.username)
print(" author profile image: " + author.profile_image_url)
displaying info for tweet: Look at my cute dog!
author name: Fake User 1
author username: fakeuser1
author profile image: fake_profile_image1.jpg