Ch10.5.2: Demo: Extra Data From Twitter#

In order to get alt-text data from images in Tweets, we’re going to have to look at how to get extra data from Twitter.

Note: You don’t really need to undestand this whole process, you can just take the final code pieces and copy/paste them to use them yourself. We are including this explanation in case you want to know how it is working.

The examples here are based on examples from this website

But first let’s do our normal tweepy set-up

Normal Tweepy Set-Up#

import tweepy

(optional) use the fake version of tweepy, so you don’t have to use real twitter developer access passwords

%run ../../fake_tweepy/fake_tweepy.ipynb
Fake tweepy is replacing the tweepy library. Fake Tweepy doesn't need real passwords, and prevents you from accessing real twitter
# Load all your developer access passwords into Python
# TODO: Put your twitter account's special developer access passwords below:
bearer_token = "n4tossfgsafs_fake_bearer_token_isa53#$%$"
consumer_key = "sa@#4@fdfdsa_fake_consumer_key_$%DSG#%DG"
consumer_secret = "45adf$T$A_fake_consumer_secret_JESdsg"
access_token = "56sd5Ss4tsea_fake_access_token_%YE%hDsdr"
access_token_secret = "j^$dr_fake_consumer_key_^A5s#DR5s"
# Give the tweepy code your developer access passwords so
# it can perform twitter actions
client = tweepy.Client(
   bearer_token=bearer_token,
   consumer_key=consumer_key, consumer_secret=consumer_secret,
   access_token=access_token, access_token_secret=access_token_secret
)
Fake Tweepy is pretending to log in to twitter

Get media (including image) data#

If we want to get media (including image) data from tweets, when we are using search_recent_tweets, then we have to include:

  • expansions='attachments.media_keys' which tells Tweepy to get the media information for the tweet

  • media_fields=['preview_image_url', 'height', 'width'] which tells Tweepy which information to get for each piece of media.

Let’s do a search for tweets that include the word dog, and have an image, and are not retweets (so we don’t just get the same tweet for all the times it was retweeted):

query = "dog -is:retweet has:images"

tweet_search_results = client.search_recent_tweets(
                                    query=query,
                                    expansions='attachments.media_keys', #tell twitter to download the media related to this tweet
                                    media_fields=['preview_image_url', 'height', 'width']  # when getting the media, make sure to include this info
                                    )
Fake Tweepy is pretending to search for 'dog -is:retweet has:images' and is returning some fake tweets.

Now, when our search comes back, it has both the Tweet information and the information about media (including images) in those Tweets.

Unfortunately the Tweet info and the media info come back in two separate parts of the tweet_search_results:

  • tweet_search_results.data has the list of tweets

  • tweet_search_results.includes['media'] has a list of the pieces of media in the tweets

display(tweet_search_results.data)
[namespace(text='Look at my cute dog!',
           id=2342352355,
           author_id=213412413,
           data={'attachments': {'media_keys': ['7_4353463']}}),
 namespace(text='check out these dog photos',
           id=93298432,
           author_id=309453565,
           data={'attachments': {'media_keys': ['4_354354', '4_324654']}}),
 namespace(text='lol silly dog!',
           id=43954354,
           author_id=309453565,
           data={'attachments': {'media_keys': ['5_45353']}})]
display(tweet_search_results.includes['media'])
[namespace(media_key='7_4353463',
           type='photo',
           height=600,
           width=800,
           alt_text='Photo of a small dog lying flat on floor, looking exhausted',
           url='fake_website_photo1.jpg'),
 namespace(media_key='4_354354',
           type='photo',
           height=300,
           width=400,
           alt_text=None,
           url='fake_website_photo2.jpg'),
 namespace(media_key='4_324654',
           type='photo',
           height=300,
           width=400,
           alt_text=None,
           url='fake_website_photo3.jpg'),
 namespace(media_key='5_45353',
           type='photo',
           height=1200,
           width=1024,
           alt_text='photo taken by fake user 2',
           url='fake_website_photo4.jpg')]

The way this comes back doesn’t directly tell us which piece of media is part of which tweet. Instead, for each piece of media, there is a special id number called the media_key, and for each tweet there is a list of media_keys that are part of the tweet.

  • for a tweet in tweets.data, the media_keys are in tweet.data['attachments']['media_keys']

  • for a piece of media in the tweets.includes['media'], the media_id is in media['media_key']

So, if we are looking at a tweet, and look at the media keys, we will want to look up the media information that goes with that key. Looking up something based on a key is easiest to do with a dictionary in Python. So, what we will do is make a dictionary where the keys are media_keys, and the values are the media information. It will look something like this:

Below is the code to do this (using several Python short hand tricks at once):

media_lookup = {m.media_key: m for m in tweet_search_results.includes['media']}

display(media_lookup)
{'7_4353463': namespace(media_key='7_4353463',
           type='photo',
           height=600,
           width=800,
           alt_text='Photo of a small dog lying flat on floor, looking exhausted',
           url='fake_website_photo1.jpg'),
 '4_354354': namespace(media_key='4_354354',
           type='photo',
           height=300,
           width=400,
           alt_text=None,
           url='fake_website_photo2.jpg'),
 '4_324654': namespace(media_key='4_324654',
           type='photo',
           height=300,
           width=400,
           alt_text=None,
           url='fake_website_photo3.jpg'),
 '5_45353': namespace(media_key='5_45353',
           type='photo',
           height=1200,
           width=1024,
           alt_text='photo taken by fake user 2',
           url='fake_website_photo4.jpg')}

Now we can choose a tweet, find the media_keys for that tweet, and then look up the media information on each of those tweets

# get the first tweet
first_tweet = tweet_search_results.data[0]

print("displaying info for tweet: " + first_tweet.text)

# get the media keys for the first tweet
first_tweet_media_keys = first_tweet.data['attachments']['media_keys']

# loop through the media keys
for media_key in first_tweet_media_keys:
    # lookup the info about this particular media_key
    media_info = media_lookup[media_key]
    
    # print out some info about this piece of media
    print("  type: " + media_info.type)
    print("  height: " + str(media_info.height))
    print("  width: " + str(media_info.width))
    print()
displaying info for tweet: Look at my cute dog!
  type: photo
  height: 600
  width: 800

Get user information#

User information works the same way that media information did, though there will only be one author per tweet. We have to set an expansion and tell what user fields to download:

query = "dog -is:retweet has:images"

tweet_search_results = client.search_recent_tweets(
                                    query=query,
                                    expansions='author_id', #tell twitter to download the author related to this tweet
                                    user_fields=['profile_image_url']  # when getting the author, make sure to include this info
                                    )
Fake Tweepy is pretending to search for 'dog -is:retweet has:images' and is returning some fake tweets.

Then we make a lookup dictionary for the user information

user_lookup = {u.id: u for u in tweet_search_results.includes['users']}

display(user_lookup)
{213412413: namespace(id=213412413,
           name='Fake User 1',
           username='fakeuser1',
           profile_image_url='fake_profile_image1.jpg'),
 309453565: namespace(id=309453565,
           name='Fake User 2',
           username='fakeuser2',
           profile_image_url='fake_profile_image2.jpg')}

Then we can find the author_id of a tweet in tweet.author_id, and look it up in the user_lookup dictionary

first_tweet = tweet_search_results.data[0]

print("displaying info for tweet: " + first_tweet.text)

# get the author id for the first tweet
first_tweet_author_id = first_tweet.author_id

author = user_lookup[first_tweet_author_id]

# look up info about the author:
print("  author name: " + author.name)
print("  author username: " + author.username)
print("  author profile image: " + author.profile_image_url)
displaying info for tweet: Look at my cute dog!
  author name: Fake User 1
  author username: fakeuser1
  author profile image: fake_profile_image1.jpg