Project 2: Twitter Trends

What do people tweet?
Draw their feelings on a map
to discover trends

Introduction
Logistics
The Autograder
Phase 1: The Feelings in Tweets

Tweets
Problem 1 (1 pt)
Problem 2 (2 pt)
Problem 3 (1 pt)
Problem 4 (1 pt)

Phase 2: The Geometry of Maps

Problem 5 (2 pt)
Problem 6 (1 pt)

Phase 3: The Mood of the Nation

Problem 7 (2 pt)
Problem 8 (2 pt)

Extensions

Introduction

In this project, you will develop a geographic visualization of Twitter data across the USA. You will need to use lists and data abstraction techniques to create a modular program. This project uses ideas from Sections 2.1, 2.2, and 2.3, of the Composing Programs online textbook.

The map displayed above depicts how the people in different states feel about Texas. This image is generated by:

Collecting public Twitter posts (tweets) that have been tagged with geographic locations and filtering for those that contain the "texas" query term,
Assigning a sentiment (positive or negative) to each tweet, based on all of the words it contains,
Aggregating tweets by the state with the closest geographic center, and finally
Coloring each state according to the aggregate sentiment of its tweets. Red means positive sentiment; blue means negative.

The details of how to conduct each of these steps are contained within the project description. By the end of this project, you will be able to map the sentiment of any word or phrase. The trends.zip archive contains all the starter code and a small set of data.

The project uses several files, but all of your changes will be made to the first one.

`trends.py`	A starter implementation of the main project file.
`geo.py`	Geographic positions, 2-D projection equations, and geographic distance functions.
`database.py`	Functions for creating databases and interacting with them.
`maps.py`	Functions for drawing maps.
`data.py`	Functions for loading Twitter data from files.
`graphics.py`	A simple Python graphics library.
`ucb.py`	Utility functions for 61A.
`autograder.py`	Utility functions for grading.

The data directory contains all the data files needed for the project, and it's necessary to run the project. The trends.zip archive contains this directory: download it to get started. Downloading each file individually is error-prone.

The autograder also uses a file called tests.pkl. This file is included in the zip archive.

Finally, we have provided a larger dataset that you can use once you are done with the project. See instructions at the end of this document for ways to get the dataset.

Logistics

This is a one-week project with no partners.

Start early! Feel free to ask for help early and often. The course staff is here to assist you, but we can't help everyone an hour before the deadline. Piazza and the IRC awaits. You are not alone!

There are 15 possible points (12 for correctness and 3 for composition). You only need to submit the file trends.py. You do not need to modify any other files for this project. To submit the project, change to the directory where the trends.py file is located and run submit proj2.

The Autograder

We've included an autograder which includes tests for each question. Just as in the Hog project, you will have to unlock some of the tests first before you can use them to test your project. To unlock tests for a particular question, run the following command from your terminal:

python3 autograder.py -u q1

Once you have unlocked the tests, you can invoke autograder for a particular question as follows:

python3 autograder.py -q q1

To help with debugging, you can also start an interactive prompt if an error occurs by adding the -i flag at the end:

python3 autograder.py -q q1 -i

You can also invoke the autograder for all problems at once using:

python3 autograder.py

One last note: you might have noticed a file called tests.pkl that came with the project. This file is used to store autograder tests, so make sure not to modify it. If you need to get a fresh copy, you can download it here.

Phase 1: The Feelings in Tweets

In this phase, you will create an abstract data type for tweets, split the text of a tweet into words, and calculate the amount of positive or negative feeling in a tweet.

Tweets

First, we will define an abstract data type for tweets. To ensure that we do not violate abstraction barriers later in the project, we will create two different representations:

(A) The constructor make_tweet returns a Python list with the following items:

text: string, the text of the tweet, all in lowercase
time: datetime object, when the tweet was posted
latitude: a floating-point number, the latitude of the tweet's location
longitude: a floating-point number, the longitude of the tweet's location

(B) The alternate constructor make_tweet_fn returns a function that takes a string argument that is one of the keys above and returns the corresponding value.

Problem 1 (1 pt)

Implement the missing selector and constructor functions for these two representations: tweet_text, tweet_time, tweet_location correspond to representation (A); make_tweet_fn corresponds to representation (B).

For tweet_location you should return a position object. The constructors and selectors for this abstract data type can be found in geo.py. Remember to preserve data abstraction!

The two representations created by make_tweet and make_tweet_fn do not need to work together, but each constructor should work with its corresponding selectors.

As with project 1, you will need to unlock the tests first before using them:

python3 autograder.py -u q1
python3 autograder.py -q q1

Problem 2 (2 pt)

Improve the extract_words function as follows: Assume that a word is any consecutive substring of text that consists only of ASCII letters. The string ascii_letters in the string module contains all letters in the ASCII character set. The extract_words function should list all such words in order and nothing else.

Unlock, implement and test your implementation before moving on:

python3 autograder.py -u q2
python3 autograder.py -q q2

Problem 3 (1 pt)

Implement the sentiment abstract data type, which represents a sentiment value that may or may not exist. The constructor make_sentiment takes either a numeric value within the interval -1 to 1, or None to indicate that the value does not exist. Implement the selectors has_sentiment and sentiment_value as well. You may use any representation you choose, but the rest of your program should not depend on this representation.

Unlock, implement and test your implementation before moving on:

python3 autograder.py -u q3
python3 autograder.py -q q3

You can also call the print_sentiment function to print the sentiment values of all sentiment-carrying words in a line of text.

python3 trends.py -p computer science is my favorite!
python3 trends.py -p life without lambda: awful or awesome?

Problem 4 (1 pt)

Implement analyze_tweet_sentiment, which takes a tweet (of the abstract data type) and returns a sentiment. Read the docstrings for get_word_sentiment and analyze_tweet_sentiment in trends.py to understand how the two functions interact. Your implementation should not depend on the representation of a sentiment!.

The tweet_words function should prove useful here: it combines the tweet_text selector and extract_words function from the previous questions to return a list of words in a tweet.

Unlock, implement and test your implementation before moving on:

python3 autograder.py -u q4
python3 autograder.py -q q4

Phase 2: The Geometry of Maps

In this phase, we will implement two functions that together determine the centers of U.S. states. The shape of a state is represented as a list of polygons. Some states (e.g. Hawaii) consist of multiple polygons, but most states (e.g. Colorado) consist of only one polygon (still represented as a length-one list).

We will use the position abstract data type to represent geographic latitude-longitude positions on the Earth. The data abstraction, defined at the top of geo.py, has the constructor make_position and the selectors latitude and longitude.

Problem 5 (2 pt)

Implement find_centroid, which takes a polygon and returns three values: the coordinates of its centroid and its area. The input polygon is represented as a list of position values that are consecutive vertices of its perimeter. The first vertex is always identical to the last.

The centroid of a two-dimensional shape is its center of balance, defined as the intersection of all straight lines that evenly divide the shape into equal-area halves. find_centroid returns the centroid and area of an individual polygon.

The formula for computing the centroid of a polygon appears on Wikipedia. The formula relies on vertices being consecutive (either clockwise or counterclockwise; both give the same answer), a property that you may assume always holds for the input.

Hint: latitudes correspond to the x values, and longitudes correspond to the y values.

The area of a polygon is never negative. Depending on how you compute the area, you may need to use the built-in abs function to return a non-negative number.

Manipulate positions using their selectors (latitude and longitude) rather than assuming a particular representation.

Unlock, implement and test your implementation before moving on:

python3 autograder.py -u q5
python3 autograder.py -q q5

Problem 6 (1 pt)

Implement find_state_center, which takes a state represented by a list of polygons and returns a position object, its centroid.

The centroid of a collection of polygons can be computed by geometric decomposition The centroid of a shape is the weighted average of the centroids of its component polygons, weighted by their area.

Unlock, implement and test your implementation before moving on:

python3 autograder.py -u q6
python3 autograder.py -q q6

Once you are finished, draw_centered_map will draw the 10 states closest to a given state (including that state). A red dot should appear over the two-letter postal code of the specified state.

python3 trends.py -d CA

Your program should work identically, even if you use the functional representation for tweets defined in question 1, using the -f flag.

python3 trends.py -f -d CA

Phase 3: The Mood of the Nation

In this phase, you will group tweets by their nearest state center and calculate the average positive or negative feeling in all the tweets associated with a state.

The name us_states is bound to a database containing the shape of each U.S. state, keyed by its two-letter postal code. You can use the keys of this database to iterate over all the U.S. states.

A database is an abstract data type that we've created to store data. Data is stored in a database as key-value pairs. Given a key, we can access the value associated with that key from the database. This is similar to indexing a list except databases are indexed with keys, not numbers. Keys can be numbers or strings. Values can be anything. Consider the following:

>>> database = make_database()
>>> database = add_value(database, "color", 25) 
>>> database = add_value(database, 3, [1, "cool"]) 
>>> get_keys(database)
["color", 3]
>>> get_value_from_key(database, 3)
[1, "cool"]
>>> get_len(database)
2
>>> get_items(database)
[["color", 25], [3, [1, "cool"]]]

The following are the constructors and selectors for databases:

make_database(): creates a database
add_value(database, key, value): creates a copy of that database and creates a mapping between that key and value in that new database and returns the new database. If the key already exists in the database, then we replace the previous key-value pair with the new key-value pair in the new database.
get_keys(database): returns a list that contains all the keys found in that database
get_values(database): returns a list that contains all the values found in that database
get_value_from_key(database, key): returns the value associated with that key. It will raise an error if the key is not found in the database.
get_len(database): returns the number of key-value pairs in the database
get_items(database): returns a list of key-value pairs. Each key-value pair is represented as a list.

Problem 7 (2 pt)

Implement group_tweets_by_state, which takes a sequence of tweets and returns a database. The keys of the returned database are state names (two-letter postal codes), and the values are lists of tweets that appear closer to that state's center than any other.

You should not include any states as keys that are not nearest to any tweet. You may want to define additional functions to organize your implementation into modular components. You will need to use the database of us_states described above.

Unlock, implement and test your implementation before moving on:

python3 autograder.py -u q7
python3 autograder.py -q q7

Problem 8 (2 pt)

Implement average_sentiments. This function takes the database returned by group_tweets_by_state and also returns a database. The keys of the returned database are the state names (two-letter postal codes), and the values are average sentiment values for all the tweets that have sentiment value in that state.

If a state has no tweets with sentiment values, leave it out of the returned database entirely. Do not include a state with no sentiment using a zero sentiment value. Zero represents neutral sentiment, not unknown sentiment. States with unknown sentiment will appear gray, while states with neutral sentiment will appear white.

Unlock, implement and test your implementation before moving on:

python3 autograder.py -u q8
python3 autograder.py -q q8

You should now be able to draw maps that are colored by sentiment corresponding to tweets that contain a given term. The correct map for Texas appears at the top of this page.

python3 trends.py -m texas
python3 trends.py -m sandwich
python3 trends.py -m obama
python3 trends.py -m "my life"

Your program should work identically, even if you use the functional representation for tweets defined in question 1, using the -f flag.

python3 trends.py -f -m texas

Finally, you can download a larger dataset once you are done with your project. After extracting from the archive, you can move tweets2011.txt and tweets2014.txt to the `data` directory.

Warning: this dataset is 153 MB in zipped form. If you would rather not download the files, you can copy your trends project onto your class account, and do the following on your class account:

cd trends/data
setup-tweets

If you run your project from your class account, make sure to use the -X with ssh (on Macs or Linux) or enable XMing (on Windows)!

Note: as stated in the accompanying README.txt, the dataset is intended solely for use with this project. Contents of tweets2014.txt may not be redistributed or made public (e.g. on a version-control repository). After setting up the new tweets in your data directory, You can then use the -m flag above to search for more phrases, and the -t to specify the data file, like the following:

python3 trends.py -m christmas -t tweets2011.txt
python3 trends.py -m christmas -t tweets2014.txt

Congratulations! One more 61A project completed.

Extensions

These extensions are optional and ungraded. In this class, you are welcome to program just for fun. If you build something interesting, come to office hours and give us a demo. However, please do not change the behavior or signature of the functions you have already implemented.

Implement a function draw_map_by_hour that visualizes the tweets that were posted during each hour of the day. For example, you'll discover that "sandwich" tweets appear most positive at 10:00pm: late night snack!
Punctuation can be an indicator of sentiment as well. Add an emoticon (smiley) detector that attributes positive sentiment to happy faces :-) and negative sentiment to sad ones.
In the standard implementation, some tweets are associated with different states than the ones in which they occurred. For example, all tweets from Manhattan are assigned to New Jersey. New Yorkers would be appalled! Write a function find_containing_state that finds the state that actually contains a tweet position.
The graphics.py package supports animation. Use the slide_shape method to have states and dots slide into place.
Correct the spelling of tweets before you compute their sentiment.
Calculate the total average sentiment of the whole country for a term and display that using the map.py and graphics.py package (try and understand the implementation of draw_most_talkative_states then use it as a foundation and modify as needed)

Acknowledgements: Aditi Muralidharan developed this project with John DeNero. Hamilton Nguyen extended it. Keegan Mann developed the autograder. Many others have contributed as well.

Project 2: Twitter Trends

Table of Contents

Introduction

Logistics

The Autograder

Phase 1: The Feelings in Tweets

Tweets

Problem 1 (1 pt)

Problem 2 (2 pt)

Problem 3 (1 pt)

Problem 4 (1 pt)

Phase 2: The Geometry of Maps

Problem 5 (2 pt)

Problem 6 (1 pt)

Phase 3: The Mood of the Nation

Problem 7 (2 pt)

Problem 8 (2 pt)

Extensions