GWU

CSCI 1111

Introduction to Software Development, Fall 2021

GWU Computer Science


Project: Natural Langugage Processing

In this project, you will write a Java program that will analyze a dataset of tweets. There are two basic tasks that we'll accomplish:

  1. Finding the most common topics discussed
  2. Analyzing the sentences to detect sentiment (how positive or negative the sentence was)
You'll be working in teams of up to two students for this project, which is divided into several components that will be graded individually. Once a component has been graded, we will make available a solution for you to use in the next section of the project.

Unlike previous assignments in this class, this project is a lot more open-ended with fewer instructions: implementation details are left to the students, and you may use resources you find online, as long as you cite them (but you may not have other people write the project for you, breaking the spirit of the assignment). This project is a small, but realistic research project on a real-world dataset, and is meant to be fun and give you an idea of some of the interesting applications of the ability to program and process data.

Because so much of the implementation is left up to you to figure out, we are devoting class time this semester to work on the project together, as we're also looking forward to your questions on Piazza!

While you're working on this project, remember to have FUN! If you're stuck on the same thing for 15 minutes without making any progress, please raise your hand in class, ask a neighbor, or message us on Piazza!






Part 1: Dataset download and setup (due Wed 4/13 at 11:59pm)

The dataset we'll be using is a Twitter dataset automatically collected by researchers at Stanford University (if you're interested in more details, they've published an academic paper on their work with this dataset).

WARNING: the Twitter dataset was automatically collected by researchers using the TwitterAPI; as such, it has not been screened nor sanitized to remove potentionally offensive and/or objectionable material. We are not condoning nor endorsing any of the material in the dataset by using it in this project.

If you feel uncomfortable using this Twitter dataset for this project, please feel free to download any of the Plain Text UTF-8 text files from any of the 60,000 free Ebooks in the public domain from Project Gutenberg to use instead for this project.

Altrenatively, if you have your own dataset of English sentences (at least a few hundred, with some associated metadata for each sentence) you wish to analyze, you are welcome to use that as the basis for this project -- please post a message on Ed if you wish to do so, so we can review your materials and make sure they will be compatible with the rest of this project.

Option 1: download the Twitter dataset

Their dataset can be downloaded with the link here, and should be extracted to a local directory on your computer where you will work on this project. The download contains two files -- a training dataset of 1,600,000 tweets, and a test set of a few hundred. We'll focus on using the test dataset, as the other file is very large and difficult to open with something like MSExcel or a text editor. You can delete their large training file, but it might be interesting to run your analysis on a larger subset of tweets later on.

Use a text editor or MSExcel to open the test dataset and take a look at the file format -- we will be extracting the tweet itself, the username, and the timestamp of the tweet for our analysis. Notice that the file is in a csv (comma separated variable) format.

Option 2: download an EBook

You can download any of the Plain Text UTF-8 text files from any of the 60,000 free Ebooks from Project Gutenberg to use instead for this project. Each such text file typically contains a table of contents at the beginning, where each line in the table of contents appears as a chapter heading later in the file (if the book you happened to pick doesn't follow this format, pick a different one).

Reading in the sentences from a file

Next, in a separate file called Driver.java declare a main method that you will complete in order to read in the .csv or .txt file you downloaded. Practice using Google (or other resources) to find a way to read in a file, line-by-line, using a while loop, and print the results to the screen (for now). As a warning, most likely the code to do this requires you using a try-catch block to handle exceptions; we haven't learned about this yet in class, so please feel free to post questions to Ed if you need help. If you get a compilation error around the code you use to open a file, make sure that you have a try-catch block around those lines.

Creating the Sentence class

Next, let's write a class called Sentence that has at least the following methods:

  1. public Sentence(String text, String author, String timestamp): a constructor that sets three attributes
  2. public getters and setters getText, setText, getAuthor, setAuthor, getTimestamp, and setTimestamp, with argument and/or return types matching those passed in to the constructor.
  3. public String toString(): returns a human-legible string representation of the current object in the format {author:kinga, sentence:"Hello CSCI 1111!", timestamp:"May 11 2009"}
    or
    {author:janeausten, sentence:"Mr. Bennet was among the earliest of those who waited on Mr. Bingley.", chapter:"Chapter 2"}
    if you are processing books instead of tweets.

Save your class in a file called Sentence.java. You can test your Sentence file by running the Part1Tester.java file. You'll have to run your Driver manually to test it.

When you have finished, upload your solutions to BB -- ONLY ONE GROUP MEMBER should upload the solution. Please make sure to do this by the deadline (4/13 at 11:59pm). If you finish this project early in this lecture, move on to Part 2 below.

GRADING RUBRIC for Part 1:
Test cases passed in tester13 points
Uses good coding style and formatting (as discussed in lecture) for both files:
java -jar checkstyle-9.2.1-all.jar -c ./CS1111_checks.xml Sentence.java
java -jar checkstyle-9.2.1-all.jar -c ./CS1111_checks.xml Driver.java
4 points
main method opens a file correctly and prints out its contents line-by-line using a loop4 points

Part 2: Processing data into objects (due 04/15 at 11:59pm)

Processing sentences from the file

We're going to write some code to store each sentence and its metadata as an object in an ArrayList. Declare such an object, and modify your loop in main to store each Sentence object in the ArrayList, using generics to indicate the correct type of object stored. For now, just add an empty Sentence to the list each time you read in a line; in a minute, we'll discuss how to use some other code you'll write to create Sentence objects from the input file lines.

Whether you're working with tweets or books, we'll need to write code that is able to extract the three pieces of data we want from each sentence/tweet: the sentence/tweet, the author (either a twitter user or the author of the book), and temporal information about the sentence (either the timestamp of the tweet, or what chapter the sentence comes from in the book). See instructions below for one of those two options:

Whether you're reading in tweets from the csv or literary sentences, you'll notice both these formats contain double quotations: you should REMOVE all double quotations from all the fields you store using methods available in the String library. For example, the tweet @mikefish Fair enough. But i have the Kindle2 and I think it's perfect :) should become something like @mikefish Fair enough But i have the Kindle2 and I think its perfect . Use the same approach to remove all punctuation (periods, commas, etc.), but be aware if you are removing periods, this can be unexpectedly tricky -- ask for help! If you want to replace the period using replaceAll you'll want to do replaceAll("\\.", "") instead of replaceAll(".", "") -- the latter will delete the entire tweet due to a special behavior of the period as a regular expression as an argument to the method.

In order to processes the sentences from the file, add one of the following two methods to your Sentence class:

You'll need to connect one of the methods above to your main method, so that the input can be processed, line-by-line, into an ArrayList of Sentence objects (you've already set that loop up).

Test cases for sentence processing (Twitter dataset only)
In another file called ExtractTest.java, write at least two test cases in a main method:
  1. One test case should check to make sure that you can process a line with a tweet correctly where the tweet has a comma
  2. Another test case should check to make sure that you can process the timestamp of the tweet to extract just the month, day, and year as the toString expects it to be

Your test cases should demonstrate this functionality by passing in a String to the convertLine method and making sure that the Sentence is correctly captured. Use print statements to print out the results of comparing one Sentence object to another that holds the desired answer.

Processing the Ebook dataset is a bit more complex than the Twitter dataset (and harder to write test cases for with what we know so far), so you may omit writing tests if you are working on that topic -- just make sure to at least visually verify that you can correctly read in the first few sentences of the first three chapters of the book.

You can test your Sentence file by running the appropriate part of the Part2Tester.java file. You'll have to run your Driver manually to test it.

When you have finished, upload your solutions to BB -- ONLY ONE GROUP MEMBER should upload the solution. Please make sure to do this by the deadline (4/15 at 11:59pm). If you finish this project early in this lecture, move on to Part 3 below.


GRADING RUBRIC for Part 2:
Sentence class convertLine/addToEnd method works correctly together with reading in a file line-by-line (correctly stores timestamp, removes all double quotes, handles tweets with commas/handles sentences across multiple lines)15 points for Ebook;
9 points for Twitter (3 per test case in tester)
ArrayList of sentences in main is correctly declared using generics, and Sentence objects are added to it correctly.3 points
At least two tests correctly written, as specified above6 points (Twitter only)

Part 3: Processing sentences into words, topic modeling (due Wed 4/20 at 11:59pm)

As we just saw, dataset identification, collection, and pre-processing can be one of the most difficult parts of a research project! Now that we have our sentences and their metadata neatly placed into an ArrayList of Sentence objects, the fun (and easier) part can start!

For this section, we want to know what are the most common topics across all of our sentences. Taking this pulse could have a number of real-world applications; for example, people have studied tweets about COVID to try to track new outbreaks that happened in 2020, before the widespread availability of PCR tests, by looking at what symptoms were being reported in which locations.

Here, we're going to work on splitting up sentences into words, doing a little cleaning, and then making a list of the most common individual words or phrases across our dataset. You're free to implement all of these pieces manually using what we've learned this semester and the String library, or, you can look online for NLP libraries that accomplish this sort of thing already (make sure you cite your sources with comments in the code if you do this!).

Splitting up a sentence into words

First, let's modify your Sentence class to add a method that takes the text of the sentence, and returns an ArrayList that contains the words in the sentence. You can do this by splitting up the sentence over the space character ' '. Call the method public ArrayList<String> splitSentence().

Finding the most common words across all sentences

Next, write a method called public static HashMap<String,Integer> printTopWords in your Driver that takes as argument an ArrayList of Sentences, extracts the words for each sentence, and updates a HashMap object that keeps track of how many times each word appeared in the entire dataset. Note that you must store Integers in the HashMap, not ints, so you should declare it as HashMap<String, Integer>.

Once you have written that code, call it in your main method to obtain the HashMap of word counts (think about what you need to pass into it as an argument to glue everything together, and what to do with what the method returns). Then, copy the code below into your main method to have it loop through the HashMap and print out all its entries:

import java.util.Map; //place with imports
import java.util.Collections; //place with imports

Map.Entry<String, Integer> maxEntry = null;
for (Map.Entry<String, Integer> entry : YOUR_HASH_MAP.entrySet())
    if (maxEntry == null || entry.getValue().compareTo(maxEntry.getValue()) > 0)
        maxEntry = entry;
int maxValueLen = maxEntry.getValue().toString().length();
ArrayList <String> results = new ArrayList<String>();
for (Map.Entry set : YOUR_HASH_MAP.entrySet()){
    String value = set.getValue().toString();
    while(value.length() < maxValueLen)
        value = " " + value;
    results.add(value + " of " + set.getKey());
}
Collections.sort(results);
Collections.reverse(results);
for (int i = 0; i < results.size() && i < 100; i++)
    System.out.println(results.get(i));
return results;


Once again, we'll cover what this code does in CSCI 1112! In the meantime, you'll need to modify the code above to change YOUR_HASH_MAP to whatever you called your HashMap variable. Run your main method in the Driver to see if it works!

Word cleaning

You probably noticed in your printout that a lot of words were not very interesting (like "as" and "the"). In NLP, we call these stopwords, and we usually remove them before analyzing the words we collect from our sentences.

People have compiled various lists of stopwords, and we're going to use one example (below) -- you could also get fancy and import other lists from various NLP libraries if you wanted, or just copy the array below into your splitSentence method:
String[] stopwords = {"a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "aren't", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "can't", "cannot", "could", "couldn't", "did", "didn't", "do", "does", "doesn't", "doing", "don't", "down", "during", "each", "few", "for", "from", "further", "had", "hadn't", "has", "hasn't", "have", "haven't", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "isn't", "it", "it's", "its", "itself", "let's", "me", "more", "most", "mustn't", "my", "myself", "no", "nor", "not", "of", "off", "on", "once", "only", "or", "other", "ought", "our", "ours ourselves", "out", "over", "own", "same", "shan't", "she", "she'd", "she'll", "she's", "should", "shouldn't", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "wasn't", "we", "we'd", "we'll", "we're", "we've", "were", "weren't", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "won't", "would", "wouldn't", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"}; //from https://www.ranks.nl/stopwords

Now, write code that checks every word that you found when splitting the sentence by spaces against this list; if the word is found in this list of stopwords, don't include it in the ArrayList of words you're going to return. You'll notice that the stopwords are all lowercase, but your words may not be; use the .toLowerCase() method from the String class to convert your words to all lowercase before checking if they appear in this list of stopwords. Now run your main method again to call splitSentence and check to make sure that the stopwrds were removed correctly -- what do the most common topics look like now?

Optional: Further cleaning

You may notice that some words/topics may be repetitions of another word; perhaps they are synonyms, or they are the same word but with different plurarity (cat vs cats, for example). In NLP, it is common to perform tasks of stemming and/or lemmatization to ignore issues like the latter. If you're feeling very ambitious, feel free to look up a Java library that can do this sort of thing for you (at the sentence level). Otherwise, you can add further pre-processing to manually convert a plural into a singular using the .replaceAll() method of the String class. Feel free to also modify the stopword list manually to get rid of any additional words you don't think are important.

Optional: N-grams for phrases

While it's awesome that we're able to model the topics of the tweets (or Ebook) using the most common words in all the sentences, sometimes what we're really after isn't a single word, but a phrase made up of multiple words (such as "machine learning" or "out of bounds"). Taking a look at your results from the previous section, are there any such common words that might have been phrases instead? Try to see if you can improve your model by, when processing consecutive words left over after stopwords have been removed, considering each consecutive pair of words as a phrase (a bi-gram), and counting these frequencies in your HashMap as well. Note: some phrases may be non-sensical, but they will also not appear commonly enough to make it to the top 100 we print. Repeat this process for triplets (tri-grams) if you wish.

You can test your Sentence file by running the appropriate part of the Part3Tester.java file. This tester cannot test the extra credit.

When you have finished, upload your solutions to BB -- ONLY ONE GROUP MEMBER should upload the solution. Please make sure to do this by the deadline (12/08 at 11:59pm). If you finish this project early in this lecture, move on to Part 4 below -- it is also due 12/08.


GRADING RUBRIC for Part 3:
Sentences are correctly split over spaces2 points
HashMap correctly stores the number of times each word appears in the dataset5 points
The HashMap is printed out correctly, showing the top 100 words and their frequency5 points
Stopwords do not appear in the HashMap5 points
Further word cleaning was performedup to 5 points (extra credit)
N-grams and/or phrases were calculatedup to 30 points (extra credit)

Part 4: Measuring emotions of tweets (or literary sentences) (due Fri 04/22 at 11:59pm)

People are often interested in monitoring trends in natural language datasets, and one way to do that is to measure how positive or negative a sentence might be. For example, an automated mental health chatbot could be trained to monitor a conversation, and look for indicators of increasing depression. This sort of analysis is often extended to tweets, reddit posts, and other social media.

In this section of the project, we're going to learn how to install and use a Java library that will allow us to measure the sentiment (positive or negative emotion) in a sentence.

Downloading the Stanford CoreNLP libraries

In order to allow our Java code access to libraries that are not a standard part of Java, we need to download these libraries and make sure that our Java program knows how to find them. We can download these files by pasting the urls below into your browser:
https://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
https://nlp.stanford.edu/software/stanford-english-corenlp-2018-10-05-models.jar

It will take several minutes to download these files -- please read ahead and study the code below while you're waiting.

Next, copy both files into the same directory where your Java files are. Extract the contents of the .zip file using the terminal commands (if the first command doesn't work, you can unzip the way you would normally unzip files):
unzip stanford-corenlp-full-2018-10-05.zip
cd stanford-corenlp-full-2018-10-05
cp ejml-0.23.jar ..
cp stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2.jar ..
cd ..

Make sure the .jar files end up copied into the same directory as your Java files.

Adding code to score the sentiment of a sentence

Now that we've downloaded the Stanford CoreNLP libraries (note that they are files that end in .jar, which stands for "java archive"), we can import these libraries at the top of our Sentence.java:
import java.util.Properties;
import org.ejml.simple.SimpleMatrix;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations;
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations.SentimentAnnotatedTree;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.util.CoreMap;

In a minute we'll show you how to let the Java program that tries to run this file know where these libraries live on your machine.

Next, add the code above at the top of your Sentence.java. Then, also add the code below that defines a method that is able to score a sentence for sentiment using the library we just downloaded:

public int getSentiment(String tweet){
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit, pos, parse, sentiment");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation annotation = pipeline.process(tweet);
    CoreMap sentence = annotation.get(CoreAnnotations.SentencesAnnotation.class).get(0);
    Tree tree = sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class);
    return RNNCoreAnnotations.getPredictedClass(tree);
}

Let's go through what the pieces above are doing. Properties are flags that users can set to determine what processing steps to apply to the sentence. Here, you can see where those properties are set with setProperty, where a string is passed in contains different types of processing the sentence will be subject to. In this case, you may recognize sentence splitting steps (tokenization) like we did manually above. You'll see the last argument set here is sentiment, which instructs the library to return a sentiment score between 0 and 4 for the sentence as either 0=Very Negative, 1=Negative, 2=Neutral, 3=Positive or 4=Very Positive. Sentiment analysis, like our topic modeling, also requires that raw sentences are tokenized and pre-processed in order to make its judgements.

The library is called by creating a pipeline object, which passes the tweet though the processing steps we just outlined. At the end of that, the next three lines of code are taking the processed sentence (recall, it no longer looks like a human visible sentence) and passing it through a machine learning algorithm that has been trained elsewhere in the library to try to guess what the sentiment of this particular sentence was. Pretty cool, and all just in a few lines of code!

Now, we're ready almost ready to use this code to score sentences. First, since the getSentiment method lives in the Sentence class, it doesn't need an argument: change the method to operate on the text field of the Sentence class instead.

Next, in your main method loop to go through your ArrayList of sentences, and print out each sentence, along with its sentiment score. You'll notice that the println method is actually calling your toString method if you pass it your Sentence object -- this is a neat thing that Java does that you'll more about if you take CSCI 1112.

Finally, in order to run your code, you need to make sure that the .jar files we downloaded earlier are in a place where Java can find them. There are many ways to set this up elegantly, but for now we'll just point the javac and java programs to the paths of three .jar files the code above needs to run. Try the command below that applies to your operating system; if it doesn't work and you've verified that all the files are in the current directory, see the part below for instructions on how to unpack the .jar files and get it working:
javac -classpath ".:./ejml-0.23.jar:./stanford-corenlp-3.9.2.jar:./stanford-english-corenlp-2018-10-05-models.jar" *.java for Mac/linux, or
javac -classpath ".\stanford-corenlp-3.9.2.jar;.;./ejml-0.23.jar;./stanford-english-corenlp-2018-10-05-models.jar" *.java for Windows.
and then
java -classpath ".:./ejml-0.23.jar:./stanford-corenlp-3.9.2.jar:./stanford-english-corenlp-2018-10-05-models.jar" Driver for Mac/linux, or
java -classpath ".\stanford-corenlp-3.9.2.jar;.;./ejml-0.23.jar;./stanford-english-corenlp-2018-10-05-models.jar" Driver for Windows.
You'll learn more about the classpath in CSCI 1112 as well.

If the commands above didn't work, do the following to get the jar files unpacked and then try re-running the command as detailed below:

  1. In the terminal, type the following two commands:
    jar xvf ejml-0.23.jar
    jar xvf stanford-corenlp-full-2018-10-05/stanford-corenlp-3.9.2.jar
    jar xvf stanford-english-corenlp-2018-10-05-models.jar
  2. Try to run your code again with the correct command above

What do you think about the sentiments it scored for all the tweets (or literary sentences)? Do you agree or disagree?

When you have finished, upload your solutions to BB -- ONLY ONE GROUP MEMBER should upload the solution. Please make sure to do this by the deadline (12/08 at 11:59pm). If you finish this project early in this lecture, move on to Part 5 below.


GRADING RUBRIC for Part 4:
Sentence.java compiles with all the changes above5 points
getSentiment has been correctly modified to operate on the text field2 points
A loop has been written that correctly iterates through all sentences, printing them out, and displaying their sentiment scores5 points

Part 5: Temporal analysis of topics and sentences (due Wed 04/27 at 11:59pm)

We now have a way to find the major themes/topics of a group of sentences, as well as the ability to analyze the sentiment of these sentences. Next, we're going to have you write some code to be able to filter your ArrayList of sentences to interesting subsets.

First, define and implement a method in the Sentence class called public boolean keep(String temporalRange) that returns true or false depending on whether or not the Sentence object is within either:
A date range such as "May 31 2009-Jun 02 2009" if you are using tweets, or
A chapter number such as "Chapter 2" depending on the Ebook you chose.

To get some practice with real-world programming and debugging, we recommend you use built-in Java libraries to convert each date (that's a string) into an integer (long) where you can compare that one date is less than and/or greater than another. These libraries often convert dates into the number of seconds since an arbitary date in the past, so you can compare them with the less than and greater than operators, like normal integers. You can google (or use) any way you'd like to make this conversion, but here are some examples:

  1. We googled for "java convert string to timestamp" and this was a top result from Stackoverflow. You can follow some of the "solutions" there to get code to convert strings to Timestamps.
  2. You'll also likely want to import the SimpleDateFormat class to use in the solution above or ones like it; check out its API to see how you can provide an expression for the format your dates are in.
  3. Once you have a Timestamp object for each date, you can use its methods in its API to convert the timestamp into the number of seconds.
  4. Don't forget to import all the classes you need, and be aware that like when opening files, some of this code requires you have another try-catch block to handle exceptions.

Next, modify your code in main to use the keep method to generate a new ArrayList that filters out tweets or sentences to just those in the temporal range specified.

Finally, using all the code you've written for this project, plan and then execute a further analysis of the dataset that answers one or more of the following potential questions, and/or investigates something similar (check with the instructor if you want to deviate from this list):

  1. Are some months more positive or negative, on average, compared to others in the Twitter dataset, or,
    Are some chapters more positive or negative than others, on average?
    Think about how you could convert the sentiment scores produced by the code into a way you can compare their average values, or how you would graph the distribution of ratings (nothing fancy needed, can be in text!)
  2. Are sentences associated with some of the more popular topics more positive or more negative, on avergae? You can write another keep method that is able to filter out out sentences based on keywords, rather than temporal ranges.
When you've finished running some experiments by modifying what you pass to keep in main, write a paragraph summarizing your results, and include metrics to support your conclusions.

When you have finished, upload your solutions to BB -- ONLY ONE GROUP MEMBER should upload the solution. Please make sure to do this by the deadline (4/27 at 11:59pm). If you finish this project early in this lecture, move on to Part 5 below.


GRADING RUBRIC for Part 5:
keep has been correctly implemented to handle a temporal range and/or a keyword5 points
All code submitted is properly indentend, uses good variable naming, and is appropriately commented5 points
main has been correctly modified to generate filtered ArrayLists using keep.5 points
One of the two analysis above has been completed (or a third instructor-approved analysis), and a paragraph summarizing the results with metrics was provided.10 points
Any additional (instructor approved) analysis has been completed, and a paragraph summarizing the results with metrics was provided.10 points per analysis (extra credit)

All done! Where to go next?

Great work on this project! If you're interested in learning more about Natural Language Processing, including doing summer research with Dr. Kinga, feel free to drop her an email!

Otherwise, take a break before studying for finals!