\( \newcommand{\eqb}[1]{\begin{eqnarray*}#1\end{eqnarray*}} \)


Programming and Literature

Module 3 > Literature


Objectives

 

We will focus on text analysis in this module: writing code to digest text and analyze in some way.

We're using the term literature to broadly include all kinds of text.
 

3.20 Video:

 


Part A: The Flesch-Kincaid readability score

 

First, read about Flesch-Kincaid readability.

Determining the readability and reading-level of a passage is critical in helping teachers select text passages in grade school.

We'll use the F-K reading ease score $$ 206.835 - 1.015 \left(\frac{\mbox{total words}}{\mbox{total sentences}}\right) -84.6 \left(\frac{\mbox{total syllables}}{\mbox{total words}}\right) $$

 

Now let's consider automated scoring:

 

3.21 Exercise: Download the following: syllable.py and test_syllable.py. Then fill in code to implement the syllable counting rules and run the tests in test_syllable.py.
 

3.22 Audio:
 

3.23 Exercise: Download flesch_kincaid.py, wordtool.py, wordsWithPOS.txt, stopwords.txt, and the following sample text files:

In your module pdf, report on the FK-score for each of these. Analyze a text from Project Gutenberg (or elsewhere) whose FK score is surprising (either unexpectedly low or unexpectedly high), and explain why it's surprising.
 


Part B: Word-cloud analysis of text

 

What we'd like to do in this part is to run an analysis of text and produce a quick telling snapshot of the whole text.

For example, here's a word-cloud of the 10 most frequently occuring words in the Federalist Papers:

And for comparison, the same for Darwin's Origin of the Species:

If you only looked at the snapshot and hadn't been told the book titles, surely you could say the first one was about government or politics, and the second about biology.

Of course there are now far more sophisticated ways of automatically analyzing text. But a good starting point is a simple word cloud:

  • We'll count the occurence of nouns (to simplify) in a text.
  • Then we'll identify the 10 most frequently occuring nouns.
  • After that, we'll draw using font sizes proportional to occurrence.
  • Later, you'll improve on the drawing (you can see that "species" practically hides all the other nouns in Darwin's cloud).
 

Let's start with counting nouns:

  • First, it's a good idea to review both tuples and dictionaries from Module 1.

  • We'll use wordtool to read a file word-by-word, and wordtool also has a list of nouns, which will let us check whether a word is a noun.

  • We'll be careful to avoid so-called stopwords:
    • These are common short words like "a", "an", "the".
    • While most aren't nouns, we'll nonetheless remove them because any kind of word analysis typically begins by removing stopwords.

  • So, the general idea in counting nouns is:
    import wordtool as wt
    
    nouns = wt.get_nouns()              # Wordtool will build this list
    stop_words = wt.get_stop_words()    # And this one.
    
    # We'll use this dictionary for counting occurrence:
    noun_count = dict() 
    
    def compute_noun_count(filename):
        wt.open_file_byword(filename)
        w = wt.next_word()
        while w != None: 
    
            # If w is not a stop word and is a noun
            # update its count if the word is already 
            # in the dictionary's keys. Otherwise set its
            # count to 1.
    
            w = wt.next_word()
      
 

The second step is to identify the noun with the highest count:

  • The idea is this:
    • Suppose we write code to identify (from within the dictionary), the word with the highest count.
    • This becomes the largest word in the word cloud.
    • We then remove it from the dictionary.
    • Now we apply the same function to find the word with the highest count. This will now (because we removed the top word) find the word with the second-highest count.
    • Then we remove that word, and so on.

  • So the goal is to write a function that looks like this:
    def get_top():
        best_noun = None
        best_count = 0
    
        # ... code to identify the noun with the highest count ...
    
        # Note: we're returning a tuple
        return (best_noun, best_count)
      
 

3.24 Exercise: Download text_analyzer.py and test_text_analyzer.py. Then fill in the needed code in text_analyzer.py and run the second file (which describes the desired output).
 

3.25 Audio:
 

Now that we have the analysis working, we're ready to draw word clouds:

  • We'll use the simple approach of declaring a maximum font.

  • Once we have a count for a particular noun, simply compute a proportionate font size:
        # Font size based on proportion to most frequently occuring noun.
        font_size = int( (count / max_count) * max_font )
      

  • The loop structure is:
    (noun, count) = text.get_top()
    
    for i in range(n):
        # Compute font size for noun using count
        # Draw at a random location
    
        # Remove this noun from the dictionary
        text.noun_count.pop(noun)
        # Get the next pair
        (noun, count) = text.get_top()
      
    The function pop() removes from a dictionary.
 

3.26 Exercise: Download word_cloud.py and examine the code to see the above ideas implemented. Then, run to obtain the word clouds we've seen.
 

3.27 Exercise: In my_word_cloud.py, write code to improve on the drawing. You can use colors, check to minimize overlap (which is challenging) or add geometric figures. Then, find two texts whose clouds align nicely with their contents. In your module pdf, show screenshots of these word clouds and describe why they align.
 

3.28 Audio:
 


Optional further exploration

 

If you'd like to explore further:

  • There is now an emerging field, sometimes called computational literary studies.

  • The idea is to analyze (in different ways) thousands and thousands of books, which no single human can read and digest, and to ask the question: what can we learn that any human could not?

  • Here is one news story on that topic, and here's another. One interesting and quite unexpected result: did you know that, when studying the percentage writers who are female, 1970 was the worst year (25%) since 1870 (50%)?

  • Sophisticated analyses of text include:
    • Detecting sentiment, dialogue, gender, topic.
    • Extracting structure, flow of events.

  • The computational analysis of text and other kinds of data in the humanities is sometimes called the digital humanities.
 



Back to Module 3



© 2020, Rahul Simha