Skip to content

Instantly share code, notes, and snippets.

@mhbeals
Created May 11, 2021 10:49
Show Gist options
  • Save mhbeals/9b0113fa7a7979b37ad41d4a0ecd38d8 to your computer and use it in GitHub Desktop.
Save mhbeals/9b0113fa7a7979b37ad41d4a0ecd38d8 to your computer and use it in GitHub Desktop.
Visualising Texts

Imports and Exports

First, let us recap yesterday's concordance programme.

I asked you to create a programme to display every instance of the word "to" in our sample sentence, in the context of the word immediately before and after it. I suggested that you use:

  • a List
  • an Iterator / While Loop
  • the .join() and .append() Functions
  • a For Loop
  • the Print Function

Reconstruct your programme below to ensure you've remembered all the concepts we covered:

my_string = "The time has come for all good men to come to the aid of their country"

Cleaning our data

Wonderful! Because we were using a relatively simple sentence, we did not need to clean our data. If we use a longer passage, such as the first chapters of "A Tale of Two Cities", we will need to do a bit more preparatory work.

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way-- in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. There were a king with a large jaw and a queen with a plain face, on the throne of England; there were a king with a large jaw and a queen with a fair face, on the throne of France. In both countries it was clearer than crystal to the lords of the State preserves of loaves and fishes, that things in general were settled for ever. It was the year of Our Lord one thousand seven hundred and seventy-five. Spiritual revelations were conceded to England at that favoured period, as at this. Mrs. Southcott had recently attained her five-and-twentieth blessed birthday, of whom a prophetic private in the Life Guards had heralded the sublime appearance by announcing that arrangements were made for the swallowing up of London and Westminster. Even the Cock-lane ghost had been laid only a round dozen of years, after rapping out its messages, as the spirits of this very year last past (supernaturally deficient in originality) rapped out theirs. Mere messages in the earthly order of events had lately come to the English Crown and People, from a congress of British subjects in America: which, strange to relate, have proved more important to the human race than any communications yet received through any of the chickens of the Cock-lane brood. France, less favoured on the whole as to matters spiritual than her sister of the shield and trident, rolled with exceeding smoothness down hill, making paper money and spending it. Under the guidance of her Christian pastors, she entertained herself, besides, with such humane achievements as sentencing a youth to have his hands cut off, his tongue torn out with pincers, and his body burned alive, because he had not kneeled down in the rain to do honour to a dirty procession of monks which passed within his view, at a distance of some fifty or sixty yards. It is likely enough that, rooted in the woods of France and Norway, there were growing trees, when that sufferer was put to death, already marked by the Woodman, Fate, to come down and be sawn into boards, to make a certain movable framework with a sack and a knife in it, terrible in history. It is likely enough that in the rough outhouses of some tillers of the heavy lands adjacent to Paris, there were sheltered from the weather that very day, rude carts, bespattered with rustic mire, snuffed about by pigs, and roosted in by poultry, which the Farmer, Death, had already set apart to be his tumbrils of the Revolution. But that Woodman and that Farmer, though they work unceasingly, work silently, and no one heard them as they went about with muffled tread: the rather, forasmuch as to entertain any suspicion that they were awake, was to be atheistical and traitorous. In England, there was scarcely an amount of order and protection to justify much national boasting. Daring burglaries by armed men, and highway robberies, took place in the capital itself every night; families were publicly cautioned not to go out of town without removing their furniture to upholsterers' warehouses for security; the highwayman in the dark was a City tradesman in the light, and, being recognised and challenged by his fellow-tradesman whom he stopped in his character of “the Captain,” gallantly shot him through the head and rode away; the mail was waylaid by seven robbers, and the guard shot three dead, and then got shot dead himself by the other four, “in consequence of the failure of his ammunition:” after which the mail was robbed in peace; that magnificent potentate, the Lord Mayor of London, was made to stand and deliver on Turnham Green, by one highwayman, who despoiled the illustrious creature in sight of all his retinue; prisoners in London gaols fought battles with their turnkeys, and the majesty of the law fired blunderbusses in among them, loaded with rounds of shot and ball; thieves snipped off diamond crosses from the necks of noble lords at Court drawing-rooms; musketeers went into St. Giles's, to search for contraband goods, and the mob fired on the musketeers, and the musketeers fired on the mob, and nobody thought any of these occurrences much out of the common way. In the midst of them, the hangman, ever busy and ever worse than useless, was in constant requisition; now, stringing up long rows of miscellaneous criminals; now, hanging a housebreaker on Saturday who had been taken on Tuesday; now, burning people in the hand at Newgate by the dozen, and now burning pamphlets at the door of Westminster Hall; to-day, taking the life of an atrocious murderer, and to-morrow of a wretched pilferer who had robbed a farmer's boy of sixpence. All these things, and a thousand like them, came to pass in and close upon the dear old year one thousand seven hundred and seventy-five. Environed by them, while the Woodman and the Farmer worked unheeded, those two of the large jaws, and those other two of the plain and the fair faces, trod with stir enough, and carried their divine rights with a high hand. Thus did the year one thousand seven hundred and seventy-five conduct their Greatnesses, and myriads of small creatures--the creatures of this chronicle among the rest--along the roads that lay before them.

So, let's try that again but with a bit more grace. First, we'll need to import the re library, short for Regex, or Regular Expressions. This library provides new functions for finding and replacing text in your dataset.

To import a standard Python library, you need only use the command import followed by the library name, in this case re. Import the library below.

Next we'll need to run two extra steps on our string before we split it into words:

  • Remove punctuation
  • Standardise the case

In the box below, I have included a string, my_string that contains a direct copy of the first chapter of Dickens's A Tale of Two Cities. On the subsequent line, you'll need to use the function re.sub() to remove the punctuation marks. A simple (if slightly blunt) way of doing this is through negative criteria.

We'll use three standard regular expressions

  • [] — putting a regular expression in square brackets indicates it should search for any of the items within it
  • \w — any word
  • \s — any whitespace
  • ^ — not
  • * — 0 or more times

to create the expression [^\w^\s]*, which translates to "find anything that is not a word or a whitespace zero or more times". You need the "zero or more times" in case there are two punctuation marks together.

We then place this in the re.sub(r'x', y, z) function, where

  • x is the regular expression, surrounded by r'' to state that it is a regular expression rather than a simple string
  • y is what you want to replace it with in single quotation marks (in this case nothing, or '')
  • z is the name of the string you are transforming.

In the box below, on the second line, reassign my_string to the re.sub() function, replacing x, y, and z as appropriate.

Use a truncated print command—print(my_string[0:100])—to see if your transformation was successful.

After removing all the punctuation, we'll need to regularise the case. This is done by using the simple x.lower() function on your string, where x is the variable name.

In the box below, reassign my_string to a lowercase version of itself, and then split it into a word list using the x.split() function.

Let's check to see if our new text runs as smoothly as our old! Run the programme below and see your results!

ngram_list = []
i = 1

while i < len(my_words):
    if i < len(my_words)-1:
        ngram = ' '.join(my_words[(i-1):(i + 2)])
        ngram_list.append(ngram)
    i = i + 1
    
i = 0
while i < 6:
    print(ngram_list[i])
    i = i+1

Importing Data

Congratulations, you are doing great! Now, this is all well and good, but who has time to open an ebook and cut and paste a chapter into a string? Nobody, that's who. So instead you'll need to learn how to open files that are already on your computer.

First, download a copy of https://www.gutenberg.org/files/98/98-0.txt and then upload it to your working directory (Files --> upload)

You can then easily open and import the data from a file in the same directory using the .read() function:

with open('filename.txt', 'r') as filename:
    data = filename.read()

Let's break that down. The first line is commanding the console to open a file called filename.txt in r or read mode. It is then creating a temporary variable called filename to store that information in. We then just assign that information to a new string called data.

You have to use the function .read() because just assign filename would give you, in essence, the metadata about that file, rather than data itself.

Try importing A Tale of Two Cities into a variable and then printing the first 500 characters.

Breaking Out the Dictionary

Now, let's remove those headers, so our statistics aren't thrown off! Open the text file and remove the first and last few paragraphs of standarised text.

Creating a Dictionary

A dictionary is a special sort of list that has a list of key entries, each with its own value—like a dictionary.

You can also think of it like this. Imagine an index card with two words on it, a key in the corner and a value or text in the centre. If you have hundreds of these cards, each with a different key, you can place them in a single box. That box is the dictionary and each card is a two-value list within it.

So, if we want to know how often a particular ngram appears in the larger text, we can make a list of all the different (unique) ngrams with each with its own tally of instances.

First, we'll need to split our strings into a word list and transform it into an ngram list. Do this by running the concordance software we created last time (below) changing the variables as appropriate to whichever text you'd like to work with first.

ngram_list = []
i = 1

while i < len(text_words):
    if i < len(text_words)-1:
        ngram = ' '.join(text_words[(i-1):(i + 2)])
        ngram_list.append(ngram)
    i = i + 1
    
i = 0
while i < 6:
    print(ngram_list[i])
    i = i+1

Now we first need to create a new dictionary, which we'll call count. You assign a blank dictionary using a pair of empty {}, curly brackets.

You then need to create a simple for loop which goes through every ngram in your ngram list (that you created above). For each entry you need a bit of code that does two things:

  • Adds a new entry (key and value) to the dictionary
  • If that key already exists, add one to the value

This is a lot simpler than you think!

First, let's remember our simple operations. If you had a variable with a value of 0, and you wanted to add 1 to it, you would simply write:

variable = variable + 1

But how do you create that variable in the first place? Well, luckily, to add a new value to a dictionary, you just need to assign a value to a key in the exact same way that you change a value:

dictionary_name[key_name] = value

Look familiar? Just like you can use and index number to access a particular letter in a string or an item in a list, you can use a key to access a certain entry in a dictionary. So, if, for each ngram in your ngram list, you wanted to add a new value, you would just use

count[ngram] = 0

But this would keep resetting the value to zero. Instead, you can use the x.get(y,z) function to

  • check to see if that key (y) exists
  • if it doesn't, create the variable with a predetermined value (z)
  • if it does, just retrieve the value

Thus count.get(ngram, 0) will get you the current value of that ngram. You can then just +1 and assign the whole lot to that variable, count[ngram].

Try it out below.

  • Create a for loop to go through your ngram list
  • Create a assignation that retrieves the value of that ngram (or assigns it to zero) and adds one
for ngram in ngram_list:
    count[ngram] = count.get(ngram,0) + 1

Now that we have this handy dictionary of all our ngrams, let us display the ones that are most interesting—the ones with the most instances.

Because this is a small text, it is incredibly unlikely that any set of words will appear more than 100 times, so let us start there.

Create a standard while loop counter, but instead of starting at 0, set i to 100.

Then, within this loop, you'll need to pull out the key and value of each entry. You can do this with the items() function. On its own, this function would return every pair in the dictionary as a list. However, you can create a for loop for a dictionary the same way you would for a list:

for ngram,tally in count.items():
        do something

In the box below, create a while loop that starts at 100 and moves down to 0.

Within that loop, use the items() function to go through each dictionary item and

  • check if that ngram has a tally of the same value as i— this will essentially sort the list, highest to lowest
  • if it does, print that ngram and tally however you like

Note: Make sure your for loop and your i count down have the same indentation!

for ngram,tally in count.items():
    print(ngram + ": " + str(tally))

Great job! But let's be a bit more precise. Let's only show those that match a certain critera.

Perhaps you only want ngrams that include the word "black". How would you do this? Really it is as simple as asking

if 'this' in ngram

Once you have that formula, recreate your while/for/if/print programme below. But, instead of your one if condition, you'll need to and followed by a second condition:

if this == that and "this_other_thing" == "that_other_thing":

Try it out!

for ngram,tally in count.items():
    if 'black' in ngram:
        print(ngram + ": " + str(tally))

Finally, build a dictionary (count) of all the ngrams in your text but only displays all the entries with at least 2 instances.

Wait! Why print when you can save? Doesn't that make more sense? Instead of finishing your loop with a print command, save the data (the key and tally) to a string.

dictionary_data = dictionary_data + ngram + "," + tally + "\n"

You'll have to create

the "," will insert a comma between the key and the tally while the "\n" will add a line break between each entry. You'll end up with one long string with all your values.

Now, save that to a file...

with open('filename.txt', 'w') as file:
    file.write(dictionary_data)
dictionary_data = ""

for ngram,tally in count.items():
    if tally >= 2:
        dictionary_data = dictionary_data + ngram + ": " + str(tally) + "\n"

Turning up the Heat(map)

Now we will be taking our data and transforming it into a visualisation. To do this we will be using the incredibly powerful (and prolific) library matplotlib. The important thing to remember is that this is only one tiny example of what this library can help you achieve—the only limit is your imagination. Let's begin, however, by reviewing our data.

In order to make an effective visualisation, we'll need some good quality data. Our end goal is to have a dataset (namely a dictionary) in the following format:

x: [a, b, c, d, e]

where x represents a word in our text and a-z represent the index numbers of that word, or rather, the numerical representation of where that word appears in our text. This will allow us to make a heatmap of the entire book, with coloured stripes at each appearance.

So, were do we begin? Let's think about this backward.

To get a dictionary of all words and their appearance in the text, we'll first need to have a list of all the words in the text.

We'll also need to have that list regularised, to remove punctuation and casing.

Is there anything I'm forgetting? Oh, yes! as a English text, there will be many (many) prepositions and articles that are not really that interesting for this particular visualisation, so we'll need to use a list of English stop words to help us focus our results.

Creating a dictionary of words is pretty straight forward! We just need to go through our list of words and add each unique one to the dictionary. But what about the list of instances? That's not as simple as just "adding 1" like we did last time.

Instead, we'll need to create a new list for each value. Remember, a value can be anything, a string, an integer, a list. So, let's start with a new list, called list.

Next, we'll need to determine if a list for that word already exists. To do this, we'll assign the value of that key (using the get command) to a new variable, called entry.

Now, our entry might be empty, so we'll have to test for that. If it has no value (if entry == None), then it is just a matter of assigning a new list with that index number to that key.

If it's not (use the else command), we'll have to .append() the index number to the existing entry list before assigning entry to the key.

Oh wait! How do we get the index number? There are many ways to go about that, but perhaps the easiest is just to keep track of which word we are on with a simple i variable. Because we want to know the word number, lets set i as 1 rather than zero and after each loop through the word list increment i by one.

Try out your code below:

#create a blank dictionary

location_dictionary = {}

#set your word number iterator to 1

counter = 1

#create a for loop for your word list

for word in text_words:

    #assign the existing value for that key word to an entry variable
    
    entry = location_dictionary.get(word)

    #ask if that entry is None
    
    if entry == None:

        #if it is, assign a list of that word number [i] as the value of that key in your dictionary
        
        location_dictionary[word] = [counter]
        
    #use the else command
    
    else:
        #append the new word number to your entry list
        
        entry.append(counter)
        
        #assign the expaned entry list as the value of that key in your dictionary
        
        location_dictionary[word] = entry

    # increase you iterator
    
    counter += 1
    
# print it!

print(location_dictionary)

Well done. Now, let's create two helper variables in the box below.

The first is a list of common English stop words (obtained from GitHub). We'll call this stop.

The second is a wordcount of your word list. Note, this is your word list, not your dictionary (which only has one copy of each word). We want to know how long the actual text is so we can make our graph the right length. Also, because we are people, rather than computers, let's start out count at 1.

stop = "a,about,above,after,again,against,all,am,an,and,any,are,arent,as,at,be,because,been,before,being,below,between,both,but,by,cant,cannot,could,couldnt,did,didnt,do,does,doesnt,doing,dont,down,during,each,few,for,from,further,had,hadnt,has,hasnt,have,havent,having,he,hed,hell,hes,her,here,heres,hers,herself,him,himself,his,how,hows,i,id,ill,im,ive,if,in,into,is,isnt,it,its,its,itself,lets,me,more,most,mustnt,my,myself,no,nor,not,of,off,on,once,only,or,other,ought,our,ours 	ourselves,out,over,own,same,shant,she,shed,shell,shes,should,shouldnt,so,some,such,than,that,thats,the,their,theirs,them,themselves,then,there,theres,these,they,theyd,theyll,theyre,theyve,this,those,through,to,too,under,until,up,very,was,wasnt,we,wed,well,were,weve,were,werent,what,whats,when,whens,where,wheres,which,while,who,whos,whom,why,whys,with,wont,would,wouldnt,you,youd,youll,youre,youve,your,yours,yourself,yourselves"
wordcount = len(text_words) + 1

Now the fun begins. If you want to create visualisations in Python, there are many (many) libraries that can assist you. However, one of the most commonly used (and best documented) is matplotlib. This has many funcitons to create simple bar, scatter and line charts out of the box, but is much more powerful than that. If you can imagine it in 2D, you can (probably) draw it with matplotlib. So, let's import the library.

But wait! It's actually rather huge, so let's just important one sub-section of it, called pyplot using the code below. You'll notice that pyplot has been abbreviated to plt. You can abbreviate it to anything you want, but most of the examples online will use plt so its best to go with convention.

from matplotlib import pyplot as plt

So, were do we begin! Before we make our heat map, let us create a simple graph, just so we can understand the format.

First, create a new figure and give it a size (in inches)

plt.figure(figsize=(x,y)) 

Because we are using a Jupyter Notebook, it will scale down if its too large for the window, put in the computer's memory it will be full size.

Next, let's create some points.

plt.scatter([xa,xb,xc],[ya,yb,yc], label='A') 

This will create 3 points. They will take their x and y coordinates from a particular position within each of these two lists, for example (xa,ya). All three will be the same colour with label 'A'. If you want other colours/labels, you'll need a separate entry (plt.scatter) for each one.

If you want to limit how big the x and y axis are you can use

plt.xlim(xs,xe)
plt.ylim(ys,ye)

Where xs is the first number and xe the last.

You can also chose where the "ticks" or hash marks are up the side.

plt.yticks([xa,xb,xc])
plt.xticks([ya,yb,yc])

Finally, you can add those little final touches, such as a title for the chart

plt.title('My title')

and a legend (with its location)

plt.legend(loc="upper right")

This is finished up by displaying the figure and then closing it (removing it from memory so you can do another)

plt.show()
plt.close()
# instantiate the figure
plt.figure(figsize=(10,5)) 

# plot points
plt.bar([1,5,6],[2,4,1], label='A', color = 'red') 
plt.bar([2],[1], label='B', color = 'blue') 
plt.bar([1,3],[4,1], label='C', color = 'black') 

# set axis limits
plt.xlim(0,10)
plt.ylim(0,5)

# set axis tick marks
plt.yticks([2,4,6,8,10])
plt.xticks([1,3,5,7,9])

# set title
plt.title('My title')

# set legend location
plt.legend(loc="upper center")

# show chart
plt.show()

# close down chart
plt.close()

png

Isn't that great? You can do the same thing with plt.bar for bar charts or plt.plot for line charts. You can also add the parameter color = 'red' to chose the colour you want! There are so many aspects you can tweak, see (www.matplotlib.org), but for now, have a play with the code above until you understand all the components listed here.

Now onto the work at hand. We want to create a heatmap (a bar code in this case) of all the instances of a certain word in our text. We can do this by making a bar chart where all the bars are the same height (lateral thinking!)

So what information do we need to do this. First, we need to know what the most common word(s) in our word list are. We have a dictionary of every work and a list of all the instances of it. By going through our dictionary and using the len function, we can determine which one has the larger list (the most instances).

Let's start with the assumption that at least 1 word as at least 1 instance. So create a value variable with a value of 1, then go through our whole dictionary (using a for loop and the .items() function) and check if the length of each list is bigger or smaller than our starting value. If it is, change the value of value to the new length (otherwise do nothing). At the end of our loop, value will be an integer representing the longest list in your dictionary. Use the print command to find out what it is.

# create a value variable with a value of 1

value = 1

# create for loop through your dictionary

for word,instances in location_dictionary.items():

    # if the length of the current word's list is greater than value
    
    if len(instances) > value:
                
        #update the value of value
        
        value = len(instances)

# print the final value

print(value)

That's a big number! I bet it's the. Don't you? Now we need a smaller list of just the top 5 or 6 words, excluding the stop words. So, let's create a new list for our top five and then, starting at our value, iterate through all the entries in our dictionary checking if that word

  • has the same number of instances of our current iterator (value)
  • is not in the stop word list
  • and that our top-five list isn't already full

If all those conditions are met, add that word to our top-five list and then subtract one from our iterator (so we can look for the next most common word.

This system isn't perfect (can you see why?) but it'll do for our test. Write and run your code below (it may take a minute!)

# create a blank list for your top five

top_five = []

# create a while loop counting down to zero

value = 7987

while value > 0:

    # create a for loop to go through your dictionary

    for word,instances in location_dictionary.items():

        # check if the length of the list (value) is the same as your current counter
        
        if len(instances) == value:
        
            # and check if your top five list is full
        
            if len(top_five) < 5:
        
            # and check that the word isn't in the stop list
        
                if word not in stop:
        
            # append the word to your top five list
            
                    top_five.append(word)
                    
    # subtract one from the counter
    
    value -=1
    
# print your top five list
print(top_five)

Now we have our list of the five most common words (did you print to see what they were?)

The last step is to make our chart. Now, we don't want to make each bar individually (that would take too long!) Instead, make a small for loop to automatically populate your five words.

  • create a for loop to go through your top five words
    • go through every word in your dictionary
      • if the word in the dictionary is the same as your word in the top five list
        • create a bar

That bar should use variables! So instead of the x, use the list (value) in your dictionary and the label should be your word (key). You can make y whatever you like, but I would suggest 1 for simplicity sake. You'll also probably wnat to add a width=10. Remember, your text could be tens of thousands of words long, so a single line might be impossible to see!

What about the xlim? Remember when we checked for the wordcount of our whole document? That seems good length for the x-axis.

You can also use blank lists for the xticks and yticks. You don't need them!

Once you've titled your chart and placed your legend, run your code below.

# create a for loop for your top five items

for topword in top_five:
    
    # instantiate the figure
    plt.figure(figsize=(50,10)) 

    # create a for loop to go through your dictionary
    
    for word,instances in location_dictionary.items():
        
        # if the word in your top five matches the word in your dictionary
        
        if topword == word:
    
            # create a bar plot using the dictionary value as x, 1 as the y, the word as the label, and a width of 10
        
            plt.bar(instances,1,label = word, width=10)

            # set the chart title

            plt.title(topword,fontsize=30)

            # show chart
            plt.show()

            # close down chart
            plt.close()

Well done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment