A Mueller Report Wordcloud, or Who Needs Regex Anyway?

It’s Thursday, April 18th, 2019. Which means, for Americans and the world, it’s officially Mueller Time!

That said, it’s a workday and I don’t really have time to read 440 or so pages written by lawyers, even legendary lawyers, like the crew on the Special Counsel’s Office. What I do have time for is a word cloud. I’ve never really made a word cloud before, so let’s get to it! Here’s one flavor of the finished product. Simple, elegant, and quite Soviet.

Sheet 1.png

Step one: Find some words!

  1. Find some words. (Done.)

    1. While the original report is not searchable, a variety of folks have now offered searchable versions. I found mine on Google Drive.

  2. Copy/Paste the words from a searchable PDF into a powerful word processor.

    1. Notepad didn’t work for me. For some reason the clipboard was too big for Notepad. I first used OneNote, but I was worried it wouldn’t support the next step.

    2. So, for this I resorted to Notepad++ because my next step is to get this into a single column of words (and garbage characters, symbols, and numbers)

  3. In Notepad++ I performed a find/replace on all the spaces ‘ ‘ to convert them to carriage returns.

    1. This was my reference.

  4. Open my file in Tableau Prep for cleaning.

    1. Rename my column “Text”

    2. Clean -> Remove Punctuation

    3. Clean -> Remove Numbers

    4. Create a calculated field to clean up some of the weird remaining special characters

    5. Sort the remaining text in descending order. This will naturally do so by count([Text]) and exclude the common English language words that aren’t pertinent. Examples would include common articles, helper verbs, etc. “the”, “to”,”of",”and”,”that”,...

      1. This part’s got a bit of subjectivity to it, but my frame of reference is that ambiguous words or even less evocative words, like pronouns, “him”, “his”, “I”, aren’t helpful to the viz. I made this an iterative process.

    6. Lastly, and this is something one could also do in Tableau, but is useful to understand how to do in Tableau Prep Builder, I’m going to filter out the words with singular instances. I know this data set is going to have a long tail and much of these instances are either garbage data, due to the nature of reading in a PDF image and converting it to text, or it’s just not going to appear on the word cloud anyways, so this keeps the data set a bit more manageable.

      1. First, and this is the interesting bit, duplicate the [Text] Field, because in Prep Builder, you need one instance to Group By and one instance to Count() or otherwise to aggregate.

      2. Then, add an Aggregate step, and put the original [Text] in the Group By side and the duplicated [Text] in the Aggregate side and set it to Count.

      3. I then Joined the original Data set to the Count, because what I really want is to let Tableau do the aggregations after I filter out the least commonly used words.

      4. Add another clean step and filter by calculation. I opted for >= 10 because I want this data set to perform reasonably well.

    7. If you still see ‘common’ words which aren’t interesting, revisit Step 5 or exclude them in Tableau.

    8. Output to a Hyper file and build out your Word Cloud in Tableau.

    9. I also opted to filter out additional ‘ambiguous’ words in Tableau Desktop.

    10. I built a filter in Tableau Desktop to further limit the # of words by count. I settled on about 60#, which kept the viz manageable.

    11. You can find my workbook on Tableau Public: https://public.tableau.com/profile/josephschafer#!/vizhome/MuellerReportWordCloud/MuellerReportWordCloudrwb

flow.png
Mueller Report Word Cloud (rwb).png