User Controls

My Word Cloud Process - ITT We All Laugh, Learn, and See Gradual Iterative Updates

  1. #1
    gadzooks Dark Matter [keratinize my mild-tasting blossoming]
    Months ago, I got super into doing all kinds of data analysis on NiS thread/post content for practice/self-education, as well as for sheer lulz.

    https://niggasin.space/thread/31288
    https://niggasin.space/thread/31496

    One of those areas attracted heightened interest, especially by a couple of homies in particular, mmQ and Grimace.

    (By the way, Grimace, I realize for you the priority is the Totse dump, and, I am working on that as well, but parsing HTML tables is brutal, and, for ethical reasons, I don't want to simply throw hundreds of threads back onto the Internet directly like WaybackMachine does, so I'm pretty much stuck parsing the files, and Totse had two different HTML formats... Zoklet has only the one. I'm working on it all in tangent, I promise).

    Now, back to word clouds...

    The first one I did was for NiS's top candidate for most controversial figure. My motivation to choose him at the time was not out of some form of admiration of any kind, but, rather, because I figured his word cloud would likely be interesting and/or entertaining.

    Btw, for anyone who does not know exactly what a word cloud is, it takes a large portion of text and statistically determines the N most frequently occurring words within it, and then results in a pretty and colorful image showing all the top words, but with the size of the word correlating with the frequency of that particular word.

    For example, see the original infinityshock word cloud:



    But now, a bit more about the process...

    First off, you might be thinking... Won't words like "the", "a", "to", and so on, always be the top used words?

    Yes, they are the most frequently used words of course, but any kind of linguistic analysis of a large corpus (body) of text, involves a few steps to clean the data up a bit. Those super common words mentioned above are referred to as stop words. There are a few ways to remove them programmatically - I believe I used a publicly available list online to filter the large body of text for the above word cloud, but many NLP (Natural Language Processing) libraries, such as NLTK for Python (the one I typically use), have built in libraries that you just choose and declare when you're preparing the data.

    There are MANY other ways in which textual data can be prepared for analysis, but, for a word cloud, which is actually an incredibly simple analysis compared to other NLP use cases, it's literally just about counting how often words occur. Nothing all that fancy, really.

    But, cleaning and preparing the data is always an important step.

    Case in point:



    That's a word cloud generated (just now) from the exact same text data, but before doing any fancy pre-processing or filtering (other than stop words).

    Notice how "Bill" and "Krosby" are among his 20 most frequently used words? I think that, when I made the original, I simply manually added those two words to the stop word list because they came up so much (kind of a quick and dirty brute force method).

    (Apparently infinityshock references, or quotes, kr0z, with some regularity).

    OH, and that reminds me...

    Quotes...

    One reflection I had about my original word cloud (much later on) was that I did not filter out quotes... So, it is technically including words the target poster didn't actually use themselves. This skews the data.

    Right now, as we speak, I am running a python script on the data I have already archived to parse out quotes. I will elaborate on my specific method of doing so in a subsequent post in this thread.
    The following users say it would be alright if the author of this post didn't die in a fire!
  2. #2
    WellHung Black Hole
    Gadzooks, you're my latest man crush. Your fucken sexy baby.
  3. #3
    gadzooks Dark Matter [keratinize my mild-tasting blossoming]
    And a note about the actual artistic/visual representation...

    Full disclosure: I am using already existing online tools to generate the imagery.

    The bulk of the work, though, is in crawling, archiving, and ultimately parsing and collating all the relevant data.

    The next most time-consuming step is cleaning/pre-processing the text data for analysis.

    Running a word frequency analysis on a large chunk of text really only involves a couple lines of code for most high-level languages (Python, JavaScript, etc).

    Generating the visual, while also an interesting topic, is something I just "outsource".

    But, if I had to write my own, I'd probably leverage a pre-existing JavaScript visualization library (like D3).

    And then, it would simply be a matter of importing the cleaned/pre-processed data, and finding the most frequently occurring words, and then adjusting the font size of each word as a function of it's frequency.
  4. #4
    WellHung Black Hole
    Is Mozilla Firefox a browser?
  5. #5
    Sophie Pedophile Tech Support
    That's pretty dope Gadzooks. Also, post your source.
  6. #6
    gadzooks Dark Matter [keratinize my mild-tasting blossoming]
    Originally posted by WellHung Is Mozilla Firefox a browser?

    Yes.
  7. #7
    WellHung Black Hole
    Whatever happened to Internet Explorer? Always liked how the rings around that little globe would keep moving while it was loading.
  8. #8
    WellHung Black Hole
    Gadzooks I didn't understand any of your computer jargon .. in layman's terms what did you achieve?
  9. #9
    gadzooks Dark Matter [keratinize my mild-tasting blossoming]
    Originally posted by Sophie That's pretty dope Gadzooks. Also, post your source.

    I will do this at some point... Maybe as a GitHub gist, or even a full repository...

    The only part I'm kinda hesitant about posting is the code I came up with for crawling the site.

    It could very, VERY, easily be abused, if anyone with ill intent towards this site were to adapt it to nefarious purposes (*cough* infinityshock *cough*).

    But the code I use to analyze I'll post in this thread after I've cleaned it up just a little bit (I write really dirty code the first time through, and am self-conscious about others seeing it in that state, lol).
  10. #10
    gadzooks Dark Matter [keratinize my mild-tasting blossoming]
    Originally posted by WellHung Gadzooks I didn't understand any of your computer jargon .. in layman's terms what did you achieve?

    Good question.

    I'm a huge fan of the ELI5 subreddit for explanations of things outside my own realm(s) of expertise, so I will try to come up with something similar here.

    Just gimme a sec to think it through.
  11. #11
    gadzooks Dark Matter [keratinize my mild-tasting blossoming]
    Originally posted by gadzooks It could very, VERY, easily be abused, if anyone with ill intent towards this site were to adapt it to nefarious purposes (*cough* infinityshock *cough*).

    Actually, I was thinking mainly of my "NiS bot" code here...

    Archiving each page as a plain old HTML file doesn't involve establishing sessions or making requests (other than "GET"), so I could probably include that.

    The problem is that, like I said, my code is dirty and disorganized as is... I literally have files called things like "nisSORT.py" that are just filled with completely unrelated NiS stuff.
  12. #12
    Sophie Pedophile Tech Support
    Originally posted by gadzooks I will do this at some point… Maybe as a GitHub gist, or even a full repository…

    The only part I'm kinda hesitant about posting is the code I came up with for crawling the site.

    It could very, VERY, easily be abused, if anyone with ill intent towards this site were to adapt it to nefarious purposes (*cough* infinityshock *cough*).

    But the code I use to analyze I'll post in this thread after I've cleaned it up just a little bit (I write really dirty code the first time through, and am self-conscious about others seeing it in that state, lol).

    You leave the crawler out however you add something that will make it easy for the program to parse data from crawlers. Make it a repo, and if you throw in some features that can be used in an Open Source Intelligence operation. I'll bring some exposure to the project if you'd like.
    The following users say it would be alright if the author of this post didn't die in a fire!
  13. #13
    gadzooks Dark Matter [keratinize my mild-tasting blossoming]
    An ELI5 (using as little technical terminology as possible):

    So a word cloud basically involves:
    1. Take a large collection of words (forget for a second that they actually compose meaningful sentences... think of, for example, an entire NiS member's post history, as a big bag of random english words).
    2. Literally just count how often each word in that bag of words comes up. The amount of data (i.e. number of words) can be quite large, but computers are able to perform such calculations pretty quickly.
    3. Filter based on certain conditions.
    4. Display words as an image.

    For example (oversimplified, but informative):

    LIST_OF_RANDOM_WORDS = [
    hi, what, hello, homie, fuck, really, why, i, dunno, hi, again, fuck, again, fuck, fuck, fuck
    ]

    Let's say we're looking for the top 3 most used words...

    First, the program counts the occurrence of each word (it's a prerequisite for filtering out the top X number of words requested):
    hi: 2.
    what: 1.
    hello: 1.
    homie: 1.
    fuck: 5.
    really: 1.
    why: 1.
    i: 1.
    dunno: 1.
    again: 2.

    The program then sorts this list by the number of each occurrence:
    fuck: 5.
    again: 2.
    hi: 2.
    what: 1.
    hello: 1.
    homie: 1.
    really: 1.
    why: 1.
    i: 1.
    dunno: 1.

    It then filters out all but the top three:
    fuck: 5.
    again: 2.
    hi: 2.

    And then you have:
    FUCK, AGAIN, HI.

    The word cloud generator assigns a font size relative to each word's number of occurrences:
    FUCK (font-size = 12 * 5).
    AGAIN (font-size = 12 * 2).
    HI (font-size = 12 * 2).

    Note that the number 12 is simply a constant value used to compute a variable font size.
  14. #14
    aldra JIDF Controlled Opposition
    lol, nude niggers
    The following users say it would be alright if the author of this post didn't die in a fire!
  15. #15
    gadzooks Dark Matter [keratinize my mild-tasting blossoming]
    Originally posted by aldra lol, nude niggers

    I know, right?

    That's exactly why I chose finny.

    I'm planning to do a bunch more once ALL post data is parsed and cleaned/processed...

    But, unless someone specifically requests one, I generally pick them based on how entertaining the cloud will be.

    It's actually generally the most obnoxious posters that make for the best result.
  16. #16
    gadzooks Dark Matter [keratinize my mild-tasting blossoming]
    ON PARSING:

    I'll post some actual code (python) later, but for now, I'll kinda walk through the general process.

    First, what is parsing?

    To parse something has multiple meanings, depending on context, but it ultimately always comes down to one simple thing: Taking a sequence of arbitrary data, and extracting meaningful data from it.

    HTML is an excellent example, especially because it follows a rather simple pattern (especially compared to parsing full Turing-complete programming languages).

    HTML is actually a subtype of XML (eXtensible Markup Language).

    XML uses "<" and ">" symbols to indicate significant "stopping points" (for lack of a better term).

    For example (in HTML in particular), a paragraph is denoted as follows:


    <p>Hello world. I am a paragraph. I consist of multiple sentences of text.</p>


    An HTML parser extracts everything contained between the "starting p" and the "ending p."

    Often, though, each element/tag will have attributes:


    <p class="post">Hello world. I am a paragraph. I consist of multiple sentences of text.</p>
    <p>Hello world. I am a paragraph. I consist of multiple sentences of text. BUT I AM NOT A POST.</p>
    <p class="post">Hello world. I am a paragraph. I consist of multiple sentences of text.</p>


    So, using an XML/HTML parser, and assuming that one wants to extract all the paragraphs that fall into the class of "post", one would do the following (represented as pseudocode):


    all_paragraphs = find_all_elements_wrapped_between_p(source_text)

    all_posts = find_all_elements_with_specified_class("post")


    The placeholder of "all_posts" now holds all of the items labelled as posts.

    Okay, I have to take a break here because trying to explain it using as little technical terminology as possible is actually quite a cognitive exercise (but useful, even for me, because, as Einstein once said: "If you can't explain it simply, you don't understand it well enough.", or something like that at least).
Jump to Top