User Controls
Posts by gadzooks
-
2019-05-26 at 4:44 PM UTC in My Word Cloud Process - ITT We All Laugh, Learn, and See Gradual Iterative Updates
-
2019-05-26 at 4:41 PM UTC in My Word Cloud Process - ITT We All Laugh, Learn, and See Gradual Iterative UpdatesAnd a note about the actual artistic/visual representation...
Full disclosure: I am using already existing online tools to generate the imagery.
The bulk of the work, though, is in crawling, archiving, and ultimately parsing and collating all the relevant data.
The next most time-consuming step is cleaning/pre-processing the text data for analysis.
Running a word frequency analysis on a large chunk of text really only involves a couple lines of code for most high-level languages (Python, JavaScript, etc).
Generating the visual, while also an interesting topic, is something I just "outsource".
But, if I had to write my own, I'd probably leverage a pre-existing JavaScript visualization library (like D3).
And then, it would simply be a matter of importing the cleaned/pre-processed data, and finding the most frequently occurring words, and then adjusting the font size of each word as a function of it's frequency. -
2019-05-26 at 4:35 PM UTC in My Word Cloud Process - ITT We All Laugh, Learn, and See Gradual Iterative UpdatesMonths ago, I got super into doing all kinds of data analysis on NiS thread/post content for practice/self-education, as well as for sheer lulz.
https://niggasin.space/thread/31288
https://niggasin.space/thread/31496
One of those areas attracted heightened interest, especially by a couple of homies in particular, mmQ and Grimace.
(By the way, Grimace, I realize for you the priority is the Totse dump, and, I am working on that as well, but parsing HTML tables is brutal, and, for ethical reasons, I don't want to simply throw hundreds of threads back onto the Internet directly like WaybackMachine does, so I'm pretty much stuck parsing the files, and Totse had two different HTML formats... Zoklet has only the one. I'm working on it all in tangent, I promise).
Now, back to word clouds...
The first one I did was for NiS's top candidate for most controversial figure. My motivation to choose him at the time was not out of some form of admiration of any kind, but, rather, because I figured his word cloud would likely be interesting and/or entertaining.
Btw, for anyone who does not know exactly what a word cloud is, it takes a large portion of text and statistically determines the N most frequently occurring words within it, and then results in a pretty and colorful image showing all the top words, but with the size of the word correlating with the frequency of that particular word.
For example, see the original infinityshock word cloud:
But now, a bit more about the process...
First off, you might be thinking... Won't words like "the", "a", "to", and so on, always be the top used words?
Yes, they are the most frequently used words of course, but any kind of linguistic analysis of a large corpus (body) of text, involves a few steps to clean the data up a bit. Those super common words mentioned above are referred to as stop words. There are a few ways to remove them programmatically - I believe I used a publicly available list online to filter the large body of text for the above word cloud, but many NLP (Natural Language Processing) libraries, such as NLTK for Python (the one I typically use), have built in libraries that you just choose and declare when you're preparing the data.
There are MANY other ways in which textual data can be prepared for analysis, but, for a word cloud, which is actually an incredibly simple analysis compared to other NLP use cases, it's literally just about counting how often words occur. Nothing all that fancy, really.
But, cleaning and preparing the data is always an important step.
Case in point:
That's a word cloud generated (just now) from the exact same text data, but before doing any fancy pre-processing or filtering (other than stop words).
Notice how "Bill" and "Krosby" are among his 20 most frequently used words? I think that, when I made the original, I simply manually added those two words to the stop word list because they came up so much (kind of a quick and dirty brute force method).
(Apparently infinityshock references, or quotes, kr0z, with some regularity).
OH, and that reminds me...
Quotes...
One reflection I had about my original word cloud (much later on) was that I did not filter out quotes... So, it is technically including words the target poster didn't actually use themselves. This skews the data.
Right now, as we speak, I am running a python script on the data I have already archived to parse out quotes. I will elaborate on my specific method of doing so in a subsequent post in this thread. -
2019-05-26 at 5:51 AM UTC in goodbye
-
2019-05-26 at 5:28 AM UTC in I can't believe 2008 was 11 years agoFather's day is coming up...
I like Amazon gift cards, or cash...
Actually, cash is better. -
2019-05-26 at 5:28 AM UTC in I can't believe 2008 was 11 years ago
-
2019-05-26 at 5:24 AM UTC in I can't believe 2008 was 11 years agoI can't believe I fucked your mom ~28 years ago.
-
2019-05-26 at 5:21 AM UTC in GADZOOKA BAZOOKAI don't care about being unbanned.
I almost never go into TC as it is.
I really do need to pass out. -
2019-05-26 at 5:19 AM UTC in So in tinychat the damndest thing happenedlmao i think i got banned.
Prolly for the best.
I need to finish off this drink and pass the fuck out. -
2019-05-26 at 4:33 AM UTC in Panthrax's weekend: September 2005tfw parsing HTML tables for hours...
FML.
-
2019-05-26 at 4:33 AM UTC in Panthrax's weekend: September 2005I've been spending all night parsing my archived Totse threads...
Largely because I promised panny...
Nigga AWOL at the moment, but I'm carrying on anyway. -
2019-05-26 at 4:32 AM UTC in Panthrax's weekend: September 2005Topic: My Weekend, And Yours.
panthrax
Moderator
posted 09-19-2005 14:31
I don't remember a god damn thing about my weekend. As I stare onto the computer desk, the desk this very computer sits on, I admire the ovals, circles, and multi-colored entities we call "pills".
Soma 350mg x 5 (chewed)
Klonopin 1mg x 3
Xanax .5mg x 2
That was my day by day weekend. And I don't remember any of it.
How was your weekend, if you remember it?
[This message has been edited by panthrax (edited 09-19-2005).] -
2019-05-26 at 4:29 AM UTC in I went to the free buffet yesterdayWarren or Jimmy?
-
2019-05-26 at 4:28 AM UTC in Anyone got ACP's Number?She told me she was single.
Lying fucking slut. -
2019-05-26 at 4:28 AM UTC in Anyone got ACP's Number?*googles it*
Yep, it's Jenny. -
2019-05-26 at 4:27 AM UTC in Anyone got ACP's Number?
-
2019-05-26 at 4:26 AM UTC in So in tinychat the damndest thing happenedWait, how did this happen in TC?
Does TC mean something different now? -
2019-05-26 at 4:24 AM UTC in So in tinychat the damndest thing happened
-
2019-05-25 at 6:23 PM UTC in I need video editing software^ I finally managed to upload it as a gif (took longer than editing the video in the first place).
-
2019-05-25 at 6:22 PM UTC in I need video editing software