squeegee joined in and replied with this 2 years ago, 11 minutes later, 1 hour after the original post[^][v]#1,237,401
the bitter seething restless rage of the dilettante. the faceless pitiful moans of the lost ones. finding solace in their friendless echos. it's almost poetic that they'll be dead soon
squeegee replied with this 2 years ago, 2 hours later, 3 days after the original post[^][v]#1,237,753
@previous (A)
Alright then. You know what, i believe you. you could be trying to play it off like with a subtle dismissive nonchalance thinking i'll maybe decide it's not really bothering anyone because you think i'm posting in response to threads like these as a way to fight-fire-with-fire. And i'll quit because it's played out. you're almost even challenging me to post more and even more explicit pics. almost taunting me. almost like you're kinda thinking "i'll bet little miss princess butternuts over there knows all the best hardcore transgender pornography," and might could be coaxed into literally wallpapering this place like the window of a Bourbon Street burlesque show cabaret. but why would you want that? to make me look like a deviant by prodding me into spamming the most hardcore pornography -you can't even understand what you're looking at for a good couple of minutes and then, "OH." i ha- i had no idea that porn could DO that. Wo-Woahhaha, god damn. that's a breathing tube.
and you know what, i doesn't matter if you LOVE watching transgender porn and you've already seen Mariana Cordoba's entire... catalog and recognize more transgender porn actresses by their crotch shot than you can recognize famous CIS porn actresses by their name, face, with and pornhub's pornstar directory pulled up. you'd still shoot your mouth off to spite your face and endlessly attack me over snorkel porn -i didn't invent intubation, i've never even seen it. and even if it IS a genre of transgender porn you're still going to google it, and when you find it, don't blame me. it's what you want. but i'm not going to take all the abuse you're afraid of and foist onto anyone and everyone around that's not you cleverly disguised as a person who's just going about their day doing normal people things, like mocking and ridiculing whole categories of people -the same kinds of people, as always, the ones who it's is fashion to hate cause that's what people do, is spend time going out of their way to be the cause of a scene. what was it you said, you're not really so much interested in the topic as you are interested in the responses people have to the topic. not just the topic, tho, but your expression of the topic.
like one about kids and the lifestyle choices of adults that you say, is what, "debauchery magnified by armies of ai bots waging an information war."
on the minds of children, like a sickness.
And, you know, you've been working for months and months trying to train Lambda and other language models on the minichan data you have -YOU SENT me my entire posting history and claimed it was scraped, but you know damn well that the schema on this site is half PHP and all the usual tags a bot could scrape for are server side and the only things to format the content that's readable are style tags like BOLD and a bunch of divs and yeah, the text is there, buried behind unparseable syntax and php preprocessing pages layout and content separately.
Yeah, you know, you said you couldn't do it normally and had to use a mouse gesture bot to automate it and it took a long time, and blamed cloudfare when i was figuring out that this place is built like a Norwegian seed vault on an iceberg. but you just have access to the entire database backend and that's why the squeegee data came formatted so perfectly raw like it was organized by schema, and not the "beautifulsoup" that NLP usually scapes and then processes and formats and dumps devoid of special characters and uncommon words. No, all the fantabulous squeegisms were maintained. you know what that tells me?
that tells me you are totally being for real, that porn is boring and softcore. but don't hold your breath for anything exciting, it's not coming. you'll have to intubate your own debauchery Mr. Information war.
squeegee replied with this 2 years ago, 54 minutes later, 4 days after the original post[^][v]#1,238,053
@previous (H)
really? lol, i didn't actually know that. i just don't see how the whole thing comes together on the client side without the server, like, shuffling two halves of a deck together. one half seems to be just the content, like a .json, the other half seems to be, like how it all comes together with users and, idk, inspect the source, even ai can't make heads or tails of how to parse it logically. it's not an unsophisticated bit of software. it's just styled to look unassuming is my guess, but idk.
Anonymous H replied with this 2 years ago, 34 minutes later, 5 days after the original post[^][v]#1,238,057
@1,238,053 (squeegee) > i just don't see how the whole thing comes together on the client side without the server
You're right, it doesn't. PHP is an example of server side rendering.
> even ai can't make heads or tails of how to parse it logically.
Parsing it isn't a problem at all.
squeegee replied with this 2 years ago, 3 hours later, 5 days after the original post[^][v]#1,238,102
@previous (H)
idk how to code, so i don't know what the actual problem was, but i couldn't get beautiful soup to scrape content, i forgot the effing thing it said, but it needed uhhh, a css tag of some kind i think to know which elements to pull from. i looked, it needed a class and i couldn't get it to work with class "c" or see <strong>A</strong> as a name field, and the name="reply_783782479" and id being based on what and if it was a reply to, and the actual comments are i guess inheriting attributes since they are just kinda chilling there like they don't need give a fuck.
i'm sure it makes sense if i knew better how... to speak computer. but yeah, i got past the dang cookie message and that was more work than i thought it would have been. this was like 6 months ago. it IS very well tidy though. it looks simple, but i at least know that sometimes that's one of the harder things to do, my coding is all kinds of janky.
You ever heard the expression: one month in the lab saves you an hour in the library?
Turn on your brain and don't make the board think for you. Same goes for ChatGPT, and whatever other fancy bloat-max python libraries you're using.
For highly regular parsing problems like this, it's often a lot faster to just parse it out by hand than use some fancy library or trying to work with whatever bullshit ChatGPT is giving you.
> a css tag of some kind i think to know which elements to pull from. i looked, it needed a class and i couldn't get it to work with class "c"
I have no idea wtf you're talking about. You don't need to worry about pulling in css or even doing much in the way of interpreting it. Everything can be parsed out from the single html page and ignoring all else.
Here's one way to attack a problem like this:
-Start up ipython.
-Load in the html page of a topic as a string, call it variable 'thread'.
thread = open('thread_html_file.html','r').read()
-Look at the same html file in a syntax highlighted text editor. Also load it in your web browser so you know the type of outputs you should even be expecting to parse out.
-In the html file, try to find a deliminator that sits at the start or end of every post. Here's one: <div class="body"
-Try it out and split using that deliminator:
posts = thread.split('<div class="body"')
look at entry 0:
posts[0]
entry 1:
posts[1]
the last entry:
posts[-1]
and so on.
Notice that the zeroth (first) element is still trash but the rest seem pretty good. That is, they contain one and only post per entry. Okay, we discard the zeroth element:
posts = thread.split('<div class="body"')[1:]
Notice the new zeroth element which contain's the OP's text starts differently than the remaining. You may have to parse it slightly differently. No big deal. Let's focus on everything except the zeroth. Look at them and notice they all start the same: ' id="reply_box_XXXXXX">'. Ex:
posts[2]:
' id="reply_box_722533">WHAT TYPE? Like a benadryl?<ul class="menu"></ul></div><h3 class="c" name="reply_722536" id="reply_722536"><strong>Falco</strong> !a7Q.SWNEJw (OP) replied with this <strong><span class="help" title="2016-09-28 08:33:21 UTC — Wednesday the 28th of September 2016, 8:33 AM">6.9 years ago</span></strong>, 16 minutes later, 33 minutes after the original post<span class="reply_id unimportant"><a href="#top">[^]</a> <a href="#bottom">[v]</a> <a href="#reply_722536">#722,536</a></span></h3> '
If you needed to know the reply ID of the post, that's where you'd get it (722533, above). But, regardless, you can crop that stuff out by finding the end of that block, the first occurrence of '">' and taking everything after it:
post = posts[2]
start_delim_loc = post.find('">')
assert start_delim_loc != -1
post = post[start_delim_loc+2:]
Now, do the same thing to find the end of the post. You'll notice that you can use '<ul class="menu">' as a reliable end deliminator. So we just cut off the end like this:
which matches what you'd expect if you looked at the thread I randomly selected for this example (http://minichan.net/topic/55881)
Now you just put code like the above in a for loop over our original list `posts` and you're on your way. There will still be html tags within the posts, but just use your head and do things step wise like this and you can reverse literally all the formatting back to the original. Here are some easy ones:
post = post.replace('<s>', '[s]')
post = post.replace('</s>', '[/s]')
post = post.replace('<strong>', '[b]')
post = post.replace('</strong>', '[/b]')
post = post.replace('<u>', '[u]')
post = post.replace('</u>', '[/u]')
post = post.replace('<em>', '[i]')
post = post.replace('</em>', '[/i]')
post = post.replace('>', '>')
post = post.replace('<', '<')
post = post.replace('&', '&')
Citations are a little more tricky, but not that tricky.
The names of posters can be parsed also separately using their own start and end deliminators into a separate list, then you can run an `assert` statement to make sure you found the same number of names as you did posts in the topic.
Print out the output of the parsing to the screen. Make sure it looks sane. Run the code on multiple example threads and add new replacements and rules as you go. Eventually you'll hit everything and can then run the code en masse.
There's about a million different ways to do parsing like this. Some more verbose, some less, such as with regexp everywhere. For me, the above strikes a good middle ground and I find it makes the code more readable than heavy use of regular expressions, which I always find unreadable and ironically more time consuming than writing the code out a little more step-wise, like I did above.