Notice: You have been identified as a bot, so no internal UID will be assigned to you. If you are a real person messing with your useragent, you should change it back to something normal.

Minichan

Topic: Minichan/Tinychan text dataset

Anonymous A started this discussion 2 years ago #113,133

I started scraping sometime between April and June and finished sometime before late July (all this year, 2023). These dates are all from my memory, so I could be off by a month or two. I do know for a fact I finished before the end of July.

The unprocessed html files total 351 MB (gziped).
A stripped down version where I reverse parsed the html and converted it back to BB code is 108 MB (gziped).

Here's an example of a stripped down version of this thread:
/-- 0 Meta
/--- Chinese rice is not plastic rice.
There was a scandal where a small percentage of rice producers were mixing plastic to stretch their profits. This was discovered and they were shut down and punished. It is not like literally all rice in China is plastic. That is a very illogical way of interpreting things.
/-- 1 Matthew R. Miller
i eat shit lol
/-- 2 Marsden
@previous

I am disappoint.
/-- 3 Iris
@1
Now now my grandson. Don't be telling all your secrets.
/-- 4 Saibra (OP)
@previous
He cheated on me!
/-- 5 Matthew R. Miller
@previous
ask me about the sacred heart stories!
/-- 6 Anonymous E
<img>
[b]This suit is expensive.[/b]
/-- 7 Peggy (OP)
@5
Ask me about my "dad"!

The format should be mostly self-explanatory, but a few things to note:

I chose to place the name of OP (ex: "/-- 0 Meta") before the topic name (ex. "/--- Chinese rice[...]") because the main point of the scraping was to have a dataset for training neural nets, specifically transformers. So, I wanted to be able to prompt the network to generate a topic by a specific user and have it fill in the topic name and everything else.

I also chose to re-number post IDs to be relative to the OP, because transformers are notoriously bad at doing math, and I thought it could help them along in knowing which posts in a thread came before which others.

If anyone is interested in either of these datasets, find a way for me to share it with you, and I'll send it over.

I'm especially interested if anyone uses this for training a neural net. Let me know what you find and how it goes. I'm also happy to talk shop about what I've tried and what I may eventually try.

Anonymous B joined in and replied with this 2 years ago, 18 hours later[^] [v] #1,253,672

You're redoing a lot of work, the text datasets TTEH used to finetune the llama models was reformatted in a similar way. The timestamps were removed, but the names were just in the same order.

If you are going to do this right, can you find a way to weight the MRM posts less, because it trains based on how often they post and doesn't give an equal blend to all posters.
:

Please familiarise yourself with the rules and markup syntax before posting.