Topic: regex regular expressions for Firefox extension History AutoDelete Rebooted

boof started this discussion 3 months ago #133,261

This thread is referenced in https://minichan.net/topic/132404#reply_1420524 where the Firefox extension History AutoDelete Rebooted is described. Here, details are given about making regular expression patterns to match URLs in Firefox browser history for deletion. Regular expressions, also known as regex, is a method to match and locate text in many different usage situations in general. It has a particular syntax whereby special metacharacters represent things other than than the literal characters that they appear to be. This is a list of metacharacters that are used in these notes. More details about each are given a little later.

^	anchor:  start of string or line
[^]	negated character class
$	anchor:  end of string or line
.	any character except line break
\	escape, apply to metacharacters to make them literal, or apply to regular characters for
	special meanings.  e.g. \.  \?  \/  \n
()	capturing group
?	quantifier:  zero or one
+	quantifier:  one or more 
{2,}	quantifier:  two or more
*	quantifier:  zero or more
(|)	choice (alternation, set union) operator, OR
(?!)	negative lookahead
(?=)	positive lookahead
(?i)	inline modifier:  activate case insensitive mode
(?-i)	inline modifier:  switch off case insensitive mode

(Edited 1 minute later.)

Anonymous B joined in and replied with this 3 months ago, 40 minutes later[^] [v] #1,420,533

i strongly prefer a browser thats functional right off the executable.

like... the more you tweak a browser the more it stands out, and the more it stands out, the easier the glowniggers can git you for posting a meme they dont like.

Anonymous C joined in and replied with this 3 months ago, 11 minutes later, 52 minutes after the original post[^] [v] #1,420,536

Private browsing mode is easier

Oatmeal Fucker !BYUc1TwJMU joined in and replied with this 3 months ago, 28 minutes later, 1 hour after the original post[^] [v] #1,420,540

I use Google Chrome with no addons

Anonymous E joined in and replied with this 3 months ago, 4 hours later, 6 hours after the original post[^] [v] #1,420,560

@1,420,533 (B)

> i strongly prefer a browser thats functional right off the executable.
>
> like... the more you tweak a browser the more it stands out, and the more it stands out, the easier the glowniggers can git you for posting a meme they dont like.

Firefox’s appeal is in customization and add-ons. It’s the browser you tweak, not the one you’re fed.

Anonymous B replied with this 3 months ago, 12 hours later, 18 hours after the original post[^] [v] #1,420,625

@previous (E)
firefoxes appeal is that its a shit browser that people cope about because its not chrome

boof (OP) replied with this 3 months ago, 5 hours later, 23 hours after the original post[^] [v] #1,420,663

The above is not a complete list of metacharacters for regular expressions in general. For more information about regular expressions and their syntax, refer to these sites:
https://www.regular-expressions.info/
https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html
https://www.rexegg.com
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions
https://bitlaunch.io/blog/how-to-use-the-grep-command-in-linux/ (grep is a command-line utility for searching text for lines that match a regular expression)
https://en.wikipedia.org/wiki/Regular_expression
https://stackoverflow.com/questions/399078/what-special-characters-must-be-escaped-in-regular-expressions
https://www.youtube.com/watch?v=0LKdKixl5Ug&list=PL55RiY5tL51ryV3MhCbH8bLl7O_RZGUUE
https://www.youtube.com/watch?v=kX3WpzLRiW4&list=PLLdz3KlabJv1UVT8cZ-h4iI7fRqC_rArb

These sites are useful for trying regular expression patterns with text that you can paste into the web pages:
https://regex101.com
https://spannbaueradam.shinyapps.io/r_regex_tester
read https://adamspannbauer.github.io/2018/01/16/r-regex-tester-shiny-app/ for usage notes.
https://www.regextester.com

This is software:
https://github.com/nedrysoft/regex101/blob/master/README.md (Linux, Windows, Mac)
https://sourceforge.net/projects/simregextester (Windows)
https://sourceforge.net/projects/regextester (Windows)
https://sourceforge.net/projects/regexcreator (Windows, Linux, Mac, BSD, ChromeOS)
https://sourceforge.net/projects/regexlab-net (Windows)

boof (OP) double-posted this 3 months ago, 1 day later, 1 day after the original post[^] [v] #1,420,782

Here is more detail about each of the listed metacharacters from earlier:

^
The caret symbol indicates the beginning of a string or line of text. For instance, ^abc would locate the start of this text if it began a line: abcdef, while it would not match anything in a line whose text began as: ghabc.

[^]
The caret as the first character within square brackets indicates any other character (including a space) than those listed afterward. For example, [^589hk] would match any character other than 5, 8, 9, h, and k. In the text 35hmp, the matches would be the 3, m, and p. Also, this[^17puv]that matches thismthat and this9that but not this1that or thisuthat.

$
The dollar sign indicates the end of a string or line of text. For instance, flop$ would locate the "flop" part of this text if it was at the end of a line: 34-vertflop, while it would not match anything in a line whose text ended with: moflopel844.

boof (OP) triple-posted this 3 months ago, 1 day later, 3 days after the original post[^] [v] #1,420,941

.
The dot is the wildcard character, and indicates any character (including a space) except for newline (anything that serves as a line break), which is a non-displaying signal to continue text on the next line. Various implementations of regular expressions can be set to a mode whereby the dot can also match a newline character. So, k.t can match k3t, kpt, k#t, k t, and so on.

\
The backslash is the escape character, meaning that it reverts metacharacters to their literal use. For instance, \. is for matching the dot, so that it is not taken as a wildcard character. For example, youtube\.com matches youtube.com, and not say, youtube\tcom. In the context of within square brackets, . is taken as a literal dot, and the backslash is not necessary (nor does it have any effect of changing that usage, so . and \. will be understood as literal . regardless). Patterns that will be used later in these notes include escaping the question mark and the forward slash, i.e. \? and \/. The backslash can also convert ordinarily literal characters to some special meaning. In these notes, \n will be used to indicate an unseen newline character.

()
The round brackets (parentheses) cause their contents to be considered as a group. One reason to do that would be to be able to reference a captured group later in a pattern. Another reason would be for application of quantifier metacharacters, as described next in these notes.

boof (OP) quadruple-posted this 3 months ago, 23 hours later, 4 days after the original post[^] [v] #1,421,119

?
The question mark indicates that the prior character or captured group can be at that place in the pattern either zero or one time. For instance, be?t matches bt and bet. Likewise, art(perk)?ow matches artow and artperkow. The pattern

122[^8m]?

matches 122 and 122R but not 1228 or 122m. Also,

.?(Citing a deleted or non-existent reply.)

matches

(Citing a deleted or non-existent reply.)

and

h(Citing a deleted or non-existent reply.)

among other possibilities.

+
The plus sign indicates that the prior character or captured group can be at that place in the pattern at least once. For instance, ca+t matches cat, caat, caaat, and so on. Likewise, 124(-01)+ matches 124-01, 124-01-01, and so on. Also, [^aeiou]+43 matches c43, 6843, and y7k43, but not e43 or ia43. The pattern v3#.+ matches v3#9, v3#e0, and v3#11p, among others.

{2,}
The curly brackets (braces) with a 2 and then a comma between them indicates that the prior character or captured group can be at that place in the pattern two or more times. For instance, wap{2,}4 matches wapp4, wappp4, and so on. Likewise, (000-){2,}500 matches 000-000-500 and 000-000-000-500, and so on. The pattern EE[^uvUV]{2,}NN matches EE7fNN and EEWMiNN, but not EENN or EEUVrrNN, among others. Also, m.{2,}9 matches m3g9, mWPe9, m10fA9, and so forth.

*
The asterisk indicates that the prior character or captured group can be at that place in the pattern zero or more times. For instance, j4* matches j, j4, j44, and so on. Likewise, (rat)*32 matches 32, rat32, ratrat32, and so on. Also, G-[^@#&]* matches G-, G-B, and G-8c, but not G-(Citing a deleted or non-existent reply.) G-#m, G-4&k, among others. The pattern .*bun matches bun, Pbun, #xbun, and so on.

(Edited 4 minutes later.)

boof (OP) quintuple-posted this 3 months ago, 6 minutes later, 4 days after the original post[^] [v] #1,421,120

christ, no amount of editing within 3 minutes was enough to stop the fuckups in the previous reply

boof (OP) sextuple-posted this 3 months ago, 23 hours later, 4 days after the original post[^] [v] #1,421,270

OK, replacing the problematic at symbol with a percent, here's the sentence from the ? paragraph: Also, .?%2 matches %2 and h%2 among other possibilities.

(|)
The pipe within round brackets (parentheses) is for placing two alternative patterns to match, one on each side of the pipe. The pipe is effectively a logical OR operator. More than one pipe can be used so that three or more alternatives can be listed. For example, (go|stop) matches go and stop. Likewise, (one|two|three) matches one, two, and three. Also, h(a|i|o|u)t matches hat, hit, hot, and hut. The pattern b(3|54)+ matches b3, b54, b33, b354, b35454, b54354, b54333 and so on.

(?=)
The question mark and equals sign within round brackets (parentheses) is for performing a positive lookahead, which requires matching whatever appears to the right of a position, but without capturing that text. For example, #43(?=.*%) when applied to #43E78k%556p matches only the #43 part of that text, while #43(?=.*%).*p matches the entire #43E78k%556p text because of the wildcard and letter p that are part of the search pattern after the lookahead.

(?!)
The question mark and exclamation symbol within round brackets (parentheses) is for performing a negative lookahead, which requires not matching whatever appears to the right of a position, but without capturing that text. For example, nab(?!56) when applied to nab1154YC matches only the nab part of the text, while nab(?!56).{2,}YC matches the entire nab1154YC text because of the wildcard and letters YC that are part of the search pattern.

Oatmeal Fucker !BYUc1TwJMU replied with this 3 months ago, 7 minutes later, 4 days after the original post[^] [v] #1,421,271

@previous (boof)

This is all too technical for me. When can I look at the cartoon porno?

boof (OP) replied with this 3 months ago, 23 hours later, 5 days after the original post[^] [v] #1,421,527

(?i)
The question mark and letter i within round brackets (parentheses) is for activating the case-insensitive mode for matching letters. In that mode, unlike the default case-sensitive mode, lowercase and uppercase versions of the same letters are not distinguished from each other. For example, (?i)f8-ep matches f8-ep, F8-EP, f8-EP, f8-eP, f8-Ep, F8-ep, F8-Ep, and F8-eP.

(?-i)
The question mark, hyphen, and letter i within round brackets (parentheses) is for deactivating the case-insensitive mode for matching letters. The text between (?i) and (?-i) matches regardless of case, while letter text outside of those modifiers has to match whatever case is used in the pattern. For example, (?i)k(?-i)ace matches kace and Kace, but not kAce for instance.

boof (OP) double-posted this 3 months ago, 23 hours later, 6 days after the original post[^] [v] #1,421,639

patterns for matching a specific domain:

Here is some detail on using regular expressions to specify URL patterns to match for deletion:
The regular expression URL patterns require the start-of-line character ^ and the end-of-line character $. Dots are used to indicate a wildcard, and following a dot with an asterisk indicates that there can be 0 or more possible characters. Different pages at websites extend beyond the main part of the URL (the domain), and so a dot with an asterisk should appear after the domain in the pattern to match all such pages' URLs. To indicate an actual dot, as in .com, the escape character \ must precede the dot. For example: ^https://apnews\.com.*$

One thing to keep in mind is that History AutoDelete Rebooted does not properly handle www. in patterns that you enter. For whatever reason, it was programmed to treat URLs starting with http:\\www. or https:\\www. as if those strings of characters were not there. For instance, https:\\www.youtube.com is seen by the extension as youtube.com, and therefore entering ^https:\\www\.youtube\.com.*$ as a pattern to match will not work. Use ^youtube\.com.*$ instead.

Another thing to keep in mind when you want to match all URLs of a specific domain, some sites prefix the domain with character strings and a dot to indicate subdomains. For instance, https://boards.straightdope.com is part of the site https://straightdope.com (technically the com part is called a top-level domain and straightdope.com is a second-level subdomain, but since the minimum form is straightdope.com, I mean that minimum base form when I refer to a site's domain).

boof (OP) triple-posted this 3 months ago, 1 day later, 1 week after the original post[^] [v] #1,421,706

If we want a pattern to include all subdomains of straightdope.com, then we need a wildcard indicator and a dot to the left of the domain part. To match regardless of having a subdomain or not, the wildcard and dot are put within brackets to be handled as a group, and an asterisk is placed afterward to indicate that the group could be there zero or more times. The wildcard needs to indicate any character that is not itself a dot, because the dot part of the subdomain is already specified afterward. To indicate prohibition of characters, the characters are placed following a single caret ^, all within square brackets. This prohibition pattern acts as a wildcard. The wildcard has to represent a string that is at least one character long, and to indicate that, we place a plus sign afterward. So, we would enter: ^https://([^\.]+\.)*straightdope\.com.*$

Some sites might use the prefix http://, without the letter s seen in https://. To match for both possibilities, put a ? following the s, which indicates that the s can appear zero or once. So, the pattern is now: ^https?://([^\.]+\.)*straightdope\.com.*$

The straightdope site happens to be accessible with or without a www. part to the URL (not all sites are like that though). To match for both possibilities, we have to remember how the extension does not handle https://www. in its representation of URLs, so we need to place https?:// within brackets to indicate that it is too be treated as a group of characters, and a ? is placed afterwards to indicate that the group can occur zero or once. The pattern is: ^(https?://)?([^\.]+\.)*straightdope\.com.*$

boof (OP) quadruple-posted this 3 months ago, 1 day later, 1 week after the original post[^] [v] #1,421,778

Note that we would not try to simplify the pattern to ^.*straightdope\.com.*$, because that would allow other domains that happen to end with the same characters, e.g. catstraightdope.com.
A good place to test regular expression patterns is https://regex101.com. When using regex101.com to test patterns with lists of URLs that are each on a separate line, we also need to indicate that the prohibitory wildcard is not a newline indicator, because we could get false matches that span longer than one line. Also, we need to escape each / character with a backslash. History AutoDelete Rebooted does not have those requirements. Copy and paste ^(https?:\/\/)?([^\.\n]+\.)*straightdope\.com.*$ into the area titled Regular Expression.

Copy and paste these lines into the Test String area:
https://boards.straightdope.com
https://straightdope.com
https://abcstraightdope.com
https://abc.defstraightdope.com
http://abc.def.straightdope.com
straightdope.com
defstraightdope.com
abc.def.straightdope.com

The lines that do not have https:// represent URLs that show https://www. when seen in the address bar, but are not handled by History AutoDelete Rebooted that way. The regex101 page should show the first, second, fifth, sixth, and eighth lines with a highlighted background to indicate that they match the given pattern.

Here is a general pattern to match the URLs that do not have a www. part:
^https?://.*$

Here is a general pattern to match the URLs with www. (and so appear to History AutoDelete Rebooted as having no https://www. as described earlier in these notes):
^(?!https?://).*$

boof (OP) quintuple-posted this 3 months ago, 1 day later, 1 week after the original post[^] [v] #1,421,933

If trying for an exact match of an URL that does not continue beyond the domain part, then keep in mind that the browser history holds a forward slash / afterwards regardless of appearing to lack it in the address bar. If you copy https://www.youtube.com seen in the address bar and paste elsewhere, you will see https://www.youtube.com/, for instance. So, ^youtube\.com$ will not work to match the URL, as ^youtube\.com/$ is required. If you'd like to be sure about matching regardless of / there or not, use a question mark ? afterward: ^youtube\.com/?$

Here are typical patterns to match for deletion, using the youtube.com site as an example:
match URLs that are based upon https://www.youtube.com:
^youtube\.com/.*$

match URLs that are not based upon https://www.youtube.com:
^(?!youtube\.com/.*).*$

match URLs that are based upon https://www.youtube.com that go beyond the .com part:
^youtube\.com/.+$

match URLS that have the form https://www.youtube.com/watch?v=[whatever]
(e.g. https://www.youtube.com/watch?v=7Qqmr6IiFLE):
^youtube\.com/watch\?v=.+$

match URLs that are based upon https://www.youtube.com, except for those that have the form https://www.youtube.com/watch?v=[whatever]:
^youtube\.com(?!/watch\?v=.+).*$

matches URLs that do not have the form https://www.youtube.com/watch?v=[whatever], regardless of being based upon youtube.com:
^(?!youtube\.com/watch\?v=.+).*$

The first of the six patterns above will make History AutoDelete Rebooted blacklist all URLs with youtube.com as the base domain. The second pattern effectively blacklists all URLs except for those that have youtube.com as the base domain, effectively whitelisting that site. The third pattern blacklists all URLs that go beyond the .com part of youtube.com. The fourth pattern selectively blacklists only those URLs that are like https://www.youtube.com/watch?v=7Qqmr6IiFLE, leaving other URLs based upon youtube.com alone. The fifth pattern blacklists all URLs based upon youtube.com except for those that are like https://www.youtube.com/watch?v=7Qqmr6IiFLE. The sixth pattern is like the fifth pattern, but it also effectively blacklists all other URL domains regardless of being based upon youtube.com. The effect is to whitelist the https://www.youtube.com/watch?v=[whatever] form specifically and blacklist everything else.

boof (OP) sextuple-posted this 3 months ago, 23 hours later, 1 week after the original post[^] [v] #1,422,112

patterns for matching more than one specific domain:

blacklisting:
While unnecessary, you could blacklist more than one domain pattern while using a single line regular expression, with use of the | character which has the meaning of an OR symbol, and round brackets (parentheses). For instance, the following pattern would match all URLs built around the domains youtube.com, reddit.com, phys.org, and archive.org:
^.*(youtube\.com|reddit\.com|phys\.org|archive\.org).*$

However, that pattern also allows matching other domains such as shreddit.com or starchive.org, because of the wildcard to the left of the list of domains. To prevent such unwanted matching, the pattern has to be made more complicated, as was described earlier in these notes about patterns for matching a specific domain:
^(https?://)?([^\.]+\.)*(youtube\.com|reddit\.com|phys\.org|archive\.org).*

(Edited 54 seconds later.)

boof (OP) septuple-posted this 2 months ago, 22 hours later, 1 week after the original post[^] [v] #1,422,329

whoops, put a $ at the end of the last regular expression line above.

patterns for matching more than one specific domain:

whitelisting:
Ordinary usage of the History AutoDelete Rebooted extension is for blacklisting. Site domain blacklist patterns are easily entered one at a time, and there is no chance of the patterns conflicting. No whitelisting function is provided, but there is the workaround with the use of negative lookaheads. Separate whitelist expressions with negative lookaheads would cancel each other out, however, as every expression would be for blocking everything except for the one domain in that expression. To effectively whitelist more than domain pattern, list them as described earlier with use of the | character within brackets and apply a negative lookahead:
^(?!(https?:\/\/)?([^\.\n]+\.)*(youtube\.com|reddit\.com|phys\.org|archive\.org).*).*$

Notice that the pattern is the same as the single-line blacklisting pattern seen earlier, but other than the ^ and $ characters which remain at the start and end, it is contained within a set of brackets with ?! following the opening bracket and another wildcard .* instance following the closing bracket.

Z Ц ᄃ ᄃ joined in and replied with this 2 months ago, 3 minutes later, 1 week after the original post[^] [v] #1,422,330

ZimbabweBoy is behind the counter.

boof (OP) replied with this 2 months ago, 23 hours later, 1 week after the original post[^] [v] #1,422,544

Here is a set of URL lines for pattern testing in https://regex101.com or other such place:

thedomain.com
thedomain.com/
http://thedomain.com
http://thedomain.com/
https://thedomain.com
https://thedomain.com/
sub1.thedomain.com
sub1.thedomain.com/
http://sub1.thedomain.com
http://sub1.thedomain.com/
https://sub1.thedomain.com
https://sub1.thedomain.com/
sub2.sub1.thedomain.com
sub2.sub1.thedomain.com/
http://sub2.sub1.thedomain.com
http://sub2.sub1.thedomain.com/
https://sub2.sub1.thedomain.com
https://sub2.sub1.thedomain.com/
thedomain.com/more
thedomain.com/more/
thedomain.com/more.fold/
thedomain.com/more.ext
http://thedomain.com/more
http://thedomain.com/more/
http://thedomain.com/more.fold/
http://thedomain.com/more.ext
https://thedomain.com/more
https://thedomain.com/more/
https://thedomain.com/more.fold/
https://thedomain.com/more.ext
sub1.thedomain.com/more
sub1.thedomain.com/more/
sub1.thedomain.com/more.fold/
sub1.thedomain.com/more.ext
http://sub1.thedomain.com/more
http://sub1.thedomain.com/more/
http://sub1.thedomain.com/more.fold/
http://sub1.thedomain.com/more.ext
https://sub1.thedomain.com/more
https://sub1.thedomain.com/more/
https://sub1.thedomain.com/more.fold/
https://sub1.thedomain.com/more.ext
sub2.sub1.thedomain.com/more
sub2.sub1.thedomain.com/more/
sub2.sub1.thedomain.com/more.fold/
sub2.sub1.thedomain.com/more.ext
http://sub2.sub1.thedomain.com/more
http://sub2.sub1.thedomain.com/more/
http://sub2.sub1.thedomain.com/more.fold/
http://sub2.sub1.thedomain.com/more.ext
https://sub2.sub1.thedomain.com/more
https://sub2.sub1.thedomain.com/more/
https://sub2.sub1.thedomain.com/more.fold/
https://sub2.sub1.thedomain.com/more.ext
thedomain.com/some/more
thedomain.com/some.fold/more
thedomain.com/some/more/
thedomain.com/some.fold/more/
thedomain.com/some/more.fold/
thedomain.com/some.fold/more.fold/
thedomain.com/some/more.ext
thedomain.com/some.fold/more.ext
http://thedomain.com/some/more
http://thedomain.com/some.fold/more
http://thedomain.com/some/more/
http://thedomain.com/some.fold/more/
http://thedomain.com/some/more.fold/
http://thedomain.com/some.fold/more.fold/
http://thedomain.com/some/more.ext
http://thedomain.com/some.fold/more.ext
https://thedomain.com/some/more
https://thedomain.com/some.fold/more
https://thedomain.com/some/more/
https://thedomain.com/some/more.fold/
https://thedomain.com/some.fold/more.fold/
https://thedomain.com/some/more.ext
https://thedomain.com/some.fold/more.ext
sub1.thedomain.com/some/more
sub1.thedomain.com/some.fold/more
sub1.thedomain.com/some/more/
sub1.thedomain.com/some.fold/more/
sub1.thedomain.com/some/more.fold/
sub1.thedomain.com/some.fold/more.fold/
sub1.thedomain.com/some/more.ext
sub1.thedomain.com/some.fold/more.ext
http://sub1.thedomain.com/some/more
http://sub1.thedomain.com/some.fold/more
http://sub1.thedomain.com/some/more/
http://sub1.thedomain.com/some.fold/more/
http://sub1.thedomain.com/some/more.fold/
http://sub1.thedomain.com/some.fold/more.fold/
http://sub1.thedomain.com/some/more.ext
http://sub1.thedomain.com/some.fold/more.ext
https://sub1.thedomain.com/some/more
https://sub1.thedomain.com/some.fold/more
https://sub1.thedomain.com/some/more/
https://sub1.thedomain.com/some.fold/more/
https://sub1.thedomain.com/some/more.fold/
https://sub1.thedomain.com/some.fold/more.fold/
https://sub1.thedomain.com/some/more.ext
https://sub1.thedomain.com/some.fold/more.ext
sub2.sub1.thedomain.com/some/more
sub2.sub1.thedomain.com/some.fold/more
sub2.sub1.thedomain.com/some/more/
sub2.sub1.thedomain.com/some.fold/more/
sub2.sub1.thedomain.com/some/more.fold/
sub2.sub1.thedomain.com/some.fold/more.fold/
sub2.sub1.thedomain.com/some/more.ext
sub2.sub1.thedomain.com/some.fold/more.ext
http://sub2.sub1.thedomain.com/some/more
http://sub2.sub1.thedomain.com/some.fold/more
http://sub2.sub1.thedomain.com/some/more/
http://sub2.sub1.thedomain.com/some.fold/more/
http://sub2.sub1.thedomain.com/some/more.fold/
http://sub2.sub1.thedomain.com/some.fold/more.fold/
http://sub2.sub1.thedomain.com/some/more.ext
http://sub2.sub1.thedomain.com/some.fold/more.ext
https://sub2.sub1.thedomain.com/some/more
https://sub2.sub1.thedomain.com/some.fold/more
https://sub2.sub1.thedomain.com/some/more/
https://sub2.sub1.thedomain.com/some.fold/more/
https://sub2.sub1.thedomain.com/some/more.fold/
https://sub2.sub1.thedomain.com/some.fold/more.fold/
https://sub2.sub1.thedomain.com/some/more.ext
https://sub2.sub1.thedomain.com/some.fold/more.ext

boof (OP) double-posted this 2 months ago, 1 day later, 1 week after the original post[^] [v] #1,422,751

In the URL patterns that follow in these notes, thedomain.com stands in for any particular desired base domain pattern such as youtube.com. It can be substituted with another domain as desired, or with more than one domain as in (thedomain\.com|youtube\.com|reddit\.com|phys\.org|archive\.org). Recall that for use at regex101.com, patterns with the metacharacters [^] risk causing unwanted matching beyond single lines unless \n is included within the square brackets. Also, regex101.com requires that literal forward slashes be escaped with the backslash, as \/.

Many of the patterns here may seem to have limited usefulness or interest, but are included because they were good for the exercise of making regular expressions. The patterns given here are not necessarily the only ones possible that work, and could be replaced with simpler versions.

matches using a specified domain:

matches involving any additional subdomains to the base domain pattern or not:
match URLs that have the given domain pattern but only if there are no additional subdomains
^(https?://)?thedomain\.com.*$
match URLs that have the given domain pattern but only if there is at least one additional subdomain
^(https?://)?([^\./]+\.)+thedomain\.com.*$

match URLs that have the given domain pattern but only if there at least one additional subdomain, also match URLs with any other base domain
The result would be a history containing only the URLs with no additional subdomains with the base domain, and no URLs that have any other base domain.
^((https?://)?([^\./]+\.)+thedomain\.com.*|(?!(https?://)?([^\./]+\.)*thedomain\.com.*).*)$

boof (OP) triple-posted this 2 months ago, 22 hours later, 2 weeks after the original post[^] [v] #1,422,890

matches using a specified domain (continued):

matches involving going beyond the .com/ (or whatever that part is set to be) or not:
match URLs that have the given domain pattern but only if there is nothing more beyond the .com/ (or whatever that part is set to be)
^(https?://)?([^\./]+\.)*thedomain\.com/?$

match URLs that have the given domain pattern but only if there is anything more beyond the .com/ (or whatever that part is set to be)
^(https?://)?([^\./]+\.)*thedomain\.com/.+$

match URLs that have the given domain but only if there is anything beyond the .com/ (or whatever that part is set to be), also match if the URL is based on any other domain
The result would be a history containing only the URLs with the given domain that have anything beyond the .com/ (or whatever that part is set to be), and no other URLs.
^((https?://)?([^\./]+\.)*thedomain\.com/.+|(?!(https?://)?(.*\.)*thedomain\.com.*).*)$

boof (OP) quadruple-posted this 2 months ago, 23 hours later, 2 weeks after the original post[^] [v] #1,423,100

matches using a specified domain (continued):

matches involving additional subdomains or not and going beyond the .com/ (or whatever that part is set to be) or not, considered in the same pattern:
match URLs that have the given domain pattern but only if there are no additional subdomains and also nothing beyond the .com/ (or whatever that part is set to be)
^(https?://)?thedomain\.com/?$

match URLs that have the given domain pattern but only if there are any additional subdomains and/or anything beyond the .com/ (or whatever that part is set to be)
^((https?://)?([^\./]+\.)+thedomain\.com.*|(https?://)?([^\./]+\.)*thedomain\.com/.+)$

match URLs that have the given domain pattern but only if there are any additional subdomains and/or anything beyond the .com/ (or whatever that part is set to be), also match URLs that have any other domain regardless
The result would be a history containing only the URLs with no additional subdomains for the given domain and nothing beyond the .com/ (or whatever that part is set to be), and no other URLs.
^((https?://)?([^\./]+\.)+thedomain\.com.*|(https?://)?([^\./]+\.)*thedomain\.com/.+|(?!(https?://)?([^\./]+\.)*thedomain\.com.*).*)$

boof (OP) quintuple-posted this 2 months ago, 1 day later, 2 weeks after the original post[^] [v] #1,423,219

matches using a specified domain (continued):

The last line of the previous reply can be replaced with something very trivial if the URL involved is https://www.thedomain.com: ^((?!thedomain\.com/?).)*$
and if https://thedomain.com: ^((?!https?://thedomain\.com/?).)*$

matches involving a specified domain, with a filename.extension pattern or not:
match URLs that have a specified domain and any filename.extension pattern
^(https?://)?([^\./]+\.)*thedomain\.com/.+(?=\.[^\./]+$).+$

match URLs that have a specified domain and no filename.extension pattern
^(https?://)?([^\./]+\.)*thedomain\.com(?!.+\.[^\./]+$).*$

match URLs that have a specified domain and no filename.extension pattern, and match URLs of all other domains regardless
The result would be a history containing only the URLs with the specified domain that end with any filename.extension.
^((https?://)?([^\./]+\.)*thedomain\.com(?!.+\.[^\./]+$).*|(?!(https?://)?([^\.]+\.)*thedomain\.com.*).*)$

(Edited 2 minutes later.)

boof (OP) sextuple-posted this 2 months ago, 1 day later, 2 weeks after the original post[^] [v] #1,423,475

[interstitial topic: inverting a pattern]

Any regular expression pattern can be simply inverted, so that if a pattern is for matching lines and not others, the inversion will match the others and not what the original pattern matched. If the original pattern is PATTERN, then the inversion has the form ^((?!PATTERN).)*$ The PATTERN may have one or both of its anchors ^ and $ removed as redundant, but not necessarily. Test to verify, or just leave as is.

Some of the longer patterns that I have already put in this thread are equivalent to inversions of simpler patterns.
The last line in https://minichan.net/topic/133261#reply_1422751 can be replaced:
^((?!^(https?://)?thedomain\.com.*$).)*$
and the .*$ after .com is redundant so we can use:
^((?!^(https?://)?thedomain\.com).)*$

The last line in https://minichan.net/topic/133261#reply_1422890 can be replaced:
^((?!^(https?://)?([^\./]+\.)*thedomain\.com/?$).)*$
and the second ^ is redundant so we can use:
^((?!(https?://)?([^\./]+\.)*thedomain\.com/?$).)*$

The last line in https://minichan.net/topic/133261#reply_1423219 can be replaced:
^((?!^(https?://)?([^\./]+\.)*thedomain\.com/.+(?=\.[^\./]+$).+$).)*$
and the second ^ and the .+$ prior to ).)*$ are redundant and so we can use:
^((?!(https?://)?([^\./]+\.)*thedomain\.com/.+(?=\.[^\./]+$)).)*$

[interstitial topic ends here]

boof (OP) septuple-posted this 2 months ago, 22 hours later, 2 weeks after the original post[^] [v] #1,423,663

matches using a specified domain (continued):

With these next patterns, I show three variations for each description: specified file extension with example being jpg, specified filename and extension with example being more.jpg, and specified list of file extensions being jpg, gif, and png. Change any file extension to ext to match lines in the URL list in https://minichan.net/topic/133261#reply_1422544

match URLs that have a specified domain and a specified filename.extension
^(https?://)?([^\./]+\.)*thedomain\.com/.+\.(jpg|JPG)$
^(https?://)?([^\./]+\.)*thedomain\.com/(.+/)*more\.(jpg|JPG)$
^(https?://)?([^\./]+\.)*thedomain\.com/.+\.((?i)jpg|gif|png(?-i))$
match URLs that have a specified domain and not the specified filename.extension, having an extension is not required
^(https?://)?([^\./]+\.)*thedomain\.com(?!.*\.(jpg|JPG)$).*$
^(https?://)?([^\./]+\.)*thedomain\.com(?!.*/more\.(jpg|JPG)$).*$
^(https?://)?([^\./]+\.)*thedomain\.com(?!.*\.((?i)jpg|gif|png(?-i))$).*$

match URLs that have a specified domain and not the specified filename.extension, and match URLs that have any other domain regardless
The result would be a history containing only the URLs with the specified domain and the specified filename.extension.
^((https?://)?([^\./]+\.)*thedomain\.com(?!.*\.(jpg|JPG)$).*$|(?!(https?://)?([^\.]+\.)*thedomain\.com.*).*)
^((https?://)?([^\./]+\.)*thedomain\.com(?!.*/more\.(jpg|JPG)$).*$|(?!(https?://)?([^\.]+\.)*thedomain\.com.*).*)
^((https?://)?([^\./]+\.)*thedomain\.com(?!.*\.((?i)jpg|gif|png(?-i))$).*$|(?!(https?://)?([^\.]+\.)*thedomain\.com.*).*)
The three above patterns can be replaced with these simpler forms:
^((?!^(https?://)?([^\./]+\.)*thedomain\.com/.+\.(jpg|JPG)$).)*$
^((?!^(https?://)?([^\./]+\.)*thedomain\.com/(.+/)*more\.(jpg|JPG)$).)*$
^((?!^(https?://)?([^\./]+\.)*thedomain\.com/.+\.((?i)jpg|gif|png(?-i))$).)*$
Also, the second ^ and first $ in the above forms are redundant, so we can use:
^((?!(https?://)?([^\./]+\.)*thedomain\.com/.+\.(jpg|JPG)).)*$
^((?!(https?://)?([^\./]+\.)*thedomain\.com/(.+/)*more\.(jpg|JPG)).)*$
^((?!(https?://)?([^\./]+\.)*thedomain\.com/.+\.((?i)jpg|gif|png(?-i))).)*$

boof (OP) octuple-posted this 2 months ago, 1 day later, 2 weeks after the original post[^] [v] #1,423,906

matches using a specified domain (continued):

match URLs that have a specified domain and not the specified filename.extension, but having an extension is required
^(https?://)?([^\./]+\.)*thedomain\.com(?!.*\.(jpg|JPG)$).+(?=\.[^\./]+$).+$
^(https?://)?([^\./]+\.)*thedomain\.com(?!.*/more\.(jpg|JPG)$).+(?=\.[^\./]+$).+$
^(https?://)?([^\./]+\.)*thedomain\.com(?!.*\.((?i)jpg|gif|png(?-i))$).+(?=\.[^\./]+$).+$

match URLs that have a specified domain and not the specified filename.extension, but having an extension is required, and match URLs of all other domains regardless
The result would be a history containing only the URLs with the specified domain and not the specified filename.extension. Those remaining URLs can be with any other filename.extension, or with no filename.extension.
^((https?://)?([^\./]+\.)*thedomain\.com/(?!.*\.(jpg|JPG)$).+(?=\.[^\./]+$).+|(?!(https?://)?([^\.]+\.)*thedomain\.com.*).*)$
^((https?://)?([^\./]+\.)*thedomain\.com(?!.*/more\.(jpg|JPG)$).+(?=\.[^\./]+$).+|(?!(https?://)?([^\.]+\.)*thedomain\.com.*).*)$
^((https?://)?([^\./]+\.)*thedomain\.com(?!.*\.((?i)jpg|gif|png(?-i))$).+(?=\.[^\./]+$).+|(?!(https?://)?([^\.]+\.)*thedomain\.com.*).*)$

boof (OP) nonuple-posted this 2 months ago, 1 day later, 2 weeks after the original post[^] [v] #1,424,157

referring back to @1,422,544 (boof)

This pattern should match every URL in the above list:
^(https?://)?([^\./]+\.)+[^\./]+(/[^/]+)*/?$

The (https?://)? matches the two cases of URLs starting with https:// or https://www., which History AutoDelete Rebooted treats as the absence of https://. The ([^\./]+\.)+ part matches one or more instances of a block of characters that aren't . or / with a . following each block. The [^\./]+ is for the rightmost block of characters that have followed a . and are prior to a possible first instance of /. The (/[^/]+)* part matches zero or more instances of a block of characters that begin with a / and are followed by characters that are not /. The /? part matches an optional / that might finish the URL.

Note that the pattern does not prove that a string is a valid URL, as it does not screen for characters that are disallowed in URLs, but the pattern will match URL forms that are valid if they are of the kind that begin with https://. For applicability outside of the use of History AutoDelete Rebooted, the possible instance of www. has to be included:
^https?://(www.)?([^\./]+\.)+[^\./]+(/[^/]+)*/?$

boof (OP) decuple-posted this 2 months ago, 23 hours later, 3 weeks after the original post[^] [v] #1,424,406

matches for patterns without specifying a domain:

Note that a "base domain" here is assumed to be something like ebay.com and not ebay.co.uk, i.e. there are two parts separated by a dot rather than three (or more) parts separated by more dots. There is no good way to accommodate the longer forms without listing specific possible parts.

match URLs that have the base domain form only with no additional subdomain
^(https?://)?[^\.]+\.[^\./]+(/?|/.+)$

match URLs that have at least one subdomain in addition to the base domain
^(https?://)?([^\./]+\.){2,}[^\./]+(/?|/.+)$

boof (OP) undecuple-posted this 2 months ago, 1 day later, 3 weeks after the original post[^] [v] #1,424,662

matches for patterns without specifying a domain (continued):

match URLs that have nothing beyond the .com/ (or whatever that part happens to be)
^(https?://)?[^/]+/?$

match URLs that have nothing beyond the .com/ (or whatever that part happens to be) except for an optional block of text followed by an optional /, e.g. http://thedomain.com/more/
^(https?://)?[^/]+/?[^/]*/?$

match URLs that have anything beyond the .com/ (or whatever that part happens to be)
The result would be a history containing only the URLs that do not extend beyond the domain.
^(https?://)?([^\./]+\.)*[^\./]+/([^/]+/)*[^/]+/?$

boof (OP) duodecuple-posted this 2 months ago, 23 hours later, 3 weeks after the original post[^] [v] #1,425,038

matches for patterns without specifying a domain (continued):

match URLs that have the base domain form only with no additional subdomain part, and nothing beyond the .com/ (or whatever that part happens to be)
^(https?://)?[^\.]+\.[^\./]+/?$

match URLs that have the base domain form only with no additional subdomain part, and nothing beyond the .com/ (or whatever that part happens to be) except for an optional block of text followed by an optional /, e.g. http://thedomain.com/more/
^(https?://)?[^\.]+\.[^\./]+(/?|/[^/]*)$

match URLs that have any subdomains in addition to the base domain form, and/or anything beyond .com/ (or whatever that part happens to be)
The result would be a history containing only the URLs that have the base domain form.
^((https?://)?([^\./]+\.){2,}[^\./]+(/?|/.+)|(https?://)?([^\./]+\.)*[^\./]+/[^/].*)$

The above pattern can be replaced with this simpler form:
^((?!^(https?:\/\/)?[^\.\n]+\.[^\.\/\n]+\/?$).)*$

boof (OP) tridecuple-posted this 2 months ago, 23 hours later, 3 weeks after the original post[^] [v] #1,425,217

matches for patterns without specifying a domain (continued):

matches involving an unspecified filename.extension pattern:
match URLs that have a filename.extension form immediately after the .com/ (or whatever that part happens to be)
^(https?://)?[^/]+\.[^\./]+/[^/]+\.[^/]+$

match URLs ending with a filename.extension form, regardless of appearing immediately after the .com/ (or whatever that part happens to be) or not
^(https?://)?[^/]+\.[^\.]+/.+\.[^/]+$

match URLs not ending with a filename.extension form
The result would be a history containing only the URLs ending with a filename.extension form.
^(https?://)?[^/]+\.[^\.]+(?!.*\.[^\./]+$).*$

Anonymous G joined in and replied with this 2 months ago, 17 minutes later, 3 weeks after the original post[^] [v] #1,425,220

Now you have two problems{i !}

boof (OP) replied with this 2 months ago, 23 hours later, 3 weeks after the original post[^] [v] #1,425,595

matches for patterns without specifying a domain (continued):

matches involving a particular named filename.extension:

match URLs that have the filename.extension immediately after the .com/ (or whatever that part happens to be)
^(https?://)?[^/]+/[^/]+\.(jpg|JPG)$
^(https?://)?[^/]+/more\.(jpg|JPG)$
^(https?://)?[^/]+/[^/]+\.((?i)jpg|gif|png(?-i))$

match URLs ending with the filename.extension, regardless of appearing immediately after the .com/ (or whatever that part happens to be) or not
^(https?://)?([^/]+/)+[^/]+\.(jpg|JPG)$
^(https?://)?([^/]+/)+more\.(jpg|JPG)$
^(https?://)?([^/]+/)+[^/]+\.((?i)jpg|gif|png(?-i))$

Anonymous H joined in and replied with this 2 months ago, 12 hours later, 3 weeks after the original post[^] [v] #1,425,769

I use Opera

boof (OP) replied with this 2 months ago, 10 hours later, 3 weeks after the original post[^] [v] #1,425,796

matches for patterns without specifying a domain (continued):

matches involving the URL ending with a filename.extension form but does not have the particular filename.extension that we have in mind:

match URLs that have the filename.extension immediately after the .com/ (or whatever that part happens to be)
^(https?://)?[^/]+/[^/]+\.(?!(jpg|JPG)$)[^/]+$
^(https?://)?[^/]+/(?!more\.(jpg|JPG)$)[^/]+\.[^\./]+$
^(https?://)?[^/]+/[^/]+\.(?!((?i)jpg|gif|png(?-i))$)[^/]+$

match URLs ending with a filename.extension form, regardless of appearing immediately after the .com/ (or whatever that part happens to be) or not
^(https?://)?([^/]+/)+[^/]+\.(?!(jpg|JPG)$)[^/]+$
^(https?://)?([^/]+/)+(?!more\.(jpg|JPG)$)[^/]+\.[^\./]+$
^(https?://)?([^/]+/)+[^/]+\.(?!((?i)jpg|gif|png(?-i))$)[^/]+$

Anonymous I joined in and replied with this 2 months ago, 4 minutes later, 3 weeks after the original post[^] [v] #1,425,798

Why are people poasting regexen here? Away, away! back to /prog/, please!

boof (OP) replied with this 2 months ago, 1 day later, 3 weeks after the original post[^] [v] #1,425,991

matches for patterns without specifying a domain (continued):

matches involving the URL not having a particular named filename.extension that we have in mind, regardless of ending with a filename.extension form or not:

match URLs that do not have the particular filename.extension that we have in mind
The result would be a history containing only the URLs for files that have the given particular named filename.extension.
^(https?://)?[^/]+\.[^\./]+(?!.*/[^/]+\.(jpg|JPG)$)(/[^/]+)*/?$
^(https?://)?[^/]+\.[^\./]+(?!.*/more\.(jpg|JPG)$)(/[^/]+)*/?$
^(https?://)?[^/]+\.[^\./]+(?!.*/[^/]+\.((?i)jpg|gif|png(?-i))$)(/[^/]+)*/?$

match URLs that have one block of text beyond the .com/ (or whatever that part happens to be) that can have an optional / afterward if the text is not a filename, and does not have the given particular named filename.extension if the text is a filename
^(https?://)?[^/]+/(?![^/]+\.(jpg|JPG)$)[^/]+/?$
^(https?://)?[^/]+/(?!more\.(jpg|JPG)$)[^/]+/?$
^(https?://)?[^/]+/(?![^/]+\.((?i)jpg|gif|png(?-i))$)[^/]+/?$

match URLs that have anything beyond .com/ (or whatever that part happens to be) that can have an optional / afterward if the text is not a filename, and does not have the given particular named filename.extension if the URL ends with a filename
The result would be a history containing only the URLs for files that have the given particular named filename.extension, and URLs that do not go beyond the .com/ (or whatever that part happens to be).
^(https?://)?([^/]+/)+(?![^/]+\.(jpg|JPG)$)[^/]+/?$
^(https?://)?([^/]+/)+(?!more\.(jpg|JPG)$)[^/]+/?$
^(https?://)?([^/]+/)+(?![^/]+\.((?i)jpg|gif|png(?-i))$)[^/]+/?$

boof (OP) double-posted this 2 months ago, 23 hours later, 4 weeks after the original post[^] [v] #1,426,100

Well that's my collection of regular expression patterns applicable to the use of the Firefox extension History AutoDelete Rebooted. I remind that the extension has the peculiarity of treating URLs in the browser history that start with https://www. as if that part was not there, and this thread of patterns takes that into account. For applicability outside of the use of History AutoDelete Rebooted, the possible instance of www. has to be included. Do that by inserting (www.)? after the (https?://)? part in these patterns.

(Edited 42 seconds later.)

boof (OP) triple-posted this 2 months ago, 1 day later, 1 month after the original post[^] [v] #1,426,281

@previous (boof)
Also, that (https?://)? would no longer be optional, and so https?:// replaces that (remove the surrounding brackets/parentheses and trailing ?).

@1,425,220 (G)

> Now you have two problems{i !}

And I found these two quotes:

"Hey, I know, I'll use Regex to solve this problem." Now you have two problems.

"The plural of Regex is Regrets."

Minichan