Notice: Welcome to Minichan, an account has automatically been created and assigned to you, you don't have to register or log in to use the board, but don't clear your cookies unless you have set a memorable name and password. Alternatively, you can restore your ID.
This thread is referenced in https://minichan.net/topic/132404#reply_1420524 where the Firefox extension History AutoDelete Rebooted is described. Here, details are given about making regular expression patterns to match URLs in Firefox browser history for deletion. Regular expressions, also known as regex, is a method to match and locate text in many different usage situations in general. It has a particular syntax whereby special metacharacters represent things other than than the literal characters that they appear to be. This is a list of metacharacters that are used in these notes. More details about each are given a little later.
^ anchor: start of string or line
[^] negated character class
$ anchor: end of string or line
. any character except line break
\ escape, apply to metacharacters to make them literal, or apply to regular characters for
special meanings. e.g. \. \? \/ \n
() capturing group
? quantifier: zero or one
+ quantifier: one or more
{2,} quantifier: two or more
* quantifier: zero or more
(|) choice (alternation, set union) operator, OR
(?!) negative lookahead
(?=) positive lookahead
(?i) inline modifier: activate case insensitive mode
(?-i) inline modifier: switch off case insensitive mode
Anonymous B joined in and replied with this 1 week ago, 40 minutes later[^][v]#1,420,533
i strongly prefer a browser thats functional right off the executable.
like... the more you tweak a browser the more it stands out, and the more it stands out, the easier the glowniggers can git you for posting a meme they dont like.
> i strongly prefer a browser thats functional right off the executable. > > like... the more you tweak a browser the more it stands out, and the more it stands out, the easier the glowniggers can git you for posting a meme they dont like.
Firefox’s appeal is in customization and add-ons. It’s the browser you tweak, not the one you’re fed.
boof (OP) double-posted this 1 week ago, 1 day later, 1 day after the original post[^][v]#1,420,782
Here is more detail about each of the listed metacharacters from earlier:
^
The caret symbol indicates the beginning of a string or line of text. For instance, ^abc would locate the start of this text if it began a line: abcdef, while it would not match anything in a line whose text began as: ghabc.
[^]
The caret as the first character within square brackets indicates any other character (including a space) than those listed afterward. For example, [^589hk] would match any character other than 5, 8, 9, h, and k. In the text 35hmp, the matches would be the 3, m, and p. Also, this[^17puv]that matches thismthat and this9that but not this1that or thisuthat.
$
The dollar sign indicates the end of a string or line of text. For instance, flop$ would locate the "flop" part of this text if it was at the end of a line: 34-vertflop, while it would not match anything in a line whose text ended with: moflopel844.
boof (OP) triple-posted this 1 week ago, 1 day later, 3 days after the original post[^][v]#1,420,941
.
The dot is the wildcard character, and indicates any character (including a space) except for newline (anything that serves as a line break), which is a non-displaying signal to continue text on the next line. Various implementations of regular expressions can be set to a mode whereby the dot can also match a newline character. So, k.t can match k3t, kpt, k#t, k t, and so on.
\
The backslash is the escape character, meaning that it reverts metacharacters to their literal use. For instance, \. is for matching the dot, so that it is not taken as a wildcard character. For example, youtube\.com matches youtube.com, and not say, youtube\tcom. In the context of within square brackets, . is taken as a literal dot, and the backslash is not necessary (nor does it have any effect of changing that usage, so . and \. will be understood as literal . regardless). Patterns that will be used later in these notes include escaping the question mark and the forward slash, i.e. \? and \/. The backslash can also convert ordinarily literal characters to some special meaning. In these notes, \n will be used to indicate an unseen newline character.
()
The round brackets (parentheses) cause their contents to be considered as a group. One reason to do that would be to be able to reference a captured group later in a pattern. Another reason would be for application of quantifier metacharacters, as described next in these notes.
boof (OP) quadruple-posted this 6 days ago, 23 hours later, 4 days after the original post[^][v]#1,421,119
?
The question mark indicates that the prior character or captured group can be at that place in the pattern either zero or one time. For instance, be?t matches bt and bet. Likewise, art(perk)?ow matches artow and artperkow. The pattern
122[^8m]?
matches 122 and 122R but not 1228 or 122m. Also,
.?(Citing a deleted or non-existent reply.)
matches
(Citing a deleted or non-existent reply.)
and
h(Citing a deleted or non-existent reply.)
among other possibilities.
+
The plus sign indicates that the prior character or captured group can be at that place in the pattern at least once. For instance, ca+t matches cat, caat, caaat, and so on. Likewise, 124(-01)+ matches 124-01, 124-01-01, and so on. Also, [^aeiou]+43 matches c43, 6843, and y7k43, but not e43 or ia43. The pattern v3#.+ matches v3#9, v3#e0, and v3#11p, among others.
{2,}
The curly brackets (braces) with a 2 and then a comma between them indicates that the prior character or captured group can be at that place in the pattern two or more times. For instance, wap{2,}4 matches wapp4, wappp4, and so on. Likewise, (000-){2,}500 matches 000-000-500 and 000-000-000-500, and so on. The pattern EE[^uvUV]{2,}NN matches EE7fNN and EEWMiNN, but not EENN or EEUVrrNN, among others. Also, m.{2,}9 matches m3g9, mWPe9, m10fA9, and so forth.
*
The asterisk indicates that the prior character or captured group can be at that place in the pattern zero or more times. For instance, j4* matches j, j4, j44, and so on. Likewise, (rat)*32 matches 32, rat32, ratrat32, and so on. Also, G-[^@#&]* matches G-, G-B, and G-8c, but not G-(Citing a deleted or non-existent reply.) G-#m, G-4&k, among others. The pattern .*bun matches bun, Pbun, #xbun, and so on.
boof (OP) sextuple-posted this 5 days ago, 23 hours later, 4 days after the original post[^][v]#1,421,270
OK, replacing the problematic at symbol with a percent, here's the sentence from the ? paragraph: Also, .?%2 matches %2 and h%2 among other possibilities.
(|)
The pipe within round brackets (parentheses) is for placing two alternative patterns to match, one on each side of the pipe. The pipe is effectively a logical OR operator. More than one pipe can be used so that three or more alternatives can be listed. For example, (go|stop) matches go and stop. Likewise, (one|two|three) matches one, two, and three. Also, h(a|i|o|u)t matches hat, hit, hot, and hut. The pattern b(3|54)+ matches b3, b54, b33, b354, b35454, b54354, b54333 and so on.
(?=)
The question mark and equals sign within round brackets (parentheses) is for performing a positive lookahead, which requires matching whatever appears to the right of a position, but without capturing that text. For example, #43(?=.*%) when applied to #43E78k%556p matches only the #43 part of that text, while #43(?=.*%).*p matches the entire #43E78k%556p text because of the wildcard and letter p that are part of the search pattern after the lookahead.
(?!)
The question mark and exclamation symbol within round brackets (parentheses) is for performing a negative lookahead, which requires not matching whatever appears to the right of a position, but without capturing that text. For example, nab(?!56) when applied to nab1154YC matches only the nab part of the text, while nab(?!56).{2,}YC matches the entire nab1154YC text because of the wildcard and letters YC that are part of the search pattern.
boof (OP) replied with this 4 days ago, 23 hours later, 5 days after the original post[^][v]#1,421,527
(?i)
The question mark and letter i within round brackets (parentheses) is for activating the case-insensitive mode for matching letters. In that mode, unlike the default case-sensitive mode, lowercase and uppercase versions of the same letters are not distinguished from each other. For example, (?i)f8-ep matches f8-ep, F8-EP, f8-EP, f8-eP, f8-Ep, F8-ep, F8-Ep, and F8-eP.
(?-i)
The question mark, hyphen, and letter i within round brackets (parentheses) is for deactivating the case-insensitive mode for matching letters. The text between (?i) and (?-i) matches regardless of case, while letter text outside of those modifiers has to match whatever case is used in the pattern. For example, (?i)k(?-i)ace matches kace and Kace, but not kAce for instance.
boof (OP) double-posted this 3 days ago, 23 hours later, 6 days after the original post[^][v]#1,421,639
patterns for matching a specific domain:
Here is some detail on using regular expressions to specify URL patterns to match for deletion:
The regular expression URL patterns require the start-of-line character ^ and the end-of-line character $. Dots are used to indicate a wildcard, and following a dot with an asterisk indicates that there can be 0 or more possible characters. Different pages at websites extend beyond the main part of the URL (the domain), and so a dot with an asterisk should appear after the domain in the pattern to match all such pages' URLs. To indicate an actual dot, as in .com, the escape character \ must precede the dot. For example: ^https://apnews\.com.*$
One thing to keep in mind is that History AutoDelete Rebooted does not properly handle www. in patterns that you enter. For whatever reason, it was programmed to treat URLs starting with http:\\www. or https:\\www. as if those strings of characters were not there. For instance, https:\\www.youtube.com is seen by the extension as youtube.com, and therefore entering ^https:\\www\.youtube\.com.*$ as a pattern to match will not work. Use ^youtube\.com.*$ instead.
Another thing to keep in mind when you want to match all URLs of a specific domain, some sites prefix the domain with character strings and a dot to indicate subdomains. For instance, https://boards.straightdope.com is part of the site https://straightdope.com (technically the com part is called a top-level domain and straightdope.com is a second-level subdomain, but since the minimum form is straightdope.com, I mean that minimum base form when I refer to a site's domain).
boof (OP) triple-posted this 2 days ago, 1 day later, 1 week after the original post[^][v]#1,421,706
If we want a pattern to include all subdomains of straightdope.com, then we need a wildcard indicator and a dot to the left of the domain part. To match regardless of having a subdomain or not, the wildcard and dot are put within brackets to be handled as a group, and an asterisk is placed afterward to indicate that the group could be there zero or more times. The wildcard needs to indicate any character that is not itself a dot, because the dot part of the subdomain is already specified afterward. To indicate prohibition of characters, the characters are placed following a single caret ^, all within square brackets. This prohibition pattern acts as a wildcard. The wildcard has to represent a string that is at least one character long, and to indicate that, we place a plus sign afterward. So, we would enter: ^https://([^\.]+\.)*straightdope\.com.*$
Some sites might use the prefix http://, without the letter s seen in https://. To match for both possibilities, put a ? following the s, which indicates that the s can appear zero or once. So, the pattern is now: ^https?://([^\.]+\.)*straightdope\.com.*$
The straightdope site happens to be accessible with or without a www. part to the URL (not all sites are like that though). To match for both possibilities, we have to remember how the extension does not handle https://www. in its representation of URLs, so we need to place https?:// within brackets to indicate that it is too be treated as a group of characters, and a ? is placed afterwards to indicate that the group can occur zero or once. The pattern is: ^(https?://)?([^\.]+\.)*straightdope\.com.*$
boof (OP) quadruple-posted this 1 day ago, 1 day later, 1 week after the original post[^][v]#1,421,778
Note that we would not try to simplify the pattern to ^.*straightdope\.com.*$, because that would allow other domains that happen to end with the same characters, e.g. catstraightdope.com.
A good place to test regular expression patterns is https://regex101.com. When using regex101.com to test patterns with lists of URLs that are each on a separate line, we also need to indicate that the prohibitory wildcard is not a newline indicator, because we could get false matches that span longer than one line. Also, we need to escape each / character with a backslash. History AutoDelete Rebooted does not have those requirements. Copy and paste ^(https?:\/\/)?([^\.\n]+\.)*straightdope\.com.*$ into the area titled Regular Expression.
The lines that do not have https:// represent URLs that show https://www. when seen in the address bar, but are not handled by History AutoDelete Rebooted that way. The regex101 page should show the first, second, fifth, sixth, and eighth lines with a highlighted background to indicate that they match the given pattern.
Here is a general pattern to match the URLs that do not have a www. part:
^https?://.*$
Here is a general pattern to match the URLs with www. (and so appear to History AutoDelete Rebooted as having no https://www. as described earlier in these notes):
^(?!https?://).*$
boof (OP) quintuple-posted this 1 hour ago, 1 day later, 1 week after the original post[^][v]#1,421,933
If trying for an exact match of an URL that does not continue beyond the domain part, then keep in mind that the browser history holds a forward slash / afterwards regardless of appearing to lack it in the address bar. If you copy https://www.youtube.com seen in the address bar and paste elsewhere, you will see https://www.youtube.com/, for instance. So, ^youtube\.com$ will not work to match the URL, as ^youtube\.com/$ is required. If you'd like to be sure about matching regardless of / there or not, use a question mark ? afterward: ^youtube\.com/?$
Here are typical patterns to match for deletion, using the youtube.com site as an example:
match URLs that are based upon https://www.youtube.com:
^youtube\.com/.*$
The first of the six patterns above will make History AutoDelete Rebooted blacklist all URLs with youtube.com as the base domain. The second pattern effectively blacklists all URLs except for those that have youtube.com as the base domain, effectively whitelisting that site. The third pattern blacklists all URLs that go beyond the .com part of youtube.com. The fourth pattern selectively blacklists only those URLs that are like https://www.youtube.com/watch?v=7Qqmr6IiFLE, leaving other URLs based upon youtube.com alone. The fifth pattern blacklists all URLs based upon youtube.com except for those that are like https://www.youtube.com/watch?v=7Qqmr6IiFLE. The sixth pattern is like the fifth pattern, but it also effectively blacklists all other URL domains regardless of being based upon youtube.com. The effect is to whitelist the https://www.youtube.com/watch?v=[whatever] form specifically and blacklist everything else.