Notice: Welcome to Minichan, an account has automatically been created and assigned to you, you don't have to register or log in to use the board, but don't clear your cookies unless you have set a memorable name and password. Alternatively, you can restore your ID.

Minichan

Topic: regex regular expressions for Firefox extension History AutoDelete Rebooted

boof started this discussion 1 week ago #133,261

This thread is referenced in https://minichan.net/topic/132404#reply_1420524 where the Firefox extension History AutoDelete Rebooted is described. Here, details are given about making regular expression patterns to match URLs in Firefox browser history for deletion. Regular expressions, also known as regex, is a method to match and locate text in many different usage situations in general. It has a particular syntax whereby special metacharacters represent things other than than the literal characters that they appear to be. This is a list of metacharacters that are used in these notes. More details about each are given a little later.

^	anchor:  start of string or line
[^]	negated character class
$	anchor:  end of string or line
.	any character except line break
\	escape, apply to metacharacters to make them literal, or apply to regular characters for
	special meanings.  e.g. \.  \?  \/  \n
()	capturing group
?	quantifier:  zero or one
+	quantifier:  one or more 
{2,}	quantifier:  two or more
*	quantifier:  zero or more
(|)	choice (alternation, set union) operator, OR
(?!)	negative lookahead
(?=)	positive lookahead
(?i)	inline modifier:  activate case insensitive mode
(?-i)	inline modifier:  switch off case insensitive mode

(Edited 1 minute later.)

Anonymous B joined in and replied with this 1 week ago, 40 minutes later[^] [v] #1,420,533

i strongly prefer a browser thats functional right off the executable.

like... the more you tweak a browser the more it stands out, and the more it stands out, the easier the glowniggers can git you for posting a meme they dont like.

Anonymous C joined in and replied with this 1 week ago, 11 minutes later, 52 minutes after the original post[^] [v] #1,420,536

Private browsing mode is easier

Oatmeal Fucker !BYUc1TwJMU joined in and replied with this 1 week ago, 28 minutes later, 1 hour after the original post[^] [v] #1,420,540

I use Google Chrome with no addons

Anonymous E joined in and replied with this 1 week ago, 4 hours later, 6 hours after the original post[^] [v] #1,420,560

@1,420,533 (B)

> i strongly prefer a browser thats functional right off the executable.
>
> like... the more you tweak a browser the more it stands out, and the more it stands out, the easier the glowniggers can git you for posting a meme they dont like.

Firefox’s appeal is in customization and add-ons. It’s the browser you tweak, not the one you’re fed.

Anonymous B replied with this 1 week ago, 12 hours later, 18 hours after the original post[^] [v] #1,420,625

@previous (E)
firefoxes appeal is that its a shit browser that people cope about because its not chrome

boof (OP) replied with this 1 week ago, 5 hours later, 23 hours after the original post[^] [v] #1,420,663

The above is not a complete list of metacharacters for regular expressions in general. For more information about regular expressions and their syntax, refer to these sites:
https://www.regular-expressions.info/
https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html
https://www.rexegg.com
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions
https://bitlaunch.io/blog/how-to-use-the-grep-command-in-linux/ (grep is a command-line utility for searching text for lines that match a regular expression)
https://en.wikipedia.org/wiki/Regular_expression
https://stackoverflow.com/questions/399078/what-special-characters-must-be-escaped-in-regular-expressions
https://www.youtube.com/watch?v=0LKdKixl5Ug&list=PL55RiY5tL51ryV3MhCbH8bLl7O_RZGUUE
https://www.youtube.com/watch?v=kX3WpzLRiW4&list=PLLdz3KlabJv1UVT8cZ-h4iI7fRqC_rArb

These sites are useful for trying regular expression patterns with text that you can paste into the web pages:
https://regex101.com
https://spannbaueradam.shinyapps.io/r_regex_tester
read https://adamspannbauer.github.io/2018/01/16/r-regex-tester-shiny-app/ for usage notes.
https://www.regextester.com

This is software:
https://github.com/nedrysoft/regex101/blob/master/README.md (Linux, Windows, Mac)
https://sourceforge.net/projects/simregextester (Windows)
https://sourceforge.net/projects/regextester (Windows)
https://sourceforge.net/projects/regexcreator (Windows, Linux, Mac, BSD, ChromeOS)
https://sourceforge.net/projects/regexlab-net (Windows)

boof (OP) double-posted this 1 week ago, 1 day later, 1 day after the original post[^] [v] #1,420,782

Here is more detail about each of the listed metacharacters from earlier:

^
The caret symbol indicates the beginning of a string or line of text. For instance, ^abc would locate the start of this text if it began a line: abcdef, while it would not match anything in a line whose text began as: ghabc.

[^]
The caret as the first character within square brackets indicates any other character (including a space) than those listed afterward. For example, [^589hk] would match any character other than 5, 8, 9, h, and k. In the text 35hmp, the matches would be the 3, m, and p. Also, this[^17puv]that matches thismthat and this9that but not this1that or thisuthat.

$
The dollar sign indicates the end of a string or line of text. For instance, flop$ would locate the "flop" part of this text if it was at the end of a line: 34-vertflop, while it would not match anything in a line whose text ended with: moflopel844.

boof (OP) triple-posted this 1 week ago, 1 day later, 3 days after the original post[^] [v] #1,420,941

.
The dot is the wildcard character, and indicates any character (including a space) except for newline (anything that serves as a line break), which is a non-displaying signal to continue text on the next line. Various implementations of regular expressions can be set to a mode whereby the dot can also match a newline character. So, k.t can match k3t, kpt, k#t, k t, and so on.

\
The backslash is the escape character, meaning that it reverts metacharacters to their literal use. For instance, \. is for matching the dot, so that it is not taken as a wildcard character. For example, youtube\.com matches youtube.com, and not say, youtube\tcom. In the context of within square brackets, . is taken as a literal dot, and the backslash is not necessary (nor does it have any effect of changing that usage, so . and \. will be understood as literal . regardless). Patterns that will be used later in these notes include escaping the question mark and the forward slash, i.e. \? and \/. The backslash can also convert ordinarily literal characters to some special meaning. In these notes, \n will be used to indicate an unseen newline character.

()
The round brackets (parentheses) cause their contents to be considered as a group. One reason to do that would be to be able to reference a captured group later in a pattern. Another reason would be for application of quantifier metacharacters, as described next in these notes.

boof (OP) quadruple-posted this 6 days ago, 23 hours later, 4 days after the original post[^] [v] #1,421,119

?
The question mark indicates that the prior character or captured group can be at that place in the pattern either zero or one time. For instance, be?t matches bt and bet. Likewise, art(perk)?ow matches artow and artperkow. The pattern
122[^8m]?
matches 122 and 122R but not 1228 or 122m. Also,
.?(Citing a deleted or non-existent reply.)
matches
(Citing a deleted or non-existent reply.)
and
h(Citing a deleted or non-existent reply.)
among other possibilities.

+
The plus sign indicates that the prior character or captured group can be at that place in the pattern at least once. For instance, ca+t matches cat, caat, caaat, and so on. Likewise, 124(-01)+ matches 124-01, 124-01-01, and so on. Also, [^aeiou]+43 matches c43, 6843, and y7k43, but not e43 or ia43. The pattern v3#.+ matches v3#9, v3#e0, and v3#11p, among others.

{2,}
The curly brackets (braces) with a 2 and then a comma between them indicates that the prior character or captured group can be at that place in the pattern two or more times. For instance, wap{2,}4 matches wapp4, wappp4, and so on. Likewise, (000-){2,}500 matches 000-000-500 and 000-000-000-500, and so on. The pattern EE[^uvUV]{2,}NN matches EE7fNN and EEWMiNN, but not EENN or EEUVrrNN, among others. Also, m.{2,}9 matches m3g9, mWPe9, m10fA9, and so forth.

*
The asterisk indicates that the prior character or captured group can be at that place in the pattern zero or more times. For instance, j4* matches j, j4, j44, and so on. Likewise, (rat)*32 matches 32, rat32, ratrat32, and so on. Also, G-[^@#&]* matches G-, G-B, and G-8c, but not G-(Citing a deleted or non-existent reply.) G-#m, G-4&k, among others. The pattern .*bun matches bun, Pbun, #xbun, and so on.

(Edited 4 minutes later.)

boof (OP) quintuple-posted this 6 days ago, 6 minutes later, 4 days after the original post[^] [v] #1,421,120

christ, no amount of editing within 3 minutes was enough to stop the fuckups in the previous reply

boof (OP) sextuple-posted this 5 days ago, 23 hours later, 4 days after the original post[^] [v] #1,421,270

OK, replacing the problematic at symbol with a percent, here's the sentence from the ? paragraph: Also, .?%2 matches %2 and h%2 among other possibilities.

(|)
The pipe within round brackets (parentheses) is for placing two alternative patterns to match, one on each side of the pipe. The pipe is effectively a logical OR operator. More than one pipe can be used so that three or more alternatives can be listed. For example, (go|stop) matches go and stop. Likewise, (one|two|three) matches one, two, and three. Also, h(a|i|o|u)t matches hat, hit, hot, and hut. The pattern b(3|54)+ matches b3, b54, b33, b354, b35454, b54354, b54333 and so on.

(?=)
The question mark and equals sign within round brackets (parentheses) is for performing a positive lookahead, which requires matching whatever appears to the right of a position, but without capturing that text. For example, #43(?=.*%) when applied to #43E78k%556p matches only the #43 part of that text, while #43(?=.*%).*p matches the entire #43E78k%556p text because of the wildcard and letter p that are part of the search pattern after the lookahead.

(?!)
The question mark and exclamation symbol within round brackets (parentheses) is for performing a negative lookahead, which requires not matching whatever appears to the right of a position, but without capturing that text. For example, nab(?!56) when applied to nab1154YC matches only the nab part of the text, while nab(?!56).{2,}YC matches the entire nab1154YC text because of the wildcard and letters YC that are part of the search pattern.

Oatmeal Fucker !BYUc1TwJMU replied with this 5 days ago, 7 minutes later, 4 days after the original post[^] [v] #1,421,271

@previous (boof)

This is all too technical for me. When can I look at the cartoon porno?

boof (OP) replied with this 4 days ago, 23 hours later, 5 days after the original post[^] [v] #1,421,527

(?i)
The question mark and letter i within round brackets (parentheses) is for activating the case-insensitive mode for matching letters. In that mode, unlike the default case-sensitive mode, lowercase and uppercase versions of the same letters are not distinguished from each other. For example, (?i)f8-ep matches f8-ep, F8-EP, f8-EP, f8-eP, f8-Ep, F8-ep, F8-Ep, and F8-eP.

(?-i)
The question mark, hyphen, and letter i within round brackets (parentheses) is for deactivating the case-insensitive mode for matching letters. The text between (?i) and (?-i) matches regardless of case, while letter text outside of those modifiers has to match whatever case is used in the pattern. For example, (?i)k(?-i)ace matches kace and Kace, but not kAce for instance.

boof (OP) double-posted this 3 days ago, 23 hours later, 6 days after the original post[^] [v] #1,421,639

patterns for matching a specific domain:

Here is some detail on using regular expressions to specify URL patterns to match for deletion:
The regular expression URL patterns require the start-of-line character ^ and the end-of-line character $. Dots are used to indicate a wildcard, and following a dot with an asterisk indicates that there can be 0 or more possible characters. Different pages at websites extend beyond the main part of the URL (the domain), and so a dot with an asterisk should appear after the domain in the pattern to match all such pages' URLs. To indicate an actual dot, as in .com, the escape character \ must precede the dot. For example: ^https://apnews\.com.*$

One thing to keep in mind is that History AutoDelete Rebooted does not properly handle www. in patterns that you enter. For whatever reason, it was programmed to treat URLs starting with http:\\www. or https:\\www. as if those strings of characters were not there. For instance, https:\\www.youtube.com is seen by the extension as youtube.com, and therefore entering ^https:\\www\.youtube\.com.*$ as a pattern to match will not work. Use ^youtube\.com.*$ instead.

Another thing to keep in mind when you want to match all URLs of a specific domain, some sites prefix the domain with character strings and a dot to indicate subdomains. For instance, https://boards.straightdope.com is part of the site https://straightdope.com (technically the com part is called a top-level domain and straightdope.com is a second-level subdomain, but since the minimum form is straightdope.com, I mean that minimum base form when I refer to a site's domain).

boof (OP) triple-posted this 2 days ago, 1 day later, 1 week after the original post[^] [v] #1,421,706

If we want a pattern to include all subdomains of straightdope.com, then we need a wildcard indicator and a dot to the left of the domain part. To match regardless of having a subdomain or not, the wildcard and dot are put within brackets to be handled as a group, and an asterisk is placed afterward to indicate that the group could be there zero or more times. The wildcard needs to indicate any character that is not itself a dot, because the dot part of the subdomain is already specified afterward. To indicate prohibition of characters, the characters are placed following a single caret ^, all within square brackets. This prohibition pattern acts as a wildcard. The wildcard has to represent a string that is at least one character long, and to indicate that, we place a plus sign afterward. So, we would enter: ^https://([^\.]+\.)*straightdope\.com.*$

Some sites might use the prefix http://, without the letter s seen in https://. To match for both possibilities, put a ? following the s, which indicates that the s can appear zero or once. So, the pattern is now: ^https?://([^\.]+\.)*straightdope\.com.*$

The straightdope site happens to be accessible with or without a www. part to the URL (not all sites are like that though). To match for both possibilities, we have to remember how the extension does not handle https://www. in its representation of URLs, so we need to place https?:// within brackets to indicate that it is too be treated as a group of characters, and a ? is placed afterwards to indicate that the group can occur zero or once. The pattern is: ^(https?://)?([^\.]+\.)*straightdope\.com.*$

boof (OP) quadruple-posted this 1 day ago, 1 day later, 1 week after the original post[^] [v] #1,421,778

Note that we would not try to simplify the pattern to ^.*straightdope\.com.*$, because that would allow other domains that happen to end with the same characters, e.g. catstraightdope.com.
A good place to test regular expression patterns is https://regex101.com. When using regex101.com to test patterns with lists of URLs that are each on a separate line, we also need to indicate that the prohibitory wildcard is not a newline indicator, because we could get false matches that span longer than one line. Also, we need to escape each / character with a backslash. History AutoDelete Rebooted does not have those requirements. Copy and paste ^(https?:\/\/)?([^\.\n]+\.)*straightdope\.com.*$ into the area titled Regular Expression.

Copy and paste these lines into the Test String area:
https://boards.straightdope.com
https://straightdope.com
https://abcstraightdope.com
https://abc.defstraightdope.com
http://abc.def.straightdope.com
straightdope.com
defstraightdope.com
abc.def.straightdope.com

The lines that do not have https:// represent URLs that show https://www. when seen in the address bar, but are not handled by History AutoDelete Rebooted that way. The regex101 page should show the first, second, fifth, sixth, and eighth lines with a highlighted background to indicate that they match the given pattern.

Here is a general pattern to match the URLs that do not have a www. part:
^https?://.*$

Here is a general pattern to match the URLs with www. (and so appear to History AutoDelete Rebooted as having no https://www. as described earlier in these notes):
^(?!https?://).*$

boof (OP) quintuple-posted this 1 hour ago, 1 day later, 1 week after the original post[^] [v] #1,421,933

If trying for an exact match of an URL that does not continue beyond the domain part, then keep in mind that the browser history holds a forward slash / afterwards regardless of appearing to lack it in the address bar. If you copy https://www.youtube.com seen in the address bar and paste elsewhere, you will see https://www.youtube.com/, for instance. So, ^youtube\.com$ will not work to match the URL, as ^youtube\.com/$ is required. If you'd like to be sure about matching regardless of / there or not, use a question mark ? afterward: ^youtube\.com/?$

Here are typical patterns to match for deletion, using the youtube.com site as an example:
match URLs that are based upon https://www.youtube.com:
^youtube\.com/.*$

match URLs that are not based upon https://www.youtube.com:
^(?!youtube\.com/.*).*$

match URLs that are based upon https://www.youtube.com that go beyond the .com part:
^youtube\.com/.+$

match URLS that have the form https://www.youtube.com/watch?v=[whatever]
(e.g. https://www.youtube.com/watch?v=7Qqmr6IiFLE):
^youtube\.com/watch\?v=.+$

match URLs that are based upon https://www.youtube.com, except for those that have the form https://www.youtube.com/watch?v=[whatever]:
^youtube\.com(?!/watch\?v=.+).*$

matches URLs that do not have the form https://www.youtube.com/watch?v=[whatever], regardless of being based upon youtube.com:
^(?!youtube\.com/watch\?v=.+).*$

The first of the six patterns above will make History AutoDelete Rebooted blacklist all URLs with youtube.com as the base domain. The second pattern effectively blacklists all URLs except for those that have youtube.com as the base domain, effectively whitelisting that site. The third pattern blacklists all URLs that go beyond the .com part of youtube.com. The fourth pattern selectively blacklists only those URLs that are like https://www.youtube.com/watch?v=7Qqmr6IiFLE, leaving other URLs based upon youtube.com alone. The fifth pattern blacklists all URLs based upon youtube.com except for those that are like https://www.youtube.com/watch?v=7Qqmr6IiFLE. The sixth pattern is like the fifth pattern, but it also effectively blacklists all other URL domains regardless of being based upon youtube.com. The effect is to whitelist the https://www.youtube.com/watch?v=[whatever] form specifically and blacklist everything else.
:

You are required to fill in a captcha for your first 5 posts. That's only 5 more! We apologize, but this helps stop spam.

Please familiarise yourself with the rules and markup syntax before posting.