Access to the MacCafé website

Using regular expressions to search in MacCafé

Last update:  2024/09/06 


1. Introduction

Regular expressions, abbreviated as RegEx (or RegExp) are sequences of characters that describe a search pattern in a character string. Their highly precise syntax makes it possible to develop complex search models and obtain more refined results than with a standard search tool.

From version 3.03 of MacCafé, it is possible to use RegEx to build message filters and to search the database. You will find, in chapter 4.12 as well as in Appendix 10 of the MacCafé documentation, some examples of RegEx that can be used to filter messages.

This tutorial focuses on the use of RegEx in the Search window. The examples given here are intended for beginners and will only introduce them to a tiny part of what RegEx can do. They are based on the reference document cited in the documentation for the 4D language with which MacCafé is programmed:

https://unicode-org.github.io/icu/userguide/strings/regexp.html#regular-expressions

In this tutorial, RegEx and their components are highlighted in orange.

Matches found by a RegEx are highlighted in green, as in MacCafé.

There are many different implementations of RegEx, each more or less compatible with the others. The 4D Match regex() command integrated into MacCafé uses a syntax similar to PCRE (Perl Compatible Regular Expressions) but with some differences. It's a good idea to be aware of them if you're building a RegEx in MacCafé that you later want to use in another environment. In this tutorial, the description of these particularities will appear in red like this paragraph.


2. Character escape

In the construction of a RegEx, certain characters play a special role. For example, the dot . replaces any character, while the star * is an operator.

If you need to insert these characters as part of a string, you must precede them with the ‘escape character’ \, like this: \. or \*.

Here are the characters used in the RegEx syntax:

 * ? + [ ( ) { } ^ $ | \ . 

The following characters must also be ‘escaped’ if they are enclosed in square brackets [] :

 [ ] \  and, depending on the context,  - & 

Examples:

To search for the string [MacCafé], we'll use the RegEx \[MacCafé]

To search for the string (etc.), we'll use the RegEx \(etc\.\)


3. Make a RegEx case insensitive

By default, RegEx are case-sensitive. The RegEx \[MacCafé] seen in the previous chapter will find [MacCafé], but not [maccafé] or [MACCAFÉ].

To obtain case insensitivity, use the option i (insensitive) by placing the element (?i) before the string to be searched for:

(?i)\[MacCafé]

It is possible to make only part of the RegEx case insensitive. For example:

Mac(?i)SOUP will find Macsoup, but not macsoup.

The option can also be deactivated for part of the RegEx using the negation (?-i).

(?i)MacSOUP or (?-i)MacCafé will find macsoup or MacCafé, but not macsoup or maccafé.


4. Search for a string with alternative characters

Using the RegEx (?i)maccafé, we'll find MacCafé, MACCAFÉ, maccafé, etc. but not MacCafe, MACCAFE or maccafe. By modifying the RegEx as follows:

(?i)maccaf[ée]

we can find all the matches in the string with or without the accent.

Note

[ée] is a character class limited to two elements, but it is possible to build more complex classes. To find out more about character classes, see chapter 12.


5. Search for alternative strings

To search for one string OR another, use the vertical bar |.

red|yellow may match one of the following strings:

redyellow

a (red|yellow|green) apple may match one of the following strings:

a red applea yellow applea green apple

Note

In the example above, we've created a group by putting red|yellow|green in brackets. Without the brackets, the result would be significantly different:

a red|yellow|green apple could only match one of the following strings:

a red yellow green apple


6. Delimit a word or series of words

\b is used to define the limits (beginning and/or end) of a word or series of words, by detecting what separates this word or series of words from the rest of the text (beginning or end of line, space, hyphen, punctuation mark, etc.).

\bget does not accept prefixes but does accept suffixes.

Possible matches No match
What you see is what you get. I forget.
A getaway. This is Bridget.
It's getting better. unforgettable

get\b does not accept suffixes but does accept prefixes.

Possible matches No match
What you see is what you get. It's getting worse.
I forget. She forgets.
This is Bridget. unforgettable

\bget\b accepts neither prefixes nor suffixes. The only possible match is the word get.

What you see is what you get.

\bpoint of view\b will exactly match the string point of view.

Note

\B is the negation of \b. For example:

\Bget\B will only find a match if get is surrounded by a prefix and a suffix, as in:

unforgettable

targetting

• Not only letters, but also numbers and the underscore _. are considered to be part of a word. For example, in the strings _get and 123get, the substrings _ and 123 will be seen as prefixes to the string get and not as word separators.

• A word character can be found with \w, a word separator with \W.


7. Search for a string that must be located at the beginning/end of a line

In this chapter and the next, we are going to use two elements whose purpose is to delimit all or part of the text:

• The ^ element designates the start of either a line or the entire text.

• The $ element designates the end of either a line or the entire text.

The m (multiline) option, introduced by the element (?m) placed at the beginning of the RegEx, forces these ^ and $ elements to treat the text as a series of lines and not as a set without line breaks.

When this option is active, ^ designates the beginning of a line, while $ designates the end of a line.

(?m)^MacCafé will match the first occurrence of the string MacCafé placed at the beginning of a line.

(?m)MacCafé$ will match the string MacCafé placed at the end of a line.

Tip

If you want to take account of any punctuation after the string, you can use this RegEx:

(?m)MacCafé[\p{P}\p{Zs}]*$

The character class [\p{P}\p{Zs}] offers an alternative between a punctuation mark (\p{P}) and a space (\p{Zs}). To find more on the \p operator and Unicode properties, see chapter 12.

The * operator allows any number of characters (including 0) to follow the MacCafé string, as long as each character is a punctuation mark or a space.

This makes it possible to match the strings MacCafé or MacCafé., MacCafé..., MacCafé !, or even MacCafé !..., etc. placed at the end of a line.


8. Search for a string that must be located at the beginning/end of text

Without the m option, ^ designates the beginning of text and $ the end of text.

^MacCafé will match the string MacCafé if it is at the beginning of the text.

MacCafé[\p{P}\p{Zs}]*$ will match MacCafé (or MacCafé. , MacCafé... etc.) placed at the end of the text.


9. Search for a portion of text containing two distinct strings

We're going to use two elements here:

- The dot . which designates any character, except line break characters.

- The star * which, as we have seen, multiplies the preceding character (or group of characters) between 0 times and any number of times.

Note the existence of another operator, +, which repeats the preceding element between 1 time and any number of times.

The combination .* therefore designates any number of any characters (other than line breaks).

MacCafé.*MacSOUP will match a portion of text beginning with MacCafé and ending with MacSOUP, but only if this portion of text is entirely included in the same line of text. For example:

I like MacCafé as well as I liked MacSOUP!

To remove this limitation and allow the searched string to straddle several lines, the s (single line) option is used. This has the effect of modifying the value of the point so that it designates any character, including line breaks. The entire text will therefore be considered by the RegEx as a single line.

(?s)MacCafé.*MacSOUP will thus match a portion of text beginning with MacCafé and ending with MacSOUP, even if the two strings are on separate lines. Example:

I now use MacCafé and I like it just as much↵
as I enjoyed MacSoup
a few years ago.

Tip

To match MacCafé[…]MacSOUP as well as MacSOUP[…]MacCafé, use this alternative:

(?s)MacCafé.*MacSOUP|MacSOUP.*MacCafé


10. Search for a portion of text containing n occurrences of a string

Here we will use the {n} multiplicative operator, which searches for exactly n juxtaposed occurrences of the character or group that precedes it (see chapter 12 for more information on multiplicative operators). As we are looking for several occurrences of a string and not a single character, we will need to enclose this string in brackets, i.e. create a group.

Indeed, MacCafé{3} could only match MacCafééé, which is hardly relevant.

(MacCafé){3} could only match MacCaféMacCaféMacCafé, which isn't much use to us either. So let's add the .* combination to allow any number of characters between occurrences of MacCafé:

(MacCafé.*){3}

This new RegEx will find the first three occurrences of MacCafé, only if all three are present in the same line. Example:

Mais dans le dossier ~/Bibliothèque/Application Support il y avait ↵
bien un dossier MacCafé, mais à côté, il y avait les trois fichiers ↵
de base (MacCafé.4DD, MacCafé.4Dindx et MacCafé.Match).

Two comments:

• The MacCafé occurrence seen in the second line has not been taken into account, as it is isolated in this line, whereas RegEx looks for three in the same line.

• MacCafé's highlighting extends to the end of the third line, instead of stopping just after the third occurrence of MacCafé. We'll see below that simply adding a question mark after the star can remedy this imperfection.

Now let's add the s option:

(?s)(MacCafé.*){3}

This time, we'll find the first three occurrences of MacCafé even though they appear in separate lines.

Il faut les déplacer dans le dossier choisi une fois MacCafé quitté. ↵
C’est au prochain lancement de MacCafé que si et seulement si les ↵
fichiers de la base de donnée ne sont plus où ils étaient (déplacer ↵
n’est pas copier), MacCafé fera cette demande. ↵
En fait, ce n’est pas MacCafé lui même qui gère ce problème, mais 4D qui ↵
a besoin de savoir où sont les fichiers de la BDD pour lancer ↵
l’application. ↵

Pour le premier lancement de MacCafé, la documentation explique très ↵
bien où et où ne pas créer les fichiers de la BDD. ↵

-- ↵
Gilbert

Here again, we can see that the result is not ideal visually. MacCafé highlights the text from the first occurrence of MacCafé to the end of the text, even though the third occurrence of MacCafé is found on the fourth line. This is because, by default, the * operator is 'greedy': it counts as many characters as it can and will therefore extend the selection until it runs out of characters.

So we're going to make the * operator 'lazy', by following it with a question mark:

(?s)(MacCafé.*?){3}

Here is the result:

Il faut les déplacer dans le dossier choisi une fois MacCafé quitté. ↵
C’est au prochain lancement de MacCafé que si et seulement si les ↵
fichiers de la base de donnée ne sont plus où ils étaient (déplacer ↵
n’est pas copier), MacCafé
fera cette demande. ↵
En fait, ce n’est pas MacCafé lui même qui gère ce problème, mais 4D qui ↵
a besoin de savoir où sont les fichiers de la BDD pour lancer ↵
l’application. ↵

Pour le premier lancement de MacCafé, la documentation explique très ↵
bien où et où ne pas créer les fichiers de la BDD. ↵

-- ↵
Gilbert

This time, the highlighting starts at the first occurrence of MacCafé and ends just after its third occurrence, because *? now counts the smallest possible number of characters, i.e. 0 if there are no more occurrences to be found.

Important notes

• f you don't particularly need a search to be greedy, we recommend that you only do lazy searches by always adding a ? after a * or + operator. Greedy searches are generally much less efficient, and may even take so long that MacCafé has to interrupt them: it will then display an alert indicating that the search process is blocked.

• Our RegEx contains two nested multiplicative operators: * and {n}. This can cause problems if the number n is large and if your database contains very long messages. In such a case, the RegEx may have to go back and forth a lot in the text, which is likely to lead to a MacCafé freeze from which you can only get out with a ‘Force Quit...’ Some applications can detect when a RegEx is in danger of ‘going round in circles’, but this is unfortunately not the case with 4D. So it's better not to tempt the devil and not to increase the value of the number n too much.

Tip

If you need to use several options in the same RegEx, you can group them together. For example:

(?msi) is equivalent to (?m)(?s)(?i)

(?si)(maccaf[ée].*?){3} is an improved version of the previous RegEx, taking into account all the forms of MacCafé.


11. Search for messages whose body does not contain a given string

Want to list all the messages from fr.comp.sys.mac.communication that don't mention MacCafé at all? Here's the RegEx you need:

(?si)^(?!.*maccaf[ée])

The s is of course essential, so that the absence of the string is checked throughout the whole text.

^ indicates that the search must start at the beginning of the text. ?! stipulates that .*maccaf[ée] must not be found from the beginning to the end of the text.

To find messages that mention neither MacCafé nor MacSOUP, use this alternative:

(?si)^(?!.*(maccaf[ée]|macsoup))

In both cases, no text will be highlighted in the body of the messages found (unless you have linked this search with a previous search).

Note

• These RegEx can of course be applied to searches in header fields. In this case, the s option will only be useful if you extend the search to the entire header (choice - All - in the list of fields), but its presence will not prevent the RegEx from working if you launch the search for a given field.

• If you test these RegEx in the RegEx help panel, logically they will only give the result TRUE if the target text entered does not contain the string(s) concerned.


12. To go further...

• Multiplicative operators

We have come across multiplication operators in this tutorial. Olivier Miakinen was kind enough to describe these operators to us in a very comprehensible way..

{n,m} Multiplies the preceding element between n and m times.
{n} Is equivalent to {n,n} (multiplies exactly n times).
{n,} Is equivalent to {n,infinite} (multiplies between n and any number of times).
* Shortcut equivalent to {0,}.
+ Shortcut equivalent to {1,}.
? Shortcut equivalent to {0,1}.

As we saw in chapter 10, it is possible (and recommended) to add a ? after a multiplicative operator to make it ‘lazy’, which will affect the amount of text highlighted by MacCafé and, in some cases, could prevent the search from being blocked.


• Character classes

In chapter 4, we created a class of characters offering an alternative between two characters: [ée]. This small class represents one or other of the characters é and e.

It is of course possible to define a larger number of alternatives. For example:

[éèêëeÉÈÊËE]

We can also use the dash - to define intervals:

Class Match
[A-Z] Any uppercase letter (unaccented) from A to Z.
[0-9] Any digit from 0 to 9.
[a-f] Any character of the [abcdef] class.
[A-Za-z] Any unaccented letter, uppercase or lowercase.
[A-Za-z0-9] Any unaccented letter or any digit from 0 to 9.
[A-Fa-f0-9] Any character used in hexadecimal numbering.

If we are looking for a character that does not belong to a given class, we use the negation ^:

[^abc] Any character except a, b and c.
[^0-9] Any character that is not a digit from 0 to 9.

• Unicode properties

In chapter 7, we introduced two Unicode properties: {P} and {Zs}. These predefined classes are very useful, as they allow you to include in the Regex a specific type of character (using \p) or to exclude it from the RegEx (using \P).

For example, \p{Lu} (Letter, uppercase) designates any uppercase letter such as A, Ç, Ê, or even Ω (uppercase letter omega in the Greek alphabet). Conversely, \P{Lu} refers to any character that is not an uppercase letter.

The following page provides an exhaustive list of Unicode classes:

https://www.unicode.org/reports/tr44/

But as it is rather complicated, we'll just look at this table for the moment:

https://www.unicode.org/reports/tr44/#GC_Values_Table

Below are the most useful Unicode classes for getting started.

Abbr. Full name Match
{Lu} {Uppercase_Letter} Uppercase letter (including, but not limited to, the letters of the Latin alphabet).
{Ll} {Lowercase_Letter} Lowercase letter (including, but not limited to, the letters of the Latin alphabet).
{L} {Letter} Any type of letter. For the Latin alphabet alone, use the {Latin} property (read note below).
{Nd} {Decimal_Number} Decimal number (including, but not limited to, the [0-9] digits; for example, — digit 4 in Thai — will match). The abbreviation \d, equivalent to \p{Nd}, also includes decimal digits that are not in [0-9].
{No} {Other_Number} Other type of numeric character (exponent, fraction, etc.)
{N} {Number} Any type of numeric character.
{P} {Punctuation} Any type of punctuation character.
{Sm} {Math_Symbol} Mathematic symbol.
{Sc} {Currency_Symbol} Currency symbol.
{So} {Other_Symbol} Other type of symbol (including Emoji). For Emoji alone, use the {Emoji} property (read note below).
{S} {Symbol} Any type of symbol.
{Zs} {Space_Separator} Any type of space.

Note

• A Unicode property can be used both inside and outside a character class. For example, the character class [\p{P}\p{Zs}] in chapter 7 could have been written as an alternative between \p{P} and \p{Zs} using the form (\p{P}|\p{Zs}).

• Although the full name of properties is accepted by 4D, this is not the case for most implementations of PCRE-type RegEx. For this reason (in addition to the fact that it is quicker to enter) it is better to use the abbreviation {Nd} instead of {Decimal_Number}, for example.

• The Match regex() 4D command does not take into account the case sensitivity of Unicode property names and considers, for example, {Lu}, {LU} and {lu} to be equivalent. However, this is a peculiarity of 4D that does not respect standards and will be rejected in other applications.

• The {Emoji} property does not seem to exist outside 4D. As for {Latin}, it is unknown in certain environments (Javascript, Python, Java, .NET).


• Useful references

To conclude, here are a few references on RegEx, which you can also find in Appendix 3 of the MacCafé documentation.

References already mentioned in this tutorial

○ The document cited in 4D documentation:

https://unicode-org.github.io/icu/userguide/strings/regexp.html#regular-expressions

○ The table of Unicode character classes:

https://www.unicode.org/reports/tr44/#GC_Values_Table

Other references

○ The pages of the PHP manual devoted to RegEx:

https://www.php.net/manual/en/book.pcre.php

https://www.php.net/manual/en/reference.pcre.pattern.syntax.php

○ A website to build and test RegEx:

https://regex101.com/

To obtain results on this page that are as close as possible to those of MacCafé, leave the default RegEx type PCRE2 checked and click on / gm to the right of the REGULAR EXPRESSION field (REGEX FLAGS menu), then click on unicode to check this option.

Options insensitive (see chapter 3), multi line (see chapter 7) and single line (see chapter 9) can be activated in two ways:

- either by checking them in the REGEX FLAGS menu mentioned above,

- or by entering them in your RegEx as you would do in MacCafé: (?i), (?m) or (?s).

Please note: the option only needs to be activated once, either in the menu or in your RegEx, for it to be applied.

The REGEX FLAGS menu includes a global option (active by default). This allows RegEx to find all matches in the entire text, instead of stopping at the first match found.

The global option is not available in 4D (do not try to add (?g) to your RegEx in MacCafé, you'll just get an error message). To make up for this shortcoming, MacCafé offers a Highlight all matches option, visible in the Message and Header tabs of the Search window (only if you used the RegEx search method). When this option is checked, MacCafé will automatically run the RegEx again and again until all matches have been found and highlighted.See chapter 4.07 of the MacCafé documentation for more information on this subject.

○ Finally, a French-speaking newsgroup dedicated to RegEx (unfortunately, no English-speaking newsgroup about RegEx seems to exist):

fr.comp.lang.regexp

This group may seem empty, as it sometimes takes several months for a question to be asked, but since its creation no question has ever gone unanswered. So if you understand and write French, even if you're not fluent, feel free to join this group. After all, our English is far from perfect. ;-)


RegEx tested by Denis (DV), Gérard (Fleuger), Gilbert OLIVIER • Written by Denis (DV)

Many thanks to Olivier Miakinen for proofreading this document,
for his valuable comments and suggested documentation.

Thanks also to Alan B, (first?) English-speaking user,
who suggested integrating RegEx into MacCafé.

This document is published under CC-BY-SA license.

Hosted by Ionos.

 ▲