Skip to content

Regex simplifications

By studying the regex, I believe it can be simplified by applying a few changes. This ticket is to be considered as a recipe that can be followed to perform everything.

Steps

All the following points will take this simplified case for an applied example:

dsi_units_regex = r"^((\\hour)|(\\kilogram))|((\\kilo)|(\\milli))?((\\metre)|(\\gram))$"

1. Common back-slashes

Considering only the units that do not allow a prefix, they both need double backslashes. Thus they could be defined only once per block instead of once per item.

(\\hour)|(\\kilogram) == \\((hour)|(kilogram))

At this point, the parenthesis defining the group containing the prefixes (without the backslashes) can also be simplified:

\\((hour)|(kilogram)) == \\(hour|kilogram)

With now the full regex, it is:

dsi_units_regex = r"^(\\(hour|kilogram))|(\\(kilo|milli))?(\\(metre|gram))$"

The length has reduced from 67 to 55 characters at this simplified example. Thus, at the real case the reduction will be greater.

Note: I don't think the positive look ahead \\(?=...) or look behind (?<=\\) calls will help of much here.

2. Use of template strings

While the real template strings is something introduced in python 3.14, these ones can also be used (in a different way) in previous python versions.

The idea is to create and f-string-like at the regex, where each group is clearly defined. Then the groups can be replaced with the .format() call.

# Def of the "human readable" template regex
DSI_REGEX_TEMPLATE = r"^({units_without_prefixes})|({prefix}?{unit})$"

# Definition of each group
_UNITS_NO_PREFIX = r"\\(hour|kilogram)"
_PREFIXES = r"\\(kilo|milli)"
_SI_UNITS = r"\\(metre|gram)"

# Definition of the regex that is to be interpreted by the machine
DSI_MACHINE_REGEX = DSI_REGEX_TEMPLATE.format(
    units_without_prefixes=_UNITS_NO_PREFIX,
    prefix=_PREFIXES,
    unit=_SI_UNITS,
)

As seen here, the output is actually two strings (considering the non-protected elements): The template (human readable) regex and the one to be used by the algorithm.

3. Definition of the regex as a constant and not as a function.

The module if called regex_generator because it has public functions that generate the regex. To improve code readability (and performance) the functions can be called directly to define the groups, then define everything as a constant:

# Def of the "human readable" template regex
DSI_REGEX_TEMPLATE = r"^({units_without_prefixes})|({prefix}?{unit})$"

# Definition of each group
_UNITS_NO_PREFIX = _get_unit_regex(["hour", "kilogram"])
_PREFIXES = _get_regex(...)
_SI_UNITS = _get_regex(...)

# Definition of the regex that is to be interpreted by the machine
DSI_MACHINE_REGEX = DSI_REGEX_TEMPLATE.format(
    units_without_prefixes=_UNITS_NO_PREFIX,
    prefix=_PREFIXES,
    unit=_SI_UNITS,
)
Edited by Jaime Gonzalez Gomez