Human2Regex Tutorial

Tutorial

0. Preface

Human2Regex (H2R) is a way to spell out a regular expression in an easy to read, easy to modify language. H2R supports multiple languages as well as many (though not all) different regular expression options such as named groups and quantifiers. You may notice multiple keywords specifying the same thing, and that is intended! Just like how in English there are many ways to express yourself, H2R is made to be flexible and easy to understand. With a range, do you prefer "...", "through", or "to"? It's up to you to choose, H2R supports all of those!

1. Your first Match

Every language starts with a "Hello World" program, so let's match the output of those programs. Matching is done using the keyword match followed by what you want to match. match "Hello World" The above statement will generate a regular expression that matches "Hello World", like "/Hello World/". Any invalid characters will automatically be escaped, so you don't need to worry about it. H2R also supports block comments with /**/, or line comments with // or # so you can explain why or what you intend to match.

/* This is a block comment */
match "Hello World" // matches the output of "Hello World" programs

Now what if we want to match every case variation of "Hello World" like "hello world" or "hELLO wORLD"? H2R supports the or operator which allows you to specify many possible combinations. match "Hello World" or "hello world" or "hELLO wORLD" Or, you can use a using statement to specify that you want it to be case insensitive.

2. Using Specifiers

Using statements appear at the beginning. You may have one or more using statements which each can contain one or more specifiers. For example: using global and case insensitive matching or

using global
using case insensitive

The matching keyword is optional. The flags which are available are:

Specifier	Description	Regex flag
`multiline`	Matches can cross line breaks	/<your regex>/m
`global`	Multiple matches are allowed	/<your regex>/g
`case sensitive`	Match must be exact case	none
`case insensitive`	Match may be any case	/<your regex>/i
`exact`	An exact statement matches a whole line exactly, nothing before, nothing after	/^<your regex>$/

To match any variation of hello world, we would then do the following:

using case insensitive matching
match "hello world"

3. Matching multiple items

H2R comes with 2 options to match multiple items in a row. The first is to simply write multiple separate match statements like:

match "hello"
match " "
match "world"

However, you can also use a comma, and, or then for a more concise match. match "hello", " ", "world" or match "hello" and " " and "world" or match "hello" then " " then "world" or any combination like match "hello", " " and then "world"

4. Optionality

Sometimes you wish to match something that may or may not exist. In H2R, this is done via the optional, optionally, possibly or maybe keyword. optionally match "hello world" will match 0 or 1 "hello world"'s. This can be used alongside matching multiple statements in a single match statement. match "hello", maybe " ", "world" will match "hello", an optional space if it exists, and "world". However, the start optional is for the entire match statement. Thus, possibly match "hello", " ", then "world" will actually make the whole "hello world" an optional match rather than just the first "hello". If you want to make the first match optional but keep the rest required, place the optional immediately after the match.

5. Negation

You can negate a match with the operator not match not "hello world" or match anything but "hello world" will match everything except for "hello world".

6. Other matching specifiers

Many times you don't know exactly what you wish to match. H2R comes with many specifiers that you can use for your matching. For example, you may wish to match any word. You can do that with: match a word The a or an is optional. The possible specifiers that H2R supports are the following:

Specifier	Description	Regex alternative	Note
`anything`	Matches any character	.
`word(s)`	Matches many a-z, A-Z, _, or digit characters	\w+	For a-z only, use `letter(s)`
`letter(s)`	Matches any letter character	[a-zA-Z]
`number(s)`	Matches a string of digit characters	\d+
`digit(s)`	Matches any digit character	\d
`integer(s)`	Matches an integer	[+-]?\d+
`decimal(s)`	Matches digits, an optional decimal point and more digits	[+-]?((\d+[,.]?\d*)\|([,.]\d+))	Supports both "," and "." decimal points
`character(s)`	Matches a-z, A-Z, _, or digits	\w	For a-z only, use `letter(s)`
`whitespace(s)`	Matches any whitespace character	\s
`(word )boundary`	Boundary between a word	\b
`line feed`/`newline`	Matches a newline	\n
`carriage return`	Matches a carriage return	\r

You can also create ranges of characters to match. Say for example, you wanted to match any characters between a and z, you could write any of the following: match from "a" to "z" // "from" is optional or match between "a" and "z" // "between" is optional or match "a" ... "z" // can use "..." or ".." or match "a" - "z" or match "a" through "z" // can also use thru

7. Repetition

H2R supports 2 types of repetition: single match repetition, or grouped repetition. When using match you can specify the number of captures you want just before the text to capture. match 2 digits or match exactly 2 digits will match any 2 digits in a row. You can also specify a range you wish to capture match 2 ... 5 digits or match 2 to 5 digits or match between 2 to 5 digits will match 2, 3, 4, or 5 digits. You can specify if the final number is exclusive with the exclusive or inclusive keywords. match 2 to 5 exclusive digits will only match up to 4 digits. You can also choose to leave the end unspecified. match 2+ digits or match 2 or more digits will match 2 or more digits. Repeition can be chained with the and then keywords or the optional keyword. For example: match 1+ digits then optionally "." then optionally 0...8 digits Suppose you want to repeat a group of these match statements. You can group a repetition using the repeat keyword. Everything underneath that is tabbed (scoped) will be repeated. By default, this will match 0 or more of the following statements.

repeat
	match "Hello "
match "World"

Will match 0 or more "Hello "s, but only 1 "World". The same qualifiers that exist for match statements also exist for repeat statements.

optionally repeat 3...7 times
	match "Hello World"

Will potentially match "Hello World" between 3 and 7 times. H2R also supports the following for numbers: One, Two, Three, Four, Five, Six, Seven, Eight, Nine, and Ten

8. Grouping

Just like regular expressions, capture groups are supported in H2R. Each group is defined using the create a group keyphrase.

create a group
	match "Hello World"

Or using the simplified syntax

group
	match "Hello World"

This will create a non-named captured group, equivalent to the regular expression "/(Hello World)/". A non-named captured group will show up in your chosen language's matches, however will not be given a name. To access this match, you will need to know the index of the group. Most regular expression engines support named capture groups, and H2R highly recommends using this feature. If you wish to do so, simply give it a name:

create a group called TestGroup
	match "Hello World"

Or using the simplified syntax

group TestGroup is
	match "Hello World"

In most languages, a named group can be accessed through the match result's group list. Take for example, in JavaScript,

"hello".match(/(?<TestGroup>hello)/).groups

Will return an object with {TestGroup: "hello"}. For another example, check out MDN web docs. Groups can also be optional.

create an optional group
	match "Hello World"

And groups may be nested

create a group called TestGroup
	match "Hello"
	create a group called InnerGroup
		match "World"

The regular expression returned by this will be "/(?<TestGroup>Hello(?<InnerGroup>World))/". Again, in JavaScript, the following

"HelloWorld".match(/(?<TestGroup>Hello(?<InnerGroup>World))/).groups

Will return an object with {TestGroup: "HelloWorld", InnerGroup: "World"}.

Putting it all together

Grouping, repetition, and matching are the 3 primary elements that make up H2R. They can be combined in any way to generate a regular expression. See the main page for an example that combines all above to parse a URL.

Advanced features

Backreferences

Sometimes you may wish to match the same text as a previously matched. Take for example matching opening and closing XML tags such as <hello>world</hello>:

match "<"
create a group called opening_tag
	match a word or digit or "_" or "-"
match ">"
match 0+ not "<"
match "</" 
create a group called closing_tag
	match a word or digit or "_" or "-"
match ">"

To ensure you matched the same opening tag as closing tag, you'll normally need to perform an additional step afterwards by checking the capture groups are equal. However, in most regex engines, this can be performed automatically through backreferences. Backreferences effectively re-capture the same group. Human2Regex allows you to rerun or recapture a previous group.

match "<"
create a group called tag
	match a word or digit or "_" or "-"
match ">"
match 0+ not "<"
match "</" 
recapture tag
match ">"

The regex will only successfully match if both the tags are the same. One thing to note however, the first group must be captured. For a "function"-like capture see regex subroutines (not yet implemented).

To allow for a more natural language, recapture the group and recapture the group called are also supported.

If statements

Certain regex languages support if statements which can be used simplify statements. Human2Regex supports if, else if, and else statements. Inside each if, you can recapture a group or run a new match. This is done as the following:

if match "hello" then optionally "world"
	match "!"
else if match "goodbye" then optionally "world"
	match "!"

create a group called tag
	match "<" then a word or digit or "_" or "-" then ">"
//do we have another tag? keep matching the same tags
if rerun tag
	repeat
		recapture tag
//ignore everything else
else
	match 0+ any thing

Unicode character properties

You can match specific unicode sequences using "\uXXXX" or "\UXXXXXXXX" where X is a hexadecimal character. match "\u0669" // matches arabic digit 9 "٩" Unicode character classes/scripts can be matched using the unicode keyword. match unicode "Latin" // matches any latin character match unicode "N" // matches any number character The following Unicode class specifiers are available:

Class	Description
C	Other
Cc	Control
Cf	Format
Cn	Unassigned
Co	Private use
Cs	Surrogate
L	Letter
Ll	Lower case letter
Lm	Modifier letter
Lo	Other letter
Lt	Title case letter
Lu	Upper case letter
M	Mark
Mc	Spacing mark
Me	Enclosing mark
Mn	Non-spacing mark
N	Number
Nd	Decimal number
Nl	Letter number
No	Other number
P	Punctuation
Pc	Connector punctuation
Pd	Dash punctuation
Pe	Close punctuation
Pf	Final punctuation
Pi	Initial punctuation
Po	Other punctuation
Ps	Open punctuation
S	Symbol
Sc	Currency symbol
Sk	Modifier symbol
Sm	Mathematical symbol
So	Other symbol
Z	Separator
Zl	Line separator
Zp	Paragraph separator
Zs	Space separator

The following Unicode script specifiers are available:

Note: Java and .NET require "Is" in front of the script name. For example, "IsLatin" rather than just "Latin"

Arabic	Armenian	Avestan	Balinese	Bamum
Batak	Bengali	Bopomofo	Brahmi	Braille
Buginese	Buhid	Canadian_Aboriginal	Carian	Chakma
Cham	Cherokee	Common	Coptic	Cuneiform
Cypriot	Cyrillic	Deseret	Devanagari	Egyptian_Hieroglyphs
Ethiopic	Georgian	Glagolitic	Gothic	Greek
Gujarati	Gurmukhi	Han	Hangul	Hanunoo
Hebrew	Hiragana	Imperial_Aramaic	Inherited	Inscriptional_Pahlavi
Inscriptional_Parthian	Javanese	Kaithi	Kannada	Katakana
Kayah_Li	Kharoshthi	Khmer	Lao	Latin
Lepcha	Limbu	Linear_B	Lisu	Lycian
Lydian	Malayalam	Mandaic	Meetei_Mayek	Meroitic_Cursive
Meroitic_Hieroglyphs	Miao	Mongolian	Myanmar	New_Tai_Lue
Nko	Ogham	Old_Italic	Old_Persian	Old_South_Arabian
Old_Turkic	Ol_Chiki	Oriya	Osmanya	Phags_Pa
Phoenician	Rejang	Runic	Samaritan	Saurashtra
Sharada	Shavian	Sinhala	Sora_Sompeng	Sundanese
Syloti_Nagri	Syriac	Tagalog	Tagbanwa	Tai_Le
Tai_Tham	Tai_Viet	Takri	Tamil	Telugu
Thaana	Thai	Tibetan	Tifinagh	Ugaritic
Vai	Yi