A regular expression primer
Most of the ways to create new fields in Splunk involve regular expressions (sometimes referred to as regex). As mentioned in the Splunk documentation:
There are many books and sites dedicated to regular expressions, so we will only touch upon the subject here. The following examples are really provided for completeness; the Splunk web interface may suffice for most users.
Given the log snippet ip=1.2.3.4, let's pull out the subnet (1.2.3) into a new field called subnet. The simplest pattern would be the following literal string:
ip=(?P<subnet>1.2.3).4
This is not terribly useful as it will only find the subnet of that one IP address. Let's try a slightly more complicated example:
ip=(?P<subnet>\d+\.\d+\.\d+)\.\d+
Let's step through this pattern:
- ip= simply looks for the raw string ip=.
- ( starts a capture buffer. Everything until the closing parenthesis is part of this capture buffer.
- ?P<subnet>, immediately inside the parentheses, says create a field called subnet from the results of this capture buffer.
- \d matches any single digit, from 0 to 9.
- + says one or more of the item immediately before.
- \. matches a literal period. A period without the backslash matches any character.
- \d+\.\d+ matches the next two parts of the IP address.
- ) ends our capture buffer.
- \.d\+ matches the last part of the IP address. Since it is outside the capture buffer, it will be discarded.
Now, let's step through an overly complicated pattern to illustrate a few more concepts:
ip=(?P<subnet>\d+.\d*\.[01234-9]+)\.\d+
Let's step through this pattern:
- ip= simply looks for the raw string ip=.
- (?P<subnet> starts our capture buffer and defines our field name.
- \d means digit. This is one of the many backslash character combinations that represent some sets of characters.
- + says one or more of what came before, in this case d.
- . matches a single character. This will match the period after the first set of digits, though it would match any single character.
- \d* means zero or more digits.
- \. matches a literal period. The backslash negates the special meaning of any special punctuation character. Not all punctuation marks have a special meaning, but so many do that there is no harm adding a backslash before a punctuation mark that you want to literally match.
- [ starts a character set. Anything inside the brackets will match a single character in the character set.
- 01234-9 means the characters 0, 1, 2, 3, and the range 4-9.
- ] closes the character set.
- + says one or more of what came before, in this case, the character set.
- ) ends our capture buffer.
- ? \.\d+ is the final part of the IP address that we are throwing away. It is not actually necessary to include this, but it ensures that we only match if there were, in fact, four sets of numbers.
There are a number of different ways to accomplish the task at hand. Here are a few examples that will work:
- ip=(?P<subnet>\d+\.\d+\.\d+)\.\d+
- ip=(?P<subnet>(\d+\.){2}\d+)\.\d+
- ip=(?P<subnet>[\d\.]+)\.\d
- ip=(?P<subnet>.*?\..*?\..*?)\.
- ip=(?P<subnet>\S+)\.
For more information about regular expressions, consult the manual pages for Perl Compatible Regular Expressions (PCRE), which can be found online at http://www.pcre.org/pcre.txt, or one of the many regular expression books or websites dedicated to the subject. We will build more expressions as we work through different configurations and searches, but it's definitely worthwhile to have a reference handy.