import triegun "github.com/Maki-Daisuke/go-triegun"
Package go-triegun
generates Golang code for matching string based on
trie (prefix tree), which is far faster than regexp
standard package.
Testing whether a string contains another string is trivial and daily task.
For example, detecting bot from User-Agent is a kind of this task. You can do
it with using regexp
package like this:
import "regexp"
var re := regexp.MustCompile("Baiduspider|bingbot|Googlebot|Twitterbot")
if re.MatchString(userAgent) {
// Matched!
}
It looks quite easy!
But, as the the number of bot signature increases, the implementation of regexp
becomes very slow. Actually, regexp
is overkill to solve the problem. It can
be done more simply and faster.
Here, we can use trie (prefix tree), which is good at testing if a string
contains any of a set of strings as its prefix. This package generates Go code
from a set of strings based upon trie. Precompiled matcher is quite faster than
one of regexp
.
Actually, this package is more than trie. It can match not only prefix, but also middle of string.
Run by my laptop (Macbook 2015, 1.3 GHz Intel Core-M):
$ go test -bench .
PASS
BenchmarkContainsRegexp-4 10000 258149 ns/op
BenchmarkContainsGeneraetd-4 200000 8276 ns/op
BenchmarkHasPrefixRegexp-4 10000 247447 ns/op
BenchmarkHasPrefixGeneraetd-4 1000000 3966 ns/op
BenchmarkIsInRegexp-4 200000 9054 ns/op
BenchmarkIsInGeneraetd-4 500000 5119 ns/op
ok github.com/Maki-Daisuke/go-triegun/test 16.089s
30x faster than regexp
! It can be much faster in real world program.
You can run the same benchmark test as follows:
$ go get github.com/Maki-Daisuke/go-triegun
$ cd $GOPATH/src/github.com/Maki-Daisuke/go-triegun/test
$ go generate
$ go test -bench .
There are two way to use this package:
This package includes a command called triegun
.
You can install it just by typing this in your command line:
$ go get github.com/Maki-Daisuke/go-triegun/cmd/triegun
$ triegun -h
Usage:
triegun [OPTIONS] [FILES...]
Application Options:
-p, --package= package name (default: main)
-t, --tag= tag name included in the generated functions
-C, --disable-contains Suppress generating code for Contains* functions (default: false)
-I, --disable-isin Suppress generating code for IsIn* functions (default: false)
-P, --disable-hasprefix Suppress generating code for HasPrefix* functions (default: false)
Help Options:
-h, --help Show this help message
triegun
reads text from files specified as command arguments or STDIN
if no argument is passed. Then, it generate Go code matching any of the input
lines and output the code to STDOUT. For example:
$ cat signatures.txt
Baiduspider
bingbot
Googlebot
Twitterbot
$ triegun -t Bot signatures.txt > matcher.go
It generates the following six functions:
func ContainsBot(b []byte) bool
func ContainsBotString(s string) bool
func HasPrefixBot(b []byte) bool
func HasPrefixBotString(s string) bool
func IsInBot(b []byte) bool
func IsInBotString(s string) bool
You can call them as you expect:
package main
import (
"bufio"
"os"
)
func main(){
r := bufio.NewReader(os.Stdin)
line, err := r.ReadSlice('\n')
if ContainsBot(line) {
// do something
}
// or
if ContainsBotString(string(line)) {
// do another thing
}
}
This way (use triegun
) just works well, but does not look so cool.
And sometimes, it's not useful, if you want to match against newline character
("\n") or other special characters. In those case, you can use "go generate".
See the next section.
From Go 1.4, we can use go generate
to generate Go code with using Gotools.
Given that we want to do the same example as above, at first prepare the code to generate the matchers like this:
// makenmatchers.go
// Declare `go build` ignores this file.
// +build ignore
package main
import triegun "github.com/Maki-Daisuke/go-triegun"
var signatures = []string{
"Baiduspider",
"bingbot",
"Googlebot",
"Twitterbot",
}
func main() {
t := triegun.New()
t.PkgName = "main"
t.TagName = "Bot"
t.AddString(signatures...)
// Generate matcher code into "matchers_generated.go" with "Bot" tag.
err := t.GenFile("matchers_generated.go")
if err != nil {
panic(err)
}
}
Then, add a special comment in your main code:
package main
//go:generate go run makenmatchers.go
import (
"bufio"
"os"
)
func main(){
r := bufio.NewReader(os.Stdin)
line, err := r.ReadSlice('\n')
if ContainsBot(line) {
// do something
}
// or
if ContainsBotString(string(line)) {
// do another thing
}
}
Now, run go generate
:
$ go generate
This will produce file "matchers_generated.go" with the matchers code. You can now build and run your program:
$ go build
This way is recommended, because build process is clearly documented in your
source code, and all you need is only Gotools to build it, you don't need make
or other toolchains.
Anyway, you can choose your favorite!
The Simplified BSD License (2-clause). See LICENSE file also.
Daisuke (yet another) Maki