Go の strings パッケージを読んでみる

Go の標準パッケージのコードを読んでみる。まずは読みやすそうな strings パッケージからはじめてみる。

Homebrew で Go をインストールしている場合は /usr/local/opt/go/libexec/src/ にコードがあるはずなのでまずはエディタで開く。現在のバージョンは 1.11.2。

$ go version
go version go1.11.2 darwin/amd64
$ code /usr/local/opt/go/libexec/src/

strings パッケージと string 型について

まずは https://blog.golang.org/strings を読めとのことらしい。

// Package strings implements simple functions to manipulate UTF-8 encoded strings.
//
// For information about UTF-8 strings in Go, see https://blog.golang.org/strings.

Strings, bytes, runes and characters in Go - The Go Programming Language

The previous blog post explained how slices work in Go, using a number of examples to illustrate the mechanism behind their implementation. Building on that background, this post discusses strings in Go. At first, strings might seem too simple a topic for a blog post, but to use them well requires understanding not only how they work, but also the difference between a byte, a character, and a rune, the difference between Unicode and UTF-8, the difference between a string and a string literal, and other even more subtle distinctions.

Go の string はスライスと関係があるらしい。うまく使うには string がどう動くかだけでなく byte と character と rune の違い、 Unicode と UTF-8 の違い、 string と string リテラルの違いを理解する必要があるらしい。

In Go, a string is in effect a read-only slice of bytes.

string は読み取り専用の byte のスライスだ。 Go を書いたことがある人ならわかると思うけど、 string の n 番目にある要素を取得しても、実際には必ずしも n 文字目の文字は取得できない。

gore> str := "あいうえお"
"あいうえお"
gore> string(str[3])
"ã"

It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

で、ただのバイトの列なので、別にそれが Unicode や UTF-8 のフォーマットになっている必要はない。仮にメチャクチャなバイト列を作ってもそれを string 型として扱うこともできる。

Count

stings.Count はふたつの string s と substr を引数として s 中にある substr の個数を返す。もし substr が空の場合、この関数は utf8.RuneCountInString を使って s の Unicode コードポイント +1 を返す。

gore> s := "abcabcabc"
"abcabcabc"
gore> substr := "ab"
"ab"
gore> strings.Count(s, substr)
3

// Count counts the number of non-overlapping instances of substr in s.
// If substr is an empty string, Count returns 1 + the number of Unicode code points in s.
func Count(s, substr string) int {
    // special case
    if len(substr) == 0 {
        return utf8.RuneCountInString(s) + 1
    }
    if len(substr) == 1 {
        return bytealg.CountString(s, substr[0])
    }
    n := 0
    for {
        i := Index(s, substr)
        if i == -1 {
            return n
        }
        n++
        s = s[i+len(substr):]
    }
}

特に特別なことはしてない。はじめに substr の長さを調べて aerly return する。 substr の長さが2以上の場合は strings.Index を使って substr のインデックスを取得する。存在しない場合は -1 が返ってくるので、見つからなくなるまで走査を続ける。

Contains 他

これら3つの関数は substr が s 中に存在するかどうかを Index 他を使って確認した結果を bool で返す。

// Contains reports whether substr is within s.
func Contains(s, substr string) bool {
    return Index(s, substr) >= 0
}

// ContainsAny reports whether any Unicode code points in chars are within s.
func ContainsAny(s, chars string) bool {
    return IndexAny(s, chars) >= 0
}

// ContainsRune reports whether the Unicode code point r is within s.
func ContainsRune(s string, r rune) bool {
    return IndexRune(s, r) >= 0
}

Index

Index は意外に複雑。

// Index returns the index of the first instance of substr in s, or -1 if substr is not present in s.
func Index(s, substr string) int {
    n := len(substr)
    switch {
    case n == 0:
        return 0
    case n == 1:
        return IndexByte(s, substr[0])
    case n == len(s):
        if substr == s {
            return 0
        }
        return -1
    case n > len(s):
        return -1
    case n <= bytealg.MaxLen:
        // Use brute force when s and substr both are small
        if len(s) <= bytealg.MaxBruteForce {
            return bytealg.IndexString(s, substr)
        }
        c := substr[0]
        i := 0
        t := s[:len(s)-n+1]
        fails := 0
        for i < len(t) {
            if t[i] != c {
                // IndexByte is faster than bytealg.IndexString, so use it as long as
                // we're not getting lots of false positives.
                o := IndexByte(t[i:], c)
                if o < 0 {
                    return -1
                }
                i += o
            }
            if s[i:i+n] == substr {
                return i
            }
            fails++
            i++
            // Switch to bytealg.IndexString when IndexByte produces too many false positives.
            if fails > bytealg.Cutover(i) {
                r := bytealg.IndexString(s[i:], substr)
                if r >= 0 {
                    return r + i
                }
                return -1
            }
        }
        return -1
    }
    c := substr[0]
    i := 0
    t := s[:len(s)-n+1]
    fails := 0
    for i < len(t) {
        if t[i] != c {
            o := IndexByte(t[i:], c)
            if o < 0 {
                return -1
            }
            i += o
        }
        if s[i:i+n] == substr {
            return i
        }
        i++
        fails++
        if fails >= 4+i>>4 && i < len(t) {
            // See comment in ../bytes/bytes_generic.go.
            j := indexRabinKarp(s[i:], substr)
            if j < 0 {
                return -1
            }
            return i + j
        }
    }
    return -1
}

まず substr のバイト長さが0のとき、1のとき、 s と同じとき、 s より長いときは単純なので個別に処理する。でそれ以外の場合は IndexByte で substr の先頭の文字を探してその位置を起点として substr と同じ文字列かどうか確認する。これを繰り返すが、失敗が一定以上になったら違う方法を試す。 s が bytealg.MaxLen 以下の場合は IndexString 、以上の場合はラビン-カープ文字列検索アルゴリズムを使う。 bytealg.MaxLen の値はCPUによって変わる。

LastIndex

Index は s 内に存在する先頭の substr のインデックスを返すが、 LastIndex は最後のインデックスを返す。

// LastIndex returns the index of the last instance of substr in s, or -1 if substr is not present in s.
func LastIndex(s, substr string) int {
    n := len(substr)
    switch {
    case n == 0:
        return len(s)
    case n == 1:
        return LastIndexByte(s, substr[0])
    case n == len(s):
        if substr == s {
            return 0
        }
        return -1
    case n > len(s):
        return -1
    }
    // Rabin-Karp search from the end of the string
    hashss, pow := hashStrRev(substr)
    last := len(s) - n
    var h uint32
    for i := len(s) - 1; i >= last; i-- {
        h = h*primeRK + uint32(s[i])
    }
    if h == hashss && s[last:] == substr {
        return last
    }
    for i := last - 1; i >= 0; i-- {
        h *= primeRK
        h += uint32(s[i])
        h -= pow * uint32(s[i+n])
        if h == hashss && s[i:i+n] == substr {
            return i
        }
    }
    return -1
}