Module: String::Similarity

Defined in:
lib/string/similarity.rb,
lib/string/similarity/version.rb

Overview

String::Similarity provides various methods for calculating string distances.

Constant Summary collapse

VERSION =

Gem version

'2.1.0'

Class Method Summary collapse

Class Method Details

.cosine(str1, str2, ngram: 1) ⇒ Float

Calcuate the Cosine similarity of two strings.

For an explanation of the Cosine similarity of two strings read this excellent SO answer.

Parameters:

  • str1 (String)

    first string

  • str2 (String)

    second string

  • ngram (Int) (defaults to: 1)

    how many characters at once to use

Returns:

  • (Float)

    cosine similarity of the two arguments.

    • 1.0 if the strings are identical

    • 0.0 if the strings are completely different

    • 0.0 if one of the strings is empty

Raises:

  • (ArgumentError)


20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# File 'lib/string/similarity.rb', line 20

def self.cosine(str1, str2, ngram: 1)
  raise ArgumentError.new('ngram should be >= 1') if ngram < 1

  return 1.0 if str1 == str2
  return 0.0 if str1.empty? || str2.empty?

  # convert both texts to vectors
  v1 = vector(str1, ngram)
  v2 = vector(str2, ngram)

  # calculate the dot product
  dot_product = dot(v1, v2)

  # calculate the magnitude
  magnitude = mag(v1.values) * mag(v2.values)
  dot_product / magnitude
end

.levenshtein(str1, str2) ⇒ Float

Calculate the Levenshtein similarity for two strings.

This is basically the inversion of the levenshtein_distance, i.e.

1 / levenshtein_distance(str1, str2)

Parameters:

  • str1 (String)

    first string

  • str2 (String)

    second string

Returns:

  • (Float)

    levenshtein similarity of the two arguments.

    • 1.0 if the strings are identical

    • 0.0 if one of the strings is empty

See Also:

  • #levenshtein_distance


49
50
51
52
53
# File 'lib/string/similarity.rb', line 49

def self.levenshtein(str1, str2)
  return 1.0 if str1.eql?(str2)
  return 0.0 if str1.empty? || str2.empty?
  1.0 / levenshtein_distance(str1, str2)
end

.levenshtein_distance(str1, str2) ⇒ Fixnum

Calculate the Levenshtein distance of two strings.

Parameters:

  • str1 (String)

    first string

  • str2 (String)

    second string

Returns:

  • (Fixnum)

    edit distance between the two strings

    • 0 if the strings are identical



62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/string/similarity.rb', line 62

def self.levenshtein_distance(str1, str2)
  # base cases
  result = base_case?(str1, str2)
  return result if result

  # Initialize cost-matrix rows
  previous = (0..str2.length).to_a
  current = []

  (0...str1.length).each do |i|
    # first element is always the edit distance from an empty string.
    current[0] = i + 1
    (0...str2.length).each do |j|
      current[j + 1] = [
        # insertion
        current[j] + 1,
        # deletion
        previous[j + 1] + 1,
        # substitution or no operation
        previous[j] + (str1[i].eql?(str2[j]) ? 0 : 1)
      ].min
    end
    previous = current.dup
  end

  current[str2.length]
end