Sunday, February 17, 2013

Longest Common Substring

Given two strings s1 and s2. Find out the longest common substring.
Lets solve this problem using Dynamic Programming.
We assume a 2D matrix such that m[i][j] is the maximum length of LCS in substring s1[0..i] and s2[0..j].
Recursive formula to get the length of LCS:
m[i][j]  = 0 if(i==0, j==0)
         = m[i-1][j-1] + 1 if(s1[i] == s2[j])
         = max( m[i][j-1], m[i-1][j] ) if(s1[i] != s2[j])

Using this recursive formula m[s1.size()][s2.size()] will give the maximum length of the common substring.
Code to get the LCS matrix:
 int **getLCSMatrix(string s1, string s2) {
  int r = s1.size() + 1, c = s2.size() + 1;
  int **m = (int**) malloc(r * sizeof(int *));
  for (int i = 0; i < c; i++)
   m[i] = (int *) malloc(c * sizeof(int));

  for (int i = 0; i < r; i++)
   m[i][0] = 0;

  for (int i = 0; i < c; i++)
   m[0][i] = 0;

  for (int i = 1; i < r; i++) {
   for (int j = 1; j < c; j++) {
    if (s1[i - 1] == s2[j - 1]) {
     m[i][j] = m[i - 1][j - 1] + 1;
    } else {
     m[i][j] = max(m[i - 1][j], m[i][j - 1]);
    }
   }
  }
  return m;
 }

Now, we can easily use this matrix to get the maximum length.
Output:
LCS length: 3
Reading out an LCS:
The following code backtracks the choices taken when computing the LCS matrix. If the last characters in the prefixes are equal, they must be in an LCS. If not, check what gave the largest LCS of keeping s1[i] and s2[j], and make the same choice. Just choose one if they were equally long. Call the function with i=s1.size() and j=s2.size().
 string getLCS(int **m, int i, int j, string s1, string s2) {
  if (i == 0 || j == 0)
   return "";
  if (s1[i - 1] == s2[j - 1])
   return getLCS(m, i - 1, j - 1, s1, s2) + s1[i - 1];
  else if (m[i - 1][j] < m[i][j - 1])
   return getLCS(m, i, j - 1, s1, s2);
  else
   return getLCS(m, i - 1, j, s1, s2);
 }
Reading out all the LCS:
 set getAllLCS(int **m, int i, int j, string s1, string s2,
   string s) {
  set v;
  if (i == 0 || j == 0) {
   v.insert(s);
   return v;
  }

  if (s1[i - 1] == s2[j - 1]) {
   set v1 = getAllLCS(m, i - 1, j - 1, s1, s2, s1[i - 1] + s);
   v.insert(v1.begin(), v1.end());
  } else if (m[i - 1][j] == m[i][j - 1]) {
   set v1 = getAllLCS(m, i - 1, j, s1, s2, s);
   set v2 = getAllLCS(m, i, j - 1, s1, s2, s);
   v.insert(v1.begin(), v1.end());
   v.insert(v2.begin(), v2.end());
  } else if (m[i - 1][j] < m[i][j - 1]) {
   set v1 = getAllLCS(m, i, j - 1, s1, s2, s);
   v.insert(v1.begin(), v1.end());
  } else {
   set v1 = getAllLCS(m, i - 1, j, s1, s2, s);
   v.insert(v1.begin(), v1.end());
  }
  return v;
 }
You can find the complete source code on my github page: Here

1 comment:

  1. It is a solution for longest common sub-sequence not longest common substring

    ReplyDelete