Exploring Needleman-Wunsch Algorithm for Global Alignment in Python
Written on
Chapter 1: Introduction to Needleman-Wunsch
In our previous discussion, we explored the fundamental workings of the Needleman-Wunsch algorithm. To recap, this algorithm is utilized to determine the optimal global alignment of two biological sequences based on a defined scoring function.
Now, we’ll advance our understanding by implementing this algorithm in Python, while also examining how Biopython can assist us in the process. This project is an excellent opportunity to enhance your Python programming skills, so let’s get started!
Chapter 1.1: Setting Up the Environment
Before we begin with the Needleman-Wunsch algorithm, we need to prepare a few essentials. First and foremost, we require two sequences to align. Additionally, we must define a scoring function that includes match, mismatch, and gap scores.
# Sequences to align
seq0 = "ACGATACG"
seq1 = "TACGTCG"
# Scoring function
match_score = 1
mismatch_score = -1
gap_score = -2
We can set match and mismatch scores as illustrated above, or alternatively, create a substitution matrix. The latter is particularly relevant for amino acid sequences where conservation is a concern.
# Substitution matrix
substitution_matrix = {
("A", "A"): 1, ("A", "C"): -1, ("A", "G"): -1, ("A", "T"): -1,
("C", "A"): -1, ("C", "C"): 1, ("C", "G"): -1, ("C", "T"): -1,
("G", "A"): -1, ("G", "C"): -1, ("G", "G"): 1, ("G", "T"): -1,
("T", "A"): -1, ("T", "C"): -1, ("T", "G"): -1, ("T", "T"): 1,
}
For simplicity, we will stick to using match and mismatch scores instead of a substitution matrix.
Chapter 1.2: Initializing the Algorithm
We will briefly touch on the details of the Needleman-Wunsch algorithm, as they are more thoroughly covered in a prior article. The algorithm involves traversing the sequences and addressing local alignment challenges. We need to construct both a scoring matrix and a traceback matrix.
The scoring matrix captures scores for local alignments, while the traceback matrix contains information about how these scores were derived. Both matrices are represented as 2D arrays, with one sequence along the rows and the other along the columns. We will add padding to accommodate alignment failures, and we can build these matrices using NumPy.
import numpy as np
scoring = np.zeros((len(seq0) + 1, len(seq1) + 1))
traceback = np.zeros((len(seq0) + 1, len(seq1) + 1), dtype="str")
Next, we initialize the algorithm by assigning gap scores along the first row and column, representing the scenario where alignment is not performed.
# Initialize scores
scoring[0, :] = np.arange(len(seq1) + 1) * gap_score
scoring[:, 0] = np.arange(len(seq0) + 1) * gap_score
# Initialize traceback
traceback[0, 1:] = "H"
traceback[1:, 0] = "V"
Now, let's visualize these matrices.
print(scoring)
Expected output:
[[ 0. -2. -4. -6. -8. -10. -12. -14.]
[ -2. 0. 0. 0. 0. 0. 0. 0.]
[ -4. 0. 0. 0. 0. 0. 0. 0.]
[ -6. 0. 0. 0. 0. 0. 0. 0.]
[ -8. 0. 0. 0. 0. 0. 0. 0.]
[-10. 0. 0. 0. 0. 0. 0. 0.]
[-12. 0. 0. 0. 0. 0. 0. 0.]
[-14. 0. 0. 0. 0. 0. 0. 0.]
[-16. 0. 0. 0. 0. 0. 0. 0.]
print(traceback)
Expected output:
[['' 'H' 'H' 'H' 'H' 'H' 'H' 'H']
['V' '' '' '' '' '' '' '']
['V' '' '' '' '' '' '' '']
['V' '' '' '' '' '' '' '']
['V' '' '' '' '' '' '' '']
['V' '' '' '' '' '' '' '']
['V' '' '' '' '' '' '' '']
['V' '' '' '' '' '' '' '']
['V' '' '' '' '' '' '' '']]
Chapter 1.3: Scoring the Algorithm
Now we proceed to fill in the scoring and traceback matrices. We will calculate values for cells based on the left, upper, and diagonal cells. Starting from cell (1, 1), we evaluate how we can arrive at that cell through a match, mismatch, or gap.
# Define cell
i = 1
j = 1
# Diagonal score
if seq0[i - 1] == seq1[j - 1]:
scoreD = scoring[i - 1, j - 1] + match_score
else:
scoreD = scoring[i - 1, j - 1] + mismatch_score
# Horizontal score
scoreH = scoring[i, j - 1] + gap_score
# Vertical score
scoreV = scoring[i - 1, j] + gap_score
# Determine best score
best_score = np.max([scoreD, scoreH, scoreV])
if scoreD == best_score:
scoring[i, j] = scoreD
traceback[i, j] = "D"
elif scoreH == best_score:
scoring[i, j] = scoreH
traceback[i, j] = "H"
elif scoreV == best_score:
scoring[i, j] = scoreV
traceback[i, j] = "V"
To apply this logic to all cells, we can utilize a loop that traverses the matrix from (1, 1) onwards.
# Loop to fill the scoring and traceback matrices
for i in range(1, len(seq0) + 1):
for j in range(1, len(seq1) + 1):
# Diagonal score
if seq0[i - 1] == seq1[j - 1]:
scoreD = scoring[i - 1, j - 1] + match_scoreelse:
scoreD = scoring[i - 1, j - 1] + mismatch_score
# Horizontal score
scoreH = scoring[i, j - 1] + gap_score
# Vertical score
scoreV = scoring[i - 1, j] + gap_score
# Determine best score
best_score = np.max([scoreD, scoreH, scoreV])
if scoreD == best_score:
scoring[i, j] = scoreD
traceback[i, j] = "D"
elif scoreH == best_score:
scoring[i, j] = scoreH
traceback[i, j] = "H"
elif scoreV == best_score:
scoring[i, j] = scoreV
traceback[i, j] = "V"
Let's examine the completed scoring and traceback matrices.
print(scoring)
Expected output:
[[ 0. -2. -4. -6. -8. -10. -12. -14.]
[ -2. -1. -1. -3. -5. -7. -9. -11.]
[ -4. -3. -2. 0. -2. -4. -6. -8.]
[ -6. -5. -4. -2. 1. -1. -3. -5.]
[ -8. -7. -4. -4. -1. 0. -2. -4.]
[-10. -7. -6. -5. -3. 0. -1. -3.]
[-12. -9. -6. -7. -5. -2. -1. -2.]
[-14. -11. -8. -5. -7. -4. -1. -2.]
[-16. -13. -10. -7. -4. -6. -3. 0.]]
print(traceback)
Expected output:
[['' 'H' 'H' 'H' 'H' 'H' 'H' 'H']
['V' 'D' 'D' 'H' 'H' 'H' 'H' 'H']
['V' 'D' 'D' 'D' 'H' 'H' 'D' 'H']
['V' 'D' 'D' 'V' 'D' 'H' 'H' 'D']
['V' 'D' 'D' 'V' 'V' 'D' 'D' 'D']
['V' 'D' 'V' 'D' 'V' 'D' 'D' 'D']
['V' 'V' 'D' 'D' 'V' 'V' 'D' 'D']
['V' 'V' 'V' 'D' 'H' 'V' 'D' 'D']
['V' 'V' 'V' 'V' 'D' 'H' 'V' 'D']]
Chapter 2: Performing Traceback
The final step is to execute the traceback, starting from the last cell of the traceback matrix and following the directions to produce the alignment. The letters from both sequences are accepted for "D", while gaps are introduced for "H" and "V".
We will implement this using a while loop that continues until we reach the first cell (0, 0). Gaps will be represented by a dash "-".
# Variables to store alignments
aligned_seq0 = ""
aligned_seq1 = ""
# Perform traceback
i = len(seq0)
j = len(seq1)
while i != 0 or j != 0:
# Resolve "D"
if traceback[i, j] == "D":
aligned_seq0 = seq0[i - 1] + aligned_seq0
aligned_seq1 = seq1[j - 1] + aligned_seq1
i -= 1
j -= 1
# Resolve "H"
elif traceback[i, j] == "H":
aligned_seq0 = "-" + aligned_seq0
aligned_seq1 = seq1[j - 1] + aligned_seq1
j -= 1
# Resolve "V"
elif traceback[i, j] == "V":
aligned_seq0 = seq0[i - 1] + aligned_seq0
aligned_seq1 = "-" + aligned_seq1
i -= 1
This method may seem peculiar as we construct the aligned sequences in reverse order, beginning with the last letters of both sequences. As we follow the diagonal, horizontal, or vertical paths, we appropriately decrement the indices.
Finally, let's print the resulting alignments.
# Display the alignment
print(aligned_seq0)
print(aligned_seq1)
Expected output:
-ACGATACG
TACG-T-CG
We can also output the alignment score, which corresponds to the value in the last cell.
# Display the alignment score
print(scoring[-1, -1])
Expected output:
0
If there's any doubt regarding the final score, it can be verified through recomputation. In this instance, we have 3 gaps, 6 matches, and no mismatches. The calculation yields a final score of 0, confirming our earlier result.
Chapter 2.1: Supplementary Video Resources
To further enhance your understanding of the Needleman-Wunsch algorithm, check out these insightful videos.
The first video, "Sequence Alignment | Needleman Wunsch in Python," provides a comprehensive overview of the algorithm’s implementation in Python.
The second video, "Needleman-Wunsch algorithm for global alignment - Python," delves deeper into the mechanics of the algorithm, ensuring you grasp its application fully.