diet-okikae.com

Exploring Needleman-Wunsch Algorithm for Global Alignment in Python

Written on

Chapter 1: Introduction to Needleman-Wunsch

In our previous discussion, we explored the fundamental workings of the Needleman-Wunsch algorithm. To recap, this algorithm is utilized to determine the optimal global alignment of two biological sequences based on a defined scoring function.

Now, we’ll advance our understanding by implementing this algorithm in Python, while also examining how Biopython can assist us in the process. This project is an excellent opportunity to enhance your Python programming skills, so let’s get started!

Chapter 1.1: Setting Up the Environment

Before we begin with the Needleman-Wunsch algorithm, we need to prepare a few essentials. First and foremost, we require two sequences to align. Additionally, we must define a scoring function that includes match, mismatch, and gap scores.

# Sequences to align

seq0 = "ACGATACG"

seq1 = "TACGTCG"

# Scoring function

match_score = 1

mismatch_score = -1

gap_score = -2

We can set match and mismatch scores as illustrated above, or alternatively, create a substitution matrix. The latter is particularly relevant for amino acid sequences where conservation is a concern.

# Substitution matrix

substitution_matrix = {

("A", "A"): 1, ("A", "C"): -1, ("A", "G"): -1, ("A", "T"): -1,

("C", "A"): -1, ("C", "C"): 1, ("C", "G"): -1, ("C", "T"): -1,

("G", "A"): -1, ("G", "C"): -1, ("G", "G"): 1, ("G", "T"): -1,

("T", "A"): -1, ("T", "C"): -1, ("T", "G"): -1, ("T", "T"): 1,

}

For simplicity, we will stick to using match and mismatch scores instead of a substitution matrix.

Chapter 1.2: Initializing the Algorithm

We will briefly touch on the details of the Needleman-Wunsch algorithm, as they are more thoroughly covered in a prior article. The algorithm involves traversing the sequences and addressing local alignment challenges. We need to construct both a scoring matrix and a traceback matrix.

The scoring matrix captures scores for local alignments, while the traceback matrix contains information about how these scores were derived. Both matrices are represented as 2D arrays, with one sequence along the rows and the other along the columns. We will add padding to accommodate alignment failures, and we can build these matrices using NumPy.

import numpy as np

scoring = np.zeros((len(seq0) + 1, len(seq1) + 1))

traceback = np.zeros((len(seq0) + 1, len(seq1) + 1), dtype="str")

Next, we initialize the algorithm by assigning gap scores along the first row and column, representing the scenario where alignment is not performed.

# Initialize scores

scoring[0, :] = np.arange(len(seq1) + 1) * gap_score

scoring[:, 0] = np.arange(len(seq0) + 1) * gap_score

# Initialize traceback

traceback[0, 1:] = "H"

traceback[1:, 0] = "V"

Now, let's visualize these matrices.

print(scoring)

Expected output:

[[ 0. -2. -4. -6. -8. -10. -12. -14.]

[ -2. 0. 0. 0. 0. 0. 0. 0.]

[ -4. 0. 0. 0. 0. 0. 0. 0.]

[ -6. 0. 0. 0. 0. 0. 0. 0.]

[ -8. 0. 0. 0. 0. 0. 0. 0.]

[-10. 0. 0. 0. 0. 0. 0. 0.]

[-12. 0. 0. 0. 0. 0. 0. 0.]

[-14. 0. 0. 0. 0. 0. 0. 0.]

[-16. 0. 0. 0. 0. 0. 0. 0.]

print(traceback)

Expected output:

[['' 'H' 'H' 'H' 'H' 'H' 'H' 'H']

['V' '' '' '' '' '' '' '']

['V' '' '' '' '' '' '' '']

['V' '' '' '' '' '' '' '']

['V' '' '' '' '' '' '' '']

['V' '' '' '' '' '' '' '']

['V' '' '' '' '' '' '' '']

['V' '' '' '' '' '' '' '']

['V' '' '' '' '' '' '' '']]

Chapter 1.3: Scoring the Algorithm

Now we proceed to fill in the scoring and traceback matrices. We will calculate values for cells based on the left, upper, and diagonal cells. Starting from cell (1, 1), we evaluate how we can arrive at that cell through a match, mismatch, or gap.

# Define cell

i = 1

j = 1

# Diagonal score

if seq0[i - 1] == seq1[j - 1]:

scoreD = scoring[i - 1, j - 1] + match_score

else:

scoreD = scoring[i - 1, j - 1] + mismatch_score

# Horizontal score

scoreH = scoring[i, j - 1] + gap_score

# Vertical score

scoreV = scoring[i - 1, j] + gap_score

# Determine best score

best_score = np.max([scoreD, scoreH, scoreV])

if scoreD == best_score:

scoring[i, j] = scoreD

traceback[i, j] = "D"

elif scoreH == best_score:

scoring[i, j] = scoreH

traceback[i, j] = "H"

elif scoreV == best_score:

scoring[i, j] = scoreV

traceback[i, j] = "V"

To apply this logic to all cells, we can utilize a loop that traverses the matrix from (1, 1) onwards.

# Loop to fill the scoring and traceback matrices

for i in range(1, len(seq0) + 1):

for j in range(1, len(seq1) + 1):

# Diagonal score

if seq0[i - 1] == seq1[j - 1]:

scoreD = scoring[i - 1, j - 1] + match_score

else:

scoreD = scoring[i - 1, j - 1] + mismatch_score

# Horizontal score

scoreH = scoring[i, j - 1] + gap_score

# Vertical score

scoreV = scoring[i - 1, j] + gap_score

# Determine best score

best_score = np.max([scoreD, scoreH, scoreV])

if scoreD == best_score:

scoring[i, j] = scoreD

traceback[i, j] = "D"

elif scoreH == best_score:

scoring[i, j] = scoreH

traceback[i, j] = "H"

elif scoreV == best_score:

scoring[i, j] = scoreV

traceback[i, j] = "V"

Let's examine the completed scoring and traceback matrices.

print(scoring)

Expected output:

[[ 0. -2. -4. -6. -8. -10. -12. -14.]

[ -2. -1. -1. -3. -5. -7. -9. -11.]

[ -4. -3. -2. 0. -2. -4. -6. -8.]

[ -6. -5. -4. -2. 1. -1. -3. -5.]

[ -8. -7. -4. -4. -1. 0. -2. -4.]

[-10. -7. -6. -5. -3. 0. -1. -3.]

[-12. -9. -6. -7. -5. -2. -1. -2.]

[-14. -11. -8. -5. -7. -4. -1. -2.]

[-16. -13. -10. -7. -4. -6. -3. 0.]]

print(traceback)

Expected output:

[['' 'H' 'H' 'H' 'H' 'H' 'H' 'H']

['V' 'D' 'D' 'H' 'H' 'H' 'H' 'H']

['V' 'D' 'D' 'D' 'H' 'H' 'D' 'H']

['V' 'D' 'D' 'V' 'D' 'H' 'H' 'D']

['V' 'D' 'D' 'V' 'V' 'D' 'D' 'D']

['V' 'D' 'V' 'D' 'V' 'D' 'D' 'D']

['V' 'V' 'D' 'D' 'V' 'V' 'D' 'D']

['V' 'V' 'V' 'D' 'H' 'V' 'D' 'D']

['V' 'V' 'V' 'V' 'D' 'H' 'V' 'D']]

Chapter 2: Performing Traceback

The final step is to execute the traceback, starting from the last cell of the traceback matrix and following the directions to produce the alignment. The letters from both sequences are accepted for "D", while gaps are introduced for "H" and "V".

We will implement this using a while loop that continues until we reach the first cell (0, 0). Gaps will be represented by a dash "-".

# Variables to store alignments

aligned_seq0 = ""

aligned_seq1 = ""

# Perform traceback

i = len(seq0)

j = len(seq1)

while i != 0 or j != 0:

# Resolve "D"

if traceback[i, j] == "D":

aligned_seq0 = seq0[i - 1] + aligned_seq0

aligned_seq1 = seq1[j - 1] + aligned_seq1

i -= 1

j -= 1

# Resolve "H"

elif traceback[i, j] == "H":

aligned_seq0 = "-" + aligned_seq0

aligned_seq1 = seq1[j - 1] + aligned_seq1

j -= 1

# Resolve "V"

elif traceback[i, j] == "V":

aligned_seq0 = seq0[i - 1] + aligned_seq0

aligned_seq1 = "-" + aligned_seq1

i -= 1

This method may seem peculiar as we construct the aligned sequences in reverse order, beginning with the last letters of both sequences. As we follow the diagonal, horizontal, or vertical paths, we appropriately decrement the indices.

Finally, let's print the resulting alignments.

# Display the alignment

print(aligned_seq0)

print(aligned_seq1)

Expected output:

-ACGATACG

TACG-T-CG

We can also output the alignment score, which corresponds to the value in the last cell.

# Display the alignment score

print(scoring[-1, -1])

Expected output:

0

If there's any doubt regarding the final score, it can be verified through recomputation. In this instance, we have 3 gaps, 6 matches, and no mismatches. The calculation yields a final score of 0, confirming our earlier result.

Chapter 2.1: Supplementary Video Resources

To further enhance your understanding of the Needleman-Wunsch algorithm, check out these insightful videos.

The first video, "Sequence Alignment | Needleman Wunsch in Python," provides a comprehensive overview of the algorithm’s implementation in Python.

The second video, "Needleman-Wunsch algorithm for global alignment - Python," delves deeper into the mechanics of the algorithm, ensuring you grasp its application fully.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Creating Multi-Styled Lines in a React Application Using Visx

Learn to add multiple styled lines to your React app using the Visx library with this detailed guide.

Transform Your Mornings: 6 Strategies for Increased Productivity

Discover six effective strategies to enhance your morning routine and boost productivity for a successful day ahead.

Navigating Remote Negotiations: 9 Key Strategies for Success

Discover nine effective strategies for enhancing your remote negotiations in a post-pandemic world.