Bioinformatics Tool for Processing DNA Sequences

Over the summer, I’ve been working on a bioinformatics tool for my internship. It does preprocessing for the genotyping by sequencing (GBS) pipeline and parses .fastq files so that the processed sequences can later be aligned or base called or analyzed.

The tool has a lot of features and I will be presenting it at a symposium and writing a paper about the tool.

It can process newer UMI reads, and non-UMI reads. Creating the tool was really exciting since I came up with modifications for existing algorithms to optimize and add more features.

For example, I combined the Wagner-Fischer dynamic programming algorithm with the Seller’s variant of it and padded the strings with extra characters to achieve semi-global alignment, variable offset, and dynamic total length of the strings.

I also used a “sliding window” algorithm as one of the quality trim algorithms. The other quality trim algorithm uses a local average.

When trimming ‘N’ (undetermined) base pairs, the algorithm maximizes the percentage of ‘N’ from an end of a sequence to the current index. To prevent consecutive ‘N’ at each end of a sequence from affecting the percentage of ‘N’, the initial consecutive ‘N’ are automatically trimmed, and the algorithm starts the percentage calculation at the first non-‘N’ base pair.

I also came up with a simple and cool compression algorithm for the sequences so they use less memory during deduplication, where a lot of sequences have to be saved in a HashMap.

The source code can be found here (GitHub), and more information can be found at the GitHub wiki for the program. I’ll post and link to the paper when it’s all done (hopefully gets published!).


