Searching for Variant Sequences with Pinpoint Accuracy

2 Aug 2018

The global market for bioengineered protein drugs is expected to reach $228.4 billion by 2021, up from $172.5 billion in 2016, rising at a compound annual growth rate (CAGR) of 5.8% from 2016 through 2021, according to BCC Research.

A journal in APL Bioengineering estimated that by 2018, the global market for industrial enzymes, many of which are sequence variants, will surpass the $7.1 billion mark and its five-year CAGR will be around 8.2%.

With bioengineering and rational design come variant sequences: one or many changes made in a protein or DNA sequence in order to impart desired characteristics to the molecule being studied. A quick text search for patents related to variant proteins uncovered almost 90,000 families and 188,000 patent documents; 24,226 families and almost 40,000 documents in the last two years alone. This large amount of IP requires effective and efficient variant searching.

Using existing tools to do variation search is very difficult and labor-intensive, requiring multiple workarounds to get potentially incomplete results. Because it’s not well-understood and there is minimal training available, variation searching is not as routinely done as other forms of sequence searching. As a result, opportunities are missed and unnecessary risks are taken.

Although sequence variation searching is applicable to a broad diversity of technologies and applications, the challenges and underlying methodology have much in common.

There are two basic types of searches:

Single point mutations at a specific position: SNPs, gRNA, gene-editing; and
Multiple position variations: enzyme optimization.
These changes can be either a variation on an extensively-cited common backbone or, in the simplest cases, a single point mutation on a narrowly-referenced region of sequence. For multiple variations, it’s common to search for multiple combinations of changes, not just a single change.

Single Position Change

Intuitively, a single position change is the simplest search case. The query sequence is written with the desired change, a search performed and the results screened at 100% identity. That will return all hits with just that single change. However, there are other potentially in-scope results that will be missed with this method.

The variant position might have a different residue or a degeneracy character instead of a defined residue, or there could be additional changes within the hit sequences which will be removed with the 100% identity filter.

Using current methodology, a two-pronged approach combining percent identity and coordinate-specific screening with the appropriate sequence search algorithm provides the best results. A lower percent identity filter is necessary to find hits of this nature, combined with a coordinate-specific screen to narrow results to ones covering the region of variation.

The limitation on this method is that it is labor-intensive; each result that is not an exact match must be reviewed manually for the presence or absence of the desired variation. There is no automated way to limit results to one position-specific change without losing additional potentially in-scope results.

Multiple-Position Variations

Multiple-position variations are far more difficult. An example of one is found in US patent number 20150087572:

“In one aspect, an automatic dishwashing detergent composition comprising a variant protease of a parent protease, said parent protease amino acid sequence being identical to the amino acid sequence of SEQ ID NO:1, said variant protease of said parent protease mutations consisting of one of the following sets of mutations versus said parent protease: N76D + S87R + G118R + S128L + P129Q + S130A."

This is an extremely narrow example. Tens or even hundreds of combinations are often exemplified and/or claimed, and often even the individual components of the combinations by themselves as well. Because of this, a best practice for this type of search is usually to search for any combination of the elements, ie, to report hits containing just N76D, for example, or N76D and S87R, etc.

In order to do this, an algorithm such as MOTIF (Aptean GenomeQuest) or STN pattern matching is required. Because these algorithms require 100% identity to all remaining regions, even a single mismatch will cause the hit to be lost. One way around the 100% identity requirement is to wild-card intervening regions between variations, so mismatches will be ignored. This is effective, but it isn’t a guarantee, because it’s necessary to retain some flanking regions in order to have a reasonable degree of specificity.

In addition, since each position may be either variation or wild type, this type of query will also retrieve wild type sequence, resulting in an even higher level of noise in results. It’s not uncommon to find ten true hits against a background of thousands of wild type sequences.

A typical search protocol involves tediously hand-creating multiple query sequences to cover all possibilities, performing the search, and then performing different Boolean operations or being clever with Aptean GenomeQuest grouping (group by subject, group size = 1, using pair of wild type and manually-created sequences in variant notation) in order to remove the unwanted wild type results. This frequently takes hours or even days.

Solutions

What if someone is interested in knowing all variations for a given position, or group of positions? MOTIF can be used with X in the position(s) of interest, but the results can’t be narrowed interactively to drill down into result combinations.

Aptean GenomeQuest realized the need for improved variant searching tools and methods, and has developed a new search product offering, called the Sequence Variation Discovery Module. It’s an optional component and it allows searchers to rapidly screen a large set of results for specific variations, without requiring that the variant sequences be created in advance.

Because it uses interactive filtering, it’s possible to view all results with one group of variations, and then change any position(s) to look at different combinations. The percent identity cutoff can be set interactively as well, so alignments with unanticipated variations can be found.

There is also a set of broad filters, so a group of sequences can be screened for all variations, either over the entire alignment or in just a specific subsequence or group of subsequences.

Viewing results and highlighting specific regions of variation, especially when doing a sequence landscape search, is also very time-consuming. Aptean GenomeQuest’s Sequence Variation Discovery Module provides user-configurable graphics to show just the desired variations. It also creates an exportable table of all variations found in a given query/subject pair, for either pasting into a text-based query or for easier evaluation.

The different views available allow optional display of the alignment graphics immediately adjacent to full text of claims, so a claim can be evaluated while viewing the search results, saving even more time.

The world of sequence variation IP is always going to be complicated and confusing, and fraught with legal risk. The lack of uniformity in crafting claims and citing variants makes text searching very difficult. The large number of possibilities and combinations results in only a minuscule percentage of sequence variants being included in sequence listings, meaning very few end up indexed in sequence databases.

However, of the ones that are in listings, Aptean GenomeQuest’s new Sequence Variation Discovery Module will make finding that variable needle in a haystack much less time-consuming and painful.

With its extensive data coverage (over 500 million sequences), powerful search tools and user-friendly functionality, Aptean GenomeQuest is the obvious choice for searching the entire sequence domain, both patent and non-patent.

Avoid the pitfalls of using free solutions for IP sequence searching. Download our RFP template or start a free trial today!

verken deze oplossing

Auteur

Aptean Staff Writer

Klaar om te zoeken naar IP-sequenties

Gebruik ons gratis Request for Proposal (RFP)-sjabloon om de juiste IP-sequentieoplossing voor uw bedrijf te vinden.