Skip to main content

Utilizing cross-product prior knowledge to rapidly de-risk chemical liabilities in therapeutic antibody candidates


There is considerable pressure in the pharmaceutical industry to advance better molecules faster. One pervasive concern for protein-based therapeutics is the presence of potential chemical liabilities. We have developed a simple methodology for rapidly de-risking specific chemical concerns in antibody-based molecules using prior knowledge of each individual liability at a specific position in the molecule’s sequence. Our methodology hinges on the development of sequence-aligned chemical liability databases of molecules from different stages of commercialization and on sequence-aligned experimental data from prior molecules that have been developed at Amgen. This approach goes beyond the standard practice of simply flagging all instances of each motif that fall in a CDR. Instead, we de-risk motifs that are common at a specific site in commercial mAb-based molecules (and therefore did not previously pose an insurmountable barrier to commercialization) and motifs at specific sites for which we have prior experimental data indicating acceptably low levels of modification. We have used this approach successfully to identify candidates in a discovery phase program with exclusively very low risk potential chemical liabilities. Identifying these candidates in the discovery phase allowed us to bypass protein engineering and accelerate the program’s timeline by 6 months.


Pharmaceutical candidates must be screened for many different potential liabilities that could adversely affect their manufacturability, storage, and function. Many of these potential liabilities for protein-based therapeutics including monoclonal antibodies (mAb) like viscosity, aggregation, pharmacokinetics, and process yields, can be measured as attributes for the molecule as a whole. Chemical liabilities on the other hand are associated with a specific amino acid residue within the sequence of the molecule. Manifestations of any of these attributes at an inappropriate level could be catastrophic to drug development, so considerable time and money are invested to predict and prevent their appearance. Predicting these attributes as early as possible can allow molecules that are likely to harbor unfavorable characteristics to be removed from the screening process early on or engineered to remediate any shortcomings. The majority of the in silico predictive models created to date have been based on first principles (Agrawal et al. 2016, 2018; Chennamsetty et al. 2015; Sáenz-Suárez et al. 2016; Sharma et al. 2014), as it is difficult to amass datasets that are both large enough, and diverse enough, to train effective machine learning models for biologics—save for a few notable exceptions (Jia and Sun 2017; Sankar et al. 2018; Yang et al. 2017). These models have seen different levels of success, but none has been widely implemented in the industry. There are a handful of themes that are becoming generally accepted: negative charge patches in the variable region of a mAb predictive of high viscosity (Agrawal et al. 2016; Sharma et al. 2014; Chaudhri et al. 2013; Yadav et al. 2012; Li et al. 2014; Buck et al. 2015), positive charge patches in the variable region correlating with poor pharmacokinetics (Schoch et al. 2015; Boswell et al. 2010; Igawa et al. 2010; Khawli et al. 2004), solvent accessibility correlating with tryptophan (Sharma et al. 2014; Barnett et al. 2019; Folzer et al. 2015; Ehrenshaft et al. 2015), and methionine oxidation (Agrawal et al. 2018; Chennamsetty et al. 2015; Sankar et al. 2018; Yang et al. 2017; Barnett et al. 2019), but each research group tends to develop their own model based on their own dataset, and there is seldom any agreement on a preferred method. Here, we detail a different take on attribute prediction, focusing specifically on de-risking potential chemical liabilities by directly comparing to cross-product sequence and experiment databases of internal and external molecules. Our approach hinges on the growing availability of data from prior antibody-based molecules across all stages of the pipeline, from discovery to late phase, and is sensitive to the fact that in most cases we still lack the requisite size and breadth of data required to yield functional machine learning models.

The case for utilizing cross-product data

The pharmaceutical industry is growing, with a total global market of USD 1.23 trillion in 2020 projected to increase to $1.7 trillion by 2025. Growth in the market overall leads to increased competition for specific targets and increased pressure to achieve first to market status and the associated 6% average increase in market share (Pharma’s first-to-market advantage | McKinsey 2021). Given that most potential drug targets are simultaneously researched by multiple companies, the pressure to be first to market has led to an increasing focus on speed of development. Still, getting to market first with an inferior product is not acceptable, as a better successor molecule can rapidly render those first-to-market benefits obsolete. Molecules in this intensely competitive environment cannot afford to be burdened with liabilities that will affect manufacturing, distribution, or patient experience, but they still need to get to market faster than the competition.

In order to move molecules through the pipeline more rapidly, one should be able to effectively predict any potential issues a candidate molecule could face and either reject that molecule outright or rapidly engineer out those liabilities. To do this effectively without wasting time and resources, engineering efforts need to be pinpointed to liabilities that are most likely to be problematic to avoid over-engineering. Many of the standard prediction methods for chemical liabilities (e.g., scanning for deamidation by motif) have a tendency to over-predict, leading to extensive engineering efforts to remediate potential issues that would have never posed a problem if left unaltered. Existing predictive modeling efforts seek to reduce the burden of engineering by using computational models to more effectively predict the specific residues in a molecule’s sequence that are most likely to be problematic. In our method, we use a simple statistical approach to effectively utilize the ever-growing array of data that has been collected from previously studied molecules to rapidly and thoroughly de-risk individual sites in new molecules of interest.

Cross-product sequence analysis to rapidly derisk potential chemical liabilities

Our approach hinges on the fact that antibody-based molecules, which make up the majority of therapeutically relevant proteins, are built onto a highly conserved immunoglobulin scaffold. This conservation makes it possible to rapidly gain structural information about a given site on a molecule simply by aligning its sequence to a reference framework. There are several prominent methodologies in the literature for numbering the residues of an antibody’s sequence based on alignment to a reference (Kabat (Te Wu and Kabat 1970), Chothia (Al-Lazikani et al. 1997), AHo (Annemarie and Andreas 2001), IMGT (Lefranc et al. 2005), etc.). For the purpose of aligning residue level information to structure, we prefer the AHo numbering system (Annemarie and Andreas 2001) because it more appropriately handles CDR loops of differing lengths (Fig. 1A, B); however, our approach is general enough to utilize alignment in any other scheme. Having numbered residues correspond to specific positions in the immunoglobulin structure allows you to make inferences about the environment that a given residue (or sequence motif) will experience. We can then use this as the basis for comparing a motif at a given position to both sequence prevalence data and experimental data from prior molecules that also harbored that same motif at the same aligned position.

Fig. 1
figure 1

Reference alignment of sequences. All sequences are aligned to the AHo system with other numbering systems shown above each indicated sequence. A Comparison of structures from commercial mAb-based molecules with Trp at AHo position 41 and crystal structures in the PDB. The Trp at position 41 in the Aho numbering system falls at 35 or 35A in the Kabat system (red highlighted column) even though this residue is structurally identical across the different mAb crystal structures (Trp41 shown as sticks and circled in Red; PDB 4X7T (Jensen et al. 2015), green; PDB 5SX5 (Sickmier et al. 2016), grey; PDB 6B3S (Bagchi et al. 2018), blue). B Comparison of structures from commercial mAb-based molecules with Met at position 136 in AHo numbering and crystal structures in the PDB. The Met at AHo position 136 is numbered 100A, 100E, 100G, or 100M in the Kabat system across the different molecules (red highlighted column) even though crystal structures indicate the position of this Met is highly homologous across the structures (Met136 shown as sticks and circled in Red; PDB 6AL4 (Teplyakov et al. 2018), light blue; PDB 5WUV (Lee et al. 2017), light green; PDB 6UMH (Garces et al. 2020), light cyan; PDB 3GIZ (Du et al. 2009), light gray)

By pre-aligning tables of predicted hotspots across a database of sequences to a reference numbering system, we can rapidly compare a specific site in a molecule of interest to other molecules that achieved varying levels of success and apply a kind of “survival of the fittest molecule” approach to determining if a particular site in a particular molecule is likely to be problematic (Fig. 2A). Sites that are extremely common in existing commercial mAb-based molecules (commercial mAb-based molecules here means FDA or EU approved molecules for which 94 sequences were publicly available at the time of publication) are unlikely to cause an issue that would be disruptive to commercialization and can be effectively de-risked. As an example, there is a potential methionine oxidation site located at position 41 in H_CDR1 that is present in >50% of commercial mAb-based molecules (Table 1). The high prevalence of methionine residues at this site indicates that it is generally not an impediment to commercialization. The exact frequency cutoff below which a site can effectively be de-risked should be defined by the risk tolerance acceptable to a given program.

Fig. 2
figure 2

Cross-product analysis of CDR potential liabilities. A Cross-product motif prevalence showing the percentage of commercial mAbs that have the same motif at the same position as the query (mAb1) in the aligned sequence. Residue numbers correspond to the linear position in mAb1. B Box plots of experimental data spanning discovery to late phase from prior molecules under various stress conditions with the same motif at the same position in the aligned sequence as the query (mAb1) provide additional insight into the likelihood and potential magnitude of modifications in the query sequence

Table 1 Most common potential chemical liability motifs among variable regions of commercial mAbs

To supplement this purely sequence-based approach, we have applied a similar principle to the experimental chemical liability data from prior molecules. By templating our prior molecule experimental chemical liabilities data and aligning it to the AHo numbering system, we can rapidly query the available data across prior molecules that share any given hotspot with a molecule of interest. In this way, we can instantly examine both the average and the spread of the available data for a given motif at a given site across prior molecules and use this as an indicator of risk likelihood for the same motif at the same site in a molecule of interest (Fig. 2B). Specific sites that show minimal or no modification across all prior programs are unlikely to suddenly become problematic in a new program, and our confidence in this assertion increases as the number of data points increases or as variance in the data decreases.

Currently, we leverage this cross-product sequence analysis method on applicable discovery and development stage programs to accelerate molecules with acceptably low risk profiles or to target protein engineering resources towards the specific sites that we cannot effectively de-risk. In a noteworthy example of this method’s success, we utilized an early version of this methodology to de-risk key theoretical sites in a discovery stage mAb program (Fig. 3). By using prior knowledge from cross-product analysis, we were able to bypass the protein engineering stage and move molecules from a discovery campaign directly to development without engineering to remediate potential chemical liability sites. In this case, an Asn residue in H_CDR2 of multiple lead candidates was identified as a potential deamidation risk. This NT motif beginning at AHo position 67 is present in 7.4% of commercial molecules, indicating that there were multiple examples of mAb-based molecules that were able to successfully reach commercialization with this site at this position. Prior internal molecules with the same motif at the same structural position were also identified, and these molecules had previously been shown to have extremely low levels of modification at this site during development (available prior molecule experimental data in our dataset showed deamidation levels were below the limit of detection for available molecules under all recorded stress conditions). By leveraging this prior knowledge in the decision to move these discovery phase molecules forward without protein engineering, the program’s timeline was accelerated by approximately 6 months. Different companies will likely have different existing internal metrics to define what an acceptable level of modification is, and this will also likely be variable for different programs with diverse target candidate profiles. We make no recommendations for what will be necessary to meet the needs of any specific program, but as a general rule, we often look for candidates with all possible chemical liability sites having modification levels below 2–5% under mild stress conditions in prior molecule data (temperatures between 4 and 40 °C, time points out to a max of 4 weeks, pH from 4.5–8.0, cool white light stress up to 200 klux*hr). In order to meet the realization that different programs will have different requirements, additional data under harsher stress conditions (longer time points, added chemical oxidants, etc.) is available in the dataset on demand.

Fig. 3
figure 3

Potential deamidation site de-risked using the cross-product methodology. NT at AHo 67 in H_CDR2 was identified as a potential risk in two candidate molecules (gray, right). Homology models of variable domains from molecules harboring the NT site of interest are shown as cartoons with the NT site shown as sticks. The cross-product analysis identified 7 molecules that had already been successfully commercialized that shared the same motif at the same site (green, left). Additional molecules for which we had readily available internal data showed no detectable modification at this site under tested stress conditions (blue, middle)

Key requirements, learnings, and future perspectives

This data-driven cross-product approach to de-risking attributes provides incredible value for rapidly assessing the likelihood that a potential chemical liability will impact a candidate molecule. In order to implement this approach, an institution will first need to establish aligned sequence databases of commercial molecules and clinical candidates and pre-screen those sequences for hotspots. For our sequence sets, we focused on three different levels of success to get some granularity into the frequency of hotspots at specific sites in molecules moving through the drug development pipeline: molecules that progressed to process development, molecules selected for first-in-human (FIH) development, and molecules that achieved marketing authorization approval. Second, the institution will need to develop a suitable data template to maintain sequence-aligned experimental chemical liabilities data and, ideally, populate this database both proactively and retroactively through all stages of development from discovery to late stage. We are actively templating data from our historical programs, with data from hundreds of unique experiments available in our dataset at the time of this publication. While we have certainly seen increasing value from our growing dataset, we also found considerable value in early iterations of this endeavor when data on only a limited number of molecules was available, making this approach easily within reach even for many smaller organizations in the industry. Finally, to make relevant subsets of the data rapidly available for de-risking the specific hotspots in a molecule of interest, software needs to be developed to rapidly access the relevant data from the datasets and present it in a concise manner. The most challenging part of this process for us has been developing a template for our reference aligned chemical liabilities data and populating it with data from past programs as most of this data has historically been stored in PowerPoint and PDF files which has required significant manual effort to find, interpret, structure, and clean so that it can be compared across various programs. Forward-thinking companies will do well to proactively implement strategies to guarantee that future data is automatically stored in a data-science friendly template to make it easily accessible for this and any future use-cases.

Limitations and caveats

The approach described herein is an extremely rapid data-driven method for de-risking potential chemical liability sites in candidate molecules. While there are certainly many benefits to directly querying and assessing sequence and analytical data from prior molecules, it is also important to keep in mind the limitations of this approach. Perhaps the most important of which is that in its present form, these insights are principally of value to antibody-based molecules. While antibody-based molecules represent one of the largest and fastest-growing classes of pharmaceuticals, they are certainly not the only class of biologic therapeutics, and in its present form, this methodology would not be translatable to other molecule types. Additionally, this approach centers on the assumption that the modification propensity of a specific potential chemical liability site can be predicted from the combination of its sequence motif and the position of that motif within a sequence alignment. This assumes that chemical liability motifs which align together via one of the major antibody sequence alignment/numbering systems (again, our preference is for AHo) will end up in a similar structural position with a similar local environment. While the fact that all antibody domains reliably adopt immunoglobulin folds and the preponderance of available structures would indicate that there is a tremendous amount of structural similarity in these domains, certainly individual chemical liability motifs on individual molecules could fall outside of these assumptions and render the prior molecule data less relevant. Changes in solvent exposure, adjacent charged or hydrophobic patches, or regional flexibility could all potentially affect the modification propensity of a specific residue. While some of this variability in local structural environments will be reflected in the spread of prior analytical data, it is of course still possible for a new molecule to be a structural outlier and subsequently behave outside of what would be expected based on prior data. It is thus important to use this sort of data-driven approach to guide the de-risking process in concert with expert oversite in a phase-appropriate manner. As the number of candidates is reduced towards the selection of a final first-in-human trial candidate, additional scrutiny of candidates for structural relatedness to molecules in the prior knowledge dataset is advisable.

Conclusions and future directions

We have developed a straightforward approach to utilizing prior knowledge to de-risk potential chemical liabilities. The method does not require any type of sophisticated modeling and is therefore not burdened by the need for extremely large datasets or continual retraining of the model. Instead, our method relies on the fact that all immunoglobulin-based molecules will adopt the same fold and that residues and motifs can be mapped to specific positions on the 3-dimensional structure simply by aligning the sequence to a reference numbering system. We can then assume that information we gather about a motif at a particular reference position will be more relevant to that motif at the same reference position in future molecules because of the high degree of conservation inherent in the immunoglobulin fold. This approach will of course lose some predictive value in regions like H-CDR3 where loop lengths and conformations can be quite variable, but it does not completely abrogate all predictive value here. By using a numbering system like that proposed by Honegger et al. (Annemarie and Andreas 2001), it is still possible to draw parallels using this relative spatial positioning even in the more variable CDR loops. This information will be more relevant among CDR loops of similar length and sequence, and data from motifs falling at the same site in CDR loops that are more closely related to the sequence of interest should be viewed with additional weight. Even without narrowing the dataset to the most similar CDRs, potential predictive power can still be rapidly assessed by simply looking at the variance in the cross-product experimental data at a particular site, presuming it is common enough to have yielded multiple data-points.

This proposed methodology for de-risking attributes is intended to bridge the gap between the current paradigm of gross over-prediction of potential liabilities and a future state where there is enough data available to support high accuracy predictive modeling. This main requirement for implementing this methodology is the creation of templated cross-product datasets which will become more valuable not only within the predictive framework described here, but also for training machine learning models in the future, as the size of the datasets grows.

Availability of data and materials

Sequence data for most commercial molecules are available from

Experimental data from Amgen molecules is proprietary.


Download references


The authors would like to thank Camille Ybanez and Arun Raghu for assisting with data collection and Andrew Dykstra for helpful discussions.


All research was funded by Amgen Inc.

Author information

Authors and Affiliations



AWJ developed the methodology, aggregated sequence and experimental data, designed the data structure, prototyped software for rapidly analyzing query sequences and wrote the paper. WR designed and implemented the final software tool which allows multiple users to access the data and analyze query molecules. NJA assisted in conceptualizing the initial process, revising the paper, and provided support, guidance and oversight. The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Neeraj J. Agrawal.

Ethics declarations

Competing interests

The authors declare they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jacobitz, A.W., Rodezno, W. & Agrawal, N.J. Utilizing cross-product prior knowledge to rapidly de-risk chemical liabilities in therapeutic antibody candidates. AAPS Open 8, 10 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: