Projects

Projects in computational chemistry

As a computational chemist, I develope bespoke computational methods that tailors to unique challenges that are brought by each drug discovery program. This has given me opportunities to work with a range of computational methods, integrating AI, physic and QM-based methods to enhance decision-making in structure-based drug discovery. Here are some of the projects that I’ve worked on and why they are so interesting and what I enjoy most about this role

High quality dataset curation for biomedical NLP and training open-source multi-lingual Large Language Model BLOOM

I was a contributor to BigScience, an open-source science initiative which collaboration resulted in training a very large multilingual neural network language model and gathering a very large multilingual text dataset on the 28 petaflops Jean Zay (IDRIS) supercomputer located near Paris, France. (source: https://bigscience.huggingface.co/) I was involved in both of the projects: BigBio for curating biomedical datasets and BLOOM for training a multi-lingual open-source LLM. BigBio NLP models have great potential for automating complex healthcare tasks such as clinical decision support, information extraction, research summarization, and so on. To unlock their potential, however, high-quality, domain-specific training datasets are essential. As a core contributor to BIGBIO, I helped build a community library of over 126 biomedical NLP datasets spanning 13 task categories. This work addresses the underrepresentation of specialized biomedical data and enables reproducible meta-dataset curation. link to code BLOOM Our contributions to BIGBIO were instrumental in the training of BLOOM, the largest open-source multilingual language model at the time of release. BLOOM performs competitively across diverse benchmarks and represents a significant leap forward in the field by publicly releasing its codebase, setting a new precedent in the development of Large Language Models.

1D Convolutional neural nets for inferring distance from time series records of Bluetooth signals

During the COVID-19 pandemic, I contributed to several projects and tech consortiums aimed at addressing critical public health challenges. One of the key areas I focused on was contact tracing. Since COVID-19 spreads through airborne particles and droplets, tracing the movements of exposed individuals became crucial to control transmission. Bluetooth-enabled phones were identified as a promising tool for this, as the signal strength between two devices typically correlates with their distance. However, the relationship between signal strength and distance is noisy, influenced by factors such as the angle between devices and phone manufacturing differences. To address these challenges, I collaborated with researchers at MIT PathCheck to develop an AI model capable of predicting close proximity between devices, considering Bluetooth signals and other confounding factors. Our model placed third in a competition hosted by the National Institute of Standards and Technology (NIST), and we coauthored a paper detailing the model, which was accepted at the Machine Learning for Mobile Health Workshop at NeurIPS. The model was implemented in the PathCheck app, which has been adopted by governments in Minnesota, Hawaii, Guam, Puerto Rico, Teton County, Wyoming, and Cyprus. link to code

Simulating Antigen dosage and key immune cell kinetics using modified explicit tau-leap Gillespie simulation

Throughout my Ph.D. I’ve studied how vaccine scheduling modulates the antibody titer response. Together with with experimental collaborators, we identified a new cellular mechanism that can potentially amplify antibody responses upto 20 times in titers. Based on the mechanistic understanding, I built a stochastic system that simulates our body’s antibody responses from a vaccine: link to code

ChillPill

A web app that recommends anti-depressants based on patients’ symptoms and the most common side effects from taking pills: link to code