Chao Péter Yang

Chao Péter Yang

ML Research Assistant

Interpretable Machine Learning Lab.

Biography

Chao Peter Yang is a Master’s student in Interdisciplinary Data Science at Duke University and a Data Scientist Intern at Amazon Robotics. He graduated with Highest Honors in Data Science from the University of Michigan and previously worked as a Senior Data Science Analyst at Curinos, Inc. At Duke’s Interpretable Machine Learning Lab, advised by Prof. Cynthia Rudin and Dr. Stephen Ni-Hahn, he conducts research on symbolic music generation and graph neural networks, with an emphasis on interpretability and generative modeling. His broader interests include agentic AI systems, large models, and applied machine learning, with the goal of bridging theory and practice to create both scientific and real-world impact.

Interests
  • Symbolic Music Generation
  • Graph Neural Networks
  • Generative Modeling
  • Interpretable Machine Learning
  • Agentic AI Systems
  • Large Models
Education
  • M.S. Interdisciplinary Data Science, 2024 - 2026

    Duke University

  • Bachelors in Honors Data Science and Mathematical Sciences, 2018 - 2021

    University of Michigan - Ann Arbor

  • International Baccalaureate, 2018

    American International School of Budapest

Experience

 
 
 
 
 
Duke University – Interpretable Machine Learning Lab
Research Assistant
August 2024 – Present Durham, NC
  • Co-first authored ProGress—structured symbolic music generation via rule-guided Discrete Diffusion; accepted at NeurIPS 2025.
  • Researched DiffPool for Heterogeneous GNNs in musical analysis, reducing validation cross-entropy loss by 60%.
  • Developed PPO for Graph Neural Networks to enable RLHF for automated music analysis (under review at ACM CHI ’26).
  • Advised by Prof. Cynthia Rudin and Dr. Stephen Ni-Hahn.
 
 
 
 
 
Amazon.com, Inc. – Amazon Robotics
Data Scientist Intern
May 2025 – August 2025 Boston, MA
  • Researched and developed an advanced AI Agent for root-cause investigation integrating multiple data sources and MCP servers, reducing warehouse troubleshooting time from several days to 2.5 minutes (75% success rate).
  • Built a scalable agentic framework unifying internal agent development using LangGraph and Amazon Bedrock, streamlining multi-agent workflows.
  • Engineered a production-ready evaluation pipeline leveraging LLM-as-a-Judge techniques with Langfuse integration, enabling rapid benchmarking of agent performance.
 
 
 
 
 
Curinos
Senior Data Science Analyst
Curinos
September 2023 – June 2024 Chicago
  • Researched and developed industry-level nonlinear Asset-Liability Management (ALM) models to predict acquisition and other portfolio balances for smaller banks and credit unions, resulting in improved acquisition prediction vs. legacy models in terms of out-of-sample validation.
  • Created automated ad-hoc regression notebooks with PySpark for creating, testing, and validating models with different configurations, reducing the time to build proof-of-concept models by half.
 
 
 
 
 
Curinos
Data Science Analyst II
Curinos
April 2022 – September 2023 Chicago
  • Led ML engineering team to migrate legacy modeling pipeline from using Cloudera to Databricks, coordinating with DevSecOps and Application teams to schedule testing, promotion, and release plans, leading to more than $100k in annual savings for data platform expenses and a 30% decrease in pipeline processing time on average. (Publicly acknowledged in company-wide town hall meeting)
  • Tuned nonlinear hierarchical price elasticity models en masse for multiple major US banks, each with 10,000+ model segments, resulting in improved fit in terms of both AIC and R2 with a significantly higher rate of convergence.
  • Installed and managed more than 10,000 price elasticity models per client bank to predict and optimize their deposit portfolio across a wide range of interest rates, with precise Model Risk Management documentation.
 
 
 
 
 
Curinos
Data Science Analyst
Curinos
August 2021 – April 2022 Chicago
  • Converted local, single-threaded, legacy modeling pipeline to use SparkR and Cloudera, reducing run time for model fitting by up to 30 times.
  • Performed Exploratory Data Analysis (EDA) for client banks to tune and reconfigure their models and data segments, leading to better-performing price elasticity models in terms of MAPE, R2, and rate of convergence.
  • Set up and automated custom SQL procedures to clean, wrangle, map, and transform client’s data feed to be used in the modeling pipeline, partially eliminating the need for manual model data refreshes.
 
 
 
 
 
University of Michigan - Ann Arbor
Honors Student Researcher
May 2020 – April 2021 Ann Arbor
  • Researched Content Based Music Classification System with Neural Networks. Advised by Prof.Edward Ionides and Prof.Daniel Forger
  • Developed new music classification methods using Musical Instrument Digital Interface (MIDI) and LSTM neural networks resulting in 82% accuracy in music classification, more than 10% improvement over conventional ML methods.
  • Improved models using supervised machine learning methods like Support Vector Machines, Decision Trees, Ensemble Methods, K-nearest neighbors etc.
  • Recieved ”Highest Honor” distinction in Data Science from UMich, one of only 2 awarded in 2021.

Certificates

Gain foundational knowledge, practical skills, and a functional understanding of how generative AI works
See certificate
DataCamp
Introduction to Scala
See certificate
Coursera
Deep Learning Spcialization
See certificate
Coursera
Share Data Through the Art of Visualization
See certificate

Projects

*
SanAssist: LLM-Powered Healthcare Data Dashboard
A healthcare data dashboard integrated with a fine-tuned LLM-powered chatbot, enabling dynamic querying, interactive visualizations, and scalable cloud deployment.
Duke ProfMatch: AI-Powered Research Collaboration Tool
An AI-powered platform that helps Duke students find professors whose research aligns with their interests, using natural language queries and graph-based exploration.
Muscribe: Transcribing Music to Scores
A research project into developing a model that can create scores from pieces of music.
Californian House Price Prediction with Kaggle Data
Performed EDA and a simple XGBoost to predict house prices in California in a single Jupiter notebook. This is simple data project to showcase how I’d approach a relatively straight forward modeling task.
Squirrels API - Use Case Development and Documentation
Developing use cases and documentation for the Squirrels API

Contact

Feel free to leave me a message and I’ll get back to you as soon as possible!