Notes on ai-driven research (ADRS) in software from sigops.org. The core idea is using AI tools to rewrite system code itself to improve performance taking over the role of what system researchers have traditionally done. Here are some examples of such systems in the wild:

The key to these successes lies in automating the core research loop, the iterative cycle of designing, implementing, and evaluating solutions. Typically a researcher follows these stages:

  • Problem Formulation: Define the problem to solve (e.g., improve system throughput).
  • Evaluation Framework: Decide which system to use for evaluating the solution, and instrument it. Alternatively, build a new system prototype or simulator.
  • Solution: Manually design a new algorithm or technique.
  • Evaluation: Implement the solution, run it, and compare it against baselines. Iterate on this solution by going back to stage 3, until a satisfactory solution is found.
  • Paper Write-Up: Document the findings.

The experimentation loop occurs in the solution and evaluation stages. This is where researchers have usually spent the most time and which ADRS automates today. The ADRS architecture consists of five components:

  • Prompt Generator: creates prompts for the LLM. The initial prompt includes a description of the problem and the code of the evaluation framework (e.g., simulator) indicating the code segments that implement the algorithm or the policy to evolve.
  • Solution Generator: proposes a new solution by using LLMs to modify the salient parts of the evaluation framework.
  • Evaluator: evaluates the proposed solutions, typically by running the evaluation framework on a given set of workloads (e.g., traces, benchmarks).
  • Storage: stores all previously generated solutions and their results.
  • Solution Selector: chooses a subset of solutions to refine future prompts.

(beware: GPT generated table from image)

Task & SOTA PublicationObjectiveSOTA / BaselineModelTime / Cost
Telemetry Repair Krentsel et al. [HotNets ’24]Repair buggy network telemetry+9% better counter repair rate, +30% higher confidence calibrationGPT-o3, Gemini-2.5-pro8h (300 iters), ≤ $10
Adaptive Weight Compression [WIP]Assign bitrate per column to minimize bits/elem.Similar bits/elem, 14.2% worse PPLGemini-2.5-pro12h (200 iters), ≤ $20
Cloudcast Wooders et al. [NSDI ’24]Optimize multi-cloud data transfer cost.Matches SOTA cost.Gemini-2.5-pro, GPT-o31h (100 iters), ≤ $5
Expert Parallelism Load Balancer [WIP]Balance expert-parallel load across GPUs.Same imbalance, 5× faster runtime vs. internal implementation.Gemini-2.5-Flash, Gemini-2.5-Flash-Lite5h (300 iters), ≤ $10
Global Model Placement Yu et al. (2025) [arXiv]Optimize cost for model-to-GPU placement.18.5% cheaper than published solution.Gemini-2.5-Flash40m (70 iters), ≤ $5
LLM-SQL Liu et al. [MLSys ’25]Reorder tabular data to improve prefix hit rate.Comparable hit rate, 3.9× faster runtime.Gemini-2.5-pro, GPT-o31h (100 iters), ≤ $7
Transaction Scheduling Cheng et al. [VLDB ’24]Minimize makespan in transaction scheduling.20% better than greedy (offline).GPT-4.1, GPT-o3<2h (100 iters), ≤ $20
Can’t Be Late Wu et al. [NSDI ’24]Schedule deadline-driven jobs on single-region spot instances.Up to 16% (average 7%) higher cost savings vs. SOTA.GPT-o3, Gemini-2.5-pro5h (400 iters), ≤ $20
Can’t Be Late Multi-Region Extension [WIP]Schedule deadline-driven jobs on multi-region spot instances.26% lower cost vs. single-region baseline.GPT-o3, Gemini-2.5-pro1h (100 iters), ≤ $5
Sparse Attention Design Desai et al. [NeurIPS ’25]Balance attention sparsity and accuracy.7% average error and density improvement vs. SOTA.GPT-4.1, GPT-o3-mini4h (100 iters), ≤ $15
Multi-Agent System Optimization Hong et al. [ICLR ’24]Improve multi-agent collaboration using MAST taxonomy.7% improvement on ProgramDev.GPT-5, GPT-4o<2h (100 iters), ≤ $15

The above creates an automated feedback loop that can be guided. Here are some lessons from a researcher running ADRS experiments:

  1. Less is More (and More is Less)

Giving the AI less help often produced more creative and powerful solutions.

  • Start with weaker baselines: Giving the AI a highly-optimized, state-of-the-art baseline can trap it in local minima and encourage it to make only small tweaks. A simpler starting point gives it more freedom to explore the solution space.
  • Provide fewer hints: While detailed instructions can lead to faster initial progress, they also restrict the search space. Fewer hints lead to more flexibility to find better solutions.
  • Restrict access to high-level APIs: While providing ADRS high-level library APIs leads to faster evolution, we find that restricting access can lead to discovery of better optimizations.
  1. Your Solution is Only as Good as Your Evaluator

A flawed or incomplete evaluator is the primary cause of flawed solutions. Don’t let the AI exploit loopholes in the scoring function to maximize its reward.

  • Prevent overfitting with diverse workloads: Just like in traditional systems research, you should test on diverse workloads to ensure a general solution. We also recommend using hold out workloads for robust validation.
  • Prevent “reward hacking”. Use a comprehensive test suite to cover correctness, especially edge cases. For example, we had a proposed load balancing algorithm that seemed to perform well but was actually dropping some work to get better results!

AI Research Assistants elevate researchers to the role of problems architects and AI advisors. Now go focus on identitfying the most impactful problems, designing promising initial approaches, and critically assessing AI-generated solutions. Let’s accelerate scientific discovery! What a time to be alive.