Data Privacy (Spring 2026) lab syllabus
Group policy
- Team size: Each lab should be done in a group of 2.
- Rotation: You are encouraged to form a new group for each lab. This helps you network and exposes you to different coding styles and perspectives.
- Alternative: You may keep the same partner for Labs 1 & 2, and then switch to a new partner for Labs 3 & 4.
- Individual work: You are permitted to work individually, but you will be graded by the same standards as a group of two.
- Submission: Group members must submit identical results (notebooks/reports) and will receive identical grades.
Philosophy
These labs complement in-class quizzes and reflection check-ins. They are designed to be semi-open:
- Tooling allowed: You may use documentation and starter code to reduce boilerplate, but you are responsible for correctness and explanations.
- Analysis-driven: The “answer” is rarely just code. It is an experimental result, a plot, or a trade-off analysis that proves you understand why the code works.
- Iteration required: Most tasks include an “optimizer’s gap” where the first attempt is sub-optimal. Iterate to refine the results.
Lab 1: Privacy attacks (the offense)
Topic: Membership inference and data extraction
Deliverable: Jupyter notebook
- Task 1 (CTF Extraction): Extract a hidden flag (
UVA{...}) from a “Black Box” model provided by CorpXYZ.
- Task 2 (MIA): Implement a membership inference attack using loss/perplexity thresholds to identify which model was trained on a specific canary.
- Task 3 (Iteration gap - red teaming):
- Scenario: The model now has a simple filter that blocks the exact secret string.
- Challenge: Generate 3 “stealthy” prompts using different strategies (e.g., semantic variations) that bypass the filter but still trick the model into revealing the secret.
Lab 2: Re-identification & reconstruction (microdata vs. statistics)
Topic: Singling-out, linkage, reconstruction; DP counts as defense
Deliverable: Jupyter notebook
- Part 1 (Singling-out): Compute equivalence class sizes for QI tuples; report the k=1 count/fraction and the rarest QI combinations.
- Part 2 (Linkage / attribute disclosure): Join a synthetic directory table (IDs + QIs) to the microdata; report unique match rate and basic disclosure stats.
- Part 3 (Reconstruction + DP): Reconstruct sensitive bits from published subgroup counts using Z3; then add Laplace/Geometric noise (vary $\epsilon$) and re-run to observe feasibility/instability and utility error.
Lab 3: Private training (the defense)
Topic: DP-SGD, privacy accountants
Deliverable: Jupyter notebook
- Task A (The Accountant): Calculate the
noise_multiplier for a target $(\epsilon, \delta)$ and epoch count.
- Task B (Defense): Fine-tune the mini-GPT using Opacus.
- Task C (The Audit): Run your Lab 1 attack against your Lab 3 model. (Reference implementation provided).
- Task D (Iteration gap - performance competition):
- Challenge: DP-SGD is highly sensitive to hyperparameters.
- Goal: Achieve the highest possible validation accuracy for $\epsilon=3.0$.
- Iteration: Iterate to tune learning rate, batch size, and max grad norm.
- Task E (Debug Challenge): Identify the privacy leakage in a provided DP-SGD training loop.
- Analysis: Compare the “Loss Landscapes” of non-private vs. private training.
Lab 4: Secure multi-party computation (MPC)
Topic: Secure inference, arithmetic secret sharing
Deliverable: Jupyter notebook
- Task A (Warmup): Implement a private linear layer ($Y = XW + B$) using Crypten.
- Task B (Iteration gap - the softmax bottleneck):
- Challenge: Exact softmax is slow/unstable in MPC.
- Optimization: Design and test different Polynomial Approximations (e.g., $x^2+x$, Taylor series) for the attention mechanism.
- Task C (Performance): Measure the latency and communication overhead of a full Transformer block in MPC.
- Task D (Debug Challenge): Find the security flaw in a naive secret sharing implementation.
- Analysis: Discuss the “Non-Linearity Tax” – why are LLMs specifically hard to run in MPC compared to CNNs?
Technical setup and infrastructure
- Model: A 5-10M parameter GPT-2 style model (provided in
utils/model.py).
- Dataset: TinyStories.
- Codebase: Use the provided skeletons in the
labs/ directory.
2. Compute resources
- Recommended: Google Colab (Free Tier).
- Local: Requires PyTorch + Opacus + Crypten.