1. Formalizing Memory Models

Funding and Authorship

This post was made possible by research funding generously provided by AnkiHub and was written in collaboration with Giacomo Randazzo.

Comments

This text is also available on AnkiHub’s Forum. For comments and discussion, please go there.

Prelude

We are introducing a mathematical formalization of memory models for spaced repetition. A memory model estimates how likely you are to recall the answer to each card in your Anki collection. The formalization works across traditional models like SM-2 and FSRS, as well as future content-aware memory models. It will serve as the foundation for upcoming posts on evaluating memory models, which will explore how to answer questions like “Is FSRS any good?” and “How much better is FSRS than SM-2?“.

In Content-aware Spaced Repetition I (Giacomo) drew a distinction between two components of a spaced repetition system: the memory model, which predicts how likely a student is to remember a card at any given moment and the review scheduler, which makes recommendations about when you should actually study each card, taking into account the memory model’s predictions.

We believe that keeping these two concerns separate facilitates clearer reasoning about each one. Our research currently focuses on memory models because scheduler performance is bounded above by the quality of the memory model.

In general we will think of Spaced Repetition systems as operating on Retrieval Prompts which have Answers that the student attempts to recall. Most systems use a flashcard metaphor, where each (Retrieval Prompt, Answer) pair is represented by a digital flashcard. For simplicity, we will use the term card to refer to (Retrieval Prompt, Answer) pairs.

Introduction

A memory model answers the following question for any card in the student’s collection: given everything we know about the student, including their review history, what is the likelihood the student correctly recalls the answer right now?

We call this likelihood the retrievability of a card: the probability that the student will correctly recall the answer at a specific point in time. The formal target is:

retrievability = P (G = 1 ∣ student, card, history, time)

where $G \in {0, 1}$ denotes the recall outcome (1 for success, 0 for forgetting) and is modeled as $G ∣ s, c, h, t \sim Bernoulli (retrievability)$ . In this formalization, the memory model outputs a single value $p \in [0, 1]$ for each (student, card, history, time) tuple.

One could frame this as binary classification — will the student recall or not? — but that would discard most of the useful signal. A classifier outputs a hard decision boundary: there is some elapsed time $δ^{*}$ where the prediction flips from “will recall” to “won’t recall,” and that’s all a downstream consumer gets. A scheduler that wants to review a card when retrievability drops to 90% (or any other threshold) cannot extract that information from a binary label. Well-calibrated probability estimates let the scheduler sweep over $t > now$ and read off retrievability at any point on the forgetting curve, enabling more nuanced scheduling policies.

Anki uses four rating buttons — Again, Hard, Good, Easy — which capture a more detailed indication of recall confidence. In this formalization we binarize: $G = 0$ for Again (complete failure to recall) and $G = 1$ for any successful recall (Hard, Good, or Easy). The fine-grained ratings are preserved in the observation field $o$ , making it available to models that use it without conflating it with the binary recall outcome.

Formalization

Let’s define the spaces involved:

$S$ : the set of students
$C$ : the set of cards
$G = {0, 1}$ : the grade space (1 means recalled)
$T = R^{+}$ : time, expressed in seconds from a fixed epoch
$H$ : the history space, an ordered sequence of review observations $(c, t, g, o)$ where $c \in C$ is the card reviewed, $t \in T$ is the Unix timestamp, $g \in G$ is the grade, and $o$ captures any additional metadata (e.g. the Anki ease buttons: Again/Hard/Good/Easy)

A memory model is a function:

m : S \times H \times C \to (R^{+} \to [0, 1])

The model takes a student, a review history, and a card, and returns a forgetting curve: a function mapping elapsed time $δ \geq 0$ since the most recent review to a retrievability estimate. The output is only meaningful for $δ \geq 0$ ; the absolute timing of reviews is already encoded in the history $h$ .

This is equivalent by currying to a flat function $m : S \times H \times C \times R^{+} \to [0, 1]$ . We use the higher-order form to make explicit that what the model produces is a forgetting curve — elapsed time is a query into that curve, not an input on equal footing with the student, history, and card.

We use elapsed time rather than an absolute timestamp as the argument of the forgetting curve. The forgetting curve is a statement about how memory decays from a reference point — namely, the moment of the last review — so elapsed time is the natural axis. Absolute timestamps would conflate two things: when the review happened and how much time has passed since. Using $δ$ keeps those concerns separate and also makes it easier to reason about forgetting curves across different students and cards, since $δ = 0$ always means “just reviewed” regardless of when that was.

For most practical purposes we query the model at a specific elapsed time $δ$ , obtaining a scalar retrievability estimate:

retrievability (δ) = m (s, h, c) (δ) = P (G = 1 ∣ s, h, c, δ)

Traditional models vs. content-aware models

This formalization makes two departures from traditional models.

First, it breaks the local independence assumption. Most existing memory models — DASH and its variants, Duolingo’s HLR, FSRS — restrict the history used to predict retrievability for card $c$ to only previous reviews of card $c$ itself: $h_{c} = {(t, g, o) : review of card c}$ . This makes modeling tractable and is a reasonable first approximation — how well you remember a specific medical fact does mostly depend on how often you’ve reviewed that specific fact. But it means the student’s full learning history across all cards is ignored.

Here, the history $h \in H$ contains reviews of all cards. This matters because reviewing related material may strengthen or interfere with the retrievability of a given card. It also enables better initial estimates for new cards: a model that can read card content and knows a student’s history with related material can make an informed prediction before the first review of that card, rather than falling back to a generic prior.

Second, cards are represented by their content rather than a categorical identifier. Traditional systems assign each card an opaque ID; here, $C$ is the card content space. This has a practical consequence: card identity management becomes unnecessary, and edits are handled naturally — if a card’s content changes substantially, the model treats it as a different card. Traditional systems must track edits and decide heuristically whether a change warrants resetting scheduling state; here the problem does not arise.

Stability, difficulty, and derived quantities

Framing the output as a forgetting curve rather than a point-in-time probability opens up useful derived concepts. Stability can be defined as:

τ = min {δ \geq 0 : m (s, h, c) (δ) \leq 0.9}

That is, stability is the elapsed time at which retrievability first drops to 90%. This generalizes the notion used in FSRS and SuperMemo, and only requires that the forgetting curve eventually crosses 0.9.

Traditional models such as FSRS and SuperMemo make stability an explicit latent state alongside difficulty $d$ — a variable governing how much stability grows with each successful review, derived from and updated by the review history. In our formalization, neither needs to be tracked explicitly: the model takes the full history $h$ and card $c$ as inputs and can implicitly recover whatever information they encode. Stability is a derived quantity read off from the forgetting curve; difficulty is not prescribed — a model may maintain it as an explicit latent state or derive it implicitly from $h$ and $c$ . This keeps the formalization agnostic to how different models represent memory dynamics internally — whether as explicit state variables, neural hidden states, or time-windowed counts.

Second-order uncertainty

Retrievability is already a form of uncertainty: a value of 0.5 means the model is unsure whether the student will recall the card, while 0.9 means it is fairly confident they will. This is first-order uncertainty — uncertainty about the outcome. Second-order uncertainty is uncertainty about that probability itself: how confident is the model in its estimate of 0.9?

A natural question is whether the model can output not just a point estimate of retrievability, but a distribution over it — expressing both a mean and a confidence in that mean. Formally, the codomain of $m$ would change from forgetting curves valued in $[0, 1]$ to forgetting curves valued in $M ([0, 1])$ , the set of probability measures on $[0, 1]$ :

m : S \times H \times C \to (R^{+} \to M ([0, 1]))

Retrievability would then be the mean of the predictive distribution:

retrievability (δ) = E_{p \sim m (s, h, c) (δ)} [p]

The Beta distribution $Beta (α, β)$ is one natural example, with mean $\frac{α}{α + β}$ and concentration $κ = α + β$ encoding confidence in that mean, but the idea applies to any distributional form.

This would be genuinely useful for scheduling. Consider a student who has tagged a set of cards for an upcoming exam. Card A has high mean retrievability but substantial uncertainty — the model is not confident in its estimate. Card B is unrelated to the exam and has low mean retrievability with little uncertainty — confidently predicted to be near-forgotten. A scheduler that sees only point estimates would prioritize Card B, since its expected retrievability is lower. A scheduler aware of second-order uncertainty would instead prioritize Card A: the high uncertainty means the true retrievability could be considerably lower than the mean suggests, and the exam tag raises the stakes of being wrong. Point estimates cannot support this kind of risk-sensitive reasoning.

However, there is a fundamental obstacle to estimating this uncertainty from review data in a model agnostic way. For any given (student, history, card) triple, there is only ever one binary outcome to observe. And crucially, the history $h$ changes after every review — reviewing a card appends an observation to $h$ , permanently altering the input to the model. You cannot step in the same river twice. Unlike estimating a coin’s bias by flipping it repeatedly, each flip here is on a new coin that you can never flip again.

This means the variance of any distributional output is not just hard to estimate in practice — it is structurally unidentified by the data. The law of total variance makes this precise. Let $p$ be the (random) retrievability and $G ∣ p \sim Bernoulli (p)$ the observed outcome:

Var (G) = E [Var (G ∣ p)] + Var (E [G ∣ p]) = E [p (1 - p)] + Var (p)

Expanding the first term:

= E [p] - E [p^{2}] + Var (p) = μ - (Var (p) + μ^{2}) + Var (p) = μ (1 - μ)

The $Var (p)$ cancels exactly. The marginal variance of the binary outcome is always $μ (1 - μ)$ , regardless of how spread out the distribution over $p$ is. This holds for any distribution over $p \in [0, 1]$ , not just the Beta — the derivation uses nothing specific to the Beta, only that $Y ∣ p$ is Bernoulli. Binary observations carry zero information about $Var (p)$ . A model that outputs a distributional form with an explicit concentration parameter can vary that parameter freely without affecting the likelihood; it will be driven by initialization and optimizer dynamics rather than signal in the data. This is not a limitation of the likelihood specifically: Bengs et al. (2023) prove that no proper scoring rule over binary outcomes can incentivize faithful second-order uncertainty quantification, ruling out any alternative training objective as a fix.

The move from point estimates to forgetting curves does not resolve this. The forgetting curve is defined for a specific (student, history, card) triple; you observe one point on it, the history changes, and the curve you could have queried at other elapsed times no longer exists — the river argument applies just as before.

The resolution is to not treat $p$ as a free parameter per instance. If retrievability is computed as a deterministic function of shared parameters — say $p = f (s, h, c, δ; w)$ where $w$ is learned across all students and cards — then uncertainty over $p$ at a new query point is induced by uncertainty over $w$ , and $w$ is identified by the full dataset. The cancellation no longer applies because we are not varying $p$ freely; we are varying $w$ , and $w$ appears in the likelihood of every observation. The “concentration” of the distribution over $p$ at any query point is not a fitted parameter — it emerges from the posterior geometry of $w$ relative to that query point. A query point surrounded by many similar training observations will have tight uncertainty; an unusual one will have wide uncertainty.

This is a legitimate path to second-order uncertainty, but it requires committing to a specific parametric structure — what $f$ looks like, what “similar” means in input space, how $w$ is shared. These are model-specific choices. A Bayesian logistic regression, a Gaussian process, and a neural network with MC dropout would each induce different uncertainty surfaces over the same inputs, driven by their respective inductive biases rather than by anything in the general formalization.

For this reason, the formalization keeps retrievability as a point estimate. Second-order uncertainty is not impossible, but it is model-dependent, and individual model families can provide it where their structural assumptions justify it.

Summary

A memory model is a function from (student, full review history, card) to a forgetting curve over future time deltas. Queried at a specific elapsed time, it returns retrievability — the probability of recall.

The key move relative to traditional formalizations is conditioning on the full review history across all cards, with card content represented as embeddings in the item space. This breaks the local independence assumption and enables knowledge transfer and cross-card forgetting patterns.

Stability emerges as a derived quantity from this general framework rather than as a first class model state. Difficulty is dropped as a framework level concept, models may have a notion of difficulty, or they may not. Second-order uncertainty (a distribution over retrievability) is a natural extension but is not identifiable from binary review data without model-specific structural assumptions — so it is kept out of the general formalization.

With this framework in place, the next post will use it to define a principled setup for comparing memory models: what metrics to use, and why.

A. M. Campbell's Digital Garden

Explorer

1. Formalizing Memory Models

Prelude

Introduction

Formalization

Traditional models vs. content-aware models

Stability, difficulty, and derived quantities

Second-order uncertainty

Summary

Graph View

Table of Contents