How It Works
- On the first run, the pipeline reads all rows and stores the maximum value of the cursor column.
- On subsequent runs, it reads only rows where the cursor column is greater than the stored value.
- After each run, it updates the stored cursor value to the new maximum.
Choosing the Cursor Column
The cursor column must:- Increase monotonically — New and updated rows should have greater values than older rows. Timestamp columns (updated_at, created_at) work well.
- Be indexed — For performance, the cursor column should be indexed in the source.
- Never decrease — Avoid columns that can be updated to smaller values (e.g. a status that gets reset).
Common Pitfalls
- Gaps in cursor values — If the cursor column has gaps (e.g. batch updates that skip timestamps), you might miss rows. Prefer columns that are set on every update.
- Timezone — Ensure the cursor column uses a consistent timezone (UTC recommended).
- Null values — Rows with null in the cursor column may be excluded. Use a column that is always populated.