Sequences v4

Many applications require that unique surrogate ids be assigned to database entries. Often the database SEQUENCE object is used to produce these. In PostgreSQL, these can be either:

  • A manually created sequence using the CREATE SEQUENCE command and retrieved by calling the nextval() function
  • serial and bigserial columns or, alternatively, GENERATED BY DEFAULT AS IDENTITY columns

However, standard sequences in PostgreSQL aren't multi-node aware and produce values that are unique only on the local node. This is important because unique ids generated by such sequences cause conflict and data loss (by means of discarded INSERT actions) in multi-master replication.

BDR global sequences

For this reason, BDR provides an application-transparent way to generate unique ids using sequences on bigint or bigserial datatypes across the whole BDR group, called global sequences.

BDR global sequences provide an easy way for applications to use the database to generate unique synthetic keys in an asynchronous distributed system that works for mostbut not necessarily allcases.

Using BDR global sequences allows you to avoid the problems with insert conflicts. If you define a PRIMARY KEY or UNIQUE constraint on a column that's using a global sequence, no node can ever get the same value as any other node. When BDR synchronizes inserts between the nodes, they can never conflict.

BDR global sequences extend PostgreSQL sequences, so they are crash-safe. To use them, you must be granted the bdr_application role.

There are various possible algorithms for global sequences:

  • SnowflakeId sequences
  • Globally allocated range sequences

SnowflakeId sequences generate values using an algorithm that doesn't require inter-node communication at any point. It's faster and more robust and has the useful property of recording the timestamp at which the values were created.

SnowflakeId sequences have the restriction that they work only for 64-bit BIGINT datatypes and produce values up to 19 digits long, which might be too long for use in some host language datatypes such as Javascript Integer types. Globally allocated sequences allocate a local range of values that can be replenished as needed by inter-node consensus, making them suitable for either BIGINT or INTEGER sequences.

You can create a global sequence using the bdr.alter_sequence_set_kind() function. This function takes a standard PostgreSQL sequence and marks it as a BDR global sequence. It can also convert the sequence back to the standard PostgreSQL sequence.

BDR also provides the configuration variable bdr.default_sequence_kind, which determines the kind of sequence to create when the CREATE SEQUENCE command is executed or when a serial, bigserial, or GENERATED BY DEFAULT AS IDENTITY column is created. Valid settings are:

  • local, meaning that newly created sequences are the standard PostgreSQL (local) sequences.
  • galloc, which always creates globally allocated range sequences.
  • snowflakeid, which creates global sequences for BIGINT sequences that consist of time, nodeid, and counter components. You can't use it with INTEGER sequences (so you can use it for bigserial but not for serial).
  • timeshard, which is the older version of SnowflakeId sequence and is provided for backward compatibility only. The SnowflakeId is preferred.
  • distributed (the default), which is a special value that you can use only for bdr.default_sequence_kind. It selects snowflakeid for int8 sequences (i.e., bigserial) and galloc sequence for int4 (i.e., serial) and int2 sequences.

The bdr.sequences view shows information about individual sequence kinds.

currval() and lastval() work correctly for all types of global sequence.

SnowflakeId sequences

The ids generated by SnowflakeId sequences are loosely time ordered so you can use them to get the approximate order of data insertion, like standard PostgreSQL sequences. Values generated within the same millisecond might be out of order, even on one node. The property of loose time ordering means they are suitable for use as range partition keys.

SnowflakeId sequences work on one or more nodes and don't require any inter-node communication after the node join process completes. So you can continue to use them even if there's the risk of extended network partitions. They aren't affected by replication lag or inter-node latency.

SnowflakeId sequences generate unique ids in a different way from standard sequences. The algorithm uses three components for a sequence number. The first component of the sequence is a timestamp at the time of sequence number generation. The second component of the sequence number is the unique id assigned to each BDR node, which ensures that the ids from different nodes are always different. The third component is the number generated by the local sequence.

While adding a unique node id to the sequence number is enough to ensure there are no conflicts, we also want to keep another useful property of sequences. The ordering of the sequence numbers roughly corresponds to the order in which data was inserted into the table. Putting the timestamp first ensures this.

A few limitations and caveats apply to SnowflakeId sequences.

SnowflakeId sequences are 64 bits wide and need a bigint or bigserial. Values generated are up to 19 digits long. There's no practical 32-bit integer version, so you can't use it with serial sequences. Use globally allocated range sequences instead.

For SnowflakeId there's a limit of 4096 sequence values generated per millisecond on any given node (about 4 million sequence values per second). In case the sequence value generation wraps around within a given millisecond, the SnowflakeId sequence waits until the next millisecond and gets a fresh value for that millisecond.

Since SnowflakeId sequences encode timestamps into the sequence value, you can generate new sequence values only within the given time frame (depending on the system clock). The oldest timestamp that you can use is 2016-10-07, which is the epoch time for the SnowflakeId. The values wrap to negative values in the year 2086 and completely run out of numbers by 2156.

Since timestamp is an important part of a SnowflakeId sequence, there's additional protection from generating sequences with a timestamp older than the latest one used in the lifetime of a postgres process (but not between postgres restarts).

The INCREMENT option on a sequence used as input for SnowflakeId sequences is effectively ignored. This might be relevant for applications that do sequence ID caching, like many object-relational mapper (ORM) tools, notably Hibernate. Because the sequence is time based, this has little practical effect since the sequence advances to a new noncolliding value by the time the application can do anything with the cached values.

Similarly, you might change the START, MINVALUE, MAXVALUE, and CACHE settings on the underlying sequence, but there's no benefit to doing so. The sequence's low 14 bits are used and the rest is discarded, so the value range limits don't affect the function's result. For the same reason, setval() isn't useful for SnowflakeId sequences.

Timeshard sequences

Timeshard sequences are provided for backward compatibility with existing installations but aren't recommended for new application use. We recommend using the SnowflakeId sequence instead.

Timeshard is very similar to SnowflakeId but has different limits and fewer protections and slower performance.

The differences between timeshard and SnowflakeId are as following:

  • Timeshard can generate up to 16384 per millisecond (about 16 million per second), which is more than SnowflakeId. However, there's no protection against wraparound within a given millisecond. Schemas using the timeshard sequence must protect the use of the UNIQUE constraint when using timeshard values for given column.
  • The timestamp component of timeshard sequence runs out of values in the year 2050 and, if used in combination with bigint, the values wrap to negative numbers in the year 2033. This means that sequences generated after 2033 have negative values. This is a considerably shorter time span than SnowflakeId and is the main reason why SnowflakeId is preferred.
  • Timeshard sequences require occasional disk writes (similar to standard local sequences). SnowflakeIds are calculated in memory so the SnowflakeId sequences are in general a little faster than timeshard sequences.

Globally allocated range sequences

The globally allocated range (or galloc) sequences allocate ranges (chunks) of values to each node. When the local range is used up, a new range is allocated globally by consensus amongst the other nodes. This uses the key space efficiently but requires that the local node be connected to a majority of the nodes in the cluster for the sequence generator to progress when the currently assigned local range is used up.

Unlike SnowflakeId sequences, galloc sequences support all sequence data types provided by PostgreSQL: smallint, integer, and bigint. This means that you can use galloc sequences in environments where 64-bit sequences are problematic. Examples include using integers in javascript, since that supports only 53-bit values, or when the sequence is displayed on output with limited space.

The range assigned by each voting is currently predetermined based on the datatype the sequence is using:

  • smallint 1 000 numbers
  • integer 1 000 000 numbers
  • bigint 1 000 000 000 numbers

Each node allocates two chunks of seq_chunk_size, one for the current use plus a reserved chunk for future usage, so the values generated from any one node increase monotonically. However, viewed globally, the values generated aren't ordered at all. This might cause a loss of performance due to the effects on b-tree indexes and typically means that generated values aren't useful as range partition keys.

The main downside of the galloc sequences is that once the assigned range is used up, the sequence generator has to ask for consensus about the next range for the local node that requires inter-node communication. This could lead to delays or operational issues if the majority of the BDR group isn't accessible. This might be avoided in later releases.

The CACHE, START, MINVALUE, and MAXVALUE options work correctly with galloc sequences. However, you need to set them before transforming the sequence to the galloc kind. The INCREMENT BY option also works correctly. However, you can't assign an increment value that's equal to or more than the above ranges assigned for each sequence datatype. setval() doesn't reset the global state for galloc sequences; don't use it.

A few limitations apply to galloc sequences. BDR tracks galloc sequences in a special BDR catalog bdr.sequence_alloc. This catalog is required to track the currently allocated chunks for the galloc sequences. The sequence name and namespace is stored in this catalog. Since the sequence chunk allocation is managed by Raft, whereas any changes to the sequence name/namespace is managed by the replication stream, BDR currently doesn't support renaming galloc sequences or moving them to another namespace or renaming the namespace that contains a galloc sequence. Be mindful of this limitation while designing application schema.

Converting a local sequence to a galloc sequence

Before transforming a local sequence to galloc, you need to take care of several prerequisites.

1. Verify that sequence and column data type match

Check that the sequence's data type matches the data type of the column with which it will be used. For example, you can create a bigint sequence and assign an integer column's default to the nextval() returned by that sequence. With galloc sequences, which for bigint are allocated in blocks of 1 000 000 000, this quickly results in the values returned by nextval() exceeding the int4 range if more than two nodes are in use.

The following example shows what can happen:

CREATE SEQUENCE int8_seq;

SELECT sequencename, data_type FROM pg_sequences;
 sequencename | data_type
--------------+-----------
 int8_seq     | bigint
(1 row)

CREATE TABLE seqtest (id INT NOT NULL PRIMARY KEY);

ALTER SEQUENCE int8_seq OWNED BY seqtest.id;

SELECT bdr.alter_sequence_set_kind('public.int8_seq'::regclass, 'galloc', 1);
 alter_sequence_set_kind
-------------------------

(1 row)

ALTER TABLE seqtest ALTER COLUMN id SET DEFAULT nextval('int8_seq'::regclass);

After executing INSERT INTO seqtest VALUES(DEFAULT) on two nodes, the table contains the following values:

SELECT * FROM seqtest;
     id
------------
          2
 2000000002
(2 rows)

However, attempting the same operation on a third node fails with an integer out of range error, as the sequence generated the value 4000000002.

Tip

You can retrieve the current data type of a sequence from the PostgreSQL pg_sequences view. You can modify the data type of a sequence with ALTER SEQUENCE ... AS ..., for example, ALTER SEQUENCE public.sequence AS integer, as long as its current value doesn't exceed the maximum value of the new data type.

2. Set a new start value for the sequence

When the sequence kind is altered to galloc, it's rewritten and restarts from the defined start value of the local sequence. If this happens on an existing sequence in a production database, you need to query the current value and then set the start value appropriately. To assist with this use case, BDR allows users to pass a starting value with the function bdr.alter_sequence_set_kind(). If you're already using offset and you have writes from multiple nodes, you need to check what is the greatest used value and restart the sequence at least to the next value.

-- determine highest sequence value across all nodes
SELECT max((x->'response'->'command_tuples'->0->>'nextval')::bigint)
    FROM json_array_elements(
        bdr.run_on_all_nodes(
            E'SELECT nextval(\'public.sequence\');'
            )::jsonb AS x;

-- turn into a galloc sequence
SELECT bdr.alter_sequence_set_kind('public.sequence'::regclass, 'galloc', $MAX + $MARGIN);

Since users can't lock a sequence, you must leave a $MARGIN value to allow operations to continue while the max() value is queried.

The bdr.sequence_alloc table gives information on the chunk size and the ranges allocated around the whole cluster. In this example, we started our sequence from 333, and we have two nodes in the cluster. We can see that we have a number of allocation 4, which is 2 per node, and the chunk size is 1000000 that's related to an integer sequence.

SELECT * FROM bdr.sequence_alloc
    WHERE seqid = 'public.categories_category_seq'::regclass;
          seqid          | seq_chunk_size | seq_allocated_up_to | seq_nallocs |       seq_last_alloc
-------------------------+----------------+---------------------+-------------+-----------------------------
 categories_category_seq |        1000000 |             4000333 |           4 | 2020-05-21 20:02:15.957835+00
(1 row)

To see the ranges currently assigned to a given sequence on each node, use these queries:

  • Node Node1 is using range from 333 to 2000333.
SELECT last_value AS range_start, log_cnt AS range_end
    FROM categories_category_seq WHERE ctid = '(0,2)'; -- first range
 range_start | range_end
-------------+-----------
         334 |   1000333
(1 row)

SELECT last_value AS range_start, log_cnt AS range_end
    FROM categories_category_seq WHERE ctid = '(0,3)'; -- second range
 range_start | range_end
-------------+-----------
     1000334 |   2000333
(1 row)
  • Node Node2 is using range from 2000004 to 4000003.
SELECT last_value AS range_start, log_cnt AS range_end
    FROM categories_category_seq WHERE ctid = '(0,2)'; -- first range
 range_start | range_end
-------------+-----------
     2000334 |   3000333
(1 row)

SELECT last_value AS range_start, log_cnt AS range_end
    FROM categories_category_seq WHERE ctid = '(0,3)'; -- second range
 range_start | range_end
-------------+-----------
     3000334 |   4000333

You can't combine it to a single query (like WHERE ctid IN ('(0,2)', '(0,3)')) as that still shows only the first range.

When a node finishes a chunk, it asks a consensus for a new one and gets the first available. In the example, it's from 4000334 to 5000333. This is the new reserved chunk and starts to consume the old reserved chunk.

UUIDs, KSUUIDs, and other approaches

There are other ways to generate globally unique ids without using the global sequences that can be used with BDR. For example:

  • UUIDs and their BDR variant, KSUUIDs
  • Local sequences with a different offset per node (i.e., manual)
  • An externally coordinated natural key

BDR applications can't use other methods safely: counter-table-based approaches relying on SELECT ... FOR UPDATE, UPDATE ... RETURNING ... or similar for sequence generation doesn't work correctly in BDR because BDR doesn't take row locks between nodes. The same values are generated on more than one node. For the same reason, the usual strategies for "gapless" sequence generation don't work with BDR. In most cases, the application coordinates generation of sequences that must be gapless from some external source using two-phase commit. Or it generates them only on one node in the BDR group.

UUIDs and KSUUIDs

UUID keys instead avoid sequences entirely and use 128-bit universal unique identifiers. These are random or pseudorandom values that are so large that it's nearly impossible for the same value to be generated twice. There's no need for nodes to have continuous communication when using UUID keys.

In the unlikely event of a collision, conflict detection chooses the newer of the two inserted records to retain. Conflict logging, if enabled, records such an event. However, it's exceptionally unlikely to ever occur, since collisions become practically likely only after about 2^64 keys are generated.

The main downside of UUID keys is that they're somewhat inefficient in terms of space and the network. They consume more space not only as a primary key but also where referenced in foreign keys and when transmitted on the wire. Also, not all applications cope well with UUID keys.

BDR provides functions for working with a K-Sortable variant of UUID data, known as KSUUID, which generates values that can be stored using the PostgreSQL standard UUID data type. A KSUUID value is similar to UUIDv1 in that it stores both timestamp and random data, following the UUID standard. The difference is that KSUUID is K-Sortable, meaning that it's weakly sortable by timestamp. This makes it more useful as a database key as it produces more compact btree indexes, which improves the effectiveness of search, and allows natural time-sorting of result data. Unlike UUIDv1, KSUUID values don't include the MAC of the computer on which they were generated, so there are no security concerns from using them.

KSUUID v2 is now recommended in all cases. You can directly sort values generated with regular comparison operators.

There are two versions of KSUUID in BDR: v1 and v2. The legacy KSUUID v1 is deprecated but is kept in order to support existing installations. Don't use it for new installations. The internal contents of v1 and v2 aren't compatible. As such, the functions to manipulate them also aren't compatible. The v2 of KSUUID also no longer stores the UUID version number.

Step and offset sequences

In offset-step sequences, a normal PostgreSQL sequence is used on each node. Each sequence increments by the same amount and starts at differing offsets. For example, with step 1000, node1's sequence generates 1001, 2001, 3001, and so on. node2's sequence generates 1002, 2002, 3002, and so on. This scheme works well even if the nodes can't communicate for extended periods. However, the designer must specify a maximum number of nodes when establishing the schema, and it requires per-node configuration. Mistakes can easily lead to overlapping sequences.

It's relatively simple to configure this approach with BDR by creating the desired sequence on one node, like this:

CREATE TABLE some_table (
    generated_value bigint primary key
);

CREATE SEQUENCE some_seq INCREMENT 1000 OWNED BY some_table.generated_value;

ALTER TABLE some_table ALTER COLUMN generated_value SET DEFAULT nextval('some_seq');

Then, on each node calling setval(), give each node a different offset starting value, for example:

-- On node 1
SELECT setval('some_seq', 1);

-- On node 2
SELECT setval('some_seq', 2);

 -- ... etc

Be sure to allow a large enough INCREMENT to leave room for all the nodes you might ever want to add, since changing it in future is difficult and disruptive.

If you use bigint values, there's no practical concern about key exhaustion, even if you use offsets of 10000 or more. It would take hundreds of years, with hundreds of machines, doing millions of inserts per second, to have any chance of approaching exhaustion.

BDR doesn't currently offer any automation for configuration of the per-node offsets on such step/offset sequences.

Composite keys

A variant on step/offset sequences is to use a composite key composed of PRIMARY KEY (node_number, generated_value), where the node number is usually obtained from a function that returns a different number on each node. You can create such a function by temporarily disabling DDL replication and creating a constant SQL function. Alternatively, you can use a one-row table that isn't part of a replication set to store a different value in each node.

Global sequence management interfaces

BDR provides an interface for converting between a standard PostgreSQL sequence and the BDR global sequence.

The following functions are considered to be DDL, so DDL replication and global locking applies to them.

bdr.alter_sequence_set_kind

Allows the owner of a sequence to set the kind of a sequence. Once set, seqkind is visible only by way of the bdr.sequences view. In all other ways, the sequence appears as a normal sequence.

BDR treats this function as DDL, so DDL replication and global locking applies, if it's currently active. See DDL Replication.

Synopsis

bdr.alter_sequence_set_kind(seqoid regclass, seqkind text)

Parameters

  • seqoid Name or Oid of the sequence to alter.
  • seqkind local for a standard PostgreSQL sequence, snowflakeid or galloc for globally unique BDR sequences, or timeshard for legacy globally unique sequence.

Notes

When changing the sequence kind to galloc, the first allocated range for that sequence uses the sequence start value as the starting point. When there are existing values that were used by the sequence before it was changed to galloc, we recommend moving the starting point so that the newly generated values don't conflict with the existing ones using the following command:

ALTER SEQUENCE seq_name START starting_value RESTART

This function uses the same replication mechanism as DDL statements. This means that the replication is affected by the ddl filters configuration.

The function takes a global DDL lock. It also locks the sequence locally.

This function is transactional. You can roll back the effects with the ROLLBACK of the transaction. The changes are visible to the current transaction.

Only the owner of the sequence can execute the bdr.alter_sequence_set_kind function unless bdr.backwards_compatibility is set is set to 30618 or lower.

bdr.extract_timestamp_from_snowflakeid

This function extracts the timestamp component of the snowflakeid sequence. The return value is of type timestamptz.

Synopsis

bdr.extract_timestamp_from_snowflakeid(snowflakeid bigint)

Parameters

  • snowflakeid Value of a snowflakeid sequence.

Notes

This function executes only on the local node.

bdr.extract_nodeid_from_snowflakeid

This function extracts the nodeid component of the snowflakeid sequence.

Synopsis

bdr.extract_nodeid_from_snowflakeid(snowflakeid bigint)

Parameters

  • snowflakeid Value of a snowflakeid sequence.

Notes

This function executes only on the local node.

bdr.extract_localseqid_from_snowflakeid

This function extracts the local sequence value component of the snowflakeid sequence.

Synopsis

bdr.extract_localseqid_from_snowflakeid(snowflakeid bigint)

Parameters

  • snowflakeid Value of a snowflakeid sequence.

Notes

This function executes only on the local node.

bdr.timestamp_to_snowflakeid

This function converts a timestamp value to a dummy snowflakeid sequence value.

This is useful for doing indexed searches or comparisons of values in the snowflakeid column and for a specific timestamp.

For example, given a table foo with a column id that's using a snowflakeid sequence, we can get the number of changes since yesterday midnight like this:

SELECT count(1) FROM foo WHERE id > bdr.timestamp_to_snowflakeid('yesterday')

A query formulated this way uses an index scan on the column id.

Synopsis

bdr.timestamp_to_snowflakeid(ts timestamptz)

Parameters

  • ts Timestamp to use for the snowflakeid sequence generation.

Notes

This function executes only on the local node.

bdr.extract_timestamp_from_timeshard

This function extracts the timestamp component of the timeshard sequence. The return value is of type timestamptz.

Synopsis

bdr.extract_timestamp_from_timeshard(timeshard_seq bigint)

Parameters

  • timeshard_seq Value of a timeshard sequence.

Notes

This function executes only on the local node.

bdr.extract_nodeid_from_timeshard

This function extracts the nodeid component of the timeshard sequence.

Synopsis

bdr.extract_nodeid_from_timeshard(timeshard_seq bigint)

Parameters

  • timeshard_seq Value of a timeshard sequence.

Notes

This function executes only on the local node.

bdr.extract_localseqid_from_timeshard

This function extracts the local sequence value component of the timeshard sequence.

Synopsis

bdr.extract_localseqid_from_timeshard(timeshard_seq bigint)

Parameters

  • timeshard_seq Value of a timeshard sequence.

Notes

This function executes only on the local node.

bdr.timestamp_to_timeshard

This function converts a timestamp value to a dummy timeshard sequence value.

This is useful for doing indexed searches or comparisons of values in the timeshard column and for a specific timestamp.

For example, given a table foo with a column id that's using a timeshard sequence, we can get the number of changes since yesterday midnight like this:

SELECT count(1) FROM foo WHERE id > bdr.timestamp_to_timeshard('yesterday')

A query formulated this way uses an index scan on the column id.

Synopsis

bdr.timestamp_to_timeshard(ts timestamptz)

Parameters

  • ts Timestamp to use for the timeshard sequence generation.

Notes

This function executes only on the local node.

KSUUID v2 Functions

Functions for working with KSUUID v2 data, K-Sortable UUID data.

bdr.gen_ksuuid_v2

This function generates a new KSUUID v2 value using the value of timestamp passed as an argument or current system time if NULL is passed. If you want to generate KSUUID automatically using the system time, pass a NULL argument.

The return value is of type UUID.

Synopsis

bdr.gen_ksuuid_v2(timestamptz)

Notes

This function executes only on the local node.

bdr.ksuuid_v2_cmp

This function compares the KSUUID v2 values.

It returns 1 if the first value is newer, -1 if the second value is lower, or zero if they are equal.

Synopsis

bdr.ksuuid_v2_cmp(uuid, uuid)

Parameters

  • UUID KSUUID v2 to compare.

Notes

This function executes only on the local node.

bdr.extract_timestamp_from_ksuuid_v2

This function extracts the timestamp component of KSUUID v2. The return value is of type timestamptz.

Synopsis

bdr.extract_timestamp_from_ksuuid_v2(uuid)

Parameters

  • UUID KSUUID v2 value to extract timestamp from.

Notes

This function executes only on the local node.

KSUUID v1 functions

Functions for working with KSUUID v1 data, K-Sortable UUID data(v1).

bdr.gen_ksuuid

This function generates a new KSUUID v1 value, using the current system time. The return value is of type UUID.

Synopsis

bdr.gen_ksuuid()

Notes

This function executes only on the local node.

bdr.uuid_v1_cmp

This function compares the KSUUID v1 values.

It returns 1 if the first value is newer, -1 if the second value is lower, or zero if they are equal.

Synopsis

bdr.uuid_v1_cmp(uuid, uuid)

Notes

This function executes only on the local node.

Parameters

  • UUID KSUUID v1 to compare.

bdr.extract_timestamp_from_ksuuid

This function extracts the timestamp component of KSUUID v1 or UUIDv1 values. The return value is of type timestamptz.

Synopsis

bdr.extract_timestamp_from_ksuuid(uuid)

Parameters

  • UUID KSUUID v1 value to extract timestamp from.

Notes

This function executes on the local node.