investigating-slow-copy.md

It was taking too long for Bucardo to import certain tables. Decided to look into it by querying the new database to check what was so slow.

First step was checking pg_stat_activity:

> select query, query_start, wait_event from pg_stat_activity where usename = 'bucardo_replicator';

/* Bucardo 5.6.0 */COPY public.some_table("id","...","...","created_at","updated_at") FROM STDIN | 2024-05-02 15:36:13.816752+00 | IO::DataFileRead

This COPY FROM query had been running for several days. The wait_event was IO::DataFileRead.

I checked pg_stat_progress_copy to be sure it wasn't stuck.

> select command, type, tuples_processed from pg_stat_progress_copy;

 command  | type | tuples_processed
----------+------+------------------
COPY FROM | PIPE |         49396655

The tuples_processed figure is the number of rows which have been copied (not always, but in our case). Running this multiple times showed this figure increasing, so it wasn't stuck, just slow.

Taking a look at Amazon RDS Performance Insights also showed lots of waits on IO::DataFileRead.

AWS has a guide on this:

The IO:DataFileRead event occurs when a connection waits on a backend process to read a required page from storage because the page isn't available in shared memory.

So, at the core, this is a memory problem.

The guide mentions possible causes of increased waits. That, in combination with the AWS guide on tuning PostgreSQL RDS, led me to think that the shared_buffers were too small for our copy.

I wondered if we were still on the default (32 MB as per the PostgresSQL docs), but checking the setting confirmed this was not the case.

> SHOW shared_buffers;

1933 MB

So I decided to investigate the usage of the buffers. To do this, I had to add the pg_buffercache extension:

CREATE EXTENSION pg_buffercache;

Then:

> select buffers_used, buffers_unused from pg_buffercache_summary();

 buffers_used | buffers_unused
--------------+---------------
       241653 |              2 
(1 row)

buffers_unused was 2! This meant we were pretty much using all available memory.

Now to find out where our buffers were being used:

> SELECT n.nspname, c.relname, count(*) AS buffers
  FROM pg_buffercache b JOIN pg_class c
  ON b.relfilenode = pg_relation_filenode(c.oid) AND
  b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database()))
  JOIN pg_namespace n ON n.oid = c.relnamespace
  GROUP BY n.nspname, c.relname
  ORDER BY 3 DESC 
  LIMIT 10;

 nspname |                 relname                    | buffers
---------+--------------------------------------------+---------
 public  | some_table_pkey                            |   57792
 public  | index_some_table_on_some_id                |   53626
 public  | index_some_table_on_some_id_and_some_field |   35264
 public  | index_some_table_on_some_other_id          |   33086
 public  | index_some_table_on_user_id                |   30236
 public  | index_some_other_table_on_some_field_and_id|   27712
 public  | some_table                                 |    1704
 public  | index_some_table_on_collect_at             |    1041
 public  | index_some_table_on_some_other_field       |     408
 public  | index_some_table_on_deleted_at             |     377
(10 rows)

This showed a pattern: most of the buffers were being used by indexes.

From here, it was an easy fix: I killed the copy and dropped the indexes, and the copy finished in 40 minutes. Adding the indexes back afterwards took about 20 minutes. All done!

shalvah/investigating-slow-copy.md