Unboxing PostgreSQL - OSCON 2012

Postgres configuration and ops considerations -- notes taken during OSCON 2012

Presented by Christophe Pettus.

Authors site: thebuild.com

Nomenclature

database is set of tables & schema objects,
mysql "databases" === postgres "Schema"
"cluster" means "collection of databases"
pg_ctl: start/stop
$PGBASE
postgresql.conf
security: "role" === object that can own other objects and that has privileges
"user" === role that can log into the system; otherwise synonyms
important params in config: logging, memory, checkpoints, planner; done.

Tuning

use SSDs; put the transaction log onto spinning disk if you want. use server-grade for graceful memory flush in case of power failure.

Tuning logging

change logging first to get the data you need:
where to log? syslog, or, use CSV format to files.
log_min_duration_statement = 250ms <<--- important!for finding slow things
log_lock_waits = on <-- if anything's waiting for locks, log

Tuning memory

memory configs:

dont' run other servers on the same thing(?)
shared_buffers -- best == 8GB >= 32GB, otherwise 25% total system.
look for SHMALL and SHMMAX in the syslogs in the kernel -- tweak kernel params: -- calculate shared_memory in bytes + 20%. huge decimal number. -- sudo sysctl -w kernel.shmmax = (value) -- sudo sysctl -w kernel.shmmall = (value) / 4096

work_mem:

start low: 32-64MB.
look for lines about "temp files," to see if there's needing to be more temp memory.
set to 2 - 3x largest temp file you see. (why bigger? more efficient on disk, need more actual hot RAM).
dont' exceed 5-10% of systems RAM, because it's the amount of memory per planner node.
can configure per-session

maintenance_work_mem

10% of sysram up to 1GB
if vaccuum problems, maybe need more.

effective_cache_size

set to amount of file system cache available.
no idea what to set? set it to 50% system RAM
this isn't allocated, it's a hint to the planner to decide how much RAM is available.

General points:

random_page_cost && work_mem === biggest performance gains
prefer xfs or ext4 on linux.

Checkpoint tuning

a complete flush or dirty buffers to disk. two params control this:

number of WAL segments written to disk,
whenever a timeout occurs.

tuning these parameters:

wal_buffers # 16mb
checkpoint_completion_target # 0.9
checkopoint_timeout # 10m-30m # longer to start, less IO while operating
checkpoint_segments # 32

monitoring:

look for checkpoint entries in log,
happening how often? more than checkpoint_timeout? if so, it's exhausting the WAL segments more quickly, so bump up the checkpoint segments until they're less frequent than the timeout

Planner tuning

effective_io_concurrency: set to number of IO channels, otherwise ignore; if SSD with 32 channels, set to 32, etc. random_page_cost - 3.0 is typical RAID10, 2.0 for SAN, 1.1 for AMazone EBS. Radio between time to grab random disk page vs. sequential. SSD ~ 1.5 range. controls index v. sequential search. lower#use index.

do not touch

fsync = on. never change. this controls if postgres will flush then wait for the result.
synchronous_commit = on; you can turn this off (won't corrupt DB if fail);

Concepts

Write-ahead log (WAL)

continuous stream of comitteed database modifications, broken into 16mb segments
starts with DB cluster creation, lasts forever
checkpoints mean "last consistent state", segements before that WAL can be thrown away
PUT IT ON ITS OWN FILE SYSTEM because it's append-only, basically. own set of disks, etc. it stays put.

MVCC (multiversion concurrency control)

helps prevent locking, alternative to pessmisitic locking; allows higher performance
writers don't block readers, readers get old version of row.
writers block writers to the same row.
multiple versions of row may be in DB; deleted/updated aren't immediately removed.
VACUUM cleans tuples(rows) that aren't seen by anything/anyone/any transaction.
post 8.0, autovacuum happens; good idea to do manual vacuum after bulk update/delete operations
ANALYZE regenerates table stats to help make good guesses for how to execute queries; always do this after major database changes, such as restore from bacup.
"share" vs "exclusive" locks exist.
surprising locks: table-level locks when you add a new non-NULL column -- fix by creating it NULL then changing the column later

Transaction modes

read committed
repeatable read
serializable

Schema design & operations

keep data in normal form, don't fear joins
"fast/slow" rule: "fast data" changes a lot, "slow data" infrequently -- put these into separate tables. Slow data tends to be the parent of other data via foreign keys.
some indexing strategies: -- index should be selective in the sense that when the index is used, it should return a small number of rows. -- partial index: an index that only applies to certain/specific conditions (along the lines of "index where(clause)") -- index should be frequently used.
drop unused indexes. create indexes on the basis of real-life needs, and look for sequential scans
built-in views to check indexes: -- pg_stat_user_tables -- how many times a sequential scan has been done, -- pg_stat_user_indexes -- how often an index has been used.
SELECT COUNT(*) from myHugeTable is implemented as a full table scan. Don't do it. pg_stat_activity has an approximate, but try and avoid it. It's not a fast-performant thing on Postgres.
taming autovacuum -- you can cut down the number of workers, making it run more frequently, etc. other sections in the configuration file.
bulk loading: use COPY, not INSERT.

Debugging

"this query is slow"

EXPLAIN or EXPLAIN ANALYZE -- gets the query plan
http://explain.depesz.com
estimate vs. actual rows return means planner's confused
nested loops often mean joins that you can't use an index for

"the DB is slow"

pg_stat_activity -- is it waiting on a query? etc
tail -f logs
pg_locks, in connection with pg_stat_activity.

System/network

cloud hosting has terrible IO; since DBs are IO bound, you want to get as much RAM as you can (up to 2x DB size), CPU capacity isn't that important as RAM; always replicate.
store configurations in VCS
our-own-hardware: -- get SSDs, otherwise SAS drives.
-- RAID10; -- put pg_xlog on its own volume; -- move pg_stat_tmp to a RAMdisk if you want to (transient data, write intensive)

if you have little SSD, put your busy indexes / tables onto it.

monitoring

nagios: disk, cpu, mem, (if used) replication log
"checkpostgres.pl" from Bucardo.org <- use
pgAdmin3 for management, handy
log analysis: pgFouine (Traditional, not maintained much); pgbadger (new, active).

brucecrevensten/unboxing.md