All opinions expressed are those of the authors and not necessarily those of, our sponsors, or our affiliates.
  Add to My Yahoo!  Subscribe with Bloglines  Subscribe in NewsGator Online

published by (Greg Sabino Mullane) on 2017-07-22 00:16:00 in the "postgres" category

Upgrading Postgres is not quite as painful as it used to be, thanks primarily to the pg_upgrade program, but there are times when it simply cannot be used. We recently had an existing End Point client come to us requesting help upgrading from their current Postgres database (version 9.2) to the latest version (9.6 - but soon to be 10). They also wanted to finally move away from their SQL_ASCII encoding to UTF-8. As this meant that pg_upgrade could not be used, we also took the opportunity to enable checksums as well (this change cannot be done via pg_upgrade). Finally, they were moving their database server to new hardware. There were many lessons learned and bumps along the way for this migration, but for this post I'd like to focus on one of the most vexing problems, the database encoding.

When a Postgres database is created, it is set to a specific encoding. The most common one (and the default) is "UTF8". This covers 99% of all user's needs. The second most common one is the poorly-named "SQL_ASCII" encoding, which should be named "DANGER_DO_NOT_USE_THIS_ENCODING", because it causes nothing but trouble. The SQL_ASCII encoding basically means no encoding at all, and simply stores any bytes you throw at it. This usually means the database ends up containing a whole mess of different encodings, creating a "byte soup" that will be difficult to sanitize by moving to a real encoding (i.e. UTF-8).

Many tools exist which convert text from one encoding to another. One of the most popular ones on Unix boxes is "iconv". Although this program works great if your source text is using one encoding, it fails when it encounters byte soup.

For this migration, we first did a pg_dump from the old database to a newly created UTF-8 test database, just to see which tables had encoding problems. Quite a few did - but not all of them! - so we wrote a script to import tables in parallel, with some filtering for the problem ones. As mentioned above, iconv was not particularly helpful: looking at the tables closely showed evidence of many different encodings in each one: Windows-1252, ISO-8859-1, Japanese, Greek, and many others. There were even large bits that were plainly binary data (e.g. images) that simply got shoved into a text field somehow. This is the big problem with SQL_ASCII: it accepts *everything*, and does no validation whatsoever. The iconv program simply could not handle these tables, even when adding the //IGNORE option.

To better explain the problem and the solution, let's create a small text file with a jumble of encodings. Discussions of how UTF-8 represents characters, and its interactions with Unicode, are avoided here, as Unicode is a dense, complex subject, and this article is dry enough already. :)

First, we want to add some items using the encodings 'Windows-1252' and 'Latin-1'. These encoding systems were attempts to extend the basic ASCII character set to include more characters. As these encodings pre-date the invention of UTF-8, they do it in a very inelegant (and incompatible) way. Use of the "echo" command is a great way to add arbitrary bytes to a file as it allows direct hex input:

$ echo -e "[Windows-1252]   Euro: x80   Double dagger: x87" >
$ echo -e "[Latin-1]   Yen: xa5   Half: xbd" >> 
$ echo -e "[Japanese]   Ship: xe8x88xb9" >>
$ echo -e "[Invalid UTF-8]  Blob: xf4xa5xa3xa5" >> 

This file looks ugly. Notice all the "wrong" characters when we simply view the file directly:

$ cat
[Windows-1252]   Euro: �   Double dagger: �
[Latin-1]   Yen: �   Half: �
[Japanese]   Ship: ? 
[Invalid UTF-8]  Blob: ����

Running iconv is of little help:

## With no source encoding given, it errors on the Euro:
$ iconv -t utf8 >/dev/null
iconv: illegal input sequence at position 23 

## We can tell it to ignore those errors, but it still barfs on the blob:
$ iconv -t utf8//ignore >/dev/null
iconv: illegal input sequence at position 123

## Telling it the source is Window-1252 fixes some things, but still sinks the Ship:
$ iconv -f windows-1252 -t utf8//ignore 
[Windows-1252]   Euro: €   Double dagger: ‡
[Latin-1]   Yen: ¥   Half: ½
[Japanese]   Ship: è?¹
[Invalid UTF-8]  Blob: ô¥£¥

After testing a few other tools, we discovered the nifty Encoding::FixLatin , a Perl module which provides a command-line program called "fix_latin". Rather than being authoritative like iconv, it tries its best to fix things up with educated guesses. Its documentation gives a good summary:

  The script acts as a filter, taking source data which may contain a mix of
  ASCII, UTF8, ISO8859-1 and CP1252 characters, and producing output will be
  all ASCII/UTF8.

  Multi-byte UTF8 characters will be passed through unchanged (although
  over-long UTF8 byte sequences will be converted to the shortest normal
  form). Single byte characters will be converted as follows:

    0x00 - 0x7F   ASCII - passed through unchanged 
    0x80 - 0x9F   Converted to UTF8 using CP1252 mappings
    0xA0 - 0xFF   Converted to UTF8 using Latin-1 mappings

While this works great for fixing the Windows-1252 and Latin-1 problems (and thus accounted for at least 95% of our table's bad encodings), it still allows "invalid" UTF-8 to pass on through. Which means that Postgres will still refuse to accept it. Let's check our test file:

$ fix_latin
[Windows-1252]   Euro: €   Double dagger: ‡
[Latin-1]   Yen: ¥   Half: ½
[Japanese]   Ship: ?
[Invalid UTF-8]  Blob: ����

## Postgres will refuse to import that last part:
$ echo "SELECT E'"  "$(fix_latin"  "';" | psql
ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5 

## Even adding iconv is of no help:
$ echo "SELECT E'"  "$(fix_latin | iconv -t utf-8)"  "';" | psql
ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5

The UTF-8 specification is rather dense and puts many requirements on encoders and decoders. How well programs implement these requirements (and optional bits) varies, of course. But at the end of the day, we needed that data to go into a UTF-8 encoded Postgres database without complaint. When in doubt, go to the source! The relevant file in the Postgres source code responsible for rejecting bad UTF-8 (as in the examples above) is src/backend/utils/mb/wchar.c Analyzing that file shows a small but elegant piece of code whose job is to ensure only "legal" UTF-8 is accepted:

pg_utf8_islegal(const unsigned char *source, int length)
  unsigned char a;

  switch (length)
      /* reject lengths 5 and 6 for now */
      return false;
    case 4:
      a = source[3];
      if (a < 0x80 || a > 0xBF)
        return false;
      /* FALL THRU */
    case 3: 
      a = source[2];
      if (a < 0x80 || a > 0xBF)
        return false;
      /* FALL THRU */
    case 2:
      a = source[1];
      switch (*source)
        case 0xE0:
          if (a < 0xA0 || a > 0xBF)
            return false;
        case 0xED:
          if (a < 0x80 || a > 0x9F)
            return false;
        case 0xF0:
          if (a < 0x90 || a > 0xBF)
            return false;
        case 0xF4:
          if (a < 0x80 || a > 0x8F)
            return false;
          if (a < 0x80 || a > 0xBF)
            return false;
      /* FALL THRU */
    case 1:
      a = *source; 
      if (a >= 0x80 && a < 0xC2)
        return false;
      if (a > 0xF4)
        return false;
  return true;

Now that we know the UTF-8 rules for Postgres, how do we ensure our data follows it? While we could have made another standalone filter to run after fix_latin, that would increase the migration time. So I made a quick patch to the fix_latin program itself, rewriting that C logic in Perl. A new option "--strict-utf8" was added. Its job is to simply enforce the rules found in the Postgres source code. If a character is invalid, it is replaced with a question mark (there are other choices for a replacement character, but we decided simple question marks were quick and easy - and the surrounding data was unlikely to be read or even used anyway).

Voila! All of the data was now going into Postgres without a problem. Observe:

$ echo "SELECT E'"  "$(fix_latin"  "';" | psql
ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5

$ echo "SELECT E'"  "$(fix_latin --strict-utf8"  "';" | psql
  [Windows-1252]   Euro: €   Double dagger: ‡+
 [Latin-1]   Yen: ¥   Half: ½                +
 [Japanese]   Ship: ?                       +
 [Invalid UTF-8]  Blob: ???? 
(1 row)

What are the lessons here? First and foremost, never use SQL_ASCII. It's outdated, dangerous, and will cause much pain down the road. Second, there are an amazing number of client encodings in use, especially for old data, but the world has pretty much standardized on UTF-8 these days, so even if you are stuck with SQL_ASCII, the amount of Windows-1252 and other monstrosities will be small. Third, don't be afraid to go to the source. If Postgres is rejecting your data, it's probably for a very good reason, so find out exactly why. There were other challenges to overcome in this migration, but the encoding was certainly one of the most interesting ones. Everyone, the client and us, is very happy to finally have everything using UTF-8!


published by (Greg Sabino Mullane) on 2017-03-28 16:23:00 in the "postgres" category

Turtle turtle by WO1 Larry Olson from US Army

The PostgreSQL database system uses the write-ahead logging method to ensure that a log of changes is saved before being applied to the actual data. The log files that are created are known as WAL (Write Ahead Log) files, and by default are 16 MB in size each. Although this is a small size, a busy system can generate hundreds or thousands of these files per hour, at which point disk space becomes an issue. Luckily, WAL files are extremely compressible. I examined different programs to find one that offered the best compression (as indicated by a smaller size) at the smallest cost (as indicated by wall clock time). All of the methods tested worked better than the venerable gzip program, which is suggested in the Postgres documentation for the archive_command option. The best overall solution was using the pxz program inside the archive_command setting, followed closely by use of the 7za program. Use of the built-in wal_compression option was an excellent solution as well, although not as space-saving as using external programs via archive_command.

A database system is a complex beast, involving many trade-offs. An important issue is speed: waiting for changes to get written to disk before letting the client proceed can be a very expensive solution. Postgres gets around this with the use of the Write Ahead Log, which generates WAL files indicating what changes were made. Creating these files is much faster than performing the actual updates on the underlying files. Thus, Postgres is able to tell the client that the work is "done" when the WAL file has been generated. Should the system crash before the actual changes are made, the WAL files are used to replay the changes. As these WAL files represent a continuous unbroken chain of all changes to the database, they can also be used for Point in Time Recovery - in other words, the WAL files can be used to rewind the database to any single point in time, capturing the state of the database at a specific moment.

Postgres WAL files are exactly 16 MB in size (although this size may be changed at compilation time, it is extremely unheard of to do this). These files primarily sit around taking up disk space and are only accessed when a problem occurs, so being able to compress them is a good one-time exchange of CPU effort for a lower file size. In theory, the time to decompress the files should also be considered, but testing revealed that all the programs decompressed so quickly that it should not be a factor.

WAL files can be compressed in one of two ways. As of Postgres 9.5, the wal_compression feature can be enabled, which instructs Postgres to compress parts of the WAL file in-place when possible, leading to the ability to store much more information per 16 MB WAL file, and thus reducing the total number generated. The second way is to compress with an external program via the free-form archive_command parameter. Here is the canonical example from the Postgres docs, showing use of the gzip program for archive_command:

archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f'

It is widely known that gzip is no longer the best compression option for most tasks, so I endeavored to determine which program was the best at WAL file compression - in terms of final file size versus the overhead to create the file. I also wanted to examine how these fared versus the new wal_compression feature.

To compare the various compression methods, I examined all of the compression programs that are commonly available on a Linux system, are known to be stable, and which perform at least as good as the common utility gzip. The contenders were:

  • gzip - the canonical, default compression utility for many years
  • pigz - parallel version of gzip
  • bzip2 - second only to gzip in popularity, it offers better compression
  • lbzip2 - parallel version of bzip
  • xz - an excellent all-around compression alternative to gzip and bzip
  • pxz - parallel version of xz
  • 7za - excellent compression, but suffers from complex arguments
  • lrzip - compression program targeted at "large files"

For the tests, 100 random WAL files were copied from a busy production Postgres system. Each of those 100 files were compressed nine times by each of the programs above: from the "least compressed" option (e.g. -1) to the "best compressed" option (e.g. -9). The tests were performed on a 16-core system, with plenty of free RAM and nothing else running on the server. Results were gathered by wrapping each command with /usr/bin/time -verbose, which produces a nice breakdown of results. To gather the data, the "Elapsed (wall clock) time" was used, along with size of the compressed file. Here is some sample output of the time command:

  Command being timed: "bzip2 -4 ./0000000100008B91000000A5"
  User time (seconds): 1.65
  System time (seconds): 0.01
  Percent of CPU this job got: 99%
  Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.66
  Average shared text size (kbytes): 0
  Average unshared data size (kbytes): 0
  Average stack size (kbytes): 0
  Average total size (kbytes): 0
  Maximum resident set size (kbytes): 3612
  Average resident set size (kbytes): 0
  Major (requiring I/O) page faults: 0
  Minor (reclaiming a frame) page faults: 938
  Voluntary context switches: 1
  Involuntary context switches: 13
  Swaps: 0
  File system inputs: 0
  File system outputs: 6896
  Socket messages sent: 0
  Socket messages received: 0
  Signals delivered: 0
  Page size (bytes): 4096
  Exit status: 0

The wal_compression feature was tested by creating a new Postgres 9.6 cluster, then running the pgbench program twice to generate WAL files - once with wal_compression enabled, and once with it disabled. Then each of the resulting WAL files was compressed using each of the programs above.

Table 1.
Results of compressing 16 MB WAL files - average for 100 files
CommandWall clock time (s)File size (MB)
gzip -10.2714.927
gzip -20.2924.777
gzip -30.3664.667
gzip -40.3814.486
gzip -50.4924.318
gzip -60.7344.250
gzip -70.9914.235
gzip -82.0424.228
gzip -93.6264.227
CommandWall clock time (s)File size (MB)
bzip2 -11.5403.817
bzip2 -21.5313.688
bzip2 -31.5703.638
bzip2 -41.6253.592
bzip2 -51.6673.587
bzip2 -61.7073.566
bzip2 -71.7313.559
bzip2 -81.7523.557
bzip2 -91.7843.541
CommandWall clock time (s)File size (MB)
xz -10.9623.174
xz -21.1863.066
xz -35.9112.711
xz -46.2922.682
xz -56.6942.666
xz -68.9882.608
xz -79.1942.592
xz -89.1172.596
xz -99.1642.597
Table 2.
Results of compressing 16 MB WAL file - average for 100 files
CommandWall clock time (s)File size (MB)
-l -L1
-l -L2
-l -L3
-l -L4
-l -L5
-l -L6
-l -L7
-l -L8
-l -L9
CommandWall clock time (s)File size (MB)
-z -L1
-z -L2
-z -L3
-z -L4
-z -L5
-z -L6
-z -L7
-z -L8
-z -L9
CommandWall clock time (s)File size (MB)
7za -bd -mx=1
a test.7za
7za -bd -mx=2
a test.7za
7za -bd -mx=3
a test.7za
7za -bd -mx=4
a test.7za
7za -bd -mx=5
a test.7za
7za -bd -mx=6
a test.7za
7za -bd -mx=7
a test.7za
7za -bd -mx=8
a test.7za
7za -bd -mx=9
a test.7za
Table 3.
Results of compressing 16 MB WAL file - average for 100 files
CommandWall clock time (s)File size (MB)
pigz -10.0514.904
pigz -20.0514.755
pigz -30.0514.645
pigz -40.0514.472
pigz -50.0514.304
pigz -60.0604.255
pigz -70.0814.225
pigz -80.1404.212
pigz -90.2514.214
CommandWall clock time (s)File size (MB)
lbzip2 -10.1353.801
lbzip2 -20.1513.664
lbzip2 -30.1513.615
lbzip2 -40.1513.586
lbzip2 -50.1513.562
lbzip2 -60.1513.545
lbzip2 -70.1503.538
lbzip2 -80.1513.524
lbzip2 -90.1503.528
CommandWall clock time (s)File size (MB)
pxz -10.1353.266
pxz -20.1753.095
pxz -31.2442.746
pxz -42.5282.704
pxz -55.1152.679
pxz -69.1162.604
pxz -79.2552.599
pxz -89.2672.595
pxz -99.3552.591
Table 4.
Results of Postgres wal_compression option
ModificationsTotal size of WAL files (MB)
No modifications208.1
wal_compression enabled81.0
xz -28.6
wal_compression enabled PLUS xz -29.4

Table 1 shows some baseline compression values for the three popular programs gzip, bzip2, and xz. Both gzip and bzip2 show little change in the file sizes as the compression strength is raised. However, xz has a relatively large jump when going from -2 to -3, although the time cost increases to an unacceptable 5.9 seconds. As a starting point, something well under one second is desired.

Among those three, xz is the clear winner, shrinking the file to 3.07 MB with a compression argument of -2, and taking 1.2 seconds to do so. Both gzip and bzip2 never even reach this file size, even when using a -9 argument. For that matter, the best compression gzip can ever achieve is 4.23 MB, which the other programs can beat without breakng a sweat.

Table 2 demonstrates two ways of invoking the lrzip program: the -l option (lzo compression - described in the lrzip documentation as "ultra fast") and the -z option (zpaq compression - "extreme compression, extreme slow"). All of those superlatives are supported by the data. The -l option runs extremely fast: even at -L5 the total clock time is still only .39 seconds. Unfortunately, the file size hovers around an undesirable 5.5 MB, no matter what compression level is used. The -z option produces the smallest file of all the programs here (2.48 MB) at a whopping cost of over 30 seconds! Even the smallest compression level (-L1) takes 3.5 seconds to produce a 3.4 MB file. Thus, lrzip is clearly out of the competition.

Compression in action (photo by Eric Armstrong)

The most interesting program is without a doubt 7za. Unlike the others, it is organized around creating archives, and thus doesn't do in-place compression as the others do. Nevertheless, the results are quite good. At the lowest level, it takes a mere 0.13 seconds to produce a 3.18 MB file. As it takes xz 1.19 seconds to produce a nearly equivalent 3.10 MB file, 7za is the clear winner ... if we had only a single core available. :)

It is rare to find a modern server with a single processor, and a crop of compression programs have appeared to support this new paradigm. First up is the amazing pigz, which is a parallel version of gzip. As Table 3 shows, pigz is extraordinarily fast on our 16 core box, taking a mere 0.05 seconds to run at compression level -1, and only 0.25 seconds to run at compression level -9. Sadly, it still suffers from the fact that gzip is simply not good at compressing WAL files, as the smallest size was 4.23 MB. This rules out pigz from consideration.

The bzip2 program has been nipping at the heels of gzip for many years, so naturally it has its own parallel version, known as lbzip2. As Table 3 shows, it is also amazingly fast. Not as fast as pigz, but with a speed of under 0.2 seconds - even at the highest compression level! There is very little variation among the compression levels used, so it is fair to simply state that lbzip2 takes 0.15 seconds to shrink the WAL file to 3.5 MB. A decent entry.

Of course, the xz program has a parallel version as well, known as pxz. Table 3 shows that the times still vary quite a bit, and reach into the 9 second range at higher compression levels, but does very well at -2, taking a mere 0.17 seconds to produce a 3.09 MB file. This is comparable to the previous winner, 7za, which took 0.14 seconds to create a 3.12 MB file.

So the clear winners are 7za and pxz. I gave the edge to pxz, as (1) the file size was slightly smaller at comparable time costs, and (2) the odd syntax for 7za for both compressing and decompressing was annoying compared with the simplicity of "xz -2" and "xz -d".

Now, what about the built-in compression offered by the wal_compression option? As Table 4 shows, the compression for the WAL files I tested went from 208 MB to 81 MB. This is a significant gain, but only equates to compressing a single WAL file to 6.23 MB, which is a poor showing when compared to the compression programs above. It should be pointed out that the wal_compression option is sensitive to your workload, so you might see reports of greater and lesser compressions.

Of interest is that the WAL files generated by turning on wal_compression are capable of being further compressed by the archive_command option, and doing a pretty good job of it as well - going from 81 MB of WAL files to 9.4 MB of WAL files. However, using just xz in the archive_command without wal_compression on still yielded a smaller overall size, and means less CPU because the data is only compressed once.

It should be pointed out that wal_compression has other advantages, and that comparing it to archive_command is not a fair comparison, but this article was primarily about the best compression option for storing WAL files long-term.

Thus, the overall winner is "pxz -2", followed closely by 7za and its bulky arguments, with honorable mention given to wal_compression. Your particular requirements might guide you to another conclusion, but hopefully nobody shall simply default to using gzip anymore.

Thanks to my colleague Lele for encouraging me to try pxz, after I was already happy with xz. Thanks to the authors of xz, for providing an amazing program that has an incredible number of knobs for tweaking. And a final thanks to the authors of the wal_compression feature, which is a pretty nifty trick!


published by (Greg Sabino Mullane) on 2016-12-07 13:38:00 in the "postgres" category

Anytime you run a query in Postgres, it needs to compile your SQL into a lower-level plan explaining how exactly to retrieve the data. The more it knows about the tables involved, the smarter the planner can be. To get that information, it gathers statistics about the tables and stores them, predictably enough, in the system table known as pg_statistic. The SQL command ANALYZE is responsible for populating that table. It can be done per-cluster, per-database, per-table, or even per-column. One major pain about analyze is that every table *must* be analyzed after a major upgrade. Whether you upgrade via pg_dump, pg_upgrade, Bucardo, or some other means, the pg_statistic table is not copied over and the database starts as a clean slate. Running ANALYZE is thus the first important post-upgrade step.

Unfortunately, analyze can be painfully slow. Slow enough that the default analyze methods sometimes take longer that the entire rest of the upgrade! Although this article will focus on the pg_upgrade program in its examples, the lessons may be applied to any upgrade method. The short version of the lessons is: run vacuumdb in parallel, control the stages yourself, and make sure you handle any custom per-column statistics.

Before digging into the solution in more detail, let's see why all of this is needed. Doesn't pg_upgrade allow for super-fast Postgres major version upgrades, including the system catalogs? It does, with the notable exception of the pg_statistics table. The nominal reason for not copying the data is that the table format may change from version to version. The real reason is that nobody has bothered to write the conversion logic yet, for pg_upgrade could certainly copy the pg_statistics information: the table has not changed for many years.

At some point, a DBA will wonder if it is possible to simply copy the pg_statistic table from one database to another manually. Alas, it contains columns of the type "anyarray", which means it cannot be dumped and restored:

$ pg_dump -t pg_statistic --data-only | psql -q
ERROR:  cannot accept a value of type anyarray
CONTEXT:  COPY pg_statistic, line 1, column stavalues1: "{"{i,v}","{v}","{i,o,o}","{i,o,o,o}","{i,i,i,v,o,o,o}","{i,i,o,o}","{i,o}","{o,o,o}","{o,o,o,o}","{o..."

I keep many different versions of Postgres running on my laptop, and use a simple port naming scheme to keep them straight. It's simple enough to use pg_dump and sed to confirm that the structure of the pg_statistic table has not changed from version 9.2 until 9.6:

$ for z in 840 900 910 920 930 940 950; do echo -n $z: ; diff -sq <(pg_dump 
>  --schema-only -p 5$z -t pg_statistic | sed -n '/CREATE TABLE/,/^$/p') <(pg_dump 
>  --schema-only -p 5960 -t pg_statistic | sed -n '/CREATE TABLE/,/^$/p'); done
840:Files /dev/fd/63 and /dev/fd/62 differ
900:Files /dev/fd/63 and /dev/fd/62 differ
910:Files /dev/fd/63 and /dev/fd/62 differ
920:Files /dev/fd/63 and /dev/fd/62 are identical
930:Files /dev/fd/63 and /dev/fd/62 are identical
940:Files /dev/fd/63 and /dev/fd/62 are identical
950:Files /dev/fd/63 and /dev/fd/62 are identical

Of course, the same table structure does not promise that the backend of different versions uses them in the same way (spoiler: they do), but that should be something pg_upgrade can handle by itself. Even if the table structure did change, pg_upgrade could be taught to migrate the information from one format to another (its raison d'être). If the new statistics format take a long time to generate, perhaps pg_upgrade could leisurely generate a one-time table on the old database holding the new format, then copy that over as part of the upgrade.

Since pg_upgrade currently does none of those things and omits upgrading the pg_statistics table, the following message appears after pg_upgrade has been run:

Upgrade Complete
Optimizer statistics are not transferred by pg_upgrade so,
once you start the new server, consider running:

Looking at the script in question yields:


echo 'This script will generate minimal optimizer statistics rapidly'
echo 'so your system is usable, and then gather statistics twice more'
echo 'with increasing accuracy.  When it is done, your system will'
echo 'have the default level of optimizer statistics.'

echo 'If you have used ALTER TABLE to modify the statistics target for'
echo 'any tables, you might want to remove them and restore them after'
echo 'running this script because they will delay fast statistics generation.'

echo 'If you would like default statistics as quickly as possible, cancel'
echo 'this script and run:'
echo '    vacuumdb --all --analyze-only'

vacuumdb --all --analyze-in-stages

echo 'Done'

There are many problems in simply running this script. Not only is it going to iterate through each database one-by-one, but it will also process tables one-by-one within each database! As the script states, it is also extremely inefficient if you have any per-column statistics targets. Another issue with the --analyze-in-stages option is that the stages are hard-coded (at "1", "10", and "default"). Additionally, there is no way to easily know when a stage has finished other than watching the command output. Happily, all of these problems can be fairly easily overcome; let's create a sample database to demonstrate.

$ initdb --data-checksums testdb
$ echo port=5555 >> testdb/postgresql.conf 
$ pg_ctl start -D testdb
$ createdb -p 1900 alpha
$ pgbench alpha -p 1900 -i -s 2
$ for i in `seq 1 100`; do echo create table pgb$i AS SELECT * FROM pgbench_accounts;; done | psql -p 1900 alpha

Now we can run some tests to see the effect of the --jobs option. Graphing out the times shows some big wins and nice scaling. Here are the results of running vacuumdb alpha --analyze-only with various values of --jobs:

Simple graph showing time decreasing as number of jobs increases

The slope of your graph will be determined by how many expensive-to-analyze tables you have. As a rule of thumb, however, you may as well set --jobs to a high number. Anything over your max_connections setting is pointless, but don't be afraid to jack it up to at least a hundred. Experiment on your test box, of course, to find the sweet spot for your system. Note that the --jobs argument will not work on old versions of Postgres. For those cases, I usually whip up a Perl script using Parallel::ForkManager to get the job done. Thanks to Dilip Kumar for adding the --jobs option to vacuumdb!

The next problem to conquer is the use of custom statistics. Postgres' ANALYZE uses the default_statistics_target setting to determine how many rows to sample (the default value in modern versions of Postgres is 100). However, as the name suggests, this is only the default - you may also set a specific target at the column level. Unfortunately, there is no way to disable this quickly, which means that vacuumdb will always use the custom value. This is not what you want, especially if you are using the --analyze-in-stages option, as it will happily (and needlessly!) recalculate columns with specific targets three times. As custom stats are usually set much higher than the default target, this can be a very expensive option:

$ ## Create a largish table:
$ psql -qc 'create unlogged table aztest as select * from pgbench_accounts'
$ for i in {1..5}; do psql -qc 'insert into aztest select * from aztest'; done
$ psql -tc "select pg_size_pretty(pg_relation_size('aztest'))"
820 MB
$ psql -qc 'timing' -c 'analyze aztest'
Time: 590.820 ms  ## Actually best of 10: never get figures from a single run!
$ psql -c 'alter table aztest alter column aid set statistics 1000'
$ psql -qc 'timing' -c 'analyze aztest'
Time: 2324.017 ms ## as before, this is the fastest of 10 runs

As you can see, even a single column can change the analyze duration drastically. What can we do about this? The --analyze-in-stages is still a useful feature, so we want to set those columns back to a default value. While one could reset the stats and then set them again on each column via a bunch of ALTER TABLE calls, I find it easier to simply update the system catalogs directly. Specifically, the pg_attribute table contains a attstattarget column which has a positive value when a custom target is set. In our example above, the value of attstattarget for the aid column would be 1000. Here is a quick recipe to save the custom statistics values, set them to the default (-1), and then restore them all once the database-wide analyzing is complete:

## Save the values away, then reset to default:
CREATE TABLE custom_targets AS SELECT attrelid, attname, attnum, attstattarget
  FROM pg_atttribute WHERE attstattarget > 0;
UPDATE pg_attribute SET attstattarget = -1 WHERE attstattarget > 0;

## Safely run your database-wide analyze now
## All columns will use default_statistics_target

## Restore the values:
UPDATE pg_attribute a SET attstattarget = c.attstattarget
  FROM custom_targets c WHERE a.attrelid = c.attrelid
  AND a.attnum = c.attnum AND a.attname = c.attname;

## Bonus query: run all custom target columns in parallel:
SELECT 'vacuumdb --analyze-only -e -j 100 ' || 
  string_agg(format('-t "%I(%I)" ', attrelid::regclass, attname), NULL)
FROM pg_attribute WHERE attstattarget > 0;

As to the problems of not being able to pick the stage targets for --analyze-in-stages, and not being able to know when a stage has finished, the solution is to simply do it yourself. For example, to run all databases in parallel with a target of "2", you would need to change the default_statistics_target at the database level (via ALTER DATABASE), or at the cluster level (via ALTER SYSTEM). Then invoke vacuumdb, and reset the value:

$ psql -qc 'alter system set default_statistics_target = 2' -qc 'select pg_reload_conf()'
$ vacuumdb --all --analyze-only --jobs 100
$ psql -qc 'alter system reset default_statistics_target' -qc 'select pg_reload_conf()'

In summary, don't trust the given vacuumdb suggestions for a post-upgrade analyze. Instead, remove any per-column statistics, run it in parallel, and do whatever stages make sense for you.


published by (Josh Williams) on 2016-12-01 00:55:00 in the "postgres" category
We needed a way to suppress specific write commands on a Postgres streaming replica. The replica was set up for a DR configuration, with the applications able to be brought up into full service at a moment's notice. But since it's a hot standby, we'd like to still allow the applications to be used in a working but read-only state.

One of the applications on this database is MediaWiki, which worked great in this configuration. But a couple of the other apps have the classic behavior of updating its user object's "last login" field in one form or another when someone authenticates, which would cause the login process to fail entirely.

Of course we want updates to fail, up until that point when (knock on wood) the master server is declared down for the count and the official fail-over happens. Except for the one command that executes on login.

We don't really care about the "last login" type field -- the data is available through logs and other means. The affected apps could probably all be monkey patched to work around that part of the process. But we had a couple different apps doing this, and that adds a maintenance burden for each. And if we could figure out how to make it work at the database level then it'd work for all of them, plus anything else that might pop up.

The first thing we looked at was writing a trigger to intercept the commands, but triggers don't execute on a hot standby replica so that was out pretty quickly. The next hypothesis was that we could write a foreign data wrapper that'd just absorb the writes, or even just use postgres_fdw to route the commands to a local writable database that's more or less a throw-away data sink. But to our surprise, even writes to foreign tables get rejected on a hot standby. I'm slightly tempted to dig in and see what it'd take to enable that.

The third time was the charm: rules. Rules hook in pretty far down into the query parser, and they can be notoriously tricky to work with. But since they're embedded pretty deep, down to the point where views rely on them they're obeyed even on a replica.

So the technique was this: On the master (... obviously) we set up a separate schema, inside which a view was created with the same name as the target table and which had certain commands suppressed:


CREATE VIEW replica.users AS SELECT * FROM public.users;

CREATE RULE users_disable_update AS ON UPDATE TO replica.users DO INSTEAD NOTHING;

Plus any permission adjustments the app user might need. On the master server this schema and view are pretty much ignored, as the application user just uses the default search path. But on the replica, we adjust the default search path in postgresql.conf so that it applies to just that server:

search_path = '"$user",replica,public'
app@app:5432=> UPDATE "auth_user" SET "last_login" = now() WHERE "auth_user"."id" = 13;

It doesn't quite work everywhere, sadly! Notice the "UPDATE 0"? We found Django actually checks that, and panics with an error to the user when it tries to change something and the row count it gets back is different than what it expects.

Another caveat is that if the target table's schema changes, the view won't automatically follow. Presumably your user table doesn't receive changes all that often, but if you're applying this technique to something else, that might not be the case. Something to be aware of!

published by (Greg Sabino Mullane) on 2016-10-26 19:09:00 in the "postgres" category

Postgres has a wonderfully helpful (but often overlooked) feature called the connection service file (its documentation is quite sparse). In a nutshell, it defines connection aliases you can use from any client. These connections are given simple names, which then map behind the scenes to specific connection parameters, such as host name, Postgres port, username, database name, and many others. This can be an extraordinarily useful feature to have.

The connection service file is named pg_service.conf and is setup in a known location. The entries inside are in the common "INI file" format: a named section, followed by its related entries below it, one per line. To access a named section, just use the service=name string in your application.

## Find the file to access by doing:
$ echo `pg_config --sysconfdir`/pg_service.conf
## Edit the file and add a sections that look like this:

## Now you can access this database via psql:
$ psql service=foobar

## Or in your Perl code:
my $dbh = DBI->connect('dbi:Pg:service=foobar');

## Other libpq based clients are the same. JDBC, you are out of luck!

So what makes this feature awesome? First, it can save you from extra typing. No more trying to remember long hostnames (or copy and paste them). Second, it is better than a local shell alias, as the service file can be made globally available to all users. It also works similar to DNS, in that it insulates you from the details of your connections. Your hostname has changed because of a failover? No problem, just edit the one file, and no clients need to change a thing.

As seen above, the format of the file is simple: a named section, followed by connection parameters in a name=value format. Among the connection parameters one may use, the most common and useful are host, port, user, and dbname. Although you can set a password, I recommend against it, as that belongs in the more secure, per-user .pgpass file.

The complete list of what may be set can be found in the middle of the database connection documentation page. Most of them will seldom, if ever, be used in a connection service file.

The connection service file is not just limited to basic connections. You can have sections that only differ by user, for example, or in their SSL requirements, making it easy to switch things around by a simple change in the service name. It's also handy for pgbouncer connections, which typically run on non-standard ports. Be creative in your service names, and keep them distinct from each other to avoid fat fingering the wrong one. Comments are allowed and highly encouraged. Here is a slightly edited service file that was recently created while helping a client use Bucardo to migrate a Postgres database from Heroku to RDS:

## Bucardo source: Heroku

## Bucardo target: RDS

## Test database on RDS

## Hot standby used for schema population

You may notice above that "connect_timeout" is repeated in each section. Currently, there is no way to set a parameter that applies to all sections, but it's a very minor problem. I also usually set the environment variable PGCONNECT_TIMEOUT to 10 in by .bashrc, but putting it in the pg_service.conf file ensures it is always set regardless of what user I am.

One of the trickier parts of using a service file can be figuring out where the file should be located! Postgres will check for a local service file (named $USER/.pg_service.conf) and then for a global file. I prefer to always use the global file, as it allows you to switch users with ease and maintain the same aliases. By default, the location of the global Postgres service file is /usr/local/etc/postgresql/pg_service.conf, but in most cases this is not where you will find it, as many distributions specify a different location. Although you can override the location of the file with the environment variable PGSERVICEFILE and the directory holding the pg_service.conf file with the PGSYSCONFIDIR environment variable, I do not like relying on those. One less thing to worry about by simply using the global file.

The location of the global pg_service.conf file can be found by using the pg_config program and looking for the SYSCONFDIR entry. Annoyingly, pg_config is not installed by default on many systems, as it is considered part of the "development" packages (which may be named postgresql-devel, libpq-devel, or libpq-dev). While using pg_config is the best solution, there are times it cannot be installed (e.g. working on an important production box, or simply do not have root). While you can probably discover the right location through some simple investigation and trial-and-error, where is the fun in that? Here are two other methods to determine the location using nothing but psql and some standard Unix tools.

When you invoke psql with a request for a service file entry, it has to look for the service files. We can use this information to quickly find the expected location of the global pg_service.conf file. If you have the strace program installed, just run psql through strace, grep for "pg_service", and you should see two stat() calls pop up: one for the per-user service file, and one for the global service file we are looking for:

$ strace psql service=foobar 2>&1 | grep service.conf
stat("/home/greg/.pg_service.conf", 0x3526366F6637) = -1 ENOENT (No such file or directory)
stat("/var/opt/etc/postgres/pg_service.conf", 0x676F746F3131) = -1 ENOENT (No such file or directory)

What if strace is not installed? Well, perhaps gdb (the GNU debugger) can help us out:

$ gdb -q --args psql service=foobar
Reading symbols from psql...(no debugging symbols found)...done.
(gdb) start
Temporary breakpoint 1 at 0x435356
Starting program: /usr/local/bin/psql service=foobar
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/".

Temporary breakpoint 1, 0x4452474E4B4C5253 in main ()
(gdb) catch syscall stat
Catchpoint 2 (syscall 'stat' [4])
(gdb) c

Catchpoint 2 (call to syscall stat), 0x216c6f65736a6f72 in __GI___xstat (vers=, name=0x616d7061756c "/usr/local/bin/psql", buf=0x617274687572)
    at ../sysdeps/unix/sysv/linux/wordsize-64/xstat.c:35
35      return INLINE_SYSCALL (stat, 2, name, buf);
(gdb) c 4
Will ignore next 3 crossings of breakpoint 2.  Continuing.

Catchpoint 2 (call to syscall stat), 0x37302B4C49454245 in __GI___xstat (vers=, name=0x53544F442B4C "/var/opt/etc/postgres/pg_service.conf", buf=0x494543485445)
    at ../sysdeps/unix/sysv/linux/wordsize-64/xstat.c:35
35      return INLINE_SYSCALL (stat, 2, name, buf);
(gdb) quit

The use of a connection service file can be a nice addition to your tool chest, especially if you find yourself connecting from many different accounts, or if you just want to abstract away all those long, boring host names!


published by (Greg Sabino Mullane) on 2016-09-19 17:05:00 in the "postgres" category

Sometimes, the elephant gets hurt - inducing database errors! Data corruption is a fact of life in working with computers, and Postgres is not immune. With the addition of the "data checksums" feature, detecting such corruption is now much easier. But detection is not enough - what happens after the corruption is detected? What if Postgres could fix the problem all by itself - what if we could give the elephant a mutant healing power?!?

Now we can. I wrote an extension named pg_healer that does just that - detects corruption issues, and automatically repairs them. Let's see how it works with a demonstration. For this, we will be purposefully corrupting the "pgbench_branches" table, part of the venerable pgbench utility.

For the initial setup, we will create a new Postgres cluster and install the pgbench schema. The all-important checksum feature needs to be enabled when we initdb, and we will use a non-standard port for testing:

$ initdb --data-checksums dojo
The files belonging to this database system will be owned by user "greg".
Data page checksums are enabled.

creating directory dojo ... ok
creating subdirectories ... ok
$ echo port=9999 >> dojo/postgresql.conf
$ pg_ctl start -D dojo -l log.dojo.txt
server starting
$ createdb -p 9999 $USER
$ pgbench -p 9999 -i
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
creating tables...
100000 of 100000 tuples (100%) done (elapsed 0.35 s, remaining 0.00 s)
set primary keys...

Next, we install the pg_healer extension. As it needs to access some low-level hooks, we need to load it on startup, by adding a line to the postgresql.conf file:

$ git clone git://
Cloning into 'pg_healer'...
$ cd pg_healer
$ make install
gcc -Wall -Wmissing-prototypes ... -c -o pg_healer.o pg_healer.c
gcc -Wall -Wmissing-prototypes ... -shared -o pg_healer.o
$ echo "shared_preload_libraries = 'pg_healer'" >> dojo/postgresql.conf
$ pg_ctl restart -D dojo -l log.dojo.txt
waiting for server to shut down.... done
server stopped
server starting
## Make sure the extension has loaded cleanly.
## If it did not, the log file would complain
$ tail -2 log.dojo.txt
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

Now for the fun part. We want to purposefully corrupt the file containing the data for the pgbench_branches file, in simulation of a failing hard drive or other really serious problem. The type of problem that normally causes the DBA to get paged in the middle of the night. Before we do that, we want to take a peek at the contents of that table, and then find out which actual disk files contain the table:

$ psql -p 9999 -c "select * from pgbench_branches"
 bid | bbalance | filler 
   1 |        0 | 
(1 row)

$ psql -p 9999 -Atc "select format('%s/%s',

## That file is too cumbersome to keep typing out, so:
$ ln -s /home/greg/pg_healer/dojo/base/16384/198461 myrelfile

Let's throw a deadly shuriken right into the middle of it!

## Here is what the file looks like in its original uncorrupted form
## (checksum is in red):
$ xxd -a -g1 -u myrelfile
00000000: 00 00 00 00 30 69 BC 37 74 66 04 00 1C 00 E0 1F  ....0i.7tf......
00000010: 00 20 04 20 00 00 00 00 E0 9F 40 00 00 00 00 00  . . ......@.....
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001fe0: F7 0B 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001ff0: 01 00 03 00 01 09 18 03 01 00 00 00 00 00 00 00  ................

## Good ol' dd is the right tool for the job here:
$ echo -n "Shuriken!" | dd conv=notrunc oflag=seek_bytes seek=4000 bs=9 count=1 of=myrelfile
1+0 records in
1+0 records out
9 bytes (9 B) copied, 0.000156565 s, 57.5 kB/s

## Take a peek inside the file to make sure the shuriken got embedded deeply:
$ xxd -a -g1 -u myrelfile
00000000: 00 00 00 00 30 69 BC 37 74 66 04 00 1C 00 E0 1F  ....0i.7tf......
00000010: 00 20 04 20 00 00 00 00 E0 9F 40 00 00 00 00 00  . . ......@.....
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000fa0: 53 68 75 72 69 6B 65 6E 21 00 00 00 00 00 00 00  Shuriken!.......
00000fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001fe0: F7 0B 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001ff0: 01 00 03 00 01 09 18 03 01 00 00 00 00 00 00 00  ................

These shurikens are not so deadly, but quite yummy!
(photograph by kahvikisu)

Now that we've messed up the file, watch closely at what happens when we try to read from it. We are going to do this three times. The first time, the table will still be in the shared buffer cache, and thus will show no error. The second time, the table will be read from the disk and throw an error. At this point, pg_healer will see the error and repair it. The final read will pull from the completely healed table:

$ psql -p 9999 -c "select * from pgbench_branches"
 bid | bbalance | filler 
   1 |        0 | 
(1 row)

## This will force the table out of shared_buffers, so that the next
## time it is accessed, Postgres must read from the disk:
$ psql -p 9999 -qtc "select pg_healer_remove_from_buffer('pgbench_branches')"

$ psql -p 9999 -c "select * from pgbench_branches"
WARNING:  page verification failed, calculated checksum 9478 but expected 26228
INFO:  File has been healed: base/16384/198461 (intrinsic healing)
ERROR:  invalid page in block 0 of relation base/16384/198461

## Mutant healing power was activated. Observe:
$ psql -p 9999 -c "select * from pgbench_accounts"
 bid | bbalance | filler 
   1 |        0 | 
(1 row)

The corruption we created before changed the "free space" of the Postgres "page" structure. There are multiple ways pg_healer can fix things: this demonstrates one of the "intrinsic" fixes, which require no external knowledge to fix. Corruption can occur anywhere on the page, of course, including inside your data (as opposed to the meta-data or free space). One of the methods of fixing this is for pg_healer to use another copy of the table to try and repair the original table.

While eventually pg_healer will be able to reach out to replicas for a copy of the (non-corrupted) table data it needs, a simpler method is to simply create a good copy inside the data directory. There is a helper function that does just that, by copying the important files to a new directory. Details on how this is kept refreshed will be covered later; for now, let's see it in action and observe how it can help Postgres heal itself from more serious corruption problems:

$ psql -p 9999 -c 'create extension pg_healer'
$ psql -p 9999 -qc 'checkpoint'
$ psql -p 9999 -c 'select pg_healer_cauldron()'

Rather than free space, let's corrupt something a little more important: the line pointers, which indicate where, inside the page, that each tuple (aka table row) is located. Extremely critical information, that is about to get blown away with another deadly shuriken!

$ echo -n "Shuriken!" | dd conv=notrunc oflag=seek_bytes seek=20 bs=9 count=1 of=myrelfile
1+0 records in
1+0 records out
9 bytes (9 B) copied, 9.3577e-05 s, 96.2 kB/s
$ xxd -a -g1 -u myrelfile
00000000: 00 00 00 00 30 69 BC 37 74 66 04 00 1C 00 E0 1F  ....0i.7tf......
00000010: 00 20 04 20 53 68 75 72 69 6B 65 6E 21 00 00 00  . . Shuriken!...
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001fe0: F7 0B 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001ff0: 01 00 03 00 01 09 18 03 01 00 00 00 00 00 00 00  ................

$ psql -p 9999 -qtc "select pg_healer_remove_from_buffer('pgbench_branches')"

$ psql -p 9999 -c "select * from pgbench_branches"
WARNING:  page verification failed, calculated checksum 8393 but expected 26228
INFO:  File has been healed: base/16384/198461 (external checksum match)
ERROR:  invalid page in block 0 of relation base/16384/198461

$ psql -p 9999 -c "select * from pgbench_branches"
 bid | bbalance | filler 
   1 |        0 | 
(1 row)

## Has the shuriken really been removed?
$ xxd -a -g1 -u myrelfile
00000000: 00 00 00 00 30 69 BC 37 74 66 04 00 1C 00 E0 1F  ....0i.7tf......
00000010: 00 20 04 20 00 00 00 00 E0 9F 40 00 00 00 00 00  . . ......@.....
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001fe0: F7 0B 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001ff0: 01 00 03 00 01 09 18 03 01 00 00 00 00 00 00 00  ................

Once again, pg_healer has repaired the file. This time, however, it reached out to a version of the file outside the data directory, copied the old page data to the new page data, and then used the checksum to confirm that the changes were correct. This method only works, however, if the original file and the copy have the same checksum - which means that no changes have been made since the copy was made via pg_healer_cauldron(). As this is not always possible, there is a third method pg_healer can use, which is to examine things row by row and to try and repair the damage.

For this final demo, we are going to change the table by adding a new row, which ensures that the checksums against the copy will no longer match. After that, we are going to add some corruption to one of the table rows (aka tuples), and see if pg_healer is able to repair the table:

$ psql -p 9999 -qtc 'insert into pgbench_branches values (2,12345)'
$ psql -p 9999 -qc 'checkpoint'

## Throw a shuriken right into an active row!
$ echo -n "Shuriken!" | dd conv=notrunc oflag=seek_bytes seek=8180 bs=9 count=1 of=myrelfile
1+0 records in
1+0 records out
9 bytes (9 B) copied, 0.000110317 s, 81.6 kB/s

## If you look close, you will notice the checksum (in red) has also changed:
$ xxd -a -g1 -u myrelfile
00000000: 00 00 00 00 70 B0 8E 38 A4 8E 00 00 20 00 C0 1F  ....p..8.... ...
00000010: 00 20 04 20 00 00 00 00 E0 9F 40 00 C0 9F 40 00  . . ......@...@.
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001fc0: 05 0C 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001fd0: 02 00 03 00 01 08 18 03 02 00 00 00 39 30 00 00  ............90..
00001fe0: F7 0B 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001ff0: 01 00 03 00 53 68 75 72 69 6B 65 6E 21 00 00 00  ....Shuriken!...

$ psql -p 9999 -qtc "select pg_healer_remove_from_buffer('pgbench_branches')"

$ psql -p 9999 -c "select * from pgbench_branches"
WARNING:  page verification failed, calculated checksum 56115 but expected 36516
INFO:  File has been healed: base/16384/198461 (external tuple healing)
ERROR:  invalid page in block 0 of relation base/16384/198461

$ psql -p 9999 -c "select * from pgbench_branches"
 bid | bbalance | filler 
   1 |        0 | 
   2 |    12345 | 
(2 rows)

There are still some rough edges, but for a proof of concept it works quite nicely. While reacting to corruption errors as they appear is nice, in the future I would like it to be more proactive, and run as a background process that scans the database for any problems and fixes them. Ideally, it should be able to handle a wider class of table corruption problems, as well as problems in indexes, free space maps, system catalogs, etc. Please jump in and lend a hand - the project is on github as pg_healer.

Data corruption is a fact of life DBAs must confront, be it from failing hard drives, cosmic rays, or other reason. While the detection of such errors was greatly improved in Postgres 9.3 with the --data-checksums argument to initdb (which ought to default on!), it's time to not just detect, but heal!


published by (Greg Sabino Mullane) on 2016-07-14 03:46:00 in the "postgres" category

Constraints in Postgres are very powerful and versatile: not only are foreign keys, primary keys, and column uniqueness done internally via constraints, but you may create your own quite easily (at both the column and table level). Most of the time constraints are simply set and forget, but there is one time constraints may become a problem: copying the database using the pg_dump program.

The issue is that constraints are usually added *before* the data is copied to the new table via the COPY command. This means the constraint fires for each added row, to make sure that the row passes the conditions of the constraint. If the data is not valid, however, the COPY will fail, and you will not be able to load the output of your pg_dump into a new database. Further, there may be a non-trivial performance hit doing all that validation. Preventing the constraint from firing may provide a significant speed boost, especially for very large tables with non-trivial constraints.

Let's explore one way to work around the problem of pg_dump failing to work because some of the data is not valid according to the logic of the constraints. While it would be quicker to make some of these changes on the production system itself, corporate inertia, red tape, and the usual DBA paranoia means a better way is to modify a copy of the database instead.

For this example, we will first create a sample "production" database and give it a simple constraint. This constraint is based on a function, to both emulate a specific real-world example we came across for a client recently, and to allow us to easily create a database in which the data is invalid with regards to the constraint:

dropdb test_prod; createdb test_prod
pgbench test_prod -i -n
creating tables...
100000 of 100000 tuples (100%) done (elapsed 0.82 s, remaining 0.00 s)
set primary keys...
psql test_prod -c 'create function valid_account(int) returns bool language sql immutable as $$ SELECT $1 > 0$$;'
psql test_prod -c 'alter table pgbench_accounts add constraint good_aid check ( valid_account(aid) )'

Note that the constraint was added without any problem, as all of the values in the aid column satisfy the function, as each one is greater than zero. Let's tweak the function, such that it no longer represents a valid, up to date constraint on the table in question:

## Verify that the constraint is working - we should get an error:
psql test_prod -c 'update pgbench_accounts set aid = -1 where aid = 1'
ERROR:  new row for relation "pgbench_accounts" violates check constraint "good_aid"
DETAIL:  Failing row contains (-1, 1, 0,                                         ...).

## Modify the function to disallow account ids under 100. No error is produced!
psql test_prod -c 'create or replace function valid_account(int) returns bool language sql volatile as $$ SELECT $1 > 99$$'

## The error is tripped only when we violate it afresh:
psql test_prod -c 'update pgbench_accounts SET aid=125 WHERE aid=125'
psql test_prod -c 'update pgbench_accounts SET aid=88 WHERE aid=88'
ERROR:  new row for relation "pgbench_accounts" violates check constraint "good_aid"
DETAIL:  Failing row contains (88, 1, 0,                                         ...).

The volatility was changed from IMMUTABLE to VOLATILE simply to demonstrate that a function called by a constraint is not bound to any particular volatility, although it *should* always be IMMUTABLE. In this example, it is a moot point, as our function can be immutable and still be "invalid" for some rows in the table. Owing to our function changing its logic, we now have a situation in which a regular pg_dump cannot be done:

dropdb test_upgraded; createdb test_upgraded
pg_dump test_prod | psql test_upgraded -q
ERROR:  new row for relation "pgbench_accounts" violates check constraint "good_aid"
DETAIL:  Failing row contains (1, 1, 0,                                          ...).
CONTEXT:  COPY pgbench_accounts, line 1: "1             1   0          "
## Ruh roh!

Time for a workaround. When a constraint is created, it may be declared as NOT VALID, which simply means that it makes no promises about the *existing* data in the table, but will start constraining any data changed from that point forward. Of particular importance is the fact that pg_dump can dump things into three sections, "pre-data", "data", and "post-data". When a normal constraint is dumped, it will go into the pre-data section, and cause the problems seen above when the data is loaded. However, a constraint that has been declared NOT VALID will appear in the post-data section, which will allow the data to load, as it will not be declared until after the "data" section has been loaded in. Thus, our workaround will be to move constraints from the pre-data to the post-data section. First, let's confirm the state of things by making some dumps from the production database:

pg_dump test_prod --section=pre-data -x -f test_prod.pre.sql
pg_dump test_prod --section=post-data -x -f
## Confirm that the constraint is in the "pre" section:
grep good_aid test*sql
test_prod.pre.sql:    CONSTRAINT good_aid CHECK (valid_account(aid))

There are a few ways around this constraint issue, but here is one that I like as it makes no changes at all to production, and produces valid SQL files that may be used over and over.

dropdb test_upgraded; createdb test_upgraded
## Note that --schema-only is basically the combination of pre-data and post-data
pg_dump test_prod --schema-only | psql test_upgraded -q
## Save a copy so we can restore these to the way we found them later:
psql test_upgraded -c "select format('update pg_constraint set convalidated=true where conname=%L and connamespace::regnamespace::text=%L;', 
  conname, nspname) from pg_constraint c join pg_namespace n on (n.oid=c.connamespace) 
  where contype ='c' and convalidated" -t -o restore_constraints.sql
## Yes, we are updating the system catalogs. Don't Panic!
psql test_upgraded -c "update pg_constraint set convalidated=false where contype='c' and convalidated"
## Why 3? The information_schema "schema" has two harmless constraints
pg_dump test_upgraded --section=pre-data -x -o test_upgraded.pre.sql
pg_dump test_upgraded --section=post-data -x -o
## Verify that the constraint has been moved to the "post" section:
grep good test*sql
test_prod.pre.sql:    CONSTRAINT good_aid CHECK (valid_account(aid)) Name: good_aid; Type: CHECK CONSTRAINT; Schema: public; Owner: greg    ADD CONSTRAINT good_aid CHECK (valid_account(aid)) NOT VALID;
## Two diffs to show the inline (pre) versus ALTER TABLE (post) constraint creations:
$ diff -u1 test_prod.pre.sql test_upgraded.pre.sql 
--- test_prod.pre.sql        2016-07-04 00:10:06.676766984 -0400
+++ test_upgraded.pre.sql    2016-07-04 00:11:07.978728477 -0400
@@ -54,4 +54,3 @@
     abalance integer,
-    filler character(84),
-    CONSTRAINT good_aid CHECK (valid_account(aid))
+    filler character(84)

$ diff -u1 
---        2016-07-04 00:11:48.683838577 -0400
+++    2016-07-04 00:11.57.265797869 -0400
@@ -17,2 +17,10 @@
+-- Name: good_aid; Type: CHECK CONSTRAINT; Schema: public; Owner: greg
+ALTER TABLE pgbench_accounts
+    ADD CONSTRAINT good_aid CHECK (valid_account(aid)) NOT VALID;
 SET default_tablespace = '';

Now we can simply sandwich our data load between the new pre and post files, and avoid having the constraints interfere with the data load portion at all:

dropdb test_upgraded; createdb test_upgraded
psql test_upgraded -q -f test_upgraded.pre.sql
pg_dump test_prod --section=data | psql test_upgraded -q
psql test_upgraded -q -f
## As the final touch, make all the constraints we changed exactly how each were before:
psql test_upgraded -f restore_constraints.sql

A final sanity check is always a good idea, to make sure the two databases are identical, despite our system catalog tweaking:

diff -s <(pg_dump test_prod) <(pg_dump test_upgraded)
Files /dev/fd/63 and /dev/fd/62 are identical

Although we declared a goal of having the upgraded database match production as closely as possible, you can always not apply that final restore_constraints.sql file and leave the constraints as NOT VALID, which is a better reflection of the reality of things. It also means you will not have to go through this rigmarole again, as those constraints shall forevermore be put into the post-data section when doing a pg_dump (unless someone runs the ALTER TABLE ... VALIDATE CONSTRAINT ... command!).

While there is no direct way to disable constraints when loading data, using this pre-data to post-data trick can not only boost data load times, but get you out of a potential jam when your data is invalid!


published by (Greg Sabino Mullane) on 2016-06-13 20:47:00 in the "postgres" category

(A Unicode rabbit face 🐰 will never be as cute
as this real bunny. Photo by Wade Simmons)

One of our clients recently reached out to us for help in upgrading their Postgres database. The use of the pg_upgrade program was not an option, primarily because the client was also taking the opportunity to change from their SQL_ASCII encoding to UTF-8. (If any of your databases, gentle reader, are still SQL_ASCII, please do the same!). Naturally, we also took advantage of the lack of pg_upgrade to enable the use of data checksums, another action we highly recommend. Although there were plenty of wrinkles, and stories to be told about this migration/upgrade, I wanted to focus on one particular problem we had: how to detect if a table has changed.

We needed to know if any applications were modifying certain tables because the speed of the migration was very important. If we could assert that no changes were made, there were some shortcuts available that would greatly speed things up. Initial testing showed that the migration was taking over eight hours, a time unacceptable to the client (no worries, we eventually reduced the time to under an hour!).

Looking closer, we found that over half that time was spent converting a single small (50MB) table from SQL_ASCII to UTF-8. How this conversion was performed is a story for another day, but suffice to say the table had some really, really messy bytes inside of it; the conversion program had to struggle mightily. When you are converting a database to a new encoding, it is imperative to examine every byte and make sure it gets changed to a format that Postgres will accept as valid UTF-8, or the entire table import will fail with an error similar to this:

ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5

Looking closer at the data in the table showed that it might - just might! - be a historical table. In other words, it no longer receives updates, just selects. We really wanted this to be true, for it meant we could dump the whole table, convert it, and simply load the converted table into the new database (which took only a few seconds!). First, however, we had to confirm that the table was not changing.

Detecting changes may be done in several ways. For all of them, you can never prove that the table shall not change at some point in the future, but you can prove that it has not changed over a certain period of time. How you go about doing that depends on what kind of access you have. If you do not have super-user access, you could add a simple trigger to the table that updates another table when a update, insert, or delete is performed. Then, checking in on the second table will indicate if any changes have been made.

A better solution is to simply look at the underlying file that makes up the table. To do this, you need be a Postgres superuser or have access to the underlying operating system. Basically, we will trust the operating system's information on when the table was last changed to determine if the table itself has changed. Although not foolproof, it is an excellent solution. Let's illustrate it here. First: create a test table and add some rows:

$ psql
greg=# CREATE TABLE catbox AS SELECT 8675309::INT AS id FROM generate_series(1,1000);

Now we can use the pg_stat_file() function, which returns some basic information about a file on disk. With the help of the pg_relation_filepath() function, we can see when the table was last modified:

greg=# select * from pg_stat_file( pg_relation_filepath('catbox') ) xg
Expanded display is on.
-[ RECORD 1 ]+-----------------------
size         | 40960
access       | 2015-11-08 22:36:00-04
modification | 2015-11-08 22:36:00-04
change       | 2015-11-08 22:36:00-04
creation     | 
isdir        | f

Next we will revisit the table after some time (e.g. 24 hours) and see if the "modification" timestamp is the same. If it is, then the table has not been modified either. Unfortunately, the possibility of a false positive is possible due to VACUUM, which may change things on disk but does NOT change the data itself. (A regular VACUUM *may* modify the file, and a VACUUM FULL *always* modifies it).

greg=# select * from pg_stat_file( pg_relation_filepath('catbox') ) xg

-[ RECORD 1 ]+-----------------------
size         | 40960
access       | 2015-11-08 22:36:00-04
modification | 2015-11-08 22:36:00-04
change       | 2015-11-08 22:36:00-04
creation     | 
isdir        | f

greg=# vacuum catbox;

greg=# select * from pg_stat_file( pg_relation_filepath('catbox') );

2016-06-09 22:53:24-04
-[ RECORD 1 ]+-----------------------
size         | 40960
access       | 2015-11-08 22:36:00-04
modification | 2015-11-08 22:40:14-04
change       | 2015-11-08 22:40:14-04
creation     | 
isdir        | f

A second (and more foolproof) method is to simply generate a checksum of the entire table. This is a fairly straightforward approach; just pump the output of pg_dump to a checksum program:

$ pg_dump -t catbox --data-only | sha1sum
6f724565656f455072736e44646c207472536e61  -

The advantage here is that even a VACUUM FULL will not change the checksum. However, because pg_dump does no ORDER BY when dumping out the table, it is possible for the rows to be returned in a different order. To work around that, issue a VACUUM FULL yourself before taking the checksum. As before, come back later (e.g. 24 hours) and re-run the command. If the checksums match, then the table has not changed (and is probably no longer updated by the application). By using this method, we were able to verify that the large, SQL_ASCII byte-soup table was indeed not being updated, and thus we took it out of the direct migration.

Of course, that table needed to be part of the new database, but we simply dumped the table, ran the conversion program on it, and (four hours later), had a complete dump of the table that loads extremely fast into the new database.

That solved only one of the problems, however; another table was also slowing down the migration. Although it did not have the SQL_ASCII conversion issue, it was a large table, and took a large percentage of the remaining migration time. A quick look at this table showed it had a "creation_time" column as well as a SERIAL primary key, and was obviously being updated quite often. Close examination showed that it was possible this was an append-only table, such that older rows were never updated. This called for a similar approach: could we prove that a large chunk of the table was not changing? If we could, we could pre-populate the new database and copy over only the most recent rows during the migration, saving a good bit of time.

The previous tricks would not work for this situation, because the underlying file would change constantly as seen by pg_stat_file(), and a pg_dump checksum would change on every insert. We needed to analyze a slice of the table - in this particular case, we wanted to see about checksumming all rows except those created in the last week. As a primary key lookup is very fast, we used the "creation_time" column to determine an approximate primary key to start with. Then it was simply a matter of feeding all those rows into the sha1sum program:

greg=# CREATE TABLE catbox2 (id SERIAL PRIMARY KEY, creation_time TIMESTAMPTZ);
greg=# INSERT INTO catbox2(creation_time) select now() - '1 year'::interval + (x* '1 hour'::interval) from generate_series(1,24*365) x;
INSERT 0 8760

greg=# select * from catbox2 where creation_time > now()-'1 week'::interval order by 1 limit 1
  id  |         creation_time         
 8617 | 2016-06-11 10:51:00.101971-08

$ psql -Atc "select * from catbox2 where id < 8617 order by 1" | sha1sum
456272656d65486e6f203139353120506173733f  -

## Add some rows to emulate the append-only nature of this table:
greg=# insert into catbox2(creation_time) select now() from generate_series(1,1000)
INSERT 0 1000

## Checksums should still be identical:
$ psql -Atc "select * from catbox2 where id < 8617 order by 1" | sha1sum
456272656d65486e6f203139353120506173733f  -

Despite the large size of this table (around 10 GB), this command did not take that long to run. A week later, we ran the same commands, and got the same checksum! Thus, we were able to prove that the table was mostly append-only - or at least enough for our use case. We copied over the "old" rows, then copied over the rest of the rows during the critical production migration window.

In the future, this client will able to take advantage of pg_upgrade, but getting to UTF-8 and data checksums was absolutely worth the high one-time cost. There were several other tricks used to speed up the final migration, but being able to remove the UTF-8 conversion of the first table, and being able to pre-copy 99% of the second table accounted for the lion's share of the final speed improvements.


published by (Greg Sabino Mullane) on 2016-04-29 00:04:00 in the "postgres" category

Postgres has a wonderful feature called concurrent indexes. It allows you to create indexes on a table without blocking reads OR writes, which is quite a handy trick. There are a number of circumstances in which one might want to use concurrent indexes, the most common one being not blocking writes to production tables. There are a few other use cases as well, including:

Photograph by Nicholas A. Tonelli

  • Replacing a corrupted index
  • Replacing a bloated index
  • Replacing an existing index (e.g. better column list)
  • Changing index parameters
  • Restoring a production dump as quickly as possible

In this article, I will focus on that last use case, restoring a database as quickly as possible. We recently upgraded a client from a very old version of Postgres to the current version (9.5 as of this writing). The fact that use of pg_upgrade was not available should give you a clue as to just how old the "very old" version was!

Our strategy was to create a new 9.5 cluster, get it optimized for bulk loading, import the globals and schema, stop write connections to the old database, transfer the data from old to new, and bring the new one up for reading and writing.

The goal was to reduce the application downtime as much as reasonably possible. To that end, we did not want to wait until all the indexes were created before letting people back in, as testing showed that the index creations were the longest part of the process. We used the "--section" flags of pg_dump to create pre-data, data, and post-data sections. All of the index creation statements appeared in the post-data file.

Because the client determined that it was more important for the data to be available, and the tables writable, than it was for them to be fully indexed, we decided to try using CONCURRENT indexes. In this way, writes to the tables could happen at the same time that they were being indexed - and those writes could occur as soon as the table was populated. That was the theory anyway.

The migration went smooth - the data was transferred over quickly, the database was restarted with a new postgresql.conf (e.g. turn fsync back on), and clients were able to connect, albeit with some queries running slower than normal. We parsed the post-data file and created a new file in which all the CREATE INDEX commands were changed to CREATE INDEX CONCURRENTLY. We kicked that off, but after a certain amount of time, it seemed to freeze up.

The frogurt is also cursed.

Looking closer showed that the CREATE INDEX CONCURRENTLY statement was waiting, and waiting, and never able to complete - because other transactions were not finishing. This is why concurrent indexing is both a blessing and a curse. The concurrent index creation is so polite that it never blocks writers, but this means processes can charge ahead and be none the wiser that the create index statement is waiting on them to finish their transaction. When you also have a misbehaving application that stays "idle in transaction", it's a recipe for confusion. (Idle in transaction is what happens when your application keeps a database connection open without doing a COMMIT or ROLLBACK). A concurrent index can only completely finish being created once any transaction that has referenced the table has completed. The problem was that because the create index did not block, the app kept chugging along, spawning new processes that all ended up in idle in transaction.

At that point, the only way to get the concurrent index creation to complete was to forcibly kill all the other idle in transaction processes, forcing them to rollback and causing a lot of distress for the application. In contrast, a regular index creation would have caused other processes to block on their first attempt to access the table, and then carried on once the creation was complete, and nothing would have to rollback.

Another business decision was made - the concurrent indexes were nice, but we needed the indexes, even if some had to be created as regular indexes. Many of the indexes were able to be completed (concurrently) very quickly - and they were on not-very-busy tables - so we plowed through the index creation script, and simply canceled any concurrent index creations that were being blocked for too long. This only left a handful of uncreated indexes, so we simply dropped the "invalid" indexes (these appear when a concurrent index creation is interrupted), and reran with regular CREATE INDEX statements.

The lesson here is that nothing comes without a cost. The overly polite concurrent index creation is great at letting everyone else access the table, but it also means that large complex transactions can chug along without being blocked, and have to have all of their work rolled back. In this case, things worked out as we did 99% of the indexes as CONCURRENT, and the remaining ones as regular. All in all, the use of concurrent indexes was a big win, and they are still an amazing feature of Postgres.


published by (Szymon Lipi?ski) on 2016-04-07 20:12:00 in the "postgres" category

The new PostgreSQL 9.5 release has a bunch of great features. I describe below the ones I find most interesting.


UPSERT is simply a combination of INSERT and UPDATE. This works like this: if a row exists, then update it, if it doesn't exist, create it.

Before Postgres 9.5 when I wanted to insert or update a row, I had to write this:

INSERT INTO test(username, login)
SELECT 'hey', 'ho ho ho'
WHERE NOT EXISTS (SELECT 42 FROM test WHERE username='hey');

UPDATE test SET login='ho ho ho' WHERE username='hey' AND login <> 'ho ho ho';

Which was a little bit problematic. You need to make two queries, and both can have quite complicated WHERE clauses.

In PostgreSQL 9.5 there is much simpler version:

INSERT INTO test(username, login) VALUES ('hey', 'ho ho ho')
ON CONFLICT (username)
DO UPDATE SET login='ho ho ho';

The only requirement is that there should be a UNIQUE constraint on a column which should fail while inserting a row.

The version above makes the UPDATE when the INSERT fails. There is also another form of the UPSERT query, which I used in this blog post. You can just ignore the INSERT failure:

INSERT INTO test(username, login) VALUES ('hey', 'ho ho ho')
ON CONFLICT (username)

Switching Tables to Logged and Unlogged

PostgreSQL keeps a transaction write ahead log, which helps restore the database after a crash, and is used in replication, but it comes with some overhead, as additional information must be stored on disk.

In PostgreSQL 9.5 you can simply switch a table from logged to unlogged. The unlogged version can be much faster when filling it with data, processing it etc. However at the end of such operations it might be good to make it a normal logged table. Now it is simple:


JSONB Operators and Functions

This is the binary JSON type, and these new functions allow us to perform more operations without having to convert our data first to the slower, non-binary JSON alternative.

Now you can remove a key from a JSONB value:

SELECT '{"a": 1, "b": 2, "c": 3}'::jsonb || '{"x": 1, "y": 2, "c": 42}'::jsonb;

 {"b": 2, "c": 3}

And merge JSONB values (the last value's keys overwrite the first's one):

SELECT '{"a": 1, "b": 2, "c": 3}'::jsonb || '{"x": 1, "y": 2, "c": 42}'::jsonb;

 {"a": 1, "b": 2, "c": 42, "x": 1, "y": 2}

And we have the nice jsonb_pretty() function which instead of this:

SELECT jsonb_set('{"name": "James", "contact": {"phone": "01234 567890",
                   "fax": "01987 543210"}}'::jsonb,
                   '{contact,phone}', '"07900 112233"'::jsonb);

 {"name": "James", "contact": {"fax": "01987 543210", "phone": "07900 112233"}}

prints this:

SELECT jsonb_pretty(jsonb_set('{"name": "James", "contact": {"phone": "01234 567890",
                   "fax": "01987 543210"}}'::jsonb,
                   '{contact,phone}', '"07900 112233"'::jsonb));

  {                              ?
      "name": "James",           ?
      "contact": {               ?
          "fax": "01987 543210", ?
          "phone": "07900 112233"?
      }                          ?

More Information

There are more nice features in the new PostgreSQL 9.5. You can read the full list at'snewinPostgreSQL9.5


published by (Szymon Lipi?ski) on 2016-03-03 12:17:00 in the "postgres" category

In the previous posts I have described a simple database table for storing JSON values, and a way to unpack nested JSON attributes into simple database views. This time I will show how to write a very simple query (thanks to PostgreSQL 9.5) to load the JSON files

Here's a simple Python script to load the database.

This script is made for PostgreSQL 9.4 (in fact it should work for 9.5 too, but is not using a nice new 9.5 feature described below).

#!/usr/bin/env python

import os
import sys
import logging

    import psycopg2 as pg
    import psycopg2.extras
    print "Install psycopg2"

    import progressbar
    print "Install progressbar2"

import json

import logging
logger = logging.getLogger()

PG_CONN_STRING = "dbname='blogpost' port='5433'"

data_dir = "data"
dbconn = pg.connect(PG_CONN_STRING)"Loading data from '{}'".format(data_dir))

cursor = dbconn.cursor()

counter = 0
empty_files = []

class ProgressInfo:

    def __init__(self, dir):
        files_no = 0
        for root, dirs, files in os.walk(dir):
            for file in files:
                if file.endswith(".json"):
                    files_no += 1
        self.files_no = files_no
        print "Found {} files to process".format(self.files_no) = progressbar.ProgressBar(maxval=self.files_no,
                                           widgets=[' [', progressbar.Timer(), '] [', progressbar.ETA(), '] ', progressbar.Bar(),])

    def update(self, counter):

pi = ProgressInfo(os.path.expanduser(data_dir))

for root, dirs, files in os.walk(os.path.expanduser(data_dir)):
    for f in files:
        fname = os.path.join(root, f)

        if not fname.endswith(".json"):
        with open(fname) as js:
            data =
            if not data:
            import json
            dd = json.loads(data)
            counter += 1
                            INSERT INTO stats_data(data)
                            SELECT %s
                            WHERE NOT EXISTS (SELECT 42
                                              FROM stats_data
                                                    ((data->>'metadata')::json->>'country')  = %s
                                                AND ((data->>'metadata')::json->>'installation') = %s
                                                AND tstzrange(
                                                        to_timestamp((data->>'start_ts')::double precision),
                                                        to_timestamp((data->>'end_ts'  )::double precision)
                                                    ) &&
                                                        to_timestamp(%s::text::double precision),
                                                        to_timestamp(%s::text::double precision)
                        """, (data, str(dd['metadata']['country']), str(dd['metadata']['installation']), str(dd['start_ts']), str(dd['end_ts'])))

print ""

logger.debug("Refreshing materialized views")
cursor.execute("""REFRESH MATERIALIZED VIEW sessions""");

dbconn.commit()"Loaded {} files".format(counter))"Found {} empty files".format(len(empty_files)))
if empty_files:"Empty files:")
    for f in empty_files:" >>> {}".format(f))

I have created two example files in the 'data' directory, the output of this script is:

Found 2 files to process
 [Elapsed Time: 0:00:00] [ETA:  0:00:00] |#####################################|

Yey, so it works. What's more, I can run the script again on the same files, and it will try loading the same data without any errors. Do you rememember that there was an EXCLUDE constraint which doesn't allow us to load any JSON for the same country, and installation, and overlapping time range? That's why the query is so long. I also need to check that such a JSON is not in the database, so I can load it.

This is twice slower than the next solution. The problem is that it needs to unpack the JSON to run the subquery, then insert the data checking the same thing (in fact the insert, and the subquery are using the same index made by the EXCLUDE constraint).

And then PostgreSQL 9.5 was released, with one great feature: ON CONFLICT DO SOMETHING. The conflict is a UNIQUE index violation. The EXCLUDE clause in the stats_data table created such a unique index.

There can also be ON CONFLICT DO NOTHING, and that's what I have used. I changed only one query in the script, and instead of this:

                            INSERT INTO stats_data(data)
                            SELECT %s
                            WHERE NOT EXISTS (SELECT 42
                                              FROM stats_data
                                                    ((data->>'metadata')::json->>'country')  = %s
                                                AND ((data->>'metadata')::json->>'installation') = %s
                                                AND tstzrange(
                                                        to_timestamp((data->>'start_ts')::double precision),
                                                        to_timestamp((data->>'end_ts'  )::double precision)
                                                    ) &&
                                                        to_timestamp(%s::text::double precision),
                                                        to_timestamp(%s::text::double precision)
                        """, (data, str(dd['metadata']['country']), str(dd['metadata']['installation']), str(dd['start_ts']), str(dd['end_ts'])))

It looks like this:

                            INSERT INTO stats_data(data)
                            VALUES (%s)
                            ON CONFLICT ON CONSTRAINT no_overlapping_jsons DO NOTHING
                        """, (data, ))

This version requires PostgreSQL 9.5 and will not work on the previous versions.

It is twice as fast as the original, and works as expected. This means that I can run it on the already loaded files, and will not load them. This way when I use rsync to download the new files, I can just run the script, and it will load only the new files into the database.

Loading 88k of JSON files using the production version of the script with the first query takes over 1 minute. Loading the files using the second version takes less than 30 seconds.


published by (Szymon Lipi?ski) on 2016-02-29 13:08:00 in the "postgres" category

In the previous post I showed a simple PostgreSQL table for storing JSON data. Let's talk about making the JSON data easier to use.

One of the requirements was to store the JSON from the files unchanged. However using the JSON operators for deep attributes is a little bit unpleasant. In the example JSON there is attribute country inside metadata. To access this field, we need to write:

SELECT data->'metadata'->>'country' FROM stats_data;

The native SQL version would rather look like:

SELECT country FROM stats;

So let's do something to be able to write the queries like this. We need to repack the data to have the nice SQL types, and hide all the nested JSON operators.

I've made a simple view for this:

  id                                                          as id,
  created_at                                                  as created_at,
  to_timestamp((data->>'start_ts')::double precision)         as start_ts,
  to_timestamp((data->>'end_ts')::double precision)           as end_ts,
    to_timestamp((data->>'start_ts')::double precision),
    to_timestamp((data->>'end_ts')::double precision)
  )                                                           as ts_range,
  ( SELECT array_agg(x)::INTEGER[]
    FROM jsonb_array_elements_text(data->'resets') x)         as resets,
  (data->'sessions')                                          as sessions,
  (data->'metadata'->>'country')                              as country,
  (data->'metadata'->>'installation')                         as installation,
  (data->>'status')                                           as status
FROM stats_data;

This is a normal view, which means that it is only a query stored in the database. Each time the view is queried, the data must be taken from the stats_data table.

There is some code I could extract to separate functions. This will be useful in the future, and the view sql should be cleaner.

Here are my new functions:

CREATE OR REPLACE FUNCTION to_array(j jsonb) RETURNS integer[] AS $$
  SELECT array_agg(x)::INTEGER[] FROM jsonb_array_elements_text(j) x;

CREATE OR REPLACE FUNCTION to_timestamp(j jsonb) RETURNS timestamptz AS $$
  SELECT to_timestamp(j::text::double precision);

CREATE OR REPLACE FUNCTION to_timestamp_range(start_ts jsonb, end_ts jsonb) RETURNS tstzrange AS $$
  SELECT tstzrange(
    to_timestamp(start_ts::text::double precision),
    to_timestamp(end_ts::text::double precision)

And now the view can be changed to this:

  id                                                   as id,
  created_at                                           as created_at,
  to_timestamp(data->'start_ts')                       as start_ts,
  to_timestamp(data->'end_ts'  )                       as end_ts,
  to_timestamp_range(data->'start_ts', data->'end_ts') as ts_range,
  to_array(data->'resets')                             as resets,
  (data->'sessions')                                   as sessions,
  (data->'metadata'->>'country')                       as country,
  (data->'metadata'->>'installation')                  as installation,
  (data->>'status')                                    as status
FROM stats_data;

So currently we have normal SQL fields, except for the sessions part, which is there as JSON for a purpose.

The types made by PostgreSQL are:

                 View "public.stats"
    Column    ?           Type           ? Modifiers
 id           ? integer                  ?
 created_at   ? timestamp with time zone ?
 start_ts     ? timestamp with time zone ?
 end_ts       ? timestamp with time zone ?
 ts_range     ? tstzrange                ?
 resets       ? integer[]                ?
 sessions     ? jsonb                    ?
 country      ? text                     ?
 installation ? text                     ?
 status       ? text                     ?

The data from this view looks like this:

SELECT * FROM stats WHERE id = 1;

-[ RECORD 1 ]+----------------------------------------------------------------
id           | 1
created_at   | 2016-02-09 16:46:15.369802+01
start_ts     | 2015-08-03 21:10:33+02
end_ts       | 2015-08-03 21:40:33+02
ts_range     | ["2015-08-03 21:10:33+02","2015-08-03 21:40:33+02")
resets       | 
sessions     | [{"end_ts": 1438629089, "start_ts": 1438629058, "application": "first"},
                {"end_ts": 1438629143, "start_ts": 1438629123, "application": "second"},
                {"end_ts": 1438629476, "start_ts": 1438629236, "application": "third"}]
country      | USA
installation | FIRST
status       | on

The last part left is to extract information about the sessions. To make the reports simpler, I've extracted the sessions list into another view. However, because the operation of extracting the data is more expensive, I made it as a MATERIALIZED VIEW. This means that this view not only stores the query, but also keeps all the view data. This also means that this view is not updated automatically when the stats_data changes. I refresh the view data automatically in a script which loads the JSON files.

The sessions view looks like this:

  id                                                                            as id,
  country                                                                       as country,
  installation                                                                  as installation,
  s->>'application'                                                             as appname,
  to_timestamp_range(s->'start_ts', s->'end_ts')                                as ts_range,
  COALESCE(bool(s->>'occupancy_triggered'), false)                              as occupancy_triggered,
  to_timestamp(s->'end_ts') - to_timestamp(s->'start_ts')                       as session_length
FROM stats, jsonb_array_elements(sessions) s

CREATE INDEX i_sessions_country  ON sessions (country);
CREATE INDEX i_sessions_retailer ON sessions (installation);
CREATE INDEX i_sessions_ts_range ON sessions USING GIST (ts_range);

I've also created indexes on the materialized view, as my report queries will contain the where clause like:

WHERE country='' and installation='' and ts_range && tstzrange(fromdate, todate)

An example data extracted from the JSON looks like this:

select * from sessions;

id | country | installation | appname |                      ts_range                       | occupancy_triggered | session_length
 1 | USA     | FIRST        | first   | ["2015-08-03 21:10:58+02","2015-08-03 21:11:29+02") | f                   | 00:00:31
 1 | USA     | FIRST        | second  | ["2015-08-03 21:12:03+02","2015-08-03 21:12:23+02") | f                   | 00:00:20
 1 | USA     | FIRST        | third   | ["2015-08-03 21:13:56+02","2015-08-03 21:17:56+02") | f                   | 00:04:00
(3 rows)

In the next part I will show how to load the same JSON files multiple times, without any errors in a very simple Python script.


published by (Szymon Lipi?ski) on 2016-02-24 17:50:00 in the "postgres" category

We have plenty of Liquid Galaxy systems, where we write statistical information in json files. This is quite a nice solution. However we end with a bunch of files on a bunch of machines.

Inside we have a structure like:

    "end_ts": 1438630833,
    "resets": [],
    "metadata": {
        "country": "USA",
        "installation": "FIRST"
    "sessions": [
            "application": "first",
            "end_ts": 1438629089,
            "start_ts": 1438629058
            "application": "second",
            "end_ts": 1438629143,
            "start_ts": 1438629123
            "application": "third",
            "end_ts": 1438629476,
            "start_ts": 1438629236
    "start_ts": 1438629033,
    "status": "on"

And the files are named like "{start_ts}.json". The number of files is different on each system. For January we had from 11k to 17k files.

The fields in the json mean:

  • start_ts/end_ts - timestamps for start/end for the file
  • resets - is an array of timestamps when system was resetted
  • sessions - a list of sessions, each contains application name and start/end timestamps

We keep these files in order to get statistics from them. So we can do one of two things: keep the files on disk, and write a script for making reports. Or load the files into a database, and make the reports from the database.

The first solution looks quite simple. However for a year of files, and a hundred of systems, there will be about 18M files.

The second solution has one huge advantage: it should be faster. A database should be able to have some indexes, where the precomputed data should be stored for faster querying.

For a database we chose PostgreSQL. The 9.5 version released in January has plenty of great features for managing JSON data.

The basic idea behind the database schema is:

  • the original jsons should be stored without any conversion
  • the report queries must be fast
  • there should be only one json entry for a site for given time
  • the script loading the data should load the same file many times without any error

I've started with the main table for storing jsons:

CREATE TABLE stats_data (

This is not enough. We also want to avoid storing the same json multiple times. This can easily be done with an EXCLUDE clause.

CREATE TABLE stats_data (

  CONSTRAINT no_overlapping_jsons
      to_timestamp((data->>'start_ts')::double precision),
      to_timestamp((data->>'end_ts'  )::double precision)
    ) WITH &&,
    ((data->>'metadata')::json->>'country')      WITH =,
    ((data->>'metadata')::json->>'installation') WITH =

The above SQL requires a small extention to be installed


And now inserting the same json results in error:

$ insert into stats_data(data) select data from stats_data;
ERROR:  conflicting key value violates exclusion constraint "no_overlapping_jsons"

So for now we have a simple table with original json, and with a constraint disallowing to insert overlapping jsons.

In the next part I will show how to make simple reports and load the json files.


published by (Greg Sabino Mullane) on 2015-07-01 18:22:00 in the "postgres" category

Back in the old days, upgrading Postgres required doing a pg_dump and loading the resulting logical SQL into the new database. This could be a very slow, very painful process, requiring a lot of downtime. While there were other solutions (such as Bucardo) that allowed little (or even zero) downtime, setting them up was a large complex task. Enter the pg_upgrade program, which attempts to upgrade a cluster with minimal downtime. Just how fast is it? I grew tired of answering this question from clients with vague answers such as "it depends" and "really, really fast" and decided to generate some data for ballpark answers.

Spoiler: it's either about 3.5 times as fast as pg_dump, or insanely fast at a flat 15 seconds or so. Before going further, let's discuss the methodology used.

I used the venerable pgbench program to generate some sample tables and data, and then upgraded the resulting database, going from Postgres version 9.3 to 9.4. The pgbench program comes with Postgres, and simply requires an --initialize argument to create the test tables. There is also a --scale argument you can provide to increase the amount of initial data - each increment increases the number of rows in the largest table, pgbench_accounts, by one hundred thousand rows. Here are the scale runs I did, along with the number of rows and overall database size for each level:

Effect of --scale
--scaleRows in pgbench_accountsDatabase size
10010,000,0001418 MB
15015,000,0002123 MB
20020,000,0002829 MB
25025,000,0003535 MB
30030,000,0004241 MB
35035,000,0004947 MB
40040,000,0005652 MB
45045,000,0006358 MB
50050,000,0007064 MB
55055,000,0007770 MB
60060,000,0008476 MB

To test the speed of the pg_dump program, I used this simple command:

$ pg_dump postgres | psql postgres -q -p 5433 -f -

I did make one important optimization, which was to set fsync off on the target database (version 9.4). Although this setting should never be turned off in production - or anytime you cannot replace all your data, upgrades like this are an excellent time to disable fsync. Just make sure you flip it back on again right away! There are some other minor optimizations one could make (especially boosting maintenance_work_mem), but for the purposes of this test, I decided that the fsync was enough.

For testing the speed of pg_upgrade, I used the following command:

$ pg_upgrade -b $BIN1 -B $BIN2 -d $DATA1 -D $DATA2 -P 5433

The speed difference can be understood because pg_dump rewrites the entire database, table by table, row by row, and then recreates all the indexes from scratch. The pg_upgrade program simply copies the data files, making the minimum changes needed to support the new version. Because of this, it will always be faster. How much faster depends on a lot of variables, e.g. the number and size of your indexes. The chart below shows a nice linear slope for both methods, and yielding on average a 3.48 increase in speed of pg_upgrade versus pg_dump:

pg_dump versus pg_upgrade
--scaleDatabase sizepg_dump
1001.4 GB210.074.72.82
1502.1 GB305.079.43.86
2002.8 GB457.6122.23.75
2503.5 GB636.1172.13.70
3004.2 GB832.2215.13.87
3504.9 GB1098.8320.73.43
4005.7 GB1172.7361.43.25
4506.4 GB1340.2426.73.15
5007.1 GB1509.6476.33.17
5507.8 GB1664.0480.03.47
6008.5 GB1927.06073.17

If you graph it out, you can see both of them having a similar slope, but with pg_upgrade as the clear winner:

I mentioned earlier that there were some other optimizations that could be done to make the pg_dump slightly faster. As it turns out, pg_upgrade can also be made faster. Absolutely, beautifully, insanely faster. All we have to do is add the --link argument. What this does is rather than copying the data files, it simply links them via the filesystem. Thus, each large data file that makes up the majority of a database's size takes a fraction of a second to link to the new version. Here are the new numbers, generated simply by adding a --link to the pg_upgrade command from above:

pg_upgrade --link is crazy fast
--scaleDatabase sizepg_upgrade --link
1001.4 GB12.9
1502.1 GB13.4
2002.8 GB13.5
2503.5 GB13.2
3004.2 GB13.6
3504.9 GB14.4
4005.7 GB13.1
4506.4 GB13.0
5007.1 GB13.2
5507.8 GB13.1
6008.5 GB12.9

No, those are not typos - an average of thirteen seconds despite the size of the database! The only downside to this method is that you cannot access the old system once the new system starts up, but that's a very small price to pay, as you can easily backup the old system first. There is no point in graphing these numbers out - just look at the graph above and imagine a nearly flat line traveling across the bottom of the graph :)

Are there any other options that can affect the time? While pgbench has a handy --foreign-keys argument I often use to generate a more "realistic" test database, both pg_dump and pg_upgrade are unaffected by any numbers of foreign keys. One limitation of pg_upgrade is that it cannot change the --checksum attribute of a database. In other words, if you want to go from a non-checksummed version of Postgres to a checksummed version, you need to use pg_dump or some other method. On the plus side, my testing found negligible difference between upgrading a checksummed versus a non-checksummed version.

Another limitation of the pg_upgrade method is that all internal stats are blown away by the upgrade, so the database starts out in a completely unanalyzed state. This is not as much an issue as it used to be, as pg_upgrade will generate a script to regenerate these stats, using the handy --analyze-in-stages argument to vacuum. There are a few other minor limitations to pg_upgrade: read the documentation for a complete list. In the end, pg_upgrade is extraordinarily fast and should be your preferred method for upgrading. Here is a final chart showing the strengths and weaknesses of the major upgrade methods.

Postgres upgrade methods compared
  • Always works
  • Battle tested
  • Slowest method
  • Maximum downtime
  • Requires lots of disk space
  • Very fast
  • --link mode super fast
  • Cannot always be used (finicky)
  • Stats are lost
  • Minimal but non-zero downtime
  • Handles complex cases
  • Zero-downtime possible
  • Complex to setup
  • Requires primary keys on large tables
  • Requires lots of disk space

(As an addendum of sorts, pg_upgrade is fantastic, but the Holy Grail is still out of sight: true in-place upgrades. This would mean dropping in a new major version (similar to the way revisions can be dropped in now), and this new version would be able to read both old and new data file formats, and doing an update-on-write as needed. Someday!)


published by (Joshua Tolley) on 2015-01-31 01:02:00 in the "postgres" category
We ran into a couple of interesting situations recently, and used some helpful tricks to solve them, which of course should be recorded for posterity.

Unlogged tables

One of our customers needed a new database, created as a copy of an existing one but with some data obscured for privacy reasons. We could have done this with a view or probably any of several other techniques, but in this case, given the surrounding infrastructure, a new database, refreshed regularly from the original, was the simplest method. Except that with this new database, the regular binary backups became too large for the backup volume in the system. Since it seemed silly to re-provision the backup system (and talk the client into paying for it) to accommodate data we could throw away and recalculate at any time, we chose unlogged tables as an alternative.

"Unlogged," in this case, means changes to this table aren't written in WAL logs. This makes for better performance, but also means if the database crashes, these tables can't be recovered in the usual way. As a side effect, it also means these tables aren't copied via WAL-based replication, so the table won't show up in a hot standby system, for instance, nor will the table appear in a system restored from a WAL-based backup (pg_dump will still find them). Unlogged tables wouldn't give our application much of a performance boost in this case — the improved performance applies mostly to queries that modify the data, and ours were meant to be read-only. But before this change, the regular refresh process generated all kinds of WAL logs, and now they've all disappeared. The backups are therefore far smaller, and once again fit within the available space. Should the server crash, we'll have a little more work to do, regenerating these tables from their sources, but that's a scripted process and simple to execute.

Stubborn vacuum

Another fairly new customer has a database under a heavy and very consistent write load. We've had to make autovacuum very aggressive to keep up with bloat in several tables. When the vacuum process happens to clean all the tuples from the end of a table file, it tries to shrink the file and reclaim disk space, but it has to obtain a brief exclusive lock to do it. If it can't get one1, it gives up, and emits a log message you'll see if you are vacuuming in verbose mode:

INFO:  "some_big_table": stopping truncate due to conflicting lock request

Note that though the log message calls this process "truncating", it should not be confused with the "TRUNCATE TABLE" command, which (locks permitting) would reclaim quite a bit more disk space than we want it to. Anyway, when the shrinking operation succeeds, there is no log message, so if VACUUM VERBOSE doesn't say anything about "stopping truncate", it's because it was able to get its lock and shrink the table, or the table didn't need it in the first place. Because of this database's tendency to bloat, we'd like vacuum to be able to shrink tables regularly, but the query load is such that for some tables, it never gets the chance. We're working on mitigating that, but in the meantime, one stop-gap solution is to run VACUUM VERBOSE in a tight loop until you don't see one of those "stopping truncate" messages. In our case we do it like this:

for table in $problematic_tables; do
    echo "Vacuuming $table"
    ( timeout $timeout bash <<VACUUM
        while :; do
            vacuumdb -v -t $table $my_database 2>&1 | grep "stopping truncate" || break
) || echo "Timed out on table $table"

This script iterates through a list of tables we'd like to shrink, and vacuums each repeatedly, as quickly as possible, until the vacuum process fails to emit a "stopping truncate" message, or it finds it has spent eight minutes2 trying. Of course this whole technique is only useful in a few limited cases, but for our purposes we've found it helpful for managing bloat while we continue to work on the query patterns to reduce locking overall.

1. In version 9.0 and before, it simply tries to obtain the lock once. In 9.1 and later versions, it tries every few milliseconds for up to five seconds to obtain the lock.?
2. There's nothing magic about eight minutes, it just works out well enough for our purposes.?