All opinions expressed are those of the authors and not necessarily those of, our sponsors, or our affiliates.
  Add to My Yahoo!  Subscribe with Bloglines  Subscribe in NewsGator Online

published by (Greg Sabino Mullane) on 2017-10-23 10:30:00 in the "postgres" category

(Photo of Turtle Island by Edwin Poon)

Someone recently asked me something about tail_n_mail, a program that watches over your log files, scans for certain patterns, and sends out an email if matches are found. It is frequently used to watch over Postgres logs so you can receive an automatic email alert when Bad Things start happening to your database. The questioner wanted to know if it was possible for tail_n_mail to change its behavior based on the time of day - would it be able to do things differently outside of "business hours"? Although tail_n_mail cannot do so directly, a simple solution is to use alternate configuration files - which get swapped by cron - and the INHERIT keyword.

To demonstrate the solution, let's spin up a Postgres 10 instance, route the logs to syslog, setup tail_n_mail, and then create separate configuration files for different times of the week. First, some setup:

$ initdb --version
initdb (PostgreSQL) 10.0
$ initdb --data-checksums data
$ cat >> data/postgresql.conf << EOT
$ echo 'local0.*  /var/log/postgres.log' | sudo tee -a /etc/rsyslog.conf > /dev/null
$ sudo systemctl restart rsyslog
$ pg_ctl start -D data -l logfile

Grab the latest version of tail_n_mail and verify it:

$ wget --no-verbose{,.asc} 
2017-03-03 10:00:33 URL: [98767/98767] -> "tail_n_mail" [1]
2017-03-03 10:00:33 URL: [163/163] -> "tail_n_mail.asc" [1]
FINISHED --2017-03-03 10:00:33--
Total wall clock time: 0.3s
Downloaded: 2 files, 96K in 0.1s (702 KB/s)
$ gpg --verify tail_n_mail.asc
gpg: assuming signed data in `tail_n_mail'
gpg: Signature made Sun 01 Oct 2017 11:14:07 AM EDT using DSA key ID 14964AC8
gpg: Good signature from "Greg Sabino Mullane "
gpg:                 aka "Greg Sabino Mullane (End Point Corporation) "
gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the owner.
Primary key fingerprint: 2529 DF6A B8F7 9407 E944  45B4 BC9B 9067 1496 4AC8

The main way to configure tail_n_mail is through its configuration file, which is always the first argument given to the program. This file describes where the log files are, what to look for, and a few other important items. In addition, it automatically updates itself each time tail_n_mail is run to keep track of where the last run left of, so the next run can start at the exact same file, and the correct place within that file. In this example, let's assume the DBA wants to get email for every error that pops up in the database (in practice, this means any severity levels that are ERROR, FATAL, or PANIC). The configuration file would look like this:

$ cat > tnm.conf << EOT
FILE: /var/log/postgres.log
PGLOG: syslog
## Okay, we don't want to get emailed on *every* error:
EXCLUDE: could not serialize access due to concurrent update
EXCLUDE: canceling statement due to user request

To test it out, we will generate some errors, and then run tail_n_mail from the command line. If all goes well, it sends out an email and then rewrites the configuration file to indicate how far along it got. The --dry-run option can be used to view the email without actually sending it.

$ for i in 2 4 6 8; do psql -tc "select $i/0"; done
ERROR:  division by zero
ERROR:  division by zero
ERROR:  division by zero
ERROR:  division by zero
$ perl tail_n_mail tnm.conf --dry-run
Subject: localhost.localdomain Postgres errors 4
Auto-Submitted: auto-generated
Precedence: bulk

Date: Tue Oct  3 03:19:22 2017 EDT
Host: localhost.localdomain
Unique items: 1
Matches from /var/log/postgres.log: 4

[1] (between lines 139 and 142, occurs 4 times)
First: Oct   3 03:19:00 localhost postgres[28483]: [6-1]
Last:  Oct   3 03:19:00 localhost postgres[28495]: [6-1]
ERROR: division by zero
STATEMENT: select ?/0
ERROR: division by zero
STATEMENT: select 2/0

  DRYRUN: /usr/sbin/sendmail '' < tnmBWaG6QA1.tnm2

Running it in normal mode rewrites the configuration file:

$ perl tail_n_mail tnm.conf
$ cat tnm.conf
## Config file for the tail_n_mail program
## This file is automatically updated
## Last updated: Mon Oct  2 12:09:29 2017
PGLOG: syslog

## Okay, we don't want to get emailed on *every* error:
EXCLUDE: could not serialize access due to concurrent update
EXCLUDE: canceling statement due to user request

FILE1: /var/log/postgres.log
LASTFILE1: /var/log/postgres.log
OFFSET1: 333

Note how the file was rewritten to include state information about the files we are tracking, but leaves the exclusion rules and their comments in place. Tail_n_mail also attempts to "flatten" similar queries, which is why the four division-by-zero errors all appear as "SELECT ?/0". A sample of one of the literal errors appears below the normalized version.

You are not limited to a single configuration file, however, as the main config file can read in other configuration files via the INHERITS keyword. This allows you to import one or more other configuration files. Not only does this allow different tail_n_mail invocations to share common items to search for, but (as you will see in a bit) can solve the problem at the top of this post: how to change what is being looked for based on the time of day.

Using INHERITS also allows us to store files in version control, without worrying about them getting rewritten on each invocation, as we can store the ephemeral data in one file, and the constant data in a separate, version controlled file. Let's apply that idea to our example:

$ cat > << EOT
PGLOG: syslog
## Okay, we don't want to get emailed on *every* error:
EXCLUDE: could not serialize access due to concurrent update
EXCLUDE: canceling statement due to user request
$ git add && git commit 
  -m "Global config for tail_n_mail"
[master 2441df8] Global config for tail_n_mail
 1 file changed, 7 insertions(+)
 create mode 100644
$ cat > tnm.conf << EOT
FILE: /var/log/postgres.log

After another run, we observe that the inherited file does not change:

$ perl tail_n_mail tnm.conf
$ git status
On branch master
nothing to commit, working tree clean
$ cat tnm.conf

## Config file for the tail_n_mail program
## This file is automatically updated
## Last updated: Mon Oct  2 12:28:10 2017


FILE1: /var/log/postgres.log
LASTFILE1: /var/log/postgres.log
OFFSET1: 13219

Another advantage to moving common items to another file is that we can run multiple tail_n_mails, with slightly different purposes, but all sharing some of the same underlying rules. A common usage is to get an immediate email about almost all database problems, as well as a daily report about all problems. To do this, we create two configuration files and set them up in cron:

$ cp tnm.conf tnm.fatals.conf
$ mv tnm.conf tnm.errors.conf
$ perl -pi -e 's/Postgres errors/Postgres fatals/' tnm.fatals.conf
$ crontab -e
## Every five minutes, check for important problems
*/5 * * * * perl tail_n_mail tnm.fatals.conf
## Once every morning, generate a report of all errors in the last 24 hours.
30 6 * * * perl tail_n_mail tnm.errors.conf
## Note: it is usually easier to have separate "fatals" and "errors" exclusions.

What if we want to change the rules depending on the time of day, per the question that started this article? Simple enough - we just create two "inherited" configuration files, then have cron swap things around as needed. For example, let's say that after 5pm on weekdays, and all weekend, we do not want to receive emails about "division by zero" errors. First, create files named and

$ cp
$ cp
$ ln -sf
$ echo "EXCLUDE: ERROR:  division by zero" >>

Finally, have cron swap the files around at the start and end of business hours:

$ crontab -e
## May need to use 1-5 instead of Mon-Fri on some systems
0 9 * * Mon-Fri ln -sf
0 17 * * Mon-Fri ln -sf

Voila! We've changed the way tail_n_mail runs depending on the time of day. There are many other tricks you can do with tail_n_mail - check out the documentation or post to the mailing list for more help and/or inspiration.


published by (Greg Sabino Mullane) on 2017-09-27 11:00:00 in the "postgres" category

(Not as easy as it looks)
Photo by Flickr user 'edenpictures'

As consultants, the engineers at End Point are often called upon to do work on production systems - in other words, one or more servers that are vital to our client's business. These range from doing years of planning to perform a major upgrade for a long-standing client, down to jumping into a brand-new client for an emergency fix. Either way, the work can be challenging, rewarding, and a little bit nerve-wracking.

Regardless of how you end up there, following some basic good practices may reduce the chance of problems coming up, keep you calm during the chaos, and make the process easier for both you and the client.


Unless this is a true emergency, doing major work on a production system (e.g. upgrading a database server) should involve a good amount or preparation. By the time the big night arrives (yes, it is always late at night!) you should have gone through this list:

  • Do lots and lots of testing. Use systems as close as possible to production, and run through the process until it becomes second nature.
  • Know the impact on the business, and the downtime window anticipated by the rest of the company.
  • Have a coworker familiar with everything on standby, ready to call in if needed.
  • Document the process. For a large project, a spreadsheet or similar document can be quite helpful.
    • Who are the people involved, how to contact them, and what general role do they play?
    • Which task is being done at what time, who is the primary and backup for it, and what other tasks does it block?
    • How is success for each stage measured?
    • What is the rollback plan for each step? When is the point of no return reached?
  • Setup a shared meeting space (IRC, Skype, Slack, Zulip, HipChat, etc.)
  • Confirm connections (e.g. VPN up and running? Passwords not set to expire soon? Can everyone get to Slack? SSH working well?)

Screen Up

The night usually begins with connecting to a client server via SSH. The first order of business should be to invoke 'screen' or 'tmux', which are persistent session managers (aka terminal multiplexers). These keep your connections; if your network drops, so you can easily pick up where you left off. They also allow other people to view and/or join in on what you are doing. Finally, they enable you to easily view and control multiple windows. Some screen-specific tips:

  • Name your screen something obvious to the task such as "bucardo-production-rollout". Always give it a name to prevent people from joining it by accident. I often use just my email, i.e. screen -S or tmux new -s greg_endpoint_com.
  • Keep organized. Try to keep each window to one general task, and give each one a descriptive name:
    • screen: Ctrl-a A     tmux: Ctrl-b ,
    • I find that uppercase names stand out nicely from your terminal traffic.
    • Splitting windows by task also helps scrollback searching.
  • Boost your scrollback buffer so you can see what happened a while ago. The default value is usually much too low.
    • Inside /etc/screenrc or ~/.screenrc: defscrollback 50000
    • Inside /etc/tmux.conf or ~/.tmux.conf: set-option -g history-limit 50000
  • Develop a good configuration file. Nothing too fancy is needed, but being able to see all the window names at once on the bottom of the screen makes things much easier.
  • Consider logging all your screen output.

Discovery and Setup

If you don't already know, spend a little bit of time getting to know the server. A deep analysis is not needed, but you should have a rough idea how powerful it is (CPU, memory), how big it is (disk space), what OS it is (distro/version), and what else is running on it. Although one should be able to easily work on any unmodified *nix server, I almost always make a few changes:

  • Install minimal support software. For me, that usually means apt-get install git-core emacs-nox screen mlocate.
  • Put in some basic aliases to help out. Most important being rm='rm -i'.
  • Switch to a lower-priv account and do the work there when possible. Use sudo instead of staying as root.

For Postgres, you also want to get some quick database information as well. There are many things you could learn, but at a bare minimum check out the version, and per-user settings, the databases and their sizes, and what all the non-default configuration settings are:

select version();
select name, setting, source from pg_settings where source <> 'default' order by 1;

Version Control (git up)

One of the earliest steps is to ensure all of your work is done in a named subdirectory in a non-privileged account. Everything in this directory should be put into version control. This not only timestamps file changes for you, but allows quick recovery from accidentally removing important files. All your small scripts, your configuration files (except .pgpass), your SQL files - put them all in version control. And by version control, I of course mean git, which has won the version control wars (a happy example of the most popular tool also being the best one). Every time you make a change, git commit it.


As a Postgres expert, 99% of my work is done through psql, the canonical command-line client for Postgres. I am often connecting to various database servers in quick succession. Although psql is an amazing tool, there are important considerations to keep in mind.

The .psql_history file, along with readline support, is a wonderful thing. However, it is also a great footgun, owing to the ease of using the "up arrow" and "Ctrl-r" rerun SQL statements. This can lead to running a command on server B that was previously run on server A (and which should never, ever be run on server B!). Here are some ways around this issue:

One could simply remove the use of readline when on a production database. In this way, you will be forced to type everything out. Although this has the advantage of not accidentally toggling back to an older command, the loss of history is very annoying - and it greatly increases the chance of typos, as each command needs to be typed anew.

Another good practice is to empty out the psql_history file if you know it has some potentially damaging commands in it (e.g. you just did a lot of deletions on the development server, and are now headed to production.). Do not simply erase it, however, as the .psql_history provides a good audit trail of exactly what commands you ran. Save the file, then empty it out:

$ alias clean_psql_history='cat ~/.psql_history >> ~/.psql_history.log; truncate -s0 ~/.psql_history'

Although the .psql_history file can be set per-database, I find it too confusing and annoying in practice to use. I like being able to run the same command on different databases via the arrow keys and Ctrl-r.

It is important to exit your psql sessions as soon as possible - never leave it hanging around while you go off and do something else. There are multiple reasons for this:

  • Exiting psql forces a write of the .psql_history file. Better to have a clean output rather than allowing all the sessions to interleave with each other. Also, a killed psql session will not dump its history!
  • Exiting frees up a database connection, and prevents you from unintentionally leaving something idle in transaction.
  • Coming back to an old psql session risks a loss of mental context - what was I doing here again? Why did I leave this around?
  • Less chance of a accidental paste into a window with a psql prompt (/voice_of_experience).

Another helpful thing for psql is a good custom prompt. There should always be some way of telling which database - and server - you are using just by looking at the psql prompt. Database names are often the same (especially with primaries and their replicas), so you need to add a little bit more. Here's a decent recipe, but you can consult the documentation to design your own.

$ echo "set PROMPT1 '%/@%m%R%#%x '" >> ~/.psqlrc
$ psql service=PROD

Connection Service File

Using a connection service file to access Postgres can help keep things sane and organized. Connection files allow you to associate a simple name with multiple connection parameters, allowing abstraction of those details in a manner much safer and cleaner than using shell aliases. Here are some suggestions on using connection files:

  • If possible, use the local user's file (~/.pg_service.conf), but the global file is fine too. Whichever you choose, do not put the password inside them - use the .pgpass file for that.
  • Use short but descriptive service names. Keep them very distinct from each other, to reduce the chance of typing in the wrong name.
  • Use uppercase names for important connections (e.g. PROD for the connection to an important production database). This provides another little reminder to your brain that this is not a normal psql connection. (Remember, PROD = Please Respect Our Data)
  • Resist the temptation to alias them further. Force yourself to type everything out each time, e.g. "psql service=PROD" each time. This keeps your head focused on where you are going.

Dangerous SQL

Some actions are so dangerous, it is a good idea to remove any chance of direct invocation on the wrong server. The best example of this is the SQL 'truncate' command. If I find myself working on multiple servers in which I need to truncate a table, I do NOT attempt to invoke the truncate command directly. Despite all precautions, there are many ways to accidentally run the same truncate command on the wrong serve, whether via a .psql_history lookup, or simply an errant cut-n-paste error. One solution is to put the truncate into a text file, and then invoke that text file, but that simply adds the chance that this file may be invoked on the wrong database. Another solution is to use a text file, but change it when done (e.g. search and replace "truncate" to "notruncate"). This is slightly better, but still error prone as it relies on someone remembering to change the file after each use, and causes the file to no longer represent what was actually run.

For a better solution, create a function that only exists on one of the databases. For example, if you have a test database and a production database, you can run truncate via a function that only exists on the test database. Thus, you may safely run your function calls on the test server, and not worry if the functions accidentally get run on the production server. If they do, you end up with a "function not found error" rather than realizing you just truncated a production table. Of course, you should add some safeguards so that the function itself is never created on the production server. Here is one sample recipe:


DROP FUNCTION IF EXISTS safetruncate(text);

CREATE FUNCTION safetruncate(tname text)
LANGUAGE plpgsql
AS $ic$
    -- The cluster_name setting can be quite handy!
    PERFORM 1 FROM pg_settings WHERE name = 'cluster_name' AND setting = 'testserver';
      RAISE EXCEPTION 'Cannot create this function on this server!';
    END IF;

    EXECUTE FORMAT('truncate table %s', tname);

    RETURN 'Truncated table ' || tname;

Now you can create text files full of indirect truncate calls - filled with SELECT safetruncate('exultations'); instead of literal truncate calls. This file may be safely checked into git, and copy and pasted without worry.

Record keeping

The value of keeping good records of what you are doing on production systems cannot be overstated. While you should utimately use whatever system is best for you, I like to keep a local text file that spells out exactly what I did, when I did it, and what happened. These should be detailed enough that you can return to them a year later and explain to the client exactly what happened.

The primary way to document things as you go along is with good old fashioned cut and paste. Both the .bash_history and .psql_history files provide automatic tracking of entered commands, as a nice backup to your notes. Make liberal use of the tee(1) command to store command output into discrete files (which reduces the need to rely on scrollback).

Rather than entering commands directly into a psql prompt, consider putting them into a text file and then feeding that file to psql. It's a little extra work, but the file can be checked into git, giving an excellent audit trail. Plus, you may have to run those commands again someday.

If you find yourself doing the same thing over and over, write a shell script. This even applies to psql commands. For example, I recently found myself having to examine tables on two servers, and needed to quickly examine a table's size, indexes, and if anyone was currently modifying it. Thus a shall script:

psql service=PROD -c "\d $1"
psql service=PROD -Atc "select pg_relation_size('$1')"
psql service=PROD -Atc "select query from pg_stat_activity where query ~ '$1'"

Wrap Up

When you are done with all your production tasks, write up some final thoughts on how it went, and what the next steps are. Read through all your notes while it is all fresh in your mind and clean them up as needed. Revel in your success, and get some sleep!


published by (Greg Sabino Mullane) on 2017-07-22 00:16:00 in the "postgres" category

Upgrading Postgres is not quite as painful as it used to be, thanks primarily to the pg_upgrade program, but there are times when it simply cannot be used. We recently had an existing End Point client come to us requesting help upgrading from their current Postgres database (version 9.2) to the latest version (9.6 - but soon to be 10). They also wanted to finally move away from their SQL_ASCII encoding to UTF-8. As this meant that pg_upgrade could not be used, we also took the opportunity to enable checksums as well (this change cannot be done via pg_upgrade). Finally, they were moving their database server to new hardware. There were many lessons learned and bumps along the way for this migration, but for this post I'd like to focus on one of the most vexing problems, the database encoding.

When a Postgres database is created, it is set to a specific encoding. The most common one (and the default) is "UTF8". This covers 99% of all user's needs. The second most common one is the poorly-named "SQL_ASCII" encoding, which should be named "DANGER_DO_NOT_USE_THIS_ENCODING", because it causes nothing but trouble. The SQL_ASCII encoding basically means no encoding at all, and simply stores any bytes you throw at it. This usually means the database ends up containing a whole mess of different encodings, creating a "byte soup" that will be difficult to sanitize by moving to a real encoding (i.e. UTF-8).

Many tools exist which convert text from one encoding to another. One of the most popular ones on Unix boxes is "iconv". Although this program works great if your source text is using one encoding, it fails when it encounters byte soup.

For this migration, we first did a pg_dump from the old database to a newly created UTF-8 test database, just to see which tables had encoding problems. Quite a few did - but not all of them! - so we wrote a script to import tables in parallel, with some filtering for the problem ones. As mentioned above, iconv was not particularly helpful: looking at the tables closely showed evidence of many different encodings in each one: Windows-1252, ISO-8859-1, Japanese, Greek, and many others. There were even large bits that were plainly binary data (e.g. images) that simply got shoved into a text field somehow. This is the big problem with SQL_ASCII: it accepts *everything*, and does no validation whatsoever. The iconv program simply could not handle these tables, even when adding the //IGNORE option.

To better explain the problem and the solution, let's create a small text file with a jumble of encodings. Discussions of how UTF-8 represents characters, and its interactions with Unicode, are avoided here, as Unicode is a dense, complex subject, and this article is dry enough already. :)

First, we want to add some items using the encodings 'Windows-1252' and 'Latin-1'. These encoding systems were attempts to extend the basic ASCII character set to include more characters. As these encodings pre-date the invention of UTF-8, they do it in a very inelegant (and incompatible) way. Use of the "echo" command is a great way to add arbitrary bytes to a file as it allows direct hex input:

$ echo -e "[Windows-1252]   Euro: x80   Double dagger: x87" >
$ echo -e "[Latin-1]   Yen: xa5   Half: xbd" >> 
$ echo -e "[Japanese]   Ship: xe8x88xb9" >>
$ echo -e "[Invalid UTF-8]  Blob: xf4xa5xa3xa5" >> 

This file looks ugly. Notice all the "wrong" characters when we simply view the file directly:

$ cat
[Windows-1252]   Euro: �   Double dagger: �
[Latin-1]   Yen: �   Half: �
[Japanese]   Ship: ? 
[Invalid UTF-8]  Blob: ����

Running iconv is of little help:

## With no source encoding given, it errors on the Euro:
$ iconv -t utf8 >/dev/null
iconv: illegal input sequence at position 23 

## We can tell it to ignore those errors, but it still barfs on the blob:
$ iconv -t utf8//ignore >/dev/null
iconv: illegal input sequence at position 123

## Telling it the source is Window-1252 fixes some things, but still sinks the Ship:
$ iconv -f windows-1252 -t utf8//ignore 
[Windows-1252]   Euro: €   Double dagger: ‡
[Latin-1]   Yen: ¥   Half: ½
[Japanese]   Ship: è?¹
[Invalid UTF-8]  Blob: ô¥£¥

After testing a few other tools, we discovered the nifty Encoding::FixLatin , a Perl module which provides a command-line program called "fix_latin". Rather than being authoritative like iconv, it tries its best to fix things up with educated guesses. Its documentation gives a good summary:

  The script acts as a filter, taking source data which may contain a mix of
  ASCII, UTF8, ISO8859-1 and CP1252 characters, and producing output will be
  all ASCII/UTF8.

  Multi-byte UTF8 characters will be passed through unchanged (although
  over-long UTF8 byte sequences will be converted to the shortest normal
  form). Single byte characters will be converted as follows:

    0x00 - 0x7F   ASCII - passed through unchanged 
    0x80 - 0x9F   Converted to UTF8 using CP1252 mappings
    0xA0 - 0xFF   Converted to UTF8 using Latin-1 mappings

While this works great for fixing the Windows-1252 and Latin-1 problems (and thus accounted for at least 95% of our table's bad encodings), it still allows "invalid" UTF-8 to pass on through. Which means that Postgres will still refuse to accept it. Let's check our test file:

$ fix_latin
[Windows-1252]   Euro: €   Double dagger: ‡
[Latin-1]   Yen: ¥   Half: ½
[Japanese]   Ship: ?
[Invalid UTF-8]  Blob: ����

## Postgres will refuse to import that last part:
$ echo "SELECT E'"  "$(fix_latin"  "';" | psql
ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5 

## Even adding iconv is of no help:
$ echo "SELECT E'"  "$(fix_latin | iconv -t utf-8)"  "';" | psql
ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5

The UTF-8 specification is rather dense and puts many requirements on encoders and decoders. How well programs implement these requirements (and optional bits) varies, of course. But at the end of the day, we needed that data to go into a UTF-8 encoded Postgres database without complaint. When in doubt, go to the source! The relevant file in the Postgres source code responsible for rejecting bad UTF-8 (as in the examples above) is src/backend/utils/mb/wchar.c Analyzing that file shows a small but elegant piece of code whose job is to ensure only "legal" UTF-8 is accepted:

pg_utf8_islegal(const unsigned char *source, int length)
  unsigned char a;

  switch (length)
      /* reject lengths 5 and 6 for now */
      return false;
    case 4:
      a = source[3];
      if (a < 0x80 || a > 0xBF)
        return false;
      /* FALL THRU */
    case 3: 
      a = source[2];
      if (a < 0x80 || a > 0xBF)
        return false;
      /* FALL THRU */
    case 2:
      a = source[1];
      switch (*source)
        case 0xE0:
          if (a < 0xA0 || a > 0xBF)
            return false;
        case 0xED:
          if (a < 0x80 || a > 0x9F)
            return false;
        case 0xF0:
          if (a < 0x90 || a > 0xBF)
            return false;
        case 0xF4:
          if (a < 0x80 || a > 0x8F)
            return false;
          if (a < 0x80 || a > 0xBF)
            return false;
      /* FALL THRU */
    case 1:
      a = *source; 
      if (a >= 0x80 && a < 0xC2)
        return false;
      if (a > 0xF4)
        return false;
  return true;

Now that we know the UTF-8 rules for Postgres, how do we ensure our data follows it? While we could have made another standalone filter to run after fix_latin, that would increase the migration time. So I made a quick patch to the fix_latin program itself, rewriting that C logic in Perl. A new option "--strict-utf8" was added. Its job is to simply enforce the rules found in the Postgres source code. If a character is invalid, it is replaced with a question mark (there are other choices for a replacement character, but we decided simple question marks were quick and easy - and the surrounding data was unlikely to be read or even used anyway).

Voila! All of the data was now going into Postgres without a problem. Observe:

$ echo "SELECT E'"  "$(fix_latin"  "';" | psql
ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5

$ echo "SELECT E'"  "$(fix_latin --strict-utf8"  "';" | psql
  [Windows-1252]   Euro: €   Double dagger: ‡+
 [Latin-1]   Yen: ¥   Half: ½                +
 [Japanese]   Ship: ?                       +
 [Invalid UTF-8]  Blob: ???? 
(1 row)

What are the lessons here? First and foremost, never use SQL_ASCII. It's outdated, dangerous, and will cause much pain down the road. Second, there are an amazing number of client encodings in use, especially for old data, but the world has pretty much standardized on UTF-8 these days, so even if you are stuck with SQL_ASCII, the amount of Windows-1252 and other monstrosities will be small. Third, don't be afraid to go to the source. If Postgres is rejecting your data, it's probably for a very good reason, so find out exactly why. There were other challenges to overcome in this migration, but the encoding was certainly one of the most interesting ones. Everyone, the client and us, is very happy to finally have everything using UTF-8!


published by (Greg Sabino Mullane) on 2017-03-28 16:23:00 in the "postgres" category

Turtle turtle by WO1 Larry Olson from US Army

The PostgreSQL database system uses the write-ahead logging method to ensure that a log of changes is saved before being applied to the actual data. The log files that are created are known as WAL (Write Ahead Log) files, and by default are 16 MB in size each. Although this is a small size, a busy system can generate hundreds or thousands of these files per hour, at which point disk space becomes an issue. Luckily, WAL files are extremely compressible. I examined different programs to find one that offered the best compression (as indicated by a smaller size) at the smallest cost (as indicated by wall clock time). All of the methods tested worked better than the venerable gzip program, which is suggested in the Postgres documentation for the archive_command option. The best overall solution was using the pxz program inside the archive_command setting, followed closely by use of the 7za program. Use of the built-in wal_compression option was an excellent solution as well, although not as space-saving as using external programs via archive_command.

A database system is a complex beast, involving many trade-offs. An important issue is speed: waiting for changes to get written to disk before letting the client proceed can be a very expensive solution. Postgres gets around this with the use of the Write Ahead Log, which generates WAL files indicating what changes were made. Creating these files is much faster than performing the actual updates on the underlying files. Thus, Postgres is able to tell the client that the work is "done" when the WAL file has been generated. Should the system crash before the actual changes are made, the WAL files are used to replay the changes. As these WAL files represent a continuous unbroken chain of all changes to the database, they can also be used for Point in Time Recovery - in other words, the WAL files can be used to rewind the database to any single point in time, capturing the state of the database at a specific moment.

Postgres WAL files are exactly 16 MB in size (although this size may be changed at compilation time, it is extremely unheard of to do this). These files primarily sit around taking up disk space and are only accessed when a problem occurs, so being able to compress them is a good one-time exchange of CPU effort for a lower file size. In theory, the time to decompress the files should also be considered, but testing revealed that all the programs decompressed so quickly that it should not be a factor.

WAL files can be compressed in one of two ways. As of Postgres 9.5, the wal_compression feature can be enabled, which instructs Postgres to compress parts of the WAL file in-place when possible, leading to the ability to store much more information per 16 MB WAL file, and thus reducing the total number generated. The second way is to compress with an external program via the free-form archive_command parameter. Here is the canonical example from the Postgres docs, showing use of the gzip program for archive_command:

archive_command = 'gzip < %p > /var/lib/pgsql/archive/%f'

It is widely known that gzip is no longer the best compression option for most tasks, so I endeavored to determine which program was the best at WAL file compression - in terms of final file size versus the overhead to create the file. I also wanted to examine how these fared versus the new wal_compression feature.

To compare the various compression methods, I examined all of the compression programs that are commonly available on a Linux system, are known to be stable, and which perform at least as good as the common utility gzip. The contenders were:

  • gzip - the canonical, default compression utility for many years
  • pigz - parallel version of gzip
  • bzip2 - second only to gzip in popularity, it offers better compression
  • lbzip2 - parallel version of bzip
  • xz - an excellent all-around compression alternative to gzip and bzip
  • pxz - parallel version of xz
  • 7za - excellent compression, but suffers from complex arguments
  • lrzip - compression program targeted at "large files"

For the tests, 100 random WAL files were copied from a busy production Postgres system. Each of those 100 files were compressed nine times by each of the programs above: from the "least compressed" option (e.g. -1) to the "best compressed" option (e.g. -9). The tests were performed on a 16-core system, with plenty of free RAM and nothing else running on the server. Results were gathered by wrapping each command with /usr/bin/time -verbose, which produces a nice breakdown of results. To gather the data, the "Elapsed (wall clock) time" was used, along with size of the compressed file. Here is some sample output of the time command:

  Command being timed: "bzip2 -4 ./0000000100008B91000000A5"
  User time (seconds): 1.65
  System time (seconds): 0.01
  Percent of CPU this job got: 99%
  Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.66
  Average shared text size (kbytes): 0
  Average unshared data size (kbytes): 0
  Average stack size (kbytes): 0
  Average total size (kbytes): 0
  Maximum resident set size (kbytes): 3612
  Average resident set size (kbytes): 0
  Major (requiring I/O) page faults: 0
  Minor (reclaiming a frame) page faults: 938
  Voluntary context switches: 1
  Involuntary context switches: 13
  Swaps: 0
  File system inputs: 0
  File system outputs: 6896
  Socket messages sent: 0
  Socket messages received: 0
  Signals delivered: 0
  Page size (bytes): 4096
  Exit status: 0

The wal_compression feature was tested by creating a new Postgres 9.6 cluster, then running the pgbench program twice to generate WAL files - once with wal_compression enabled, and once with it disabled. Then each of the resulting WAL files was compressed using each of the programs above.

Table 1.
Results of compressing 16 MB WAL files - average for 100 files
CommandWall clock time (s)File size (MB)
gzip -10.2714.927
gzip -20.2924.777
gzip -30.3664.667
gzip -40.3814.486
gzip -50.4924.318
gzip -60.7344.250
gzip -70.9914.235
gzip -82.0424.228
gzip -93.6264.227
CommandWall clock time (s)File size (MB)
bzip2 -11.5403.817
bzip2 -21.5313.688
bzip2 -31.5703.638
bzip2 -41.6253.592
bzip2 -51.6673.587
bzip2 -61.7073.566
bzip2 -71.7313.559
bzip2 -81.7523.557
bzip2 -91.7843.541
CommandWall clock time (s)File size (MB)
xz -10.9623.174
xz -21.1863.066
xz -35.9112.711
xz -46.2922.682
xz -56.6942.666
xz -68.9882.608
xz -79.1942.592
xz -89.1172.596
xz -99.1642.597
Table 2.
Results of compressing 16 MB WAL file - average for 100 files
CommandWall clock time (s)File size (MB)
-l -L1
-l -L2
-l -L3
-l -L4
-l -L5
-l -L6
-l -L7
-l -L8
-l -L9
CommandWall clock time (s)File size (MB)
-z -L1
-z -L2
-z -L3
-z -L4
-z -L5
-z -L6
-z -L7
-z -L8
-z -L9
CommandWall clock time (s)File size (MB)
7za -bd -mx=1
a test.7za
7za -bd -mx=2
a test.7za
7za -bd -mx=3
a test.7za
7za -bd -mx=4
a test.7za
7za -bd -mx=5
a test.7za
7za -bd -mx=6
a test.7za
7za -bd -mx=7
a test.7za
7za -bd -mx=8
a test.7za
7za -bd -mx=9
a test.7za
Table 3.
Results of compressing 16 MB WAL file - average for 100 files
CommandWall clock time (s)File size (MB)
pigz -10.0514.904
pigz -20.0514.755
pigz -30.0514.645
pigz -40.0514.472
pigz -50.0514.304
pigz -60.0604.255
pigz -70.0814.225
pigz -80.1404.212
pigz -90.2514.214
CommandWall clock time (s)File size (MB)
lbzip2 -10.1353.801
lbzip2 -20.1513.664
lbzip2 -30.1513.615
lbzip2 -40.1513.586
lbzip2 -50.1513.562
lbzip2 -60.1513.545
lbzip2 -70.1503.538
lbzip2 -80.1513.524
lbzip2 -90.1503.528
CommandWall clock time (s)File size (MB)
pxz -10.1353.266
pxz -20.1753.095
pxz -31.2442.746
pxz -42.5282.704
pxz -55.1152.679
pxz -69.1162.604
pxz -79.2552.599
pxz -89.2672.595
pxz -99.3552.591
Table 4.
Results of Postgres wal_compression option
ModificationsTotal size of WAL files (MB)
No modifications208.1
wal_compression enabled81.0
xz -28.6
wal_compression enabled PLUS xz -29.4

Table 1 shows some baseline compression values for the three popular programs gzip, bzip2, and xz. Both gzip and bzip2 show little change in the file sizes as the compression strength is raised. However, xz has a relatively large jump when going from -2 to -3, although the time cost increases to an unacceptable 5.9 seconds. As a starting point, something well under one second is desired.

Among those three, xz is the clear winner, shrinking the file to 3.07 MB with a compression argument of -2, and taking 1.2 seconds to do so. Both gzip and bzip2 never even reach this file size, even when using a -9 argument. For that matter, the best compression gzip can ever achieve is 4.23 MB, which the other programs can beat without breakng a sweat.

Table 2 demonstrates two ways of invoking the lrzip program: the -l option (lzo compression - described in the lrzip documentation as "ultra fast") and the -z option (zpaq compression - "extreme compression, extreme slow"). All of those superlatives are supported by the data. The -l option runs extremely fast: even at -L5 the total clock time is still only .39 seconds. Unfortunately, the file size hovers around an undesirable 5.5 MB, no matter what compression level is used. The -z option produces the smallest file of all the programs here (2.48 MB) at a whopping cost of over 30 seconds! Even the smallest compression level (-L1) takes 3.5 seconds to produce a 3.4 MB file. Thus, lrzip is clearly out of the competition.

Compression in action (photo by Eric Armstrong)

The most interesting program is without a doubt 7za. Unlike the others, it is organized around creating archives, and thus doesn't do in-place compression as the others do. Nevertheless, the results are quite good. At the lowest level, it takes a mere 0.13 seconds to produce a 3.18 MB file. As it takes xz 1.19 seconds to produce a nearly equivalent 3.10 MB file, 7za is the clear winner ... if we had only a single core available. :)

It is rare to find a modern server with a single processor, and a crop of compression programs have appeared to support this new paradigm. First up is the amazing pigz, which is a parallel version of gzip. As Table 3 shows, pigz is extraordinarily fast on our 16 core box, taking a mere 0.05 seconds to run at compression level -1, and only 0.25 seconds to run at compression level -9. Sadly, it still suffers from the fact that gzip is simply not good at compressing WAL files, as the smallest size was 4.23 MB. This rules out pigz from consideration.

The bzip2 program has been nipping at the heels of gzip for many years, so naturally it has its own parallel version, known as lbzip2. As Table 3 shows, it is also amazingly fast. Not as fast as pigz, but with a speed of under 0.2 seconds - even at the highest compression level! There is very little variation among the compression levels used, so it is fair to simply state that lbzip2 takes 0.15 seconds to shrink the WAL file to 3.5 MB. A decent entry.

Of course, the xz program has a parallel version as well, known as pxz. Table 3 shows that the times still vary quite a bit, and reach into the 9 second range at higher compression levels, but does very well at -2, taking a mere 0.17 seconds to produce a 3.09 MB file. This is comparable to the previous winner, 7za, which took 0.14 seconds to create a 3.12 MB file.

So the clear winners are 7za and pxz. I gave the edge to pxz, as (1) the file size was slightly smaller at comparable time costs, and (2) the odd syntax for 7za for both compressing and decompressing was annoying compared with the simplicity of "xz -2" and "xz -d".

Now, what about the built-in compression offered by the wal_compression option? As Table 4 shows, the compression for the WAL files I tested went from 208 MB to 81 MB. This is a significant gain, but only equates to compressing a single WAL file to 6.23 MB, which is a poor showing when compared to the compression programs above. It should be pointed out that the wal_compression option is sensitive to your workload, so you might see reports of greater and lesser compressions.

Of interest is that the WAL files generated by turning on wal_compression are capable of being further compressed by the archive_command option, and doing a pretty good job of it as well - going from 81 MB of WAL files to 9.4 MB of WAL files. However, using just xz in the archive_command without wal_compression on still yielded a smaller overall size, and means less CPU because the data is only compressed once.

It should be pointed out that wal_compression has other advantages, and that comparing it to archive_command is not a fair comparison, but this article was primarily about the best compression option for storing WAL files long-term.

Thus, the overall winner is "pxz -2", followed closely by 7za and its bulky arguments, with honorable mention given to wal_compression. Your particular requirements might guide you to another conclusion, but hopefully nobody shall simply default to using gzip anymore.

Thanks to my colleague Lele for encouraging me to try pxz, after I was already happy with xz. Thanks to the authors of xz, for providing an amazing program that has an incredible number of knobs for tweaking. And a final thanks to the authors of the wal_compression feature, which is a pretty nifty trick!


published by (Greg Sabino Mullane) on 2016-12-07 13:38:00 in the "postgres" category

Anytime you run a query in Postgres, it needs to compile your SQL into a lower-level plan explaining how exactly to retrieve the data. The more it knows about the tables involved, the smarter the planner can be. To get that information, it gathers statistics about the tables and stores them, predictably enough, in the system table known as pg_statistic. The SQL command ANALYZE is responsible for populating that table. It can be done per-cluster, per-database, per-table, or even per-column. One major pain about analyze is that every table *must* be analyzed after a major upgrade. Whether you upgrade via pg_dump, pg_upgrade, Bucardo, or some other means, the pg_statistic table is not copied over and the database starts as a clean slate. Running ANALYZE is thus the first important post-upgrade step.

Unfortunately, analyze can be painfully slow. Slow enough that the default analyze methods sometimes take longer that the entire rest of the upgrade! Although this article will focus on the pg_upgrade program in its examples, the lessons may be applied to any upgrade method. The short version of the lessons is: run vacuumdb in parallel, control the stages yourself, and make sure you handle any custom per-column statistics.

Before digging into the solution in more detail, let's see why all of this is needed. Doesn't pg_upgrade allow for super-fast Postgres major version upgrades, including the system catalogs? It does, with the notable exception of the pg_statistics table. The nominal reason for not copying the data is that the table format may change from version to version. The real reason is that nobody has bothered to write the conversion logic yet, for pg_upgrade could certainly copy the pg_statistics information: the table has not changed for many years.

At some point, a DBA will wonder if it is possible to simply copy the pg_statistic table from one database to another manually. Alas, it contains columns of the type "anyarray", which means it cannot be dumped and restored:

$ pg_dump -t pg_statistic --data-only | psql -q
ERROR:  cannot accept a value of type anyarray
CONTEXT:  COPY pg_statistic, line 1, column stavalues1: "{"{i,v}","{v}","{i,o,o}","{i,o,o,o}","{i,i,i,v,o,o,o}","{i,i,o,o}","{i,o}","{o,o,o}","{o,o,o,o}","{o..."

I keep many different versions of Postgres running on my laptop, and use a simple port naming scheme to keep them straight. It's simple enough to use pg_dump and sed to confirm that the structure of the pg_statistic table has not changed from version 9.2 until 9.6:

$ for z in 840 900 910 920 930 940 950; do echo -n $z: ; diff -sq <(pg_dump 
>  --schema-only -p 5$z -t pg_statistic | sed -n '/CREATE TABLE/,/^$/p') <(pg_dump 
>  --schema-only -p 5960 -t pg_statistic | sed -n '/CREATE TABLE/,/^$/p'); done
840:Files /dev/fd/63 and /dev/fd/62 differ
900:Files /dev/fd/63 and /dev/fd/62 differ
910:Files /dev/fd/63 and /dev/fd/62 differ
920:Files /dev/fd/63 and /dev/fd/62 are identical
930:Files /dev/fd/63 and /dev/fd/62 are identical
940:Files /dev/fd/63 and /dev/fd/62 are identical
950:Files /dev/fd/63 and /dev/fd/62 are identical

Of course, the same table structure does not promise that the backend of different versions uses them in the same way (spoiler: they do), but that should be something pg_upgrade can handle by itself. Even if the table structure did change, pg_upgrade could be taught to migrate the information from one format to another (its raison d'être). If the new statistics format take a long time to generate, perhaps pg_upgrade could leisurely generate a one-time table on the old database holding the new format, then copy that over as part of the upgrade.

Since pg_upgrade currently does none of those things and omits upgrading the pg_statistics table, the following message appears after pg_upgrade has been run:

Upgrade Complete
Optimizer statistics are not transferred by pg_upgrade so,
once you start the new server, consider running:

Looking at the script in question yields:


echo 'This script will generate minimal optimizer statistics rapidly'
echo 'so your system is usable, and then gather statistics twice more'
echo 'with increasing accuracy.  When it is done, your system will'
echo 'have the default level of optimizer statistics.'

echo 'If you have used ALTER TABLE to modify the statistics target for'
echo 'any tables, you might want to remove them and restore them after'
echo 'running this script because they will delay fast statistics generation.'

echo 'If you would like default statistics as quickly as possible, cancel'
echo 'this script and run:'
echo '    vacuumdb --all --analyze-only'

vacuumdb --all --analyze-in-stages

echo 'Done'

There are many problems in simply running this script. Not only is it going to iterate through each database one-by-one, but it will also process tables one-by-one within each database! As the script states, it is also extremely inefficient if you have any per-column statistics targets. Another issue with the --analyze-in-stages option is that the stages are hard-coded (at "1", "10", and "default"). Additionally, there is no way to easily know when a stage has finished other than watching the command output. Happily, all of these problems can be fairly easily overcome; let's create a sample database to demonstrate.

$ initdb --data-checksums testdb
$ echo port=5555 >> testdb/postgresql.conf 
$ pg_ctl start -D testdb
$ createdb -p 1900 alpha
$ pgbench alpha -p 1900 -i -s 2
$ for i in `seq 1 100`; do echo create table pgb$i AS SELECT * FROM pgbench_accounts;; done | psql -p 1900 alpha

Now we can run some tests to see the effect of the --jobs option. Graphing out the times shows some big wins and nice scaling. Here are the results of running vacuumdb alpha --analyze-only with various values of --jobs:

Simple graph showing time decreasing as number of jobs increases

The slope of your graph will be determined by how many expensive-to-analyze tables you have. As a rule of thumb, however, you may as well set --jobs to a high number. Anything over your max_connections setting is pointless, but don't be afraid to jack it up to at least a hundred. Experiment on your test box, of course, to find the sweet spot for your system. Note that the --jobs argument will not work on old versions of Postgres. For those cases, I usually whip up a Perl script using Parallel::ForkManager to get the job done. Thanks to Dilip Kumar for adding the --jobs option to vacuumdb!

The next problem to conquer is the use of custom statistics. Postgres' ANALYZE uses the default_statistics_target setting to determine how many rows to sample (the default value in modern versions of Postgres is 100). However, as the name suggests, this is only the default - you may also set a specific target at the column level. Unfortunately, there is no way to disable this quickly, which means that vacuumdb will always use the custom value. This is not what you want, especially if you are using the --analyze-in-stages option, as it will happily (and needlessly!) recalculate columns with specific targets three times. As custom stats are usually set much higher than the default target, this can be a very expensive option:

$ ## Create a largish table:
$ psql -qc 'create unlogged table aztest as select * from pgbench_accounts'
$ for i in {1..5}; do psql -qc 'insert into aztest select * from aztest'; done
$ psql -tc "select pg_size_pretty(pg_relation_size('aztest'))"
820 MB
$ psql -qc 'timing' -c 'analyze aztest'
Time: 590.820 ms  ## Actually best of 10: never get figures from a single run!
$ psql -c 'alter table aztest alter column aid set statistics 1000'
$ psql -qc 'timing' -c 'analyze aztest'
Time: 2324.017 ms ## as before, this is the fastest of 10 runs

As you can see, even a single column can change the analyze duration drastically. What can we do about this? The --analyze-in-stages is still a useful feature, so we want to set those columns back to a default value. While one could reset the stats and then set them again on each column via a bunch of ALTER TABLE calls, I find it easier to simply update the system catalogs directly. Specifically, the pg_attribute table contains a attstattarget column which has a positive value when a custom target is set. In our example above, the value of attstattarget for the aid column would be 1000. Here is a quick recipe to save the custom statistics values, set them to the default (-1), and then restore them all once the database-wide analyzing is complete:

## Save the values away, then reset to default:
CREATE TABLE custom_targets AS SELECT attrelid, attname, attnum, attstattarget
  FROM pg_atttribute WHERE attstattarget > 0;
UPDATE pg_attribute SET attstattarget = -1 WHERE attstattarget > 0;

## Safely run your database-wide analyze now
## All columns will use default_statistics_target

## Restore the values:
UPDATE pg_attribute a SET attstattarget = c.attstattarget
  FROM custom_targets c WHERE a.attrelid = c.attrelid
  AND a.attnum = c.attnum AND a.attname = c.attname;

## Bonus query: run all custom target columns in parallel:
SELECT 'vacuumdb --analyze-only -e -j 100 ' || 
  string_agg(format('-t "%I(%I)" ', attrelid::regclass, attname), NULL)
FROM pg_attribute WHERE attstattarget > 0;

As to the problems of not being able to pick the stage targets for --analyze-in-stages, and not being able to know when a stage has finished, the solution is to simply do it yourself. For example, to run all databases in parallel with a target of "2", you would need to change the default_statistics_target at the database level (via ALTER DATABASE), or at the cluster level (via ALTER SYSTEM). Then invoke vacuumdb, and reset the value:

$ psql -qc 'alter system set default_statistics_target = 2' -qc 'select pg_reload_conf()'
$ vacuumdb --all --analyze-only --jobs 100
$ psql -qc 'alter system reset default_statistics_target' -qc 'select pg_reload_conf()'

In summary, don't trust the given vacuumdb suggestions for a post-upgrade analyze. Instead, remove any per-column statistics, run it in parallel, and do whatever stages make sense for you.


published by (Josh Williams) on 2016-12-01 00:55:00 in the "postgres" category
We needed a way to suppress specific write commands on a Postgres streaming replica. The replica was set up for a DR configuration, with the applications able to be brought up into full service at a moment's notice. But since it's a hot standby, we'd like to still allow the applications to be used in a working but read-only state.

One of the applications on this database is MediaWiki, which worked great in this configuration. But a couple of the other apps have the classic behavior of updating its user object's "last login" field in one form or another when someone authenticates, which would cause the login process to fail entirely.

Of course we want updates to fail, up until that point when (knock on wood) the master server is declared down for the count and the official fail-over happens. Except for the one command that executes on login.

We don't really care about the "last login" type field -- the data is available through logs and other means. The affected apps could probably all be monkey patched to work around that part of the process. But we had a couple different apps doing this, and that adds a maintenance burden for each. And if we could figure out how to make it work at the database level then it'd work for all of them, plus anything else that might pop up.

The first thing we looked at was writing a trigger to intercept the commands, but triggers don't execute on a hot standby replica so that was out pretty quickly. The next hypothesis was that we could write a foreign data wrapper that'd just absorb the writes, or even just use postgres_fdw to route the commands to a local writable database that's more or less a throw-away data sink. But to our surprise, even writes to foreign tables get rejected on a hot standby. I'm slightly tempted to dig in and see what it'd take to enable that.

The third time was the charm: rules. Rules hook in pretty far down into the query parser, and they can be notoriously tricky to work with. But since they're embedded pretty deep, down to the point where views rely on them they're obeyed even on a replica.

So the technique was this: On the master (... obviously) we set up a separate schema, inside which a view was created with the same name as the target table and which had certain commands suppressed:


CREATE VIEW replica.users AS SELECT * FROM public.users;

CREATE RULE users_disable_update AS ON UPDATE TO replica.users DO INSTEAD NOTHING;

Plus any permission adjustments the app user might need. On the master server this schema and view are pretty much ignored, as the application user just uses the default search path. But on the replica, we adjust the default search path in postgresql.conf so that it applies to just that server:

search_path = '"$user",replica,public'
app@app:5432=> UPDATE "auth_user" SET "last_login" = now() WHERE "auth_user"."id" = 13;

It doesn't quite work everywhere, sadly! Notice the "UPDATE 0"? We found Django actually checks that, and panics with an error to the user when it tries to change something and the row count it gets back is different than what it expects.

Another caveat is that if the target table's schema changes, the view won't automatically follow. Presumably your user table doesn't receive changes all that often, but if you're applying this technique to something else, that might not be the case. Something to be aware of!

published by (Greg Sabino Mullane) on 2016-10-26 19:09:00 in the "postgres" category

Postgres has a wonderfully helpful (but often overlooked) feature called the connection service file (its documentation is quite sparse). In a nutshell, it defines connection aliases you can use from any client. These connections are given simple names, which then map behind the scenes to specific connection parameters, such as host name, Postgres port, username, database name, and many others. This can be an extraordinarily useful feature to have.

The connection service file is named pg_service.conf and is setup in a known location. The entries inside are in the common "INI file" format: a named section, followed by its related entries below it, one per line. To access a named section, just use the service=name string in your application.

## Find the file to access by doing:
$ echo `pg_config --sysconfdir`/pg_service.conf
## Edit the file and add a sections that look like this:

## Now you can access this database via psql:
$ psql service=foobar

## Or in your Perl code:
my $dbh = DBI->connect('dbi:Pg:service=foobar');

## Other libpq based clients are the same. JDBC, you are out of luck!

So what makes this feature awesome? First, it can save you from extra typing. No more trying to remember long hostnames (or copy and paste them). Second, it is better than a local shell alias, as the service file can be made globally available to all users. It also works similar to DNS, in that it insulates you from the details of your connections. Your hostname has changed because of a failover? No problem, just edit the one file, and no clients need to change a thing.

As seen above, the format of the file is simple: a named section, followed by connection parameters in a name=value format. Among the connection parameters one may use, the most common and useful are host, port, user, and dbname. Although you can set a password, I recommend against it, as that belongs in the more secure, per-user .pgpass file.

The complete list of what may be set can be found in the middle of the database connection documentation page. Most of them will seldom, if ever, be used in a connection service file.

The connection service file is not just limited to basic connections. You can have sections that only differ by user, for example, or in their SSL requirements, making it easy to switch things around by a simple change in the service name. It's also handy for pgbouncer connections, which typically run on non-standard ports. Be creative in your service names, and keep them distinct from each other to avoid fat fingering the wrong one. Comments are allowed and highly encouraged. Here is a slightly edited service file that was recently created while helping a client use Bucardo to migrate a Postgres database from Heroku to RDS:

## Bucardo source: Heroku

## Bucardo target: RDS

## Test database on RDS

## Hot standby used for schema population

You may notice above that "connect_timeout" is repeated in each section. Currently, there is no way to set a parameter that applies to all sections, but it's a very minor problem. I also usually set the environment variable PGCONNECT_TIMEOUT to 10 in by .bashrc, but putting it in the pg_service.conf file ensures it is always set regardless of what user I am.

One of the trickier parts of using a service file can be figuring out where the file should be located! Postgres will check for a local service file (named $USER/.pg_service.conf) and then for a global file. I prefer to always use the global file, as it allows you to switch users with ease and maintain the same aliases. By default, the location of the global Postgres service file is /usr/local/etc/postgresql/pg_service.conf, but in most cases this is not where you will find it, as many distributions specify a different location. Although you can override the location of the file with the environment variable PGSERVICEFILE and the directory holding the pg_service.conf file with the PGSYSCONFIDIR environment variable, I do not like relying on those. One less thing to worry about by simply using the global file.

The location of the global pg_service.conf file can be found by using the pg_config program and looking for the SYSCONFDIR entry. Annoyingly, pg_config is not installed by default on many systems, as it is considered part of the "development" packages (which may be named postgresql-devel, libpq-devel, or libpq-dev). While using pg_config is the best solution, there are times it cannot be installed (e.g. working on an important production box, or simply do not have root). While you can probably discover the right location through some simple investigation and trial-and-error, where is the fun in that? Here are two other methods to determine the location using nothing but psql and some standard Unix tools.

When you invoke psql with a request for a service file entry, it has to look for the service files. We can use this information to quickly find the expected location of the global pg_service.conf file. If you have the strace program installed, just run psql through strace, grep for "pg_service", and you should see two stat() calls pop up: one for the per-user service file, and one for the global service file we are looking for:

$ strace psql service=foobar 2>&1 | grep service.conf
stat("/home/greg/.pg_service.conf", 0x3526366F6637) = -1 ENOENT (No such file or directory)
stat("/var/opt/etc/postgres/pg_service.conf", 0x676F746F3131) = -1 ENOENT (No such file or directory)

What if strace is not installed? Well, perhaps gdb (the GNU debugger) can help us out:

$ gdb -q --args psql service=foobar
Reading symbols from psql...(no debugging symbols found)...done.
(gdb) start
Temporary breakpoint 1 at 0x435356
Starting program: /usr/local/bin/psql service=foobar
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/".

Temporary breakpoint 1, 0x4452474E4B4C5253 in main ()
(gdb) catch syscall stat
Catchpoint 2 (syscall 'stat' [4])
(gdb) c

Catchpoint 2 (call to syscall stat), 0x216c6f65736a6f72 in __GI___xstat (vers=, name=0x616d7061756c "/usr/local/bin/psql", buf=0x617274687572)
    at ../sysdeps/unix/sysv/linux/wordsize-64/xstat.c:35
35      return INLINE_SYSCALL (stat, 2, name, buf);
(gdb) c 4
Will ignore next 3 crossings of breakpoint 2.  Continuing.

Catchpoint 2 (call to syscall stat), 0x37302B4C49454245 in __GI___xstat (vers=, name=0x53544F442B4C "/var/opt/etc/postgres/pg_service.conf", buf=0x494543485445)
    at ../sysdeps/unix/sysv/linux/wordsize-64/xstat.c:35
35      return INLINE_SYSCALL (stat, 2, name, buf);
(gdb) quit

The use of a connection service file can be a nice addition to your tool chest, especially if you find yourself connecting from many different accounts, or if you just want to abstract away all those long, boring host names!


published by (Greg Sabino Mullane) on 2016-09-19 17:05:00 in the "postgres" category

Sometimes, the elephant gets hurt - inducing database errors! Data corruption is a fact of life in working with computers, and Postgres is not immune. With the addition of the "data checksums" feature, detecting such corruption is now much easier. But detection is not enough - what happens after the corruption is detected? What if Postgres could fix the problem all by itself - what if we could give the elephant a mutant healing power?!?

Now we can. I wrote an extension named pg_healer that does just that - detects corruption issues, and automatically repairs them. Let's see how it works with a demonstration. For this, we will be purposefully corrupting the "pgbench_branches" table, part of the venerable pgbench utility.

For the initial setup, we will create a new Postgres cluster and install the pgbench schema. The all-important checksum feature needs to be enabled when we initdb, and we will use a non-standard port for testing:

$ initdb --data-checksums dojo
The files belonging to this database system will be owned by user "greg".
Data page checksums are enabled.

creating directory dojo ... ok
creating subdirectories ... ok
$ echo port=9999 >> dojo/postgresql.conf
$ pg_ctl start -D dojo -l log.dojo.txt
server starting
$ createdb -p 9999 $USER
$ pgbench -p 9999 -i
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
creating tables...
100000 of 100000 tuples (100%) done (elapsed 0.35 s, remaining 0.00 s)
set primary keys...

Next, we install the pg_healer extension. As it needs to access some low-level hooks, we need to load it on startup, by adding a line to the postgresql.conf file:

$ git clone git://
Cloning into 'pg_healer'...
$ cd pg_healer
$ make install
gcc -Wall -Wmissing-prototypes ... -c -o pg_healer.o pg_healer.c
gcc -Wall -Wmissing-prototypes ... -shared -o pg_healer.o
$ echo "shared_preload_libraries = 'pg_healer'" >> dojo/postgresql.conf
$ pg_ctl restart -D dojo -l log.dojo.txt
waiting for server to shut down.... done
server stopped
server starting
## Make sure the extension has loaded cleanly.
## If it did not, the log file would complain
$ tail -2 log.dojo.txt
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

Now for the fun part. We want to purposefully corrupt the file containing the data for the pgbench_branches file, in simulation of a failing hard drive or other really serious problem. The type of problem that normally causes the DBA to get paged in the middle of the night. Before we do that, we want to take a peek at the contents of that table, and then find out which actual disk files contain the table:

$ psql -p 9999 -c "select * from pgbench_branches"
 bid | bbalance | filler 
   1 |        0 | 
(1 row)

$ psql -p 9999 -Atc "select format('%s/%s',

## That file is too cumbersome to keep typing out, so:
$ ln -s /home/greg/pg_healer/dojo/base/16384/198461 myrelfile

Let's throw a deadly shuriken right into the middle of it!

## Here is what the file looks like in its original uncorrupted form
## (checksum is in red):
$ xxd -a -g1 -u myrelfile
00000000: 00 00 00 00 30 69 BC 37 74 66 04 00 1C 00 E0 1F  ....0i.7tf......
00000010: 00 20 04 20 00 00 00 00 E0 9F 40 00 00 00 00 00  . . ......@.....
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001fe0: F7 0B 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001ff0: 01 00 03 00 01 09 18 03 01 00 00 00 00 00 00 00  ................

## Good ol' dd is the right tool for the job here:
$ echo -n "Shuriken!" | dd conv=notrunc oflag=seek_bytes seek=4000 bs=9 count=1 of=myrelfile
1+0 records in
1+0 records out
9 bytes (9 B) copied, 0.000156565 s, 57.5 kB/s

## Take a peek inside the file to make sure the shuriken got embedded deeply:
$ xxd -a -g1 -u myrelfile
00000000: 00 00 00 00 30 69 BC 37 74 66 04 00 1C 00 E0 1F  ....0i.7tf......
00000010: 00 20 04 20 00 00 00 00 E0 9F 40 00 00 00 00 00  . . ......@.....
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000fa0: 53 68 75 72 69 6B 65 6E 21 00 00 00 00 00 00 00  Shuriken!.......
00000fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001fe0: F7 0B 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001ff0: 01 00 03 00 01 09 18 03 01 00 00 00 00 00 00 00  ................

These shurikens are not so deadly, but quite yummy!
(photograph by kahvikisu)

Now that we've messed up the file, watch closely at what happens when we try to read from it. We are going to do this three times. The first time, the table will still be in the shared buffer cache, and thus will show no error. The second time, the table will be read from the disk and throw an error. At this point, pg_healer will see the error and repair it. The final read will pull from the completely healed table:

$ psql -p 9999 -c "select * from pgbench_branches"
 bid | bbalance | filler 
   1 |        0 | 
(1 row)

## This will force the table out of shared_buffers, so that the next
## time it is accessed, Postgres must read from the disk:
$ psql -p 9999 -qtc "select pg_healer_remove_from_buffer('pgbench_branches')"

$ psql -p 9999 -c "select * from pgbench_branches"
WARNING:  page verification failed, calculated checksum 9478 but expected 26228
INFO:  File has been healed: base/16384/198461 (intrinsic healing)
ERROR:  invalid page in block 0 of relation base/16384/198461

## Mutant healing power was activated. Observe:
$ psql -p 9999 -c "select * from pgbench_accounts"
 bid | bbalance | filler 
   1 |        0 | 
(1 row)

The corruption we created before changed the "free space" of the Postgres "page" structure. There are multiple ways pg_healer can fix things: this demonstrates one of the "intrinsic" fixes, which require no external knowledge to fix. Corruption can occur anywhere on the page, of course, including inside your data (as opposed to the meta-data or free space). One of the methods of fixing this is for pg_healer to use another copy of the table to try and repair the original table.

While eventually pg_healer will be able to reach out to replicas for a copy of the (non-corrupted) table data it needs, a simpler method is to simply create a good copy inside the data directory. There is a helper function that does just that, by copying the important files to a new directory. Details on how this is kept refreshed will be covered later; for now, let's see it in action and observe how it can help Postgres heal itself from more serious corruption problems:

$ psql -p 9999 -c 'create extension pg_healer'
$ psql -p 9999 -qc 'checkpoint'
$ psql -p 9999 -c 'select pg_healer_cauldron()'

Rather than free space, let's corrupt something a little more important: the line pointers, which indicate where, inside the page, that each tuple (aka table row) is located. Extremely critical information, that is about to get blown away with another deadly shuriken!

$ echo -n "Shuriken!" | dd conv=notrunc oflag=seek_bytes seek=20 bs=9 count=1 of=myrelfile
1+0 records in
1+0 records out
9 bytes (9 B) copied, 9.3577e-05 s, 96.2 kB/s
$ xxd -a -g1 -u myrelfile
00000000: 00 00 00 00 30 69 BC 37 74 66 04 00 1C 00 E0 1F  ....0i.7tf......
00000010: 00 20 04 20 53 68 75 72 69 6B 65 6E 21 00 00 00  . . Shuriken!...
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001fe0: F7 0B 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001ff0: 01 00 03 00 01 09 18 03 01 00 00 00 00 00 00 00  ................

$ psql -p 9999 -qtc "select pg_healer_remove_from_buffer('pgbench_branches')"

$ psql -p 9999 -c "select * from pgbench_branches"
WARNING:  page verification failed, calculated checksum 8393 but expected 26228
INFO:  File has been healed: base/16384/198461 (external checksum match)
ERROR:  invalid page in block 0 of relation base/16384/198461

$ psql -p 9999 -c "select * from pgbench_branches"
 bid | bbalance | filler 
   1 |        0 | 
(1 row)

## Has the shuriken really been removed?
$ xxd -a -g1 -u myrelfile
00000000: 00 00 00 00 30 69 BC 37 74 66 04 00 1C 00 E0 1F  ....0i.7tf......
00000010: 00 20 04 20 00 00 00 00 E0 9F 40 00 00 00 00 00  . . ......@.....
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001fe0: F7 0B 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001ff0: 01 00 03 00 01 09 18 03 01 00 00 00 00 00 00 00  ................

Once again, pg_healer has repaired the file. This time, however, it reached out to a version of the file outside the data directory, copied the old page data to the new page data, and then used the checksum to confirm that the changes were correct. This method only works, however, if the original file and the copy have the same checksum - which means that no changes have been made since the copy was made via pg_healer_cauldron(). As this is not always possible, there is a third method pg_healer can use, which is to examine things row by row and to try and repair the damage.

For this final demo, we are going to change the table by adding a new row, which ensures that the checksums against the copy will no longer match. After that, we are going to add some corruption to one of the table rows (aka tuples), and see if pg_healer is able to repair the table:

$ psql -p 9999 -qtc 'insert into pgbench_branches values (2,12345)'
$ psql -p 9999 -qc 'checkpoint'

## Throw a shuriken right into an active row!
$ echo -n "Shuriken!" | dd conv=notrunc oflag=seek_bytes seek=8180 bs=9 count=1 of=myrelfile
1+0 records in
1+0 records out
9 bytes (9 B) copied, 0.000110317 s, 81.6 kB/s

## If you look close, you will notice the checksum (in red) has also changed:
$ xxd -a -g1 -u myrelfile
00000000: 00 00 00 00 70 B0 8E 38 A4 8E 00 00 20 00 C0 1F  ....p..8.... ...
00000010: 00 20 04 20 00 00 00 00 E0 9F 40 00 C0 9F 40 00  . . ......@...@.
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001fc0: 05 0C 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001fd0: 02 00 03 00 01 08 18 03 02 00 00 00 39 30 00 00  ............90..
00001fe0: F7 0B 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00001ff0: 01 00 03 00 53 68 75 72 69 6B 65 6E 21 00 00 00  ....Shuriken!...

$ psql -p 9999 -qtc "select pg_healer_remove_from_buffer('pgbench_branches')"

$ psql -p 9999 -c "select * from pgbench_branches"
WARNING:  page verification failed, calculated checksum 56115 but expected 36516
INFO:  File has been healed: base/16384/198461 (external tuple healing)
ERROR:  invalid page in block 0 of relation base/16384/198461

$ psql -p 9999 -c "select * from pgbench_branches"
 bid | bbalance | filler 
   1 |        0 | 
   2 |    12345 | 
(2 rows)

There are still some rough edges, but for a proof of concept it works quite nicely. While reacting to corruption errors as they appear is nice, in the future I would like it to be more proactive, and run as a background process that scans the database for any problems and fixes them. Ideally, it should be able to handle a wider class of table corruption problems, as well as problems in indexes, free space maps, system catalogs, etc. Please jump in and lend a hand - the project is on github as pg_healer.

Data corruption is a fact of life DBAs must confront, be it from failing hard drives, cosmic rays, or other reason. While the detection of such errors was greatly improved in Postgres 9.3 with the --data-checksums argument to initdb (which ought to default on!), it's time to not just detect, but heal!


published by (Greg Sabino Mullane) on 2016-07-14 03:46:00 in the "postgres" category

Constraints in Postgres are very powerful and versatile: not only are foreign keys, primary keys, and column uniqueness done internally via constraints, but you may create your own quite easily (at both the column and table level). Most of the time constraints are simply set and forget, but there is one time constraints may become a problem: copying the database using the pg_dump program.

The issue is that constraints are usually added *before* the data is copied to the new table via the COPY command. This means the constraint fires for each added row, to make sure that the row passes the conditions of the constraint. If the data is not valid, however, the COPY will fail, and you will not be able to load the output of your pg_dump into a new database. Further, there may be a non-trivial performance hit doing all that validation. Preventing the constraint from firing may provide a significant speed boost, especially for very large tables with non-trivial constraints.

Let's explore one way to work around the problem of pg_dump failing to work because some of the data is not valid according to the logic of the constraints. While it would be quicker to make some of these changes on the production system itself, corporate inertia, red tape, and the usual DBA paranoia means a better way is to modify a copy of the database instead.

For this example, we will first create a sample "production" database and give it a simple constraint. This constraint is based on a function, to both emulate a specific real-world example we came across for a client recently, and to allow us to easily create a database in which the data is invalid with regards to the constraint:

dropdb test_prod; createdb test_prod
pgbench test_prod -i -n
creating tables...
100000 of 100000 tuples (100%) done (elapsed 0.82 s, remaining 0.00 s)
set primary keys...
psql test_prod -c 'create function valid_account(int) returns bool language sql immutable as $$ SELECT $1 > 0$$;'
psql test_prod -c 'alter table pgbench_accounts add constraint good_aid check ( valid_account(aid) )'

Note that the constraint was added without any problem, as all of the values in the aid column satisfy the function, as each one is greater than zero. Let's tweak the function, such that it no longer represents a valid, up to date constraint on the table in question:

## Verify that the constraint is working - we should get an error:
psql test_prod -c 'update pgbench_accounts set aid = -1 where aid = 1'
ERROR:  new row for relation "pgbench_accounts" violates check constraint "good_aid"
DETAIL:  Failing row contains (-1, 1, 0,                                         ...).

## Modify the function to disallow account ids under 100. No error is produced!
psql test_prod -c 'create or replace function valid_account(int) returns bool language sql volatile as $$ SELECT $1 > 99$$'

## The error is tripped only when we violate it afresh:
psql test_prod -c 'update pgbench_accounts SET aid=125 WHERE aid=125'
psql test_prod -c 'update pgbench_accounts SET aid=88 WHERE aid=88'
ERROR:  new row for relation "pgbench_accounts" violates check constraint "good_aid"
DETAIL:  Failing row contains (88, 1, 0,                                         ...).

The volatility was changed from IMMUTABLE to VOLATILE simply to demonstrate that a function called by a constraint is not bound to any particular volatility, although it *should* always be IMMUTABLE. In this example, it is a moot point, as our function can be immutable and still be "invalid" for some rows in the table. Owing to our function changing its logic, we now have a situation in which a regular pg_dump cannot be done:

dropdb test_upgraded; createdb test_upgraded
pg_dump test_prod | psql test_upgraded -q
ERROR:  new row for relation "pgbench_accounts" violates check constraint "good_aid"
DETAIL:  Failing row contains (1, 1, 0,                                          ...).
CONTEXT:  COPY pgbench_accounts, line 1: "1             1   0          "
## Ruh roh!

Time for a workaround. When a constraint is created, it may be declared as NOT VALID, which simply means that it makes no promises about the *existing* data in the table, but will start constraining any data changed from that point forward. Of particular importance is the fact that pg_dump can dump things into three sections, "pre-data", "data", and "post-data". When a normal constraint is dumped, it will go into the pre-data section, and cause the problems seen above when the data is loaded. However, a constraint that has been declared NOT VALID will appear in the post-data section, which will allow the data to load, as it will not be declared until after the "data" section has been loaded in. Thus, our workaround will be to move constraints from the pre-data to the post-data section. First, let's confirm the state of things by making some dumps from the production database:

pg_dump test_prod --section=pre-data -x -f test_prod.pre.sql
pg_dump test_prod --section=post-data -x -f
## Confirm that the constraint is in the "pre" section:
grep good_aid test*sql
test_prod.pre.sql:    CONSTRAINT good_aid CHECK (valid_account(aid))

There are a few ways around this constraint issue, but here is one that I like as it makes no changes at all to production, and produces valid SQL files that may be used over and over.

dropdb test_upgraded; createdb test_upgraded
## Note that --schema-only is basically the combination of pre-data and post-data
pg_dump test_prod --schema-only | psql test_upgraded -q
## Save a copy so we can restore these to the way we found them later:
psql test_upgraded -c "select format('update pg_constraint set convalidated=true where conname=%L and connamespace::regnamespace::text=%L;', 
  conname, nspname) from pg_constraint c join pg_namespace n on (n.oid=c.connamespace) 
  where contype ='c' and convalidated" -t -o restore_constraints.sql
## Yes, we are updating the system catalogs. Don't Panic!
psql test_upgraded -c "update pg_constraint set convalidated=false where contype='c' and convalidated"
## Why 3? The information_schema "schema" has two harmless constraints
pg_dump test_upgraded --section=pre-data -x -o test_upgraded.pre.sql
pg_dump test_upgraded --section=post-data -x -o
## Verify that the constraint has been moved to the "post" section:
grep good test*sql
test_prod.pre.sql:    CONSTRAINT good_aid CHECK (valid_account(aid)) Name: good_aid; Type: CHECK CONSTRAINT; Schema: public; Owner: greg    ADD CONSTRAINT good_aid CHECK (valid_account(aid)) NOT VALID;
## Two diffs to show the inline (pre) versus ALTER TABLE (post) constraint creations:
$ diff -u1 test_prod.pre.sql test_upgraded.pre.sql 
--- test_prod.pre.sql        2016-07-04 00:10:06.676766984 -0400
+++ test_upgraded.pre.sql    2016-07-04 00:11:07.978728477 -0400
@@ -54,4 +54,3 @@
     abalance integer,
-    filler character(84),
-    CONSTRAINT good_aid CHECK (valid_account(aid))
+    filler character(84)

$ diff -u1 
---        2016-07-04 00:11:48.683838577 -0400
+++    2016-07-04 00:11.57.265797869 -0400
@@ -17,2 +17,10 @@
+-- Name: good_aid; Type: CHECK CONSTRAINT; Schema: public; Owner: greg
+ALTER TABLE pgbench_accounts
+    ADD CONSTRAINT good_aid CHECK (valid_account(aid)) NOT VALID;
 SET default_tablespace = '';

Now we can simply sandwich our data load between the new pre and post files, and avoid having the constraints interfere with the data load portion at all:

dropdb test_upgraded; createdb test_upgraded
psql test_upgraded -q -f test_upgraded.pre.sql
pg_dump test_prod --section=data | psql test_upgraded -q
psql test_upgraded -q -f
## As the final touch, make all the constraints we changed exactly how each were before:
psql test_upgraded -f restore_constraints.sql

A final sanity check is always a good idea, to make sure the two databases are identical, despite our system catalog tweaking:

diff -s <(pg_dump test_prod) <(pg_dump test_upgraded)
Files /dev/fd/63 and /dev/fd/62 are identical

Although we declared a goal of having the upgraded database match production as closely as possible, you can always not apply that final restore_constraints.sql file and leave the constraints as NOT VALID, which is a better reflection of the reality of things. It also means you will not have to go through this rigmarole again, as those constraints shall forevermore be put into the post-data section when doing a pg_dump (unless someone runs the ALTER TABLE ... VALIDATE CONSTRAINT ... command!).

While there is no direct way to disable constraints when loading data, using this pre-data to post-data trick can not only boost data load times, but get you out of a potential jam when your data is invalid!


published by (Greg Sabino Mullane) on 2016-06-13 20:47:00 in the "postgres" category

(A Unicode rabbit face 🐰 will never be as cute
as this real bunny. Photo by Wade Simmons)

One of our clients recently reached out to us for help in upgrading their Postgres database. The use of the pg_upgrade program was not an option, primarily because the client was also taking the opportunity to change from their SQL_ASCII encoding to UTF-8. (If any of your databases, gentle reader, are still SQL_ASCII, please do the same!). Naturally, we also took advantage of the lack of pg_upgrade to enable the use of data checksums, another action we highly recommend. Although there were plenty of wrinkles, and stories to be told about this migration/upgrade, I wanted to focus on one particular problem we had: how to detect if a table has changed.

We needed to know if any applications were modifying certain tables because the speed of the migration was very important. If we could assert that no changes were made, there were some shortcuts available that would greatly speed things up. Initial testing showed that the migration was taking over eight hours, a time unacceptable to the client (no worries, we eventually reduced the time to under an hour!).

Looking closer, we found that over half that time was spent converting a single small (50MB) table from SQL_ASCII to UTF-8. How this conversion was performed is a story for another day, but suffice to say the table had some really, really messy bytes inside of it; the conversion program had to struggle mightily. When you are converting a database to a new encoding, it is imperative to examine every byte and make sure it gets changed to a format that Postgres will accept as valid UTF-8, or the entire table import will fail with an error similar to this:

ERROR:  invalid byte sequence for encoding "UTF8": 0xf4 0xa5 0xa3 0xa5

Looking closer at the data in the table showed that it might - just might! - be a historical table. In other words, it no longer receives updates, just selects. We really wanted this to be true, for it meant we could dump the whole table, convert it, and simply load the converted table into the new database (which took only a few seconds!). First, however, we had to confirm that the table was not changing.

Detecting changes may be done in several ways. For all of them, you can never prove that the table shall not change at some point in the future, but you can prove that it has not changed over a certain period of time. How you go about doing that depends on what kind of access you have. If you do not have super-user access, you could add a simple trigger to the table that updates another table when a update, insert, or delete is performed. Then, checking in on the second table will indicate if any changes have been made.

A better solution is to simply look at the underlying file that makes up the table. To do this, you need be a Postgres superuser or have access to the underlying operating system. Basically, we will trust the operating system's information on when the table was last changed to determine if the table itself has changed. Although not foolproof, it is an excellent solution. Let's illustrate it here. First: create a test table and add some rows:

$ psql
greg=# CREATE TABLE catbox AS SELECT 8675309::INT AS id FROM generate_series(1,1000);

Now we can use the pg_stat_file() function, which returns some basic information about a file on disk. With the help of the pg_relation_filepath() function, we can see when the table was last modified:

greg=# select * from pg_stat_file( pg_relation_filepath('catbox') ) xg
Expanded display is on.
-[ RECORD 1 ]+-----------------------
size         | 40960
access       | 2015-11-08 22:36:00-04
modification | 2015-11-08 22:36:00-04
change       | 2015-11-08 22:36:00-04
creation     | 
isdir        | f

Next we will revisit the table after some time (e.g. 24 hours) and see if the "modification" timestamp is the same. If it is, then the table has not been modified either. Unfortunately, the possibility of a false positive is possible due to VACUUM, which may change things on disk but does NOT change the data itself. (A regular VACUUM *may* modify the file, and a VACUUM FULL *always* modifies it).

greg=# select * from pg_stat_file( pg_relation_filepath('catbox') ) xg

-[ RECORD 1 ]+-----------------------
size         | 40960
access       | 2015-11-08 22:36:00-04
modification | 2015-11-08 22:36:00-04
change       | 2015-11-08 22:36:00-04
creation     | 
isdir        | f

greg=# vacuum catbox;

greg=# select * from pg_stat_file( pg_relation_filepath('catbox') );

2016-06-09 22:53:24-04
-[ RECORD 1 ]+-----------------------
size         | 40960
access       | 2015-11-08 22:36:00-04
modification | 2015-11-08 22:40:14-04
change       | 2015-11-08 22:40:14-04
creation     | 
isdir        | f

A second (and more foolproof) method is to simply generate a checksum of the entire table. This is a fairly straightforward approach; just pump the output of pg_dump to a checksum program:

$ pg_dump -t catbox --data-only | sha1sum
6f724565656f455072736e44646c207472536e61  -

The advantage here is that even a VACUUM FULL will not change the checksum. However, because pg_dump does no ORDER BY when dumping out the table, it is possible for the rows to be returned in a different order. To work around that, issue a VACUUM FULL yourself before taking the checksum. As before, come back later (e.g. 24 hours) and re-run the command. If the checksums match, then the table has not changed (and is probably no longer updated by the application). By using this method, we were able to verify that the large, SQL_ASCII byte-soup table was indeed not being updated, and thus we took it out of the direct migration.

Of course, that table needed to be part of the new database, but we simply dumped the table, ran the conversion program on it, and (four hours later), had a complete dump of the table that loads extremely fast into the new database.

That solved only one of the problems, however; another table was also slowing down the migration. Although it did not have the SQL_ASCII conversion issue, it was a large table, and took a large percentage of the remaining migration time. A quick look at this table showed it had a "creation_time" column as well as a SERIAL primary key, and was obviously being updated quite often. Close examination showed that it was possible this was an append-only table, such that older rows were never updated. This called for a similar approach: could we prove that a large chunk of the table was not changing? If we could, we could pre-populate the new database and copy over only the most recent rows during the migration, saving a good bit of time.

The previous tricks would not work for this situation, because the underlying file would change constantly as seen by pg_stat_file(), and a pg_dump checksum would change on every insert. We needed to analyze a slice of the table - in this particular case, we wanted to see about checksumming all rows except those created in the last week. As a primary key lookup is very fast, we used the "creation_time" column to determine an approximate primary key to start with. Then it was simply a matter of feeding all those rows into the sha1sum program:

greg=# CREATE TABLE catbox2 (id SERIAL PRIMARY KEY, creation_time TIMESTAMPTZ);
greg=# INSERT INTO catbox2(creation_time) select now() - '1 year'::interval + (x* '1 hour'::interval) from generate_series(1,24*365) x;
INSERT 0 8760

greg=# select * from catbox2 where creation_time > now()-'1 week'::interval order by 1 limit 1
  id  |         creation_time         
 8617 | 2016-06-11 10:51:00.101971-08

$ psql -Atc "select * from catbox2 where id < 8617 order by 1" | sha1sum
456272656d65486e6f203139353120506173733f  -

## Add some rows to emulate the append-only nature of this table:
greg=# insert into catbox2(creation_time) select now() from generate_series(1,1000)
INSERT 0 1000

## Checksums should still be identical:
$ psql -Atc "select * from catbox2 where id < 8617 order by 1" | sha1sum
456272656d65486e6f203139353120506173733f  -

Despite the large size of this table (around 10 GB), this command did not take that long to run. A week later, we ran the same commands, and got the same checksum! Thus, we were able to prove that the table was mostly append-only - or at least enough for our use case. We copied over the "old" rows, then copied over the rest of the rows during the critical production migration window.

In the future, this client will able to take advantage of pg_upgrade, but getting to UTF-8 and data checksums was absolutely worth the high one-time cost. There were several other tricks used to speed up the final migration, but being able to remove the UTF-8 conversion of the first table, and being able to pre-copy 99% of the second table accounted for the lion's share of the final speed improvements.


published by (Greg Sabino Mullane) on 2016-04-29 00:04:00 in the "postgres" category

Postgres has a wonderful feature called concurrent indexes. It allows you to create indexes on a table without blocking reads OR writes, which is quite a handy trick. There are a number of circumstances in which one might want to use concurrent indexes, the most common one being not blocking writes to production tables. There are a few other use cases as well, including:

Photograph by Nicholas A. Tonelli

  • Replacing a corrupted index
  • Replacing a bloated index
  • Replacing an existing index (e.g. better column list)
  • Changing index parameters
  • Restoring a production dump as quickly as possible

In this article, I will focus on that last use case, restoring a database as quickly as possible. We recently upgraded a client from a very old version of Postgres to the current version (9.5 as of this writing). The fact that use of pg_upgrade was not available should give you a clue as to just how old the "very old" version was!

Our strategy was to create a new 9.5 cluster, get it optimized for bulk loading, import the globals and schema, stop write connections to the old database, transfer the data from old to new, and bring the new one up for reading and writing.

The goal was to reduce the application downtime as much as reasonably possible. To that end, we did not want to wait until all the indexes were created before letting people back in, as testing showed that the index creations were the longest part of the process. We used the "--section" flags of pg_dump to create pre-data, data, and post-data sections. All of the index creation statements appeared in the post-data file.

Because the client determined that it was more important for the data to be available, and the tables writable, than it was for them to be fully indexed, we decided to try using CONCURRENT indexes. In this way, writes to the tables could happen at the same time that they were being indexed - and those writes could occur as soon as the table was populated. That was the theory anyway.

The migration went smooth - the data was transferred over quickly, the database was restarted with a new postgresql.conf (e.g. turn fsync back on), and clients were able to connect, albeit with some queries running slower than normal. We parsed the post-data file and created a new file in which all the CREATE INDEX commands were changed to CREATE INDEX CONCURRENTLY. We kicked that off, but after a certain amount of time, it seemed to freeze up.

The frogurt is also cursed.

Looking closer showed that the CREATE INDEX CONCURRENTLY statement was waiting, and waiting, and never able to complete - because other transactions were not finishing. This is why concurrent indexing is both a blessing and a curse. The concurrent index creation is so polite that it never blocks writers, but this means processes can charge ahead and be none the wiser that the create index statement is waiting on them to finish their transaction. When you also have a misbehaving application that stays "idle in transaction", it's a recipe for confusion. (Idle in transaction is what happens when your application keeps a database connection open without doing a COMMIT or ROLLBACK). A concurrent index can only completely finish being created once any transaction that has referenced the table has completed. The problem was that because the create index did not block, the app kept chugging along, spawning new processes that all ended up in idle in transaction.

At that point, the only way to get the concurrent index creation to complete was to forcibly kill all the other idle in transaction processes, forcing them to rollback and causing a lot of distress for the application. In contrast, a regular index creation would have caused other processes to block on their first attempt to access the table, and then carried on once the creation was complete, and nothing would have to rollback.

Another business decision was made - the concurrent indexes were nice, but we needed the indexes, even if some had to be created as regular indexes. Many of the indexes were able to be completed (concurrently) very quickly - and they were on not-very-busy tables - so we plowed through the index creation script, and simply canceled any concurrent index creations that were being blocked for too long. This only left a handful of uncreated indexes, so we simply dropped the "invalid" indexes (these appear when a concurrent index creation is interrupted), and reran with regular CREATE INDEX statements.

The lesson here is that nothing comes without a cost. The overly polite concurrent index creation is great at letting everyone else access the table, but it also means that large complex transactions can chug along without being blocked, and have to have all of their work rolled back. In this case, things worked out as we did 99% of the indexes as CONCURRENT, and the remaining ones as regular. All in all, the use of concurrent indexes was a big win, and they are still an amazing feature of Postgres.


published by (Szymon Lipi?ski) on 2016-04-07 20:12:00 in the "postgres" category

The new PostgreSQL 9.5 release has a bunch of great features. I describe below the ones I find most interesting.


UPSERT is simply a combination of INSERT and UPDATE. This works like this: if a row exists, then update it, if it doesn't exist, create it.

Before Postgres 9.5 when I wanted to insert or update a row, I had to write this:

INSERT INTO test(username, login)
SELECT 'hey', 'ho ho ho'
WHERE NOT EXISTS (SELECT 42 FROM test WHERE username='hey');

UPDATE test SET login='ho ho ho' WHERE username='hey' AND login <> 'ho ho ho';

Which was a little bit problematic. You need to make two queries, and both can have quite complicated WHERE clauses.

In PostgreSQL 9.5 there is much simpler version:

INSERT INTO test(username, login) VALUES ('hey', 'ho ho ho')
ON CONFLICT (username)
DO UPDATE SET login='ho ho ho';

The only requirement is that there should be a UNIQUE constraint on a column which should fail while inserting a row.

The version above makes the UPDATE when the INSERT fails. There is also another form of the UPSERT query, which I used in this blog post. You can just ignore the INSERT failure:

INSERT INTO test(username, login) VALUES ('hey', 'ho ho ho')
ON CONFLICT (username)

Switching Tables to Logged and Unlogged

PostgreSQL keeps a transaction write ahead log, which helps restore the database after a crash, and is used in replication, but it comes with some overhead, as additional information must be stored on disk.

In PostgreSQL 9.5 you can simply switch a table from logged to unlogged. The unlogged version can be much faster when filling it with data, processing it etc. However at the end of such operations it might be good to make it a normal logged table. Now it is simple:


JSONB Operators and Functions

This is the binary JSON type, and these new functions allow us to perform more operations without having to convert our data first to the slower, non-binary JSON alternative.

Now you can remove a key from a JSONB value:

SELECT '{"a": 1, "b": 2, "c": 3}'::jsonb || '{"x": 1, "y": 2, "c": 42}'::jsonb;

 {"b": 2, "c": 3}

And merge JSONB values (the last value's keys overwrite the first's one):

SELECT '{"a": 1, "b": 2, "c": 3}'::jsonb || '{"x": 1, "y": 2, "c": 42}'::jsonb;

 {"a": 1, "b": 2, "c": 42, "x": 1, "y": 2}

And we have the nice jsonb_pretty() function which instead of this:

SELECT jsonb_set('{"name": "James", "contact": {"phone": "01234 567890",
                   "fax": "01987 543210"}}'::jsonb,
                   '{contact,phone}', '"07900 112233"'::jsonb);

 {"name": "James", "contact": {"fax": "01987 543210", "phone": "07900 112233"}}

prints this:

SELECT jsonb_pretty(jsonb_set('{"name": "James", "contact": {"phone": "01234 567890",
                   "fax": "01987 543210"}}'::jsonb,
                   '{contact,phone}', '"07900 112233"'::jsonb));

  {                              ?
      "name": "James",           ?
      "contact": {               ?
          "fax": "01987 543210", ?
          "phone": "07900 112233"?
      }                          ?

More Information

There are more nice features in the new PostgreSQL 9.5. You can read the full list at'snewinPostgreSQL9.5


published by (Szymon Lipi?ski) on 2016-03-03 12:17:00 in the "postgres" category

In the previous posts I have described a simple database table for storing JSON values, and a way to unpack nested JSON attributes into simple database views. This time I will show how to write a very simple query (thanks to PostgreSQL 9.5) to load the JSON files

Here's a simple Python script to load the database.

This script is made for PostgreSQL 9.4 (in fact it should work for 9.5 too, but is not using a nice new 9.5 feature described below).

#!/usr/bin/env python

import os
import sys
import logging

    import psycopg2 as pg
    import psycopg2.extras
    print "Install psycopg2"

    import progressbar
    print "Install progressbar2"

import json

import logging
logger = logging.getLogger()

PG_CONN_STRING = "dbname='blogpost' port='5433'"

data_dir = "data"
dbconn = pg.connect(PG_CONN_STRING)"Loading data from '{}'".format(data_dir))

cursor = dbconn.cursor()

counter = 0
empty_files = []

class ProgressInfo:

    def __init__(self, dir):
        files_no = 0
        for root, dirs, files in os.walk(dir):
            for file in files:
                if file.endswith(".json"):
                    files_no += 1
        self.files_no = files_no
        print "Found {} files to process".format(self.files_no) = progressbar.ProgressBar(maxval=self.files_no,
                                           widgets=[' [', progressbar.Timer(), '] [', progressbar.ETA(), '] ', progressbar.Bar(),])

    def update(self, counter):

pi = ProgressInfo(os.path.expanduser(data_dir))

for root, dirs, files in os.walk(os.path.expanduser(data_dir)):
    for f in files:
        fname = os.path.join(root, f)

        if not fname.endswith(".json"):
        with open(fname) as js:
            data =
            if not data:
            import json
            dd = json.loads(data)
            counter += 1
                            INSERT INTO stats_data(data)
                            SELECT %s
                            WHERE NOT EXISTS (SELECT 42
                                              FROM stats_data
                                                    ((data->>'metadata')::json->>'country')  = %s
                                                AND ((data->>'metadata')::json->>'installation') = %s
                                                AND tstzrange(
                                                        to_timestamp((data->>'start_ts')::double precision),
                                                        to_timestamp((data->>'end_ts'  )::double precision)
                                                    ) &&
                                                        to_timestamp(%s::text::double precision),
                                                        to_timestamp(%s::text::double precision)
                        """, (data, str(dd['metadata']['country']), str(dd['metadata']['installation']), str(dd['start_ts']), str(dd['end_ts'])))

print ""

logger.debug("Refreshing materialized views")
cursor.execute("""REFRESH MATERIALIZED VIEW sessions""");

dbconn.commit()"Loaded {} files".format(counter))"Found {} empty files".format(len(empty_files)))
if empty_files:"Empty files:")
    for f in empty_files:" >>> {}".format(f))

I have created two example files in the 'data' directory, the output of this script is:

Found 2 files to process
 [Elapsed Time: 0:00:00] [ETA:  0:00:00] |#####################################|

Yey, so it works. What's more, I can run the script again on the same files, and it will try loading the same data without any errors. Do you rememember that there was an EXCLUDE constraint which doesn't allow us to load any JSON for the same country, and installation, and overlapping time range? That's why the query is so long. I also need to check that such a JSON is not in the database, so I can load it.

This is twice slower than the next solution. The problem is that it needs to unpack the JSON to run the subquery, then insert the data checking the same thing (in fact the insert, and the subquery are using the same index made by the EXCLUDE constraint).

And then PostgreSQL 9.5 was released, with one great feature: ON CONFLICT DO SOMETHING. The conflict is a UNIQUE index violation. The EXCLUDE clause in the stats_data table created such a unique index.

There can also be ON CONFLICT DO NOTHING, and that's what I have used. I changed only one query in the script, and instead of this:

                            INSERT INTO stats_data(data)
                            SELECT %s
                            WHERE NOT EXISTS (SELECT 42
                                              FROM stats_data
                                                    ((data->>'metadata')::json->>'country')  = %s
                                                AND ((data->>'metadata')::json->>'installation') = %s
                                                AND tstzrange(
                                                        to_timestamp((data->>'start_ts')::double precision),
                                                        to_timestamp((data->>'end_ts'  )::double precision)
                                                    ) &&
                                                        to_timestamp(%s::text::double precision),
                                                        to_timestamp(%s::text::double precision)
                        """, (data, str(dd['metadata']['country']), str(dd['metadata']['installation']), str(dd['start_ts']), str(dd['end_ts'])))

It looks like this:

                            INSERT INTO stats_data(data)
                            VALUES (%s)
                            ON CONFLICT ON CONSTRAINT no_overlapping_jsons DO NOTHING
                        """, (data, ))

This version requires PostgreSQL 9.5 and will not work on the previous versions.

It is twice as fast as the original, and works as expected. This means that I can run it on the already loaded files, and will not load them. This way when I use rsync to download the new files, I can just run the script, and it will load only the new files into the database.

Loading 88k of JSON files using the production version of the script with the first query takes over 1 minute. Loading the files using the second version takes less than 30 seconds.


published by (Szymon Lipi?ski) on 2016-02-29 13:08:00 in the "postgres" category

In the previous post I showed a simple PostgreSQL table for storing JSON data. Let's talk about making the JSON data easier to use.

One of the requirements was to store the JSON from the files unchanged. However using the JSON operators for deep attributes is a little bit unpleasant. In the example JSON there is attribute country inside metadata. To access this field, we need to write:

SELECT data->'metadata'->>'country' FROM stats_data;

The native SQL version would rather look like:

SELECT country FROM stats;

So let's do something to be able to write the queries like this. We need to repack the data to have the nice SQL types, and hide all the nested JSON operators.

I've made a simple view for this:

  id                                                          as id,
  created_at                                                  as created_at,
  to_timestamp((data->>'start_ts')::double precision)         as start_ts,
  to_timestamp((data->>'end_ts')::double precision)           as end_ts,
    to_timestamp((data->>'start_ts')::double precision),
    to_timestamp((data->>'end_ts')::double precision)
  )                                                           as ts_range,
  ( SELECT array_agg(x)::INTEGER[]
    FROM jsonb_array_elements_text(data->'resets') x)         as resets,
  (data->'sessions')                                          as sessions,
  (data->'metadata'->>'country')                              as country,
  (data->'metadata'->>'installation')                         as installation,
  (data->>'status')                                           as status
FROM stats_data;

This is a normal view, which means that it is only a query stored in the database. Each time the view is queried, the data must be taken from the stats_data table.

There is some code I could extract to separate functions. This will be useful in the future, and the view sql should be cleaner.

Here are my new functions:

CREATE OR REPLACE FUNCTION to_array(j jsonb) RETURNS integer[] AS $$
  SELECT array_agg(x)::INTEGER[] FROM jsonb_array_elements_text(j) x;

CREATE OR REPLACE FUNCTION to_timestamp(j jsonb) RETURNS timestamptz AS $$
  SELECT to_timestamp(j::text::double precision);

CREATE OR REPLACE FUNCTION to_timestamp_range(start_ts jsonb, end_ts jsonb) RETURNS tstzrange AS $$
  SELECT tstzrange(
    to_timestamp(start_ts::text::double precision),
    to_timestamp(end_ts::text::double precision)

And now the view can be changed to this:

  id                                                   as id,
  created_at                                           as created_at,
  to_timestamp(data->'start_ts')                       as start_ts,
  to_timestamp(data->'end_ts'  )                       as end_ts,
  to_timestamp_range(data->'start_ts', data->'end_ts') as ts_range,
  to_array(data->'resets')                             as resets,
  (data->'sessions')                                   as sessions,
  (data->'metadata'->>'country')                       as country,
  (data->'metadata'->>'installation')                  as installation,
  (data->>'status')                                    as status
FROM stats_data;

So currently we have normal SQL fields, except for the sessions part, which is there as JSON for a purpose.

The types made by PostgreSQL are:

                 View "public.stats"
    Column    ?           Type           ? Modifiers
 id           ? integer                  ?
 created_at   ? timestamp with time zone ?
 start_ts     ? timestamp with time zone ?
 end_ts       ? timestamp with time zone ?
 ts_range     ? tstzrange                ?
 resets       ? integer[]                ?
 sessions     ? jsonb                    ?
 country      ? text                     ?
 installation ? text                     ?
 status       ? text                     ?

The data from this view looks like this:

SELECT * FROM stats WHERE id = 1;

-[ RECORD 1 ]+----------------------------------------------------------------
id           | 1
created_at   | 2016-02-09 16:46:15.369802+01
start_ts     | 2015-08-03 21:10:33+02
end_ts       | 2015-08-03 21:40:33+02
ts_range     | ["2015-08-03 21:10:33+02","2015-08-03 21:40:33+02")
resets       | 
sessions     | [{"end_ts": 1438629089, "start_ts": 1438629058, "application": "first"},
                {"end_ts": 1438629143, "start_ts": 1438629123, "application": "second"},
                {"end_ts": 1438629476, "start_ts": 1438629236, "application": "third"}]
country      | USA
installation | FIRST
status       | on

The last part left is to extract information about the sessions. To make the reports simpler, I've extracted the sessions list into another view. However, because the operation of extracting the data is more expensive, I made it as a MATERIALIZED VIEW. This means that this view not only stores the query, but also keeps all the view data. This also means that this view is not updated automatically when the stats_data changes. I refresh the view data automatically in a script which loads the JSON files.

The sessions view looks like this:

  id                                                                            as id,
  country                                                                       as country,
  installation                                                                  as installation,
  s->>'application'                                                             as appname,
  to_timestamp_range(s->'start_ts', s->'end_ts')                                as ts_range,
  COALESCE(bool(s->>'occupancy_triggered'), false)                              as occupancy_triggered,
  to_timestamp(s->'end_ts') - to_timestamp(s->'start_ts')                       as session_length
FROM stats, jsonb_array_elements(sessions) s

CREATE INDEX i_sessions_country  ON sessions (country);
CREATE INDEX i_sessions_retailer ON sessions (installation);
CREATE INDEX i_sessions_ts_range ON sessions USING GIST (ts_range);

I've also created indexes on the materialized view, as my report queries will contain the where clause like:

WHERE country='' and installation='' and ts_range && tstzrange(fromdate, todate)

An example data extracted from the JSON looks like this:

select * from sessions;

id | country | installation | appname |                      ts_range                       | occupancy_triggered | session_length
 1 | USA     | FIRST        | first   | ["2015-08-03 21:10:58+02","2015-08-03 21:11:29+02") | f                   | 00:00:31
 1 | USA     | FIRST        | second  | ["2015-08-03 21:12:03+02","2015-08-03 21:12:23+02") | f                   | 00:00:20
 1 | USA     | FIRST        | third   | ["2015-08-03 21:13:56+02","2015-08-03 21:17:56+02") | f                   | 00:04:00
(3 rows)

In the next part I will show how to load the same JSON files multiple times, without any errors in a very simple Python script.


published by (Szymon Lipi?ski) on 2016-02-24 17:50:00 in the "postgres" category

We have plenty of Liquid Galaxy systems, where we write statistical information in json files. This is quite a nice solution. However we end with a bunch of files on a bunch of machines.

Inside we have a structure like:

    "end_ts": 1438630833,
    "resets": [],
    "metadata": {
        "country": "USA",
        "installation": "FIRST"
    "sessions": [
            "application": "first",
            "end_ts": 1438629089,
            "start_ts": 1438629058
            "application": "second",
            "end_ts": 1438629143,
            "start_ts": 1438629123
            "application": "third",
            "end_ts": 1438629476,
            "start_ts": 1438629236
    "start_ts": 1438629033,
    "status": "on"

And the files are named like "{start_ts}.json". The number of files is different on each system. For January we had from 11k to 17k files.

The fields in the json mean:

  • start_ts/end_ts - timestamps for start/end for the file
  • resets - is an array of timestamps when system was resetted
  • sessions - a list of sessions, each contains application name and start/end timestamps

We keep these files in order to get statistics from them. So we can do one of two things: keep the files on disk, and write a script for making reports. Or load the files into a database, and make the reports from the database.

The first solution looks quite simple. However for a year of files, and a hundred of systems, there will be about 18M files.

The second solution has one huge advantage: it should be faster. A database should be able to have some indexes, where the precomputed data should be stored for faster querying.

For a database we chose PostgreSQL. The 9.5 version released in January has plenty of great features for managing JSON data.

The basic idea behind the database schema is:

  • the original jsons should be stored without any conversion
  • the report queries must be fast
  • there should be only one json entry for a site for given time
  • the script loading the data should load the same file many times without any error

I've started with the main table for storing jsons:

CREATE TABLE stats_data (

This is not enough. We also want to avoid storing the same json multiple times. This can easily be done with an EXCLUDE clause.

CREATE TABLE stats_data (

  CONSTRAINT no_overlapping_jsons
      to_timestamp((data->>'start_ts')::double precision),
      to_timestamp((data->>'end_ts'  )::double precision)
    ) WITH &&,
    ((data->>'metadata')::json->>'country')      WITH =,
    ((data->>'metadata')::json->>'installation') WITH =

The above SQL requires a small extention to be installed


And now inserting the same json results in error:

$ insert into stats_data(data) select data from stats_data;
ERROR:  conflicting key value violates exclusion constraint "no_overlapping_jsons"

So for now we have a simple table with original json, and with a constraint disallowing to insert overlapping jsons.

In the next part I will show how to make simple reports and load the json files.