All opinions expressed are those of the authors and not necessarily those of, our sponsors, or our affiliates.
  Add to My Yahoo!  Subscribe with Bloglines  Subscribe in NewsGator Online

published by (Jon Jensen) on 2017-03-11 01:36:00 in the "bash" category

Little tools can be a nice improvement. Not everything needs to be thought-leaderish.

For example, once upon a time in my Unix infancy I didn't know that mkdir has the -p option to make intervening directories automatically. So back then, in order to create the path a/b/c/ I would've run: mkdir a; mkdir b; mkdir c when I could instead have simply run: mkdir -p a/b/c.

In working at the shell, particularly on my own local machine, I often find myself wanting to move one or several files into a different location, to file them away. For example:

mv -i ~/Downloads/Some Long File Name.pdf ~/some-other-long-file-name.tar.xz ~/archive/new...

at which point I realize that the subdirectory of ~/archive that I want to move those files into does not yet exist.

I can't simply move to the beginning of the line and change mv to mkdir -p without removing my partially-typed ~/archive/new....

I can go ahead and remove that, and then after I run the command I have to change the mkdir back to mv and add back the ~/archive/new....

In one single day I found I was doing that so often that it became tedious, so I re-read the GNU coreutils manpage for mv to see if there was a relevant option I had missed or a new one that would help. And I searched the web to see if a prebuilt tool is out there, or if anyone had any nice solutions.

To my surprise I found nothing suitable, but I did find some discussion forums full of various suggestions and many brushoffs and ill-conceived suggestions that either didn't work for me or seemed much overengineered.

The solution I came up with was very simple. I've been using it for a few months and am happy enough with it to share it and see if it helps anyone else.

In zsh (my main local shell) add to ~/.zshrc:

mkmv() {
    mkdir -p -- "$argv[-1]"
    mv "$@"

And in bash (which I use on most of the many servers I access remotely) add to ~/.bashrc:

mkmv() {
    mkdir -p -- "${!#}"
    mv "$@"

To use: Once you realize you're about to try to move files or directories into a nonexistent directory, simply go to the beginning of the line (^A in standard emacs keybindings) and type mk in front of the mv that was already there:

mkmv -i ~/Downloads/Some Long File Name.pdf ~/some-other-long-file-name.tar.xz ~/archive/new...

It makes the directory (or directories) and then completes the move.

There are a few important considerations that I didn't foresee in my initial naive implementation:

  • Having the name be somethingmv meant less typing than something requiring me to remove the mv.
  • For me, it needs to support not just moving one thing to one destination, but rather a whole list of things. That meant accessing the last argument (the destination) for the mkdir.
  • I also needed to allow through arguments to mv such as -i, -v, and -n, which I often use.
  • The -- argument to mkdir ensures that we don't accidentally end up with any other options and that we can handle a destination with a leading - (however unlikely that is).
  • The mv command needs to have a double-quoted "$@" so that the original parameters are each expanded into double-quoted arguments, allowing for spaces and other shell metacharacters in the paths. (See the zsh and bash manpages for details on the important difference in behavior of "$@" compared to "$*" and either of them unquoted.)

This doesn't support GNU extensions to mv such as a --target-directory that precedes the source paths. I don't use that interactively, so I don't mind.

Because this is such a small thing, I avoided for years bothering to set it up. But now that I use it all the time I'm glad I have it!


published by (Marco Matarazzo) on 2016-12-12 18:32:00 in the "bash" category

Let's say you're working in Bash, and you want to loop over a list of files, using wildcards.

The basic code is:

for f in /path/to/files/*; do
  echo "Found file: $f"

Easy as that. However, there could be a problem with this code: if the wildcard does not expand to actual files (i.e. there's no file under /path/to/files/ directory), $f will expand to the path string itself, and the for loop will still be executed one time with $f containing "/path/to/files/*".

How to prevent this from happening? Nullglob is what you're looking for.

Nullglob, quoting shopts man page, "allows filename patterns which match no files to expand to a null string, rather than themselves".

Using shopt -s you can enable BASH optional behaviors, like Nullglob. Here's the final code:

shopt -s nullglob
for f in /path/to/files/*; do
  echo "Found file: $f"

Another interesting option you may want to check for, supported by Bash since version 3, is failglob.

With failglob enabled, quoting again, "patterns which fail to match filenames during filename expansion result in an expansion error". Depending on what you need, that could even be a better behavior.

Wondering why nullglob it's not the default behavior? Check this very good answer to the question.


published by (Szymon Lipi?ski) on 2016-04-08 13:42:00 in the "bash" category

Bash has quite a nice feature, you can write a command in a console, and then press <TAB> twice. This should should you possible options you can write for this command.

I will show how to integrate this mechanism into a custom python script with two types of arguments. What's more, I want this to be totally generic. I don't want to change it when I will change the options, or change config files.

This script accepts two types of arguments. One type contains mainly flags beginning with '--', the other type is a host name taken from a bunch of chef scripts.

Let's name this script - it will show some information about the host. This way I can use it with: szymon

The szymon part is the name of my special host, and it is taken from one of our chef node definition files.

This script also takes huge number of arguments like: --cpu --memory --format=json

So we have two kinds of arguments: one is a simple string, one begins with --.

To implement the bash completion on double <TAB>, first I wrote a simple python script, which is prints a huge list of all the node names:

#!/usr/bin/env python

from sys import argv
import os
import json

if __name__ == "__main__":
    pattern = ""
    if len(argv) == 2:
        pattern = argv[1]

    chef_dir = os.environ.get('CHEF_DIR', None)
    if not chef_dir:
    node_dirs = [os.path.join(chef_dir, "nodes"),
                 os.path.join(chef_dir, "dev_nodes")]
    node_names = []

    for nodes_dir in node_dirs:
        for root, dirs, files in os.walk(nodes_dir):
            for f in files:
                    with open(os.path.join(root, f), 'r') as nf:
                        data = json.load(nf)

    for name in node_names:
        print name

Another thing was to get a list of all the program options. We used the below one liner. It uses the help information shown by the script. So each time the script changed its options, and it is shown when used --help, the tab completion will have show these new options.

$CHEF_DIR/repo_scripts/ --help | grep '  --' | awk '{print $1}'

The last step to make all this work was making a simple bash script, which uses the above python script, and the one liner. I placed this script in a file $CHEF_DIR/repo_scripts/show.bash-completion.

    local cur prev opts node_names
    opts=`$CHEF_DIR/repo_scripts/ --help | grep '  --' | awk '{print $1}'`
    node_names=`python $CHEF_DIR/repo_scripts/`

    if [[ ${cur} == -* ]] ; then
        COMPREPLY=( $(compgen -W "${opts}" -- ${cur}) )
        return 0

    COMPREPLY=( $(compgen -W "${node_names}" -- ${cur}) )

complete -F _show_complete

The last thing was to source this file, so I've added the below line in my ~/.bashrc.

source $CHEF_DIR/repo_scripts/show.bash-completion

And now pressing the <TAB> twice in a console shows quite nice completion options:

Display all 42 possibilities? (y or n)
... and here go all 42 node names ...
$ h
... and here go all node names beginning with 'h' ...
$ --
.. and here go all the options beginning with -- ...


published by (Jeff Boes) on 2016-01-15 14:30:00 in the "bash" category

Recently I managed to break the bash shell in an interesting and puzzling way. The initial symptoms were very frustrating: a workflow process we use here (creating a development camp) failed for me, but for no one else. That was at least a clue that it was me, not the workflow process.

Eventually, I narrowed down the culprit to the "grep" command (and that was more through luck than steadfast Sherlock-like detective work).

$ grep foo bar

grep: foo: No such file or directory

Eh? grep is mis-parsing the arguments! How does that happen?

So I began to study my bash environment. Eventually I came up with this fascinating little typo:

export GREP_OPTIONS='?color=auto'

That's supposed to be:

export GREP_OPTIONS='--color=auto'

but it got recorded in my .bashrc as a en-dash, not a double-dash. (My guess is that I cut-and-pasted this from a web page where someone over-helpfully "typeset" this command.)

Ironically, this typo is innocuous under Bash 3.x, but when you slot it into a Bash 4.x installation, all heck busts loose.


published by (Ramkumar Kuppuchamy) on 2015-11-04 20:25:00 in the "bash" category
Here are some of the unix command line tools which we feel make our hands faster and lives easier. Let?s go through them in this post and make sure to leave a comment with your favourite!

1. Find the command that you are unaware of

In many situations we need to perform a command line operation but we might not know the right utility to run. The command (apropos) searches for the given keyword against its short description in the unix manual page and returns a list of commands that we may use to accomplish our need.

If you can not find the right utility, then Google is our friend :)

$ apropos "list dir"
$ man -k "find files"

2. Fix typos in our commands

It's normal to make typographical errors when we type so fast. Consider a situation where we need to run a command with a long list of arguments and when executing it returns "command not found" and you noticed that you have made a "typo" on the executed command.
Now, we really do not want to retype the long list of arguments, instead use the following to simply just correct the typo command and execute
$ ^typo_cmd^correct_cmd
 $ dc /tmp
 $ ^dc^cd
The above will navigate to /tmp directory

3. Bang and its Magic

Bang quite useful, when we want to play with the bash history commands . Bang helps by letting you execute commands in history easily when you need them
  • !! --> Execute the last executed command in the bash history
  • !* --> Execute the command with all the arguments passed to the previous command
  • !? --> Get the first argument of the last executed command in the bash history
  • !$ --> Get the last argument of the last executed command in the bash history
  • ! --> Execute a command which is in the specified number in bash history
  • !?keyword? --> Execute a command from bash history for the first pattern match of the specified keyword
  • !-N --> Execute the command that was Nth position from the last in bash history
$ ~/bin/lg-backup
 $ sudo !!
In the last part of the above example we didn't realize that the lg-backup command had to be run as "sudo". Now, Instead of typing the whole command again with sudo, we can just use "sudo !!" which will re-run the last executed command in bash history as sudo, which saves us lot of time.

4. Working with Incron

This incron configuration is almost like crontab setup, but the main difference is that "Incron" monitors a directory for specific changes and triggers future actions as specified
Syntax: $directory $file_change_mask $command_or_action

/var/www/html/contents/ IN_CLOSE_WRITE,IN_CREATE,IN_DELETE /usr/bin/rsync ?exclude ?*.tmp? -a /home/ram/contents/ user@another_host:/home/ram/contents/
 /tmp IN_ALL_EVENTS logger "/tmp action for #file"
The above example shows triggering an "rsync" event whenever there is a change in "/var/www/html/contents" directory. In cases of immediate backup implementations this will be really helpful. Find more about incron here.

5. Double dash

There are situations where we end up in creating/deleting the directories whose name start with a symbol. These directories can not be removed by just using "rm -rf or rmdir". So we need to use the "double dash" (--) to perform deletion of such directories
$ rm -rf -- $symbol_dir
There are situations where you may want to create a few directory that starts with a symbol. "You can just these create directories using double dash(--) and starting the directory name with a symbol"
$ mkdir -- $symbol_dir

6. Comma and Braces Operators

We can do lot with comma and braces to make our life easier when we are performing some operations, lets see few usages
  • Rename and backup operations with comma & braces operator
  • Pattern matching with comma & braces operator
  • Rename and backup (prefixing name) operations on long file names
To backup the httpd.conf to httpd.conf.bak
$ cp httpd.conf{,.bak}
To revert the file from httpd.conf.bak to httpd.conf
$ mv http.conf{.bak,}
To rename the file with prefixing 'old'
$ cp exampleFile old-!#?

7. Read only vim

As we all know, vim is a powerful command line editor. We can also use vim to view files in read only mode if you want to stick to vim
$ vim -R filename
We can also use the "view" tool which is nothing but read only vim
$ view filename 

8. Push and Pop Directories

Sometimes when we are working with various directories and looking at the logs and executing scripts we find alot of our time is spent navigating the directory structure. If you think your directory navigations resembles a stack structure then use push and pop utilities which will save you lots of time
  • Push the directory using pushd
  • List the stack directories using the command "dirs"
  • Pop the directories using popd
  • This is mainly used in navigating between directories

9. Copy text from Linux terminal(stdin) to the system clipboard

Install xclip and create the below alias
$ alias pbcopy=?xclip -selection clipboard?
$ alias pbpaste=?xclip -selection clipboard -o?
We need to have the X window system running it to work. In Mac OS X, these pbcopy and pbpaste commands are readily available to you
To Copy:
$ ls | pbcopy
To Paste:
$ pbpaste > lstxt.txt 

10. TimeMachine like Incremental Backups in Linux using rsync --link-dest

This means that it will not recopy all of the files every single time a backup is performed. Instead, only the files that have been newly created or modified since the last backup will be copied. Unchanged files are hard linked from prevbackup to the destination directory.
$ rsync -a ?link-dest=prevbackup src dst

11. To display the ASCII art of the Process tree

Showing your processes in a tree structure is very useful for confirming the relationship between every process running on your system. Here is an option which is available by default on most of the Linux systems.
$ ps -aux ?forest
?forest is an argument to ps command, which displays ASCII art of process tree

There are many commands available like 'pstree', 'htop' to achieve the same thing.

12. Tree view of git commits

If you want to see git commits in a repo as tree view to understand the commit history better, the below option will be super helpful. This is available with the git installation and you do not need any additional packages.
$ git log ?graph ?oneline

13. Tee

Tee command is used to store and view (at the same time) the output of any other command.
(ie) At the same time it writes to the STDOUT, and to a file. It helps when you want to view the command output and at the time same time if you want to write it into a file or using pbcopy you can copy the output
$ crontab -l | tee crontab.backup.txt
The tee command is named after plumbing terminology for a T-shaped pipe splitter. This Unix command splits the output of a command, sending it to a file and to the terminal output. Thanks Jon for sharing this.

14. ncurses disk usage analyzer

Analysing disk usage with nurses interface, is fast and simple to use.
$ sudo apt-get install ncdu

15. hollywood

You all have seen the hacking scene on hollywood movies. Yes, there is a package which will let you create that for you.
$ sudo apt-add-repository ppa:hollywood/ppa 
$ sudo apt-get update
$ sudo apt-get install hollywood
$ hollywood


published by (Kannan Ponnusamy) on 2015-01-28 14:38:00 in the "bash" category

The Problem

When I launch a long running unix process within a python script, it waits until the process is finished, and only then do I get the complete output of my program. This is annoying if I'm running a process that takes a while to finish. And I want to capture the output and display it in the nice manner with clear formatting.

Using the subprocess and shlex library

Python has a ?batteries included? philosophy. I have used 2 standard libraries to solve this problem.
import subprocess 
import shlex 
  • subprocess - Works with additional processes
  • shlex - Lexical analysis of shell-style syntaxes


To run a process and read all of its output, set the stdout value to PIPE and call communicate().
import subprocess
process = subprocess.Popen(['echo', '"Hello stdout"'], stdout=subprocess.PIPE)
stdout = process.communicate()[0]
print 'STDOUT:{}'.format(stdout)
The above script will wait for the process to complete and then it will display the output. So now we are going to read the stdout line by line and display it in the console untill it completes the process.
output = process.stdout.readline()
This will read a line from the stdout.
The poll() method will return
  • the exit code if the process is completed.
  • None if the process is still running.
while True:
        output = process.stdout.readline()
        if output == '' and process.poll() is not None:
        if output:
    rc = process.poll()
The above will loop and keep on reading the stdout and check for the return code and displays the output in real time.
I had one more problem in parsing the shell commands to pass it to popen when I set the shell=False. Below is an example command:
rsync -avzXH --delete --exclude=*.swp --exclude=**/drivers.ini /media/lgisos/lg.iso root@42-a:/isodevice
To split the string using shell-like syntax I have used shlex library's split method.

Here is the final code looks like

def run_command(self, command):
    process = subprocess.Popen(shlex.split(command), stdout=subprocess.PIPE)
    while True:
        output = process.stdout.readline()
        if output == '' and process.poll() is not None:
        if output:
    rc = process.poll()
    return rc


published by (David Christensen) on 2014-11-27 02:33:00 in the "bash" category
I recently was involved in a project to migrate a client's existing application from MySQL to PostgreSQL, and I wanted to record some of my experiences in doing so in the hopes they would be useful for others.

Note that these issues should not be considered exhaustive, but were taken from my notes of issues encountered and/or things that we had to take into consideration in this migration process.

Convert the schema

The first step is to convert the equivalent schema in your PostgreSQL system, generated from the original MySQL.

We used `mysqldump --compatible=postgresql --no-data` to get a dump which matched PostgreSQL's quoting rules. This file still required some manual editing to cleanup some of the issues, such as removing MySQL's "Engine" specification after a CREATE TABLE statement, but this resulted in a script in which we were able to create a skeleton PostgreSQL database with the correct database objects, names, types, etc.

Some of the considerations here include the database collations/charset. MySQL supports multiple collations/charset per database; in this case we ended up storing everything in UTF-8, which matched the encoding of the PostgreSQL database, so there were no additional changes needed here; otherwise, it would have been necessary to note the original encoding of the individual tables and later convert that to UTF-8 in the next step.

We needed to make the following modifications for datatypes:

MySQL Datatype PostgreSQL Datatype
tinyint int
int(NN) int
blob bytea*
datetime timestamp with timezone
int unsigned int**
enum('1') bool
longtext text
varbinary(NN) bytea

* Note: we ended up converting these specific fields to text, just given the data that was stored in these fields in actuality, which just goes to show you should review your data.

** Note: because PostgreSQL does not have unsigned numeric types, if this feature is an important part of your data model you can/should add a CHECK constraint to the column in question to check that the value is non-negative.

A few other syntactic changes; MySQL's UNIQUE KEY in the CREATE TABLE statement needs to just be UNIQUE.

Some of the MySQL indexes were defined as FULLTEXT indexes as well, which was a keyword PostgreSQL did not recognize. We made note of these, then created just normal indexes for the time being, intending to review to what extent these actually needed full text search capabilities.

Some of the AUTO_INCREMENT fields did not get the DEFAULT value set correctly to a sequence, because those types just ended up as integers without being declared a serial field, so we used the following query to correct this:

-- cleanup missing autoincrement fields

datasource AS (
        information_schema.key_column_usage k
        attrelid = k.table_name :: regclass AND
        attname = k.column_name
        adrelid = k.table_name :: regclass AND
        adnum = k.ordinal_position
        table_name IN (
            SELECT table_name::text FROM information_schema.key_column_usage WHERE constraint_name LIKE '%_pkey' GROUP BY table_name HAVING count(table_name) = 1
        ) AND
        adsrc IS NULL AND
        atttypid = 'integer' ::regtype
frags AS (
        quote_ident(table_name || '_' || column_name || '_seq') AS q_seqname,
        quote_ident(table_name) as q_table,
        quote_ident(column_name) as q_col
queries AS (
        'CREATE SEQUENCE ' || q_seqname || ';
' ||
        'ALTER TABLE ' || q_table || ' ALTER COLUMN ' || q_col || $$ SET DEFAULT nextval('$$ || q_seqname || $$');
    $$ ||
        $$SELECT setval('$$ || q_seqname || $$',(SELECT max($$ || q_col || ') FROM ' || q_table || '));
' AS query
    FROM frags
    COALESCE(string_agg(query, E'n'),$$SELECT 'No autoincrement fixes needed';$$) AS queries FROM queries


Basically the idea is that we look for all table with a defined integer primary key (hand-waving it it by using the _pkey suffix in the constraint name), but without a current default value, then generate the equivalent SQL to create a sequence and set that table's default value to the nextval() for the sequence in question. We also generate SQL to scan that table and set that sequence value to the next appropriate value for the column in question. (Since this is for a migration and we know we'll be the only user accessing these tables we can ignore MVCC.)

Another interesting thing about this script is that we utilize psql's ability to store results in a variable, using the gset command, then we subsequently execute this SQL by interpolating that corresponding variable in the same script.

Convert the data

The next step was to prepare the data load from a MySQL data-only dump. Using a similar dump recipe as for the initial import, we used: `mysqldump --compatible=postgresql --no-create-info --extended-insert > data.sql` to save the data in a dump file so we could iteratively tweak our approach to cleaning up the MySQL data.

Using our dump file, we attempted a fresh load into the new PostgreSQL database. This failed initially due to multiple issues, including ones of invalid character encoding and stricter datatype interpretations in PostgreSQL.

What we ended up doing was to create a filter script to handle all of the "fixup" issues needed here. This involved decoding the data and reencoding to ensure we were using proper UTF8, performing some context-sensitive datatype conversions, etc.

Additional schema modifications

As we were already using a filter script to process the data dump, we decided to take the opportunity to fixup some warts in the current table definitions. This included some fields which were varchar, but should have actually been numeric or integer; as this was a non-trivial schema (100 tables) we were able to use PostgreSQL's system views to identify a list of columns which should should be numeric and were currently not.

Since this was an ecommerce application, we identified columns that were likely candidates for data type reassignment based on field names *count, *qty, *price, *num.

Once we identified the fields in question, I wrote a script to generate the appropriate ALTER TABLE statements to first drop the default, change the column type, then set the new default. This was done via a mapping between table/column name and desired output type.

Convert the application

The final (and needless to say most involved step) was to convert the actual application itself to work with PostgreSQL. Despite the fact that these databases both speak SQL, we had to come up for solutions for the following issues:

Quotation styles

MySQL is more lax with its quoting styles, so some of this migration involved hunting down differences in quoting styles. The codebase contained lots of double-quoted string literals, which PostgreSQL interprets as identifiers, as well as the difference in quoting of column names (backticks for MySQL, double-quotes for PostgreSQL). These had to be identified wherever they appeared and fixed to use a consistent quoting style.

Specific unsupported syntax differences:


MySQL supports the INSERT ON DUPLICATE KEY syntax. Modifying these queries involved creating a special UPSERT-style function to support the different options in use in the code base. We isolated and categorized the uses of INSERT ON DUPLICATE KEY UPDATE into several categories: those which did a straight record replace and those which did some sort of modification. I wrote a utility script (detailed later in this article) which served to replicate the logic needed to handle this as the application would expect.

Upcoming versions of PostgreSQL are likely to incorporate an INSERT ... ON CONFLICT UPDATE/IGNORE syntax, which would produce a more direct method of handling migration of these sorts of queries.


MySQL's INSERT ... IGNORE syntax allows you to insert a row and effectively ignore a primary key violation, assuming that the rest of the row is valid. You can handle this case via creating a similar UPSERT function as in the previous point. Again, this case will be easily resolved if PostgreSQL adopts the INSERT ... ON CONFLICT IGNORE syntax.


MySQL's REPLACE INTO syntax effectively does a DELETE followed by an INSERT; it basically ensures that a specific version of a row exists for the given primary key value. We handle this case by just modifying these queries to do an unconditional DELETE for the Primary Key in question followed by the corresponding INSERT. We ensure these are done within a single transaction so the result is atomic.


Date interval syntax can be slightly different in MySQL; intervals may be unquoted in MySQL, but must be quoted in PostgreSQL. This project necessitated hunting down several instances to add quoting of specific literal INTERVAL instances.

Function considerations


Many times when you insert a records into a MySQL table, later references to this are found using the last_insert_id() SQL function. These sorts of queries need to be modified to utilize the equivalent functionality using PostgreSQL sequences, such as the currval() function.


MySQL has the GROUP_CONCAT function, which serves as a string "join" of sorts. We emulate this behavior in PostgreSQL by using the string_agg aggregate function with the delimiter of choice.

CONCAT_WS() - expected to be but not an issue; PG has this function

PostgreSQL has included a CONCAT_WS() function since PostgreSQL 9.1, so this was not an issue with the specific migration, but could still be an issue if you are migrating to an older version of PostgreSQL.


This function does not exist directly in PostgreSQL, but can be simulated using to_date(). Note however that the format string argument differs between MySQL and PostgreSQL's versions.


MySQL has a date_format() function which transforms a date type to a string with a given format option. PostgreSQL has similar functionality using the to_char() function; the main difference here lies in the format string specifier.


DateDiff() does not exist in PostgreSQL, this is handled by transforming the function call to the equivalent date manipulation operators using the subtraction (-) operator.

rand() to random()

This is more-or-less a simple function rename, as the equivalent functionality for returning a random float between 0.0 <= x <= 1.0 exists in PostgreSQL and MySQL, it's just what the function name itself is. The other difference is that MySQL supports a scale argument so the random number for rand(N) will be returned between 0.0 <= x <= N, whereas you'd have to scale the result in PostgreSQL yourself, via random() * N.


MySQL has an IF() function which returns the second argument in the case the first argument evaluates to true otherwise returns the third argument. This can be trivially converted from IF(expression1, arg2, arg3) to the equilvalent PostgreSQL syntax: CASE WHEN expression1 THEN arg2 ELSE arg3.


MySQL has a function IFNULL() which returns the first argument if it is not NULL, otherwise it returns the second argument. This can effectively be replaced by the PostgreSQL COALESCE() function, which serves the same purpose.


MySQL has a built-in function called split_part() which allows you to access a specific index of an array delimited by a string. PostgreSQL also has this function, however the split_part() function in MySQL allows the index to be negative, in which case this returns the part from the right-hand side.

in MySQL:
split_part('a banana boat', ' ', -1) => 'boat'
in PostgreSQL:
split_part('a banana boat', ' ', -1) => // ERROR:  field position must be greater than zero

I fixed this issue by creating a custom plpgsql function to handle this case. (In my specific case, all of the negative indexes were -1; i.e., the last element in the array, so I created a function to return only the substring occurring after the last instance of the delimiter.)

Performance considerations

You may need to revisit COUNT(*) queries

MySQL MyISAM tables have a very fast COUNT(*) calculation, owing to queries taking a table lock (which means MySQL can cache the COUNT(*) result itself, since there can be no concurrent updates), while PostgreSQL utilizes MVCC in order to calculate table counts which necessitates a full table scan to see which rows are visible to the specific calling snapshot, so this assumption may need to be revisited in order to calculate an equivalent performant query.


MySQL is much more (*ahem*) flexible when it comes to GROUP BY/aggregate queries, allowing some columns to be excluded in a GROUP BY or an aggregate function. Making the equivalent query in PostgreSQL involves transforming the query from SELECT ... GROUP BY (cols) to a SELECT DISTINCT ON (cols) ... and providing an explicit sort order for the rows.

More notes

Don't be afraid to script things; in fact, I would go so far as to suggest that everything you do should be scripted. This process was complicated and there were lots of moving parts to ensure moved in tandem. There were changes being made on the site itself concurrently, so we were doing testing against a dump of the original database at a specific point-in-time. Having everything scripted ensured that this process was repeatable and testable, and that we could get to a specific point in the process without having to remember anything I'd done off-the-cuff.

In addition to scripting the actual SQL/migrations, I found it helpful to script the solutions to various classifications of problems. I wrote some scripts which I used to create some of the various scaffolding/boilerplate for the tables involved. This included a script which would create an UPSERT function for a specific table given the table name, which was used when replacing the INSERT ON DUPLICATE KEY UPDATE functions. This generated script could then be tailored to handle more complex logic beyond a simple UPDATE. (One instance here is an INSERT ON DUPLICATE KEY UPDATE which increased the count of a specific field in the table instead of replacing the value.)

#!/usr/bin/env perl
# -*-cperl-*-

use strict;
use warnings;

use Data::Dumper;

my $table = shift or die "Usage: $0 <table>n";
my @cols = @ARGV;

my $dbh = DBI->connect(...);

my @raw_cols = @{ $dbh->column_info(undef, 'public', $table, '%')->fetchall_arrayref({}) };
my %raw_cols = map { $_->{COLUMN_NAME} => $_ } @raw_cols;

die "Can't find table $tablen" unless @raw_cols;

my @missing_cols = grep { ! defined $raw_cols{$_} } @cols;

die "Referenced non-existing columns: @missing_colsn" if @missing_cols;

my %is_pk;

unless (@cols) {
    @cols = map { $_->{COLUMN_NAME} } @raw_cols;

my @pk_cols = $dbh->primary_key(undef, 'public', $table);

@is_pk{@pk_cols} = (1)x@pk_cols;

my @data_cols = grep { ! $is_pk{$_} } @cols;

die "Missing PK cols from column list!n" unless @pk_cols == grep { $is_pk{$_} } @cols;
die "No data columns!n" unless @data_cols;

print <<EOF
    upsert_$table (@{[
    join ', ' => map {
        "_$_ $raw_cols{ $_ }->{pg_type}"
    } @cols
LANGUAGE plpgsql
        UPDATE $table SET @{[
    join ', ' => map { "$_ = _$_" } @data_cols
]} WHERE @{[
    join ' AND ' => map { "$_ = _$_" } @pk_cols
        END IF;
            INSERT INTO $table (@{[join ',' => @cols]}) VALUES (@{[join ',' => map { "_$_" } @cols]});
        EXCEPTION WHEN unique_violation THEN
            -- Do nothing, and loop to try the UPDATE again.

This script created an upsert function from a given table to update all columns by default, also allowing you to create one with a different number of columns upserted.

I also wrote scripts which could handle/validate some of the column datatype changes. Since there were large numbers of columns which were changed, often multiple in the same table, I was able to have this script create a single ALTER TABLE statement with multiple ALTER COLUMN TYPE USING clauses, plus be able to specify the actual method that these column changes were to take place. These included several different approaches, depending on the target data type, but generally were to solve cases where there were fairly legitimate data that was not picked up by PostgreSQL's input parsers. These included how to interpret blank fields as integers (in some cases we wanted it to be 0, in others we wanted it to be NULL), weird numeric formatting (leaving off numbers before or after the decimal point), etc.

We had to fix up in several locations missing defaults for AUTO_INCREMENT columns. The tables were created with the proper datatype, however we had to find tables which matched a specific naming convention and create/associate a sequence/serial column, set the proper default here, etc. (This was detailed above.)

There was a fair amount of iteration and customization in this process, as there was a fair amount of data which was not of the expected format. The process was iterative, and generally involved attempting to alter the table from within a transaction and finding the next datum which the conversion to the expected type did not work. This would result in a modification of the USING clause of the ALTER TABLE ALTER COLUMN TYPE to accommodate some of the specific issues.

In several cases, there were only a couple records which had bad/bunko data, so I included explicit UPDATE statements to update those data values via primary key. While this felt a bit "impure", it was a quick and preferred solution to the issue of a few specific records which did not fit general rules.


published by (Kirk Harr) on 2014-07-24 10:58:00 in the "bash" category

When working with shell scripts written in bash/csh/etc. one of the primary tools you have to rely on is a simple method of piping output and input from subprocesses called by the script to create complex logic to accomplish the goal of the script. When working with python, this same method of calling subprocesses to redirect the input/output is available, but the overhead of using this method in python would be so cumbersome as to make python a less desirable scripting language. In effect you were implementing large parts of the I/O facilities, and potentially even writing replacements for the existing shell utilities that would perform the same work. Recently, python developers attempted to solve this problem, by updating an existing python subprocess wrapper library called pbs, into an easier to use library called sh.

Sh can be installed using pip, and the author has posted some documentation for the library here:

Using the sh library

After installing the library into your version of python, there will be two ways to call any existing shell command available to the system, firstly you can import the command as though it was itself a python library:

from sh import hostname
In addition, you can also call the command directly by just referencing the sh namespace prior to the command name:
import sh

When running this command on my linux workstation (hostname atlas) it will return the expected results:

Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sh
>>> print(sh.hostname())

However at this point, we are merely replacing a single shell command which prints output to the screen, the real benefit of the shell scripts was that you could chain together commands in order to create complex logic to help you do work.

Advanced Gymnastics

A common use of shell scripts is to provide administrators the ability to quickly filter log file output and to potentially search for specific conditions within those logs, to alert in the event that an application starts throwing errors. With python piping in sh we can create a simple log watcher, which would be capable of calling anything we desire in python when the log file contains any of the conditions we are looking for.

To pipe together commands using the sh library, you would encapsulate each command in series to create a similar syntax to bash piping:

>>> print(sh.wc("-l", "/etc"), "-l"))

This command would have been equivalent to the bash pipe of "ls -l /etc | wc -l" indicating that the long listing of /etc on my workstation contained 199 lines of output. Each piped command is encapsulated inside the parenthesis of the command the precedes it.

For our log listener we will use the tail command along with a python iterator to watch for a potential error condition, which I will represent with the string "ERROR":

>>> for line in sh.tail("-f", "/tmp/test_log", _iter=True):
...     if "ERROR" in line:
...         print line

In this example, once executed, python will call the tail command to follow a particular log file. It will iterate over each line of output produced by tail and if any of the lines contain the string we are watching for python will print that line to standard output. At this point, this would be similar to using the tail command and piping the output to a string search command, like grep. However, you could replace the third line of the python with a more complex action, emailing the error condition out to a developer or administrator for review, or perhaps initiating a procedure to recover from the error automatically.


In this manner with just a few lines of python, much like with bash, one could create a relatively complex process without recreating all the shell commands which already perform this work, or create a convoluted wrapping process of passing output from command to command. This combination of the existing shell commands and the power of python; you get all the functions available to any python environment, with the ease of using the shell commands to do some of the work. In the future I will definitely be using this python library for my own shell scripting needs, as I have generally preferred the syntax and ease of use of python over that of bash, but now I will be able to enjoy both at the same time.