All opinions expressed are those of the authors and not necessarily those of OSNews.com, our sponsors,
or our affiliates.
In the last few weeks, a few of us have been working on a project for Puppet involving several lines of concurrent development. We've relied extensively on the distributed nature of Git and the low cost of branching to facilitate this work. Throughout the process, I occasionally find myself pondering a few things:
- How do teams ever coordinate work effectively when their version control system lacks decent branching support?
- The ease with which commits can be sliced and diced and tossed about (merge, rebase, cherry-pick, and so on) is truly delightful
- It is not unreasonable to describe Git as "liberating" in this process: here is a tool with which the the logical layer (your commit histories) largely reflect reality, with which the engineer is unencumbered in his/her ability accomplish the task at hand, and from which the results' cleanliness or messiness is the product of the engineering team's cleanliness or messiness rather than a by-product of the tool's deficiencies
The current process, in accordance with practices in use within the
Puppet project itself, basically involves:
- One "canonical" branch in a particular repository, into which all work is merged by a single individual
- Engineers do work in their own branches/repositories, which they "publish" (in this case, on Github) through occasional pushes
- Different lines of development take place on different branches, keeping the logical threads of development separate until any given piece progresses sufficiently to warrant merging back into the canonical branch
Seemingly-speculative development efforts are worth more in this approach, because the most seemingly-speculative work can go out on an independent branch, starting from the common history, to be used later (or not) according to need. The ease of sharing the work, of keeping it cleanly isolated but generally low-cost to integrate later, all reduce the "speculative" part of speculation.
Much of the public discussion of distributed development in practice, using Git, revolves around Linux kernel development. That's of course a massive project with many contributors and a great many lines of development. It's easy to look at distributed version control and the related development practices and say "this is not necessary; my project isn't that complex and doesn't need all this fanciness." Such a conclusion, while understandable, ignores the most important factor in all software development work: human beings do the work.
Human beings can mentally envision complex structures, relationships, processes with instantaneous ease. While our thought processes on a given thread may move along serially, our general approach to problems often involves a graph or web rather than a single line. Furthermore, concurrent processing is second-nature to all of us, depending on the situation:
- The car driver guides the steering wheel such that over the course of traveling forty feet, the car smoothly achieves a ninety-degree change of direction, while coordinating the changing of gears and acceleration through manipulation of clutch, accelerator, and gear shift, all while chatting with the child in the back seat
- The singer performing a Bach aria manipulates diaphragm, jaw, tongue, lips, etc., to achieve the ideal resonance for the current vowel across a intricate repeated sequence of pitch relationships, while focusing on the sound of the organ for tuning and ensemble, and while envisioning the expansive overarching shape of the phrase to ensure the large-scale dynamic fits the musical expression needed
- The child in the outfield hums quietly, thinking about the cartoons he watched yesterday, while intently watching to see if the tee-ball will ever be coming his way
In my experience, when speaking about development tasks with my peers, the most common situation is for the conversation to be muddied by an excess of ideas and possibilities. Too many topics and ways forward bubble about in our collective head, and development forces us to shed these until we arrive at the stripped-bare essentials. Furthermore, it is similarly common that certain questions cannot be answered in the abstract, and require the rolling-up of sleeves to arrive at a solution. Along the way to that solution, how often does one come upon implementation choices that were not previously considered, the implications of which requiring further assessment?
We often think, individually or collectively, in webs of relationships. A tool that requires us to develop serially defies our basic humanity. This is the true liberation Git brings: concurrent development -- by a team of many, a few, or one -- can be sanely achieved. Put the new thing in a branch and move on. Merging it later will very possibly be easy, but even if it's not, it is always possible.
To quote a special fella, "freedom's untidy". Development tools that facilitate multiple lines of concurrent development mean that one ends up in the situation of dealing with, well, multiple lines of development. The technical problem (no branching!) becomes a meatspace problem (aagh! branches!). There's no magical elixir for that problem, as it requires social solutions, such as email or a wiki. The meatspace problems exist in any case, Git simply forces you to recognize them and plan for them.
Comments
Just for fun, in the Spree Git repository:
git log | grep ^Author: | sed 's/ <.*//; s/^Author: //' | sort | uniq -c | sort -nr
813 Sean Schofield
97 Brian Quinn
81 Stephanie Powell
42 Jorge Calás Lozano
37 paulcc
27 Edmundo Valle Neto
16 Dale Hofkens
13 Gregg Pollack
12 Sonny Cook
11 Bobby Santiago
8 Paul Saieg
7 Robert Kuhr
6 pierre
6 mjwall
6 Eric Budd
5 Fabio Akita
5 Ben Marini
4 Tor Hovland
4 Jason Seifer
2 Wynn Netherland
2 Will Emerson
2 spariev
2 ron
2 Ricardo Shiota Yasuda
1 Yves Dufour
1 yitzhakbg
1 unknown
1 Tomasz Mazur
1 tom
1 Peter Berkenbosch
1 Nate Murray
1 mwestover
1 Manuel Stuefer
1 Joshua Nussbaum
1 Jon Jensen
1 Chris Gaskett
1 Caius Durling
1 Bernd Ahlers
Comments
During a recent discussion about git, I realized yet again that previous knowledge of a Version Control System (VCS) actively hinders understanding of git: this is especially challenging when trying to understand the difference between bare vs non-bare repositories.
An analogy might be helpful: assume a modern newspaper, where the actual contents of the physical pages are stored in a database; i.e., the database might store contents of articles in one table, author information in another, page layout information in yet another table, and information on how an edition is built in yet another table, or perhaps in an external program. Any particular edition of the paper just happens to be a particular instantiation of items that live in the database.
Suppose an editor walks in and tells the staff "Create a special edition that consists of the front pages of the past week's papers." That edition could easily be created by taking all the front page articles from the past week from the database. No new content would be needed in the content tables themselves, just some metadata changes to label the new edition and description of how to build it.
One could consider the database, then, to be the actual newspaper.
Let's apply that analogy to git:
A git repository is the newspaper database. A particular git branch is the equivalent of a particular day's paper: e.g., the edition for February 5, 2009 consisting of a set of articles, glued together by a layout specification, tied to a label 'February 5, 2009'. In git terms, that would be blobs of data, glued together by references, perhaps labeled by either a branch or a tag.
A bare git repository, then, is the newspaper database itself, not a huge stack of all the editions ever printed. That's a large contrast to some other VCSs where a repository is the first edition ever printed, with diff's stored on top of that. Running git clone is equivalent to a database copy of all the tables of the database. Doing a git checkout of a branch is the equivalent of asking the newspaper factory to read in the metadata and content from the database and produce a physical paper instance of the newspaper.
Comments
It's awesome to see that the Perl 5 source code repository has been migrated from Perforce to Git, and is now active at http://perl5.git.perl.org/. Congratulations to all those who worked hard to migrate the entire version control history, all the way back to the beginning with Perl 1.0!
Skimming through the history turns up some fun things:
- The last Perforce commit appears to have been on 16 December 2008.
- Perl 5 is still under very active development! (It seems a lot of people are missing this simple fact, so I don't feel bad stating it.)
- Perl 5.8.0 was released on 18 July 2002, and 5.6.0 on 23 March 2000. Those both seem so recent ...
- Perl 5.000 was released on 17 October 1994.
- Perl 4.0.00 was released 21 March 1991, and the last Perl 4 release, 4.0.36, was released on 4 February 1993. For having an active lifespan of only 4 or so years till Perl 5 became popular, Perl 4 code sure kicked around on servers a lot longer than that.
- Perl 1.0 was announced by Larry Wall on 18 December 1987. He called Perl a "replacement" for awk and sed. That first release included 49 regression tests.
- Some of the patches are from people whose contact information is long gone, rendered in Git commits as e.g. Dan Faigin, Doug Landauer <unknown@longtimeago>.
- The modern Internet hadn't yet completely taken over, as evidenced by email addresses such as isis!aburt and arnold@emoryu2.arpa.
- The first Larry Wall entry with email address larry@wall.org was 28 June 1988, though he continued to use his jpl.nasa.gov after that sometimes too.
- There are some weird things in the commit notices. For example, it's hard to believe the snippet of Perl code in the following change notice wasn't somehow mangled in the conversion process:
commit d23b30860e3e4c1bd7e12ed5a35d1b90e7fa214c
Author: Larry Wall <lwall@scalpel.netlabs.com>
Date: Wed Jan 11 11:01:09 1995 -0800
duplicate DESTROY
In order to fix the duplicate DESTROY bug, I need to remove [the
modified] lines from sv_setsv.
Basically, copying an object shouldn't produce another object without an
explicit blessing. I'm not sure if this will break anything. If Ilya
and anyone else so inclined would apply this patch and see if it breaks
anything related to overloading (or anything else object-oriented), I'd
be much obliged.
By the way, here's a test script for the duplicate DESTROY. You'll note
that it prints DESTROYED twice, once for , and once for . I don't
think an object should be considered an object unless viewed through
a reference. When accessed directly it should behave as a builtin type.
#!./perl
= new main;
= '';
sub new {
my ;
local /tmp/ssh-vaEzm16429/agent.16429 = bless $a;
local = ; # Bogusly makes an object.
/tmp/ssh-vaEzm16429/agent.16429;
}
sub DESTROY {
print "DESTROYEDn";
}
Larry
sv.c | 4 ----
1 files changed, 0 insertions(+), 4 deletions(-)
Yes, it really is that weird. Check it out for yourself.
The Easy Git summary information from eg info has some interesting trivia:
Total commits: 36647
Number of contributors: 926
Number of files: 4439
Number of directories: 657
Biggest file size, in bytes: 4176496 (Changes5.8)
Commits: 31178
And there's a nice new POD document instructing how work with the Perl repository using Git: perlrepository.
In other news, maintenance release Perl 5.8.9 is out, expected to be the last 5.8.x release. The change log shows most bundled modules have been updated.
Finally, use Perl also notes that Booking.com is donating $50,000 to further Perl development, specifically Perl 5.10 development and maintenance. They're also hosting the new Git master repository. Thanks!
Comments
It's no little secret that we here at End Point love and encourage the use of version control systems to generally make life easier both on ourselves as well as our clients. While a full-fledged development environment is ideal for maintaining/developing new client code, not everyone has the time to be able to implement these quickly.
A situation we've sometimes found with clients editing/updating production data directly. This can be through a variety of means; direct server access, scp/sftp, or web-based editing tools which save directly to the file system.
I recently implemented a script for a client who uses a web-based tool for managing their content in order to provide transparent version control. While they are still making changes to their site directly, we now have the ability to roll back any changes on a file-by-file basis as they are created, modified, or deleted.
I wanted something that was: 1) fast, 2) useful, and 3) stayed out of the user's way. I turned naturally to git.
In the user's account, I executed git init to create a new git repository in their home directory. I then git added the relevant parts that we definitely wanted under version control. This included all of the relevant static content, the app server files, and associated configuration: basically anything we might want to track changes to.
Finally, I determined the list of directories which we would like to automatically detect any newly created files. These corresponded to the usual places where new content was apt to show up. I codified the automatic update of the git repo in a script called git_heartbeat, which is called periodically from cron.
The basic listing for git_heartbeat:
#!/bin/bash
# automatically add any new files in these space-separated directories
AUTO_ADD_DIRS="catalogs/acme/pages htdocs"
# make sure we're in the proper git root directory
cd /home/acme
# actually add any newly created files in $AUTO_ADD_DIRS
find $AUTO_ADD_DIRS -print0 | xargs -0 git add
DATE=`date`
git commit -q -a -m "Acme Co git heartbeat - $DATE" > /dev/null
A couple notes:
- git commit -a takes care of the modification/deletion of any already tracked files. The git add ensures that any newly created files are currently in the index and will be included with the commit.
- if no files have been added, removed, or deleted, no checkpoint is created. This ensures that every commit in the log is meaningful and corresponds to an actual change to the site itself.
- Compared to other VCSs which keep metadata in each versioned subdirectory (such as Subversion), this approach stays out of the user's way; we don't have to worry about the user accidentally overwriting/deleting data in their upload directories and thus corrupting the repository.
- This approach is fast; it runs near instantaneously for thousands of files, so we could even push the cron interval to every minute if desired. For our purposes, this system works great as is.
- Once the git tools are installed, there is no need to set up a central repository; git repos are very cheap to create/use and for a use case such as this, require little to no maintenance beyond the initial setup.
Areas of improvement/known issues:
- This script could definitely be improved insofar as providing more informative information as to which files were added/modified/deleted. However, git's own tools can come in quite useful; for instance, git log --stat will show the files which each heartbeat commit affected.
- Since this is set up as a general cron job running every hour (the period is configurable, obviously), it does preclude extended stagings for non-heartbeat commits; basically, anything which takes longer than the heartbeat interval will be inadvertently committed.
Comments
The ability to push and pull commits to/from remote repositories is obviously one of the great aspects of Git. However, if you're not careful with how you use git-push, you may find yourself in an embarrassing situation.
When you have multiple remote tracking branches within a Git repository, any bare git push invocation will attempt to push to all of those remote branches out. If you have commits stacked up that you weren't quite ready to push out, this can be somewhat unfortunate.
There are a variety of ways to accommodate this:
- use local branches for your commits, only merging those commits into your remote tracking branches when you're ready to push them out;
- push remote tracking branches out whenever you have something worth committing.
However, even with sensible branch management practices, it's worthwhile to know exactly what it is you're pushing. Therefore, if you want to have a sense of what you're potentially doing in calling a bare git push, always call it with the --dry-run option first. This will show you what a the push will send out, where the conflicts are, and so on, all without actually performing the push.
It is ultimately best, though, to understand the different ways of invoking git push so you can control things precisely and only change exactly what you want to change.
git push some_repo some_branch
This will identify the ref named some_branch within your repository and push it out to the some_repo repository. If you are good about having your remote tracking branches use the same name as the source branch in the relevant remote ref, this is a simple, effective way of ensuring that you're pushing out one branch and only one branch. However, it does require that you know the purpose of some_repo; it doesn't do any magic for deciding what the "right" repository to push to is based on some_branch.
To be extremely precise, you can use a full refspec in your push call:
git push some_repo local_branch:refs/heads/new_branch
This would take the local branch local_branch and push it out to within the remote ref identified by some_repo, but pushing it to the branch name new_branch within some_repo. This is a very useful invocation to understand in order to create new branches in bare repositories to be shared between developers/repositories. While both examples shown here will create the branch in some_repo if it does not already exist, the second example gives the programmer full control over the branch names.
If you're sharing your work with multiple developers/repositories, it can become unwieldy if not impossible to keep your tracking branch names consistent with source branch names in your remote refs. In which case, knowing these invocations of git push is an absolute necessity.
Check out the documentation on git push for a full explanation, and for an example of how to delete a branch in a remote ref. There are considerably more options for the command than what is explained here, but the refspec documentation can be a bit confusing to newcomers, in which case hopefully this discussion provides a bit more clarity. (Then again, perhaps it doesn't.)
Comments