All opinions expressed are those of the authors and not necessarily those of OSNews.com, our sponsors, or our affiliates.
  Add to My Yahoo!  Subscribe with Bloglines  Subscribe in NewsGator Online

published by noreply@blogger.com (Phin Jensen) on 2017-09-19 18:10:00 in the "browsers" category

Security is often a very difficult thing to get right, especially when it?s not easy to find reliable or up-to-date information or the process of testing can be confusing and complicated. We have a lot of history and experience working on the security of websites and servers, and we?ve found many tools and websites to be very helpful. Here is a collection of those.

Server-side security

There are a number of tools available that can scan your website to check for common vulnerabilities and the quality of SSL/TLS configuration, as well as give great tips on how to improve security for your website.

  • Qualys SSL Labs Server Test takes a simple domain name, performs a series of tests from a variety of clients, and returns a simple letter grade (from A+ down to F) indicating the quality of your SSL/TLS configuration, as well as a detailed summary for a host of configuration options. It covers certificates key and algorithms; TLS and SSL configurations; cipher suites; handshakes on a wide variety of platforms including Android, iOS, Chrome, Firefox, Internet Explorer and Edge, Safari, and others; common protocols and vulnerabilities; and other details.
  • HTTP Security Report does a similar scan, but provides a much more simplified summary of a website, with a numeric score from 0 to 100. It gives a simple, easy to understand list of results, with a green check mark or a red X to indicate whether something is configured for security or not. It also provides short paragraphs explaining settings and recommended configurations.
  • HT-Bridge SSL/TLS Server Test is very similar to Qualys SSL Labs Server Test, but provides some valuable extra information, such as PCI-DSS, HIPAA, and NIST guidelines compliance, as well as industry best practices and basic analysis of third-party content.
  • securityheaders.io is another letter-grade scan, but focuses on server headers only. It provides simple explanations for each recommended server header and links to guides on how to configure them correctly.
  • Observatory by Mozilla scans and gives information on HTTP, TLS, and SSH configuration, as well as simple summaries from other websites, including Qualys, HT-Bridge, and securityheaders.io as covered above.
  • SSL-Tools is focused on SSL and TLS configuration and certificates, with tools to scan websites and mail servers, check for common vulnerabilities, and decode certificates.
  • Microsoft Site Scan performs a series of simple tests, focused more on general website guidelines and best practices, including tests for outdated libraries and plugins which can be a security issue.
  • testssl.sh, the final website scanning tool I?ll cover, is a more advanced bash script that covers many of the same things these other websites do, but provides lots of options for fine-tuning test methods, returned information, and testing abnormal configurations. It?s also open source and doesn?t rely on any third parties.

These websites provide valuable information on SSL/TLS which can be used to create a secure, fast, and functional server configuration:

  • Security/Server Side TLS on the Mozilla wiki is a fantastic page which provides great summaries, recommendations, and reference information on many TLS topics, including handshakes, OCSP Stapling, HSTS, HPKP, certificate ciphers, and common attacks.
  • Mozilla SSL Configuration Generator is a simple tool that generates boilerplate server configuration files for common servers, including Apache and Nginx, and specific server and OpenSSL versions. It also allows you to target ?modern?, ?intermediate?, or ?old? clients and servers, which will give the best configuration possible for each level.
  • Is TLS Fast Yet? is a great, simple, and to-the-point informational website which explains why TLS is so important and how to improve its performance so it has the smallest impact possible on your website?s speed.

Client-side security

These websites provide information and diagnostic tools to ensure that you are using a secure browser.

  • badssl.com gives a list of links to subdomains with various SSL configurations, including badly configured SSL, so you can have a good idea of what a well-configured website looks like versus one with errors in configuration, weak ciphers or key exchange protocols, or insecure HTTP forms.
  • IPv6 Test checks your network and browser for IPv6 support, showing you your ISP, reverse DNS pointers, both your IPv4 and IPv6 addresses, and giving an idea of when your computer or network may have problems with dual-stack IPv4 + IPv6 remote hosts or DNS.
  • How?s My SSL? and Qualys Labs SSL Client Test both check your browser for support of SSL/TLS versions, protocols, ciphers, and features, as well as susceptibility to common vulnerabilities.

General Tools

  • NeverSSL is a simple website that promises to never use SSL. Many public wifi networks require you to go through a payment or login page, which can be blocked when trying to access a well-secured website such as Google, Facebook, Twitter, or Amazon, which can cause trouble connecting to that website. NeverSSL provides an easy and simple way to access that login website.
  • crt.sh is a search engine for public TLS certificate information. It provides a history of certificates for a given domain name, with information including issuer and issue date, as well as an advanced search.
  • Digital Attack Map is an interactive map showing DDoS attacks across the world.
  • The Internet-Wide Scan Data Repository is a public archive of scans across the internet, intended for research and provided by the University of Michigan Censys Team.
  • take-a-screenshot.org is a simple website that shows how to take a screenshot on a variety of operating systems and desktop environments. It?s a fantastic tool to help less technically-minded people share their screens or issues they?re having.

Comments

published by noreply@blogger.com (Greg Hanson) on 2017-08-28 22:36:00 in the "ecommerce" category
"Custom eCommerce" means different things to different people and organizations. For some eCommerce shopping cart sites that pump out literally hundreds of websites a year, it may mean you get to choose from a dizzying array of "templates" that can set your website apart from others.
For others, it may be a slightly more involved arrangement where you can "create categories" to group display of your products on your website ... after you have entered your products into a prearranged database schema.
There are many levels of "custom" out there. Generally speaking, the closer you get to true "custom", the more accurate the term "development" becomes.
It is very important for your business that you decide what fits your needs, and that you match your needs to a platform or company that can provide appropriate services. As you can imagine, this will depend entirely on your business.

Example scenarios

For example, a small one- or two-person business that does fulfillment of online orders may be well suited for a pre-built approach, where you pay a monthly fee to simply log into an admin, add your products, and some content, and the company does the rest. It handles all of the "details."
A slightly larger company that has maybe 5-10 employees, and possibly a staff member with some understanding of websites, may choose to purchase a package that requires more customization and company related input, and perhaps even design or choice of templates.
From this level up, decisions become far more important and complex. For example, even though the previously described company may be perfectly suited with the choice described, if sales are expected to increase dramatically in the near future, or if the company is in a niche market where custom accounting or regulations require specific handling of records, a more advanced approach may be warranted.

What we do

The purpose of this post is not to give you guidelines as to what sort of website service you should buy, or consultancy you should hire for your company. Rather it is to point out some of the types of things that we at End Point do for companies that need a higher level of custom eCommerce development. In fact, the development we do is not limited to eCommerce.

We offer a full range of business consultancy and IT development services. We can guide you through many areas of your business development. True, we primarily provide services to companies that sell things on the web. But we also provide support for inventory management systems in your warehouses, accounts receivable / payable integration with your websites, management of your POS (point of sale) machines, strategic pricing for seasonal products with expiry dates, and the list goes on.

Real-life scenarios

The following is a real-life example of services we have provided for one client.

Case Study: Vervante

Consultant vs Service

Hopefully, the real life scenarios will help serve as an example as to how complex business needs can be, and how using an out of the box "eCommerce" website, will not work in every circumstance. When you find a good business consultant, you will know it. A consultant will not try to make your business fit into their template, they will listen to you and then assemble and tailor products to fit your business.
Regardless of the state of maturity of your business, very seldom will a single "system" or "website" cover all of your business needs. It will more likely be a collection of systems. Which systems and how they work together is likely what will determine success or failure. The more mature your business, the broader the scope of systems required to support the growing requirements of your business.
So whether you are a sole proprietor getting started with your business, or you are a CTO tasked with organizing and optimizing the many systems in your organization, understanding what type of service or partner you need, is the first step. In the future I will spotlight a few other examples of how we have assisted businesses in growing and improving how they do business.

Comments

published by noreply@blogger.com (Greg Hanson) on 2017-08-28 21:21:00

A real-life scenario

The following is a real-life example of services we have provided for one of our clients.
Vervante Corporation provides a print on demand and order fulfillment service for thousands of customers, in their case, "Authors". Vervante needed a way for these authors to keep track of their products. Essentially they needed an Inventory management system. So we designed a complete system from the ground up that allows Vervante's authors many custom functions that simply are not offered in a pre-built package anywhere.
This is also a good time to mention that you should always view your web presence, in fact your business itself, as a process, not a one time "setup". Your products will change, your customers will change, the web will change, everything will change. If you want your business to be successful, you will change.

Some Specifics

While it is beyond the scope of this case study to describe all of the programs that were developed for Vervante, it will be valuable for the reader to sample just a few of the areas to understand how diverse a single business can be. Here are a few of the functions we have built from scratch, over several years to continue to provide Vervante, their authors, and even their vendors with efficient processes to achieve their daily business needs.

Requirements

  1. Author Requirement - First, in some cases, the best approach to a problem is to use someone else's solution! Vervante's authors have large data files that are converted to a product, and then shipped on demand as the orders come in. So we initially provided a custom file transfer process so that customers could directly upload their files to a server we set up for Vervante. Soon Vervante's rapid growth outpaced the efficacy of this system, so we investigated and determined the most efficient and cost-effective approach was to incorporate a 3rd party service. So we recommended a well known file transfer service and wrote a program to communicate with the file transfer service API. Now a client can easily describe and upload large files to Vervante.


    We were able to effectively showcase our ability to incorporate 3D models and mapping layers at BOMA through the use of Google Earth, Cesium, ArcGIS, Unity, and Sketchfab. We were also able to pull data and develop content for neighboring booths and visitors, demonstrating what an easy and data-agnostic platform Liquid Galaxy can be.

    We?re very excited about the increasing traction we have in the real estate industry, and hope our involvement with BOMA will take that to the next level. If you?d like to learn more about Liquid Galaxy, please visit our website or email ask@endpoint.com.
    Comments

    published by noreply@blogger.com (Mark Johnson) on 2017-06-26 13:00:00 in the "CommonAdjust" category

    Product pricing can be quite complex. A typical Interchange catalog will have at least one table in the ProductFiles directive (often products plus either options or variants) and those tables will often have one or more pricing fields (usually price and sales_price). But usually a single, static price isn't sufficient for more complex needs, such as accessory adjustments, quantity pricing, product grouping--not to mention promotions, sales, or other conditional features that may change a product's price for a given situation, dependent on the user's account or session.

    Typically to handle these variety of pricing possibilities, a catalog developer will implement a CommonAdjust algorithm. CommonAdjust can accommodate all the above pricing adjustments and more, and is a powerful tool (yet can become quite arcane when reaching deeper complexity). CommonAdjust is enabled by setting the PriceField directive to a non-existent field value in the tables specified in ProductFiles.

    To give an adequate introduction and treatise on CommonAdjust would be at a minimum its own post, and likely a series. There are many elements that make up a CommonAdjust string, and subtle operator nuances that instruct it to operate in differing patterns. It is even possible for elements themselves to return new CommonAdjust atoms (a feature we will be leveraging in this discussion). So I will assume for this writing that the reader is familiar generally with CommonAdjust and we will implement a very simple example to demonstrate henceforth.

    To start, let's create a CommonAdjust string that simply replaces the typical PriceField setting, and we'll allow it to accommodate a static sales price:

    ProductFiles products
    PriceField 0
    CommonAdjust :sale_price ;:price
    

    The above, in words, indicates that our products live in the products table, and we want CommonAdjust to handle our pricing by setting PriceField to a non-existent field (0 is a safe bet not to be a valid field in the products table). Our CommonAdjust string is comprised of two atoms, both of which are settors of type database lookup. In the products table, we have 2 fields: sale_price and price. If sale_price is "set" (meaning a non-zero numeric value or another CommonAdjust atom) it will be used as it comes first in the list. The semicolon indicates to Interchange "if a previous atom set a price by this point, we're done with this iteration" and, thus, the price field will be ignored. Otherwise, the next atom is checked (the price field), and as long as the price field is set, it will be used instead.

    A few comments here:
    • The bare colon indicates that the field is not restricted to a particular table. Typically, to specify the field, you would have a value like "products:price" or "variants:price". But part of the power of ProductFiles holding products in different tables is you can pick up a sku from any of them. And at that point, you don't know whether you're looking at a sku from products, variants, or as many additional tables as you'd like to grab products from. But if all of them have a price and sales_price field, then you can pick up the pricing from any of them by leaving the table off. You can think of ":price" as "*:price" where asterisk is "table this sku came from".
    • The only indicator that CommonAdjust recognizes as a terminal value is a non-zero numeric value. The proposed price is coerced to numeric, added on to the accumulated price effects of the rest of the CommonAdjust string (if applicable), and the final value is tested for truth. If it is false (empty, undef, or 0) then the process repeats.
    • What happens if none of the atoms produce a non-zero numeric value? If Interchange reaches the end of the original CommonAdjust string without hitting a set atom, it will relent and return a zero cost.

    At this point, we finally introduce our situation, and one that is not at all uncommon. What if I want a zero price? Let's say I have a promotion for buy one product, get this other product for free. Typically, a developer would be able to expect to override the prices from the database optionally by leveraging the "mv_price" parameter in the cart. So, let's adjust our CommonAdjust to accommodate that:

    CommonAdjust $ ;:sale_price ;:price
    

    The $ settor in the first atom means "look in the line-item hash for the mv_price parameter and use that, if it's set". But as we've discussed above, we "set" an atom by making it a non-zero numeric value or another CommonAdjust atom. So if we set mv_price to 0, we've gained nothing. CommonAdjust will move on to the next atom (sale_price database settor) and pick up that product's pricing from the database. And even if we set that product's sale_price and price fields to 0, it means everyone purchasing that item would get it for free (not just our promotion that allows the item to be free with the specific purchase of another item).

    In the specific case of using the $ settor in CommonAdjust, we can set mv_price to the keyword "free", and that will allow us to price the item for 0. But this restricts us to only be able to use $ and mv_price to have a free item. What if the price comes from a complex calculation, out of a usertag settor? Or out of a calc block settor? The special "free" keyword doesn't work there.

    Fortunately, there is a rarely used CommonAdjust settor that will allow for a 0 price item in a general solution. As I mentioned above, CommonAdjust calculations can themselves return other CommonAdjust atoms, which will then be operated on in a subsequent iteration. This frees us from just the special handling that works on $ and mv_price as such an atom can be returned from any of the CommonAdjust atoms and work.

    The settor of interest is >>, and according to what documentation there is on it, it was never even intended to be used as a pricing settor! Rather, it was to be a way of redirecting to additional modes for shipping or tax calculations, which can also leverage CommonAdjust for their particular purposes. However, the key to its usefulness here is thus: it does not perform any test on the value tied to it. It is set, untested, into the final result of this call to the chain_cost() routine and returned. And with no test, the fact that it's Perly false as numeric 0 is irrelevant.

    So building on our current CommonAdjust, let's leverage >> to allow our companion product to have a zero cost (assuming it is the 2nd line item in the cart):

    [calcn]
        $Items->[1]{mv_price} = '>>0';
        return;
    [/calcn]
    

    Now what happens is, $ in the first atom picks up the value out of mv_price and, because it's a CommonAdjust atom, is processed in a second iteration. But this CommonAdjust atom is very simple: take the value tied to >> and return it, untested.

    Perhaps our pricing is more complex than we can (or would like to) support with using $. So we want to write a usertag, where we have the full power of global Perl at our disposal, but we still have circumstances where that usertag may need to return zero-cost items. Using the built-in "free" solution, we're stuck, short of setting mv_price in the item hash within the usertag, which we may not want to do for a variety of reasons. But using >>, we have no such restriction. So let's change CommonAdjust:

    CommonAdjust $ ;[my-special-pricing] ;:sale_price ;:price
    

    Now instead of setting mv_price in the item, let's construct [my-special-pricing] to do some heavy lifting:

    UserTag my-special-pricing Routine <<EOR
    sub {
        # A bunch of conditional, complicated code, but then ...
        elsif (buy_one_get_one_test($item)) {
            # This is where we know this normally priced item is supposed to be
            # free because of our promotion. Excellent!
    
            return '>>0';
        }
        # remaining code we don't care about for this discussion
    }
    EOR
    

    Now we haven't slapped a zero cost onto the line item in a sticky fashion, like we do by setting mv_price. So presumably, above, if the user gets sneaky and removes the "buy one" sku identified by our promotion, our equally clever buy_one_get_one_test() sniffs it out, and the 0 price is no longer in effect.

    For more information on CommonAdjust, see the Custom Pricing section of 'price' glossary entry. And for more examples of leveraging CommonAdjust for quantity and attribute pricing adjustments, see the Examples section of the CommonAdjust document entry.


    Comments

    published by noreply@blogger.com (Ben Witten) on 2017-06-19 18:59:00 in the "Conference" category

    End Point had the privilege of participating in The Ocean Conference at the United Nations, hosted by the International Union for Conservation of Nature (IUCN), these past two weeks. The health of the oceans is critical, and The Ocean Conference, the first United Nations conference on this issue, presents a unique and invaluable opportunity for the world to reverse the precipitous decline of the health of the oceans and seas with concrete solutions.

    A Liquid Galaxy was set up in a prominent area on the main floor of the United Nations. End Point created custom content for the Ocean Conference, using the Liquid Galaxy?s Content Management System. Visiting diplomats and government officials were able to experience this content - Liquid Galaxy?s interactive panoramic setup allows visitors to feel immersed in the different locations, with video and information spanning their periphery.

    Liquid Galaxy content created for The Ocean Conference included:
    -A study of the Catlin Seaview Survey and how the world's coral reefs are changing
    -360 panoramic underwater videos
    -All Mission Blue Ocean Hope Spots
    -A guided tour of the Monaco Explorations 3 Year Expedition
    -National Marine Sanctuaries around the United States

    We were grateful to be able to create content for such a good cause, and hope to be able to do more good work for the IUCN and the UN! If you?d like to learn more, please visit our website or email ask@endpoint.com.


    Comments

    published by noreply@blogger.com (Ben Witten) on 2017-06-16 20:44:00 in the "cesium" category
    This past week, End Point attended GEOINT Symposium to showcase Liquid Galaxy as an immersive panoramic GIS solution to GEOINT attendees and exhibitors alike.

    At the show, we showcased Cesium integrating with ArcGIS and WMS, Google Earth, Street View, Sketchfab, Unity, and panoramic video. Using our Content Management System, we created content around these various features so visitors to our booth could take in the full spectrum of capabilities that the Liquid Galaxy provides.

    Additionally, we were able to take data feeds for multiple other booths and display their content during the show! Our work served to show everyone at the conference that the Liquid Galaxy is a data-agnostic immersive platform that can handle any sort of data stream and offer data in a brilliant display. This can be used to show your large complex data sets in briefing rooms, conference rooms, or command centers.

    Given the incredible draw of the Liquid Galaxy, the GEOINT team took special interest in our system and formally interviewed Ben Goldstein in front of the system to learn more! You can view the video of the interview here:



    We look forward to developing the relationships we created at GEOINT, and hope to participate further in this great community moving forward. If you would like to learn more please visit our website or email ask@endpoint.com.








    Comments

    published by noreply@blogger.com (Greg Sabino Mullane) on 2017-06-06 21:18:00 in the "Amazon" category

    Many of our clients at End Point are using the incredible Amazon Relational Database Service (RDS), which allows for quick setup and use of a database system. Despite minimizing many database administration tasks, some issues still exist, one of which is upgrading. Getting to a new version of Postgres is simple enough with RDS, but we've had clients use Bucardo to do the upgrade, rather than Amazon's built-in upgrade process. Some of you may be exclaiming "A trigger-based replication system just to upgrade?!"; while using it may seem unintuitive, there are some very good reasons to use Bucardo for your RDS upgrade:

    Minimize application downtime

    Many businesses are very sensitive to any database downtime, and upgrading your database to a new version always incurs that cost. Although RDS uses the ultra-fast pg_upgrade --links method, the whole upgrade process can take quite a while - or at least too long for the business to accept. Bucardo can reduce the application downtime from around seven minutes to ten seconds or less.

    Upgrade more than one version at once

    As of this writing (June 2017), RDS only allows upgrading of one major Postgres version at a time. Since pg_upgrade can easily handle upgrading older versions, this limitation will probably be fixed someday. Still, it means even more application downtime - to the tune of seven minutes for each major version. If you are going from 9.3 to 9.6 (via 9.4 and 9.5), that's at least 21 minutes of application downtime, with many unnecessary steps along the way. The total time for Bucardo to jump from 9.3 to 9.6 (or any major version to another one) is still under ten seconds.

    Application testing with live data

    The Bucardo upgrade process involves setting up a second RDS instance running the newer version, copying the data from the current RDS server, and then letting Bucardo replicate the changes as they come in. With this system, you can have two "live" databases you can point your applications to. With RDS, you must create a snapshot of your current RDS, upgrade *that*, and then point your application to the new (and frozen-in-time) database. Although this is still useful for testing your application against the newer version of the database, it is not as useful as having an automatically-updated version of the database.

    Control and easy rollback

    With Bucardo, the initial setup costs, and the overhead of using triggers on your production database, is balanced a bit by ensuring you have complete control over the upgrade process. The migration can happen when you want, at a pace you want, and can even happen in stages as you point some of the applications in your stack to the new version, while keeping some pointed at the old. And rolling back is as simple as pointing apps back at the older version. You could even set up Bucardo as "master-master", such that both new and old versions can write data at the same time (although this step is rarely necessary).

    Database bloat removal

    Although the pg_upgrade program that Amazon RDS uses for upgrading is extraordinarily fast and efficient, the data files are seldom, if ever, changed at all, and table and index bloat is never removed. On the other hand, an upgrade system using Bucardo creates the tables from scratch on the new database, and thus completely removes all historical bloat. (Indeed, one time a client thought something had gone wrong, as the new version's total database size had shrunk radically - but it was simply removal of all table bloat!).

    Statistics remain in place

    The pg_upgrade program currently has a glaring flaw - no copying of the information in the pg_statistic table. Which means that although an Amazon RDS upgrade completes in about seven minutes, the performance will range somewhere from slightly slow to completely unusable, until all those statistics are regenerated on the new version via the ANALYZE command. How long this can take depends on a number of factors, but in general, the larger your database, the longer it will take - a database-wide analyze can take hours on very large databases. As mentioned above, upgrading via Bucardo relies on COPYing the data to a fresh copy of the table. Although the statistics also need to be created when using Bucardo, the time cost for this does NOT apply to the upgrade time, as it can be done any time earlier, making the effective cost of generating statistics zero.

    Upgrading RDS the Amazon way

    Having said all that, the native upgrade system for RDS is very simple and fast. If the drawbacks above do not apply to you - or can be suffered with minimal business pain - then this way should always be the upgrade approach to use. Here is a quick walk through of how an Amazon RDS upgrade is done.

    For this example, we will create a new Amazon RDS instance. The creation is amazingly simple: just log into aws.amazon.com, choose RDS, choose PostgreSQL (always the best choice!), and then fill in a few details, such as preferred version, server size, etc. The "DB Engine Version" was set as PostgreSQL 9.3.16-R1", the "DB Instance Class" as db.t2.small -- 1 vCPU, 2 GiB RAM, and "Multi-AZ Deployment" as no. All other choices are the default. To finish up this section of the setup, "DB Instance Identifier" was set to gregtest, the "Master Username" to greg, and the "Master Password" to b5fc93f818a3a8065c3b25b5e45fec19

    Clicking on "Next Step" brings up more options, but the only one that needs to change is to specify the "Database Name" as gtest. Finally, the "Launch DB Instance" button. The new database is on the way! Select "View your DB Instance" and then keep reloading until the "Status" changes to Active.

    Once the instance is running, you will be shown a connection string that looks like this: gregtest.zqsvirfhzvg.us-east-1.rds.amazonaws.com:5432. That standard port is not a problem, but who wants to ever type that hostname out, or even have to look at it? The pg_service.conf file comes to the rescue with this new entry inside the ~/.pg_service.conf file:

    [gtest]
    host=gregtest.zqsvirfhzvg.us-east-1.rds.amazonaws.com
    port=5432
    dbname=gtest
    user=greg
    password=b5fc93f818a3a8065c3b25b5e45fec19
    connect_timeout=10
    

    Now we run a quick test to make sure psql is able to connect, and that the database is an Amazon RDS database:

    $ psql service=gtest -Atc "show rds.superuser_variables"
    session_replication_role
    

    We want to use the pgbench program to add a little content to the database, just to give the upgrade process something to do. Unfortunately, we cannot simply feed the "service=gtest" line to the pgbench program, but a little environment variable craftiness gets the job done:

    $ unset PGSERVICEFILE PGSERVICE PGHOST PGPORT PGUSER PGDATABASE
    $ export PGSERVICEFILE=/home/greg/.pg_service.conf PGSERVICE=gtest
    $ pgbench -i -s 4
    NOTICE:  table "pgbench_history" does not exist, skipping
    NOTICE:  table "pgbench_tellers" does not exist, skipping
    NOTICE:  table "pgbench_accounts" does not exist, skipping
    NOTICE:  table "pgbench_branches" does not exist, skipping
    creating tables...
    100000 of 400000 tuples (25%) done (elapsed 0.66 s, remaining 0.72 s)
    200000 of 400000 tuples (50%) done (elapsed 1.69 s, remaining 0.78 s)
    300000 of 400000 tuples (75%) done (elapsed 4.83 s, remaining 0.68 s)
    400000 of 400000 tuples (100%) done (elapsed 7.84 s, remaining 0.00 s)
    vacuum...
    set primary keys...
    done.
    

    At 68MB in size, this is still not a big database - so let's create a large table, then create a bunch of databases, to make pg_upgrade work a little harder:

    ## Make the whole database 1707 MB:
    $ psql service=gtest -c "CREATE TABLE extra AS SELECT * FROM pgbench_accounts"
    SELECT 400000
    $ for i in {1..5}; do psql service=gtest -qc "INSERT INTO extra SELECT * FROM extra"; done
    
    ## Make the whole cluster about 17 GB:
    $ for i in {1..9}; do psql service=gtest -qc "CREATE DATABASE gtest$i TEMPLATE gtest" ; done
    $ psql service=gtest -c "SELECT pg_size_pretty(sum(pg_database_size(oid))) FROM pg_database WHERE datname ~ 'gtest'"
    17 GB
    

    To start the upgrade, we log into the AWS console, and choose "Instance Actions", then "Modify". Our only choices for instances are "9.4.9" and "9.4.11", plus some older revisions in the 9.3 branch. Why anything other than the latest revision in the next major branch (i.e. 9.4.11) is shown, I have no idea! Choose 9.4.11, scroll down to the bottom, choose "Apply Immediately", then "Continue", then "Modify DB Instance". The upgrade has begun!

    How long will it take? All one can do is keep refreshing to see when your new database is ready. As mentioned above, 7 minutes and 30 seconds is the total time. The logs show how things break down:

    11:52:43 DB instance shutdown
    11:55:06 Backing up DB instance
    11:56:12 DB instance shutdown
    11:58:42 The parameter max_wal_senders was set to a value incompatible with replication. It has been adjusted from 5 to 10.
    11:59:56 DB instance restarted
    12:00:18 Updated to use DBParameterGroup default.postgres9.4
    

    How much of that time is spent on upgrading though? Surprisingly little. We can do a quick local test to see how long the same database takes to upgrade from 9.3 to 9.4 using pg_upgrade --links: 20 seconds! Ideally Amazon will improve upon the total downtime at some point.

    Upgrading RDS with Bucardo

    As an asynchronous, trigger-based replication system, Bucardo is perfect for situations like this where you need to temporarily sync up two concurrent versions of Postgres. The basic process is to create a new Amazon RDS instance of your new Postgres version (e.g. 9.6), install the Bucardo program on a cheap EC2 box, and then have Bucardo replicate from the old Postgres version (e.g. 9.3) to the new one. Once both instances are in sync, just point your application to the new version and shut the old one down. One way to perform the upgrade is detailed below.

    Some of the steps are simplified, but the overall process is intact. First, find a temporary box for Bucardo to run on. It doesn't have to be powerful, or have much disk space, but as network connectivity is important, using an EC2 box is recommended. Install Postgres (9.6 or better, because of pg_dump) and Bucardo (latest or HEAD recommended), then put your old and new RDS databases into your pg_service.conf file as "rds93" and "rds96" to keep things simple.

    The next step is to make a copy of the database on the new Postgres 9.6 RDS database. We want the bare minimum schema here: no data, no triggers, no indexes, etc. Luckily, this is simple using pg_dump:

    $ pg_dump service=rds93 --section=pre-data | psql -q service=rds96
    

    From this point forward, no DDL should be run on the old server. We take a snapshot of the post-data items right away and save it to a file for later:

    $ pg_dump service=rds93 --section=post-data -f rds.postdata.pg
    

    Time to get Bucardo ready. Recall that Bucardo can only replicate tables that have a primary key or unique index. But if those tables are small enough, you can simply copy them over at the final point of migration later.

    $ bucardo install
    $ bucardo add db A dbservice=rds93
    $ bucardo add db B dbservice=rds96
    ## Create a sync and name it 'migrate_rds':
    $ bucardo add sync migrate_rds tables=all dbs=A,B
    

    That's it! The current database will now have triggers that are recording any changes made, so we may safely do a bulk copy to the new database. This step might take a very long time, but that's not a problem.

    $ pg_dump service=rds93 --section=data | psql -q service=rds96
    

    Before we create the indexes on the new server, we start the Bucardo sync to copy over any rows that were changed while the pg_dump was going on. After that, the indexes, primary keys, and other items can be created:

    $ bucardo start
    $ tail -f log.bucardo ## Wait until the sync finishes once
    $ bucardo stop
    $ psql service=rds96 -q -f rds.postdata.pg 
    

    For the final migration, we simply stop anything from writing to the 9.3 database, have Bucardo perform a final sync of any changed rows, and then point your application to the 9.6 database. The whole process can happen very quickly: well under a minute for most cases.

    Upgrading major Postgres versions is never a trivial task, but both Bucardo and pg_upgrade allow it to be orders of magnitude faster and easier than the old method of using the pg_dump utility. Upgrading your Amazon AWS Postgres instance is fast and easy using the AWS pg_upgrade method, but it has limitations, so having Bucardo help out can be a very useful option.


    Comments

    published by noreply@blogger.com (Jon Jensen) on 2017-06-02 03:58:00 in the ".NET" category

    End Point has the pleasure to announce some very big news!

    After an amicable wooing period, End Point has purchased the software consulting company Series Digital, a NYC-based firm that designs and builds custom software solutions. Over the past decade, Series Digital has automated business processes, brought new ideas to market, and built large-scale dynamic infrastructure.

    Series Digital website snapshotSeries Digital launched in 2006 in New York City. From the start, Series Digital managed large database installations for financial services clients such as Goldman Sachs, Merrill Lynch, and Citigroup. They also worked with startups including Drop.io, Byte, Mode Analytics, Domino, and Brewster.

    These growth-focused, data-intensive businesses benefited from Series Digital?s expertise in scalable infrastructure, project management, and information security. Today, Series Digital supports clients across many major industry sectors and has focused its development efforts on the Microsoft .NET ecosystem. They have strong design and user experience expertise. Their client list is global.

    The Series Digital team began working at End Point on April 3rd, 2017.

    The CEO of Series Digital is Jonathan Blessing. He joins End Point?s leadership team as Director of Client Engagements. End Point has had a relationship with Jonathan since 2010, and looks forward with great anticipation to the role he will play expanding End Point?s consulting business.

    To help support End Point?s expansion into .NET solutions, End Point has hired Dan Briones, a 25-year veteran of IT infrastructure engineering, to serve as Project and Team Manager for the Series Digital group. Dan started working with End Point at the end of March.

    The End Point leadership team is very excited by the addition of Dan, Jonathan, and the rest of the talented Series Digital team: Jon Allen, Ed Huott, Dylan Wooters, Vasile Laur, Liz Flyntz, Andrew Grosser, William Yeack, and Ian Neilsen.

    End Point?s reputation has been built upon its excellence in e-commerce, managed infrastructure, and database support. We are excited by the addition of Series Digital, which both deepens those abilities, and allows us to offer new services.

    Talk to us to hear about the new ways we can help you!


    Comments

    published by noreply@blogger.com (Kamil Ciemniewski) on 2017-05-30 18:18:00 in the "computer vision" category
    Previous in series:
    In the previous two posts on machine learning, I presented a very basic introduction of an approach called "probabilistic graphical models". In this post I'd like to take a tour of some different techniques while creating code that will recognize handwritten digits.

    The handwritten digits recognition is an interesting topic that has been explored for many years. It is now considered one of the best ways to start the journey into the world of machine learning.

    Taking the Kaggle challenge

    We'll take the "digits recognition" challenge as presented in Kaggle. It is an online platform with challenges for data scientists. Most of the challenges have their prizes expressed in real money to win. Some of them are there to help us out in our journey on learning data science techniques ? so is the "digits recognition" contest.

    The challenge

    As explained on Kaggle:

    MNIST ("Modified National Institute of Standards and Technology") is the de facto ?hello world? dataset of computer vision.

    The "digits recognition" challenge is one of the best ways to get acquainted with machine learning and computer vision. The so-called "MNIST" dataset consists of 70k images of handwritten digits - each one grayscaled and of a 28x28 size. The Kaggle challenge is about taking a subset of 42k of them along with labels (what actual number does the image show) and "training" the computer on that set. The next step is to take the rest 28k of images without the labels and "predict" which actual number they present.

    Here's a short overview of how the digits in a set really look like (along with the numbers they represent):


    I have to admit that for some of them I have a really hard time recognizing the actual numbers on my own :)

    The general approach to supervised learning

    Learning from labelled data is what is called "supervised learning". It's supervised because we're taking the computer by hand through the whole training data set and "teaching" it how the data that is linked with different labels "looks" like.

    In all such scenarios we can express the data and labels as:
    Y ~ X1, X2, X3, X4, ..., Xn
    The Y is called a dependent variable while each Xn are independent variables. This formula holds both for classification problems as well as regressions.

    Classification is when the dependent variable Y is so called categorical ? taking values from a concrete set without a meaningful order. Regression is when the Y is not categorical ? most often continuous.

    In the digits recognition challenge we're faced with the classification task. The dependent variable takes values from the set:
    Y = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }
    I'm sure the question you might be asking yourself now is: what are the independent variables Xn? It turns out to be the crux of the whole problem to solve :)

    The plan of attack

    A good introduction to computer vision techniques is a book by J. R Parker - "Algorithms for Image Processing and Computer Vision". I encourage the reader to buy that book. I took some ideas from it while having fun with my own solution to the challenge.

    The book outlines the ideas revolving around computing image profiles ? for each side. For each row of pixels, a number representing the distance of the first pixel from the edge is computed. This way we're getting our first independent variables. To capture even more information about digit shapes, we'll also capture the differences between consecutive row values as well as their global maxima and minima. We'll also compute the width of the shape for each row.

    Because the handwritten digits vary greatly in their thickness, we will first preprocess the images to detect so-called skeletons of the digit. The skeleton is an image representation where the thickness of the shape has been reduced to just one.

    Having the image thinned will also allow us to capture some more info about the shapes. We will write an algorithm that walks the skeleton and records the direction change frequencies.

    Once we'll have our set of independent variables Xn, we'll use a classification algorithm to first learn in a supervised way (using the provided labels) and then to predict the values of the test data set. Lastly we'll submit our predictions to Kaggle and see how well did we do.

    Having fun with languages

    In the data science world, the lingua franca still remains to be the R programming language. In the last years Python has also came close in popularity and nowadays we can say it's the duo of R and Python that rule the data science world (not counting high performance code written e. g. in C++ in production systems).

    Lately a new language designed with data scientists in mind has emerged - Julia. It's a language with characteristics of both dynamically typed scripting languages as well as strictly typed compiled ones. It compiles its code into efficient native binary via LLVM ? but it's using it in a JIT fashion - inferring the types when needed on the go.

    While having fun with the Kaggle challenge I'll use Julia and Python for the so called feature extraction phase (the one in which we're computing information about our Xn variables). I'll then turn towards R for doing the classification itself. Note that I might use any of those languages at each step getting very similar results. The purpose of this series of articles is to be a bird eye fun overview so I decided that this way will be much more interesting.

    Feature Extraction

    The end result of this phase is the data frame saved as a CSV file so that we'll be able to load it in R and do the classification.

    First let's define the general function in Julia that takes the name of the input CSV file and returns a data frame with features of given images extracted into columns:
    using DataFrames
    
    function get_data(name :: String, include_label = true)
      println("Loading CSV file into a data frame...")
      table = readtable(string(name, ".csv"))
      extract(table, include_label)
    end
    
    Now the extract function looks like the following:
    """
    Extracts the features from the dataframe. Puts them into
    separate columns and removes all other columns except the
    labels.
    
    The features:
    
    * Left and right profiles (after fitting into the same sized rect):
      * Min
      * Max
      * Width[y]
      * Diff[y]
    * Paths:
      * Frequencies of movement directions
      * Simplified directions:
        * Frequencies of 3 element simplified paths
    """
    function extract(frame :: DataFrame, include_label = true)
      println("Reshaping data...")
      
      function to_image(flat :: Array{Float64}) :: Array{Float64}
        dim      = Base.isqrt(length(flat))
        reshape(flat, (dim, dim))'
      end
      
      from = include_label ? 2 : 1
      frame[:pixels] = map((i) -> convert(Array{Float64}, frame[i, from:end]) |> to_image, 1:size(frame, 1))
      images = frame[:, :pixels] ./ 255
      data = Array{Array{Float64}}(length(images))
      
      @showprogress 1 "Computing features..." for i in 1:length(images)
        features = pixels_to_features(images[i])
        data[i] = features_to_row(features)
      end
      start_column = include_label ? [:label] : []
      columns = vcat(start_column, features_columns(images[1]))
      
      result = DataFrame()
      for c in columns
        result[c] = []
      end
    
      for i in 1:length(data)
        if include_label
          push!(result, vcat(frame[i, :label], data[i]))
        else
          push!(result, vcat([],               data[i]))
        end
      end
    
      result
    end
    
    A few nice things to notice here about Julia itself are:
    • The function documentation is written in Markdown
    • We can nest functions inside other functions
    • The language is statically and strongly typed
    • Types can be inferred from the context
    • It is often desirable to provide the concrete types to improve performance (but that an advanced Julia related topic)
    • Arrays are indexed from 1
    • There's the nice |> operator found e. g. In Elixir (which I absolutely love)
    The above code converts the images to be arrays of Float64 and converts the values to be within 0 and 1 (instead of 0..255 originally).

    A thing to notice is that in Julia we can vectorize operations easily and we're using this fact to tersely convert our number:
    images = frame[:, :pixels] ./ 255
    We are referencing the pixels_to_features function which we define as:
    """
    Returns ImageFeatures struct for the image pixels
    given as an argument
    """
    function pixels_to_features(image :: Array{Float64})
      dim      = Base.isqrt(length(image))
      skeleton = compute_skeleton(image)
      bounds   = compute_bounds(skeleton)
      resized  = compute_resized(skeleton, bounds, (dim, dim))
      left     = compute_profile(resized, :left)
      right    = compute_profile(resized, :right)
      width_min, width_max, width_at = compute_widths(left, right, image)
      frequencies, simples = compute_transitions(skeleton)
    
      ImageStats(dim, left, right, width_min, width_max, width_at, frequencies, simples)
    end
    
    This in turn uses the ImageStats structure:
    immutable ImageStats
      image_dim             :: Int64
      left                  :: ProfileStats
      right                 :: ProfileStats
      width_min             :: Int64
      width_max             :: Int64
      width_at              :: Array{Int64}
      direction_frequencies :: Array{Float64}
    
      # The following adds information about transitions
      # in 2 element simplified paths:
      simple_direction_frequencies :: Array{Float64}
    end
    
    immutable ProfileStats
      min :: Int64
      max :: Int64
      at  :: Array{Int64}
      diff :: Array{Int64}
    end
    
    The pixels_to_features function first gets the skeleton of the digit shape as an image and then uses other functions passing that skeleton to them. The function returning the skeleton utilizes the fact that in Julia it's trivially easy to use Python libraries. Here's its definition:
    using PyCall
    
    @pyimport skimage.morphology as cv
    
    """
    Thin the number in the image by computing the skeleton
    """
    function compute_skeleton(number_image :: Array{Float64}) :: Array{Float64}
      convert(Array{Float64}, cv.skeletonize_3d(number_image))
    end
    
    It uses the scikit-image library's function skeletonize3d by using the @pyimport macro and using the function as if it was just a regular Julia code.

    Next the code crops the digit itself from the 28x28 image and resizes it back to 28x28 so that the edges of the shape always "touch" the edges of the image. For this we need the function that returns the bounds of the shape so that it's easy to do the cropping:
    function compute_bounds(number_image :: Array{Float64}) :: Bounds
      rows = size(number_image, 1)
      cols = size(number_image, 2)
    
      saw_top = false
      saw_bottom = false
    
      top = 1
      bottom = rows
      left = cols
      right = 1
    
      for y = 1:rows
        saw_left = false
        row_sum = 0
    
        for x = 1:cols
          row_sum += number_image[y, x]
    
          if !saw_top && number_image[y, x] > 0
            saw_top = true
            top = y
          end
    
          if !saw_left && number_image[y, x] > 0 && x < left
            saw_left = true
            left = x
          end
    
          if saw_top && !saw_bottom && x == cols && row_sum == 0
            saw_bottom = true
            bottom = y - 1
          end
    
          if number_image[y, x] > 0 && x > right
            right = x
          end
        end
      end
      Bounds(top, right, bottom, left)
    end
    
    Resizing the image is pretty straight-forward:
    using Images
    
    function compute_resized(image :: Array{Float64}, bounds :: Bounds, dims :: Tuple{Int64, Int64}) :: Array{Float64}
      cropped = image[bounds.left:bounds.right, bounds.top:bounds.bottom]
      imresize(cropped, dims)
    end
    
    Next, we need to compute the profile stats as described in our plan of attack:
    function compute_profile(image :: Array{Float64}, side :: Symbol) :: ProfileStats
      @assert side == :left || side == :right
    
      rows = size(image, 1)
      cols = size(image, 2)
    
      columns = side == :left ? collect(1:cols) : (collect(1:cols) |> reverse)
      at = zeros(Int64, rows)
      diff = zeros(Int64, rows)
      min = rows
      max = 0
    
      min_val = cols
      max_val = 0
    
      for y = 1:rows
        for x = columns
          if image[y, x] > 0
            at[y] = side == :left ? x : cols - x + 1
    
            if at[y] < min_val
              min_val = at[y]
              min = y
            end
    
            if at[y] > max_val
              max_val = at[y]
              max = y
            end
            break
          end
        end
        if y == 1
          diff[y] = at[y]
        else
          diff[y] = at[y] - at[y - 1]
        end
      end
    
      ProfileStats(min, max, at, diff)
    end
    
    The widths of shapes can be computed with the following:
    function compute_widths(left :: ProfileStats, right :: ProfileStats, image :: Array{Float64}) :: Tuple{Int64, Int64, Array{Int64}}
      image_width = size(image, 2)
      min_width = image_width
      max_width = 0
      width_ats = length(left.at) |> zeros
    
      for row in 1:length(left.at)
        width_ats[row] = image_width - (left.at[row] - 1) - (right.at[row] - 1)
    
        if width_ats[row] < min_width
          min_width = width_ats[row]
        end
    
        if width_ats[row] > max_width
          max_width = width_ats[row]
        end
      end
    
      (min_width, max_width, width_ats)
    end
    
    And lastly, the transitions:
    function compute_transitions(image :: Image) :: Tuple{Array{Float64}, Array{Float64}}
      history = zeros((size(image,1), size(image,2)))
    
      function next_point() :: Nullable{Point}
        point = Nullable()
    
        for row in 1:size(image, 1) |> reverse
          for col in 1:size(image, 2) |> reverse
            if image[row, col] > 0.0 && history[row, col] == 0.0
              point = Nullable((row, col))
              history[row, col] = 1.0
    
              return point
            end
          end
        end
      end
    
      function next_point(point :: Nullable{Point}) :: Tuple{Nullable{Point}, Int64}
        result = Nullable()
        trans = 0
    
        function direction_to_moves(direction :: Int64) :: Tuple{Int64, Int64}
          # for frequencies:
          # 8 1 2
          # 7 - 3
          # 6 5 4
          [
           ( -1,  0 ),
           ( -1,  1 ),
           (  0,  1 ),
           (  1,  1 ),
           (  1,  0 ),
           (  1, -1 ),
           (  0, -1 ),
           ( -1, -1 ),
          ][direction]
        end
    
        function peek_point(direction :: Int64) :: Nullable{Point}
          actual_current = get(point)
    
          row_move, col_move = direction_to_moves(direction)
    
          new_row = actual_current[1] + row_move
          new_col = actual_current[2] + col_move
    
          if new_row <= size(image, 1) && new_col <= size(image, 2) &&
             new_row >= 1 && new_col >= 1
            return Nullable((new_row, new_col))
          else
            return Nullable()
          end
        end
    
        for direction in 1:8
          peeked = peek_point(direction)
    
          if !isnull(peeked)
            actual = get(peeked)
            if image[actual[1], actual[2]] > 0.0 && history[actual[1], actual[2]] == 0.0
              result = peeked
              history[actual[1], actual[2]] = 1
              trans = direction
              break
            end
          end
        end
    
        ( result, trans )
      end
    
      function trans_to_simples(transition :: Int64) :: Array{Int64}
        # for frequencies:
        # 8 1 2
        # 7 - 3
        # 6 5 4
    
        # for simples:
        # - 1 -
        # 4 - 2
        # - 3 -
        [
          [ 1 ],
          [ 1, 2 ],
          [ 2 ],
          [ 2, 3 ],
          [ 3 ],
          [ 3, 4 ],
          [ 4 ],
          [ 1, 4 ]
        ][transition]
      end
    
      transitions     = zeros(8)
      simples         = zeros(16)
      last_simples    = [ ]
      point           = next_point()
      num_transitions = .0
      ind(r, c) = (c - 1)*4 + r
    
      while !isnull(point)
        point, trans = next_point(point)
    
        if isnull(point)
          point = next_point()
        else
          current_simples = trans_to_simples(trans)
          transitions[trans] += 1
          for simple in current_simples
            for last_simple in last_simples
              simples[ind(last_simple, simple)] +=1
            end
          end
          last_simples = current_simples
          num_transitions += 1.0
        end
      end
    
      (transitions ./ num_transitions, simples ./ num_transitions)
    end
    
    All those gathered features can be turned into rows with:
    function features_to_row(features :: ImageStats)
      lefts       = [ features.left.min,  features.left.max  ]
      rights      = [ features.right.min, features.right.max ]
    
      left_ats    = [ features.left.at[i]  for i in 1:features.image_dim ]
      left_diffs  = [ features.left.diff[i]  for i in 1:features.image_dim ]
      right_ats   = [ features.right.at[i] for i in 1:features.image_dim ]
      right_diffs = [ features.right.diff[i]  for i in 1:features.image_dim ]
      frequencies = features.direction_frequencies
      simples     = features.simple_direction_frequencies
    
      vcat(lefts, left_ats, left_diffs, rights, right_ats, right_diffs, frequencies, simples)
    end
    
    Similarly we can construct the column names with:
    function features_columns(image :: Array{Float64})
      image_dim   = Base.isqrt(length(image))
    
      lefts       = [ :left_min,  :left_max  ]
      rights      = [ :right_min, :right_max ]
    
      left_ats    = [ Symbol("left_at_",  i) for i in 1:image_dim ]
      left_diffs  = [ Symbol("left_diff_",  i) for i in 1:image_dim ]
      right_ats   = [ Symbol("right_at_", i) for i in 1:image_dim ]
      right_diffs = [ Symbol("right_diff_", i) for i in 1:image_dim ]
      frequencies = [ Symbol("direction_freq_", i)   for i in 1:8 ]
      simples     = [ Symbol("simple_trans_", i)   for i in 1:4^2 ]
    
      vcat(lefts, left_ats, left_diffs, rights, right_ats, right_diffs, frequencies, simples)
    end
    
    The data frame constructed with the get_data function can be easily dumped into the CSV file with the writeable function from the DataFrames package.

    You can notice that gathering / extracting features is a lot of work. All this was needed to be done because in this article we're focusing on the somewhat "classical" way of doing machine learning. You might have heard about algorithms existing that mimic how the human brain learns. We're not focusing on them here. This we will explore in some future article.

    We use the mentioned writetable on data frames computed for both training and test datasets to store two files: processed_train.csv and processed_test.csv.

    Choosing the model

    For the task of classifying I decided to use the XGBoost library which is somewhat a hot new technology in the world of machine learning. It's an improvement over the so-called Random Forest algorithm. The reader can read more about XGBoost on its website: http://xgboost.readthedocs.io/.

    Both random forest and xgboost revolve around the idea called ensemble learning. In this approach we're not getting just one learning model ? the algorithm actually creates many variations of models and uses them to collectively come up with better results. This is as much as can be written as a short description as this article is already quite lengthy.

    Training the model

    The training and classification code in R is very simple. We first need to load the libraries that will allow us to load data as well as to build the classification model:
    library(xgboost)
    library(readr)
    
    Loading the data into data frames is equally straight-forward:
    processed_train <- read_csv("processed_train.csv")
    processed_test <- read_csv("processed_test.csv")
    We then move on to preparing the vector of labels for each row as well as the matrix of features:
    labels = processed_train$label
    features = processed_train[, 2:141]
    features = scale(features)
    features = as.matrix(features)
    

    The train-test split

    When working with models, one of the ways of evaluating their performance is to split the data into so-called train and test sets. We train the model on one set and then we predict the values from the test set. We then calculate the accuracy of predicted values as the ratio between the number of correct predictions and the number of all observations.

    Because Kaggle provides the test set without labels, for the sake of evaluating the model's performance without the need to submit the results, we'll split our Kaggle-training set into local train and test ones. We'll use the amazing caret library which provides a wealth of tools for doing machine learning:
    library(caret)
    
    index <- createDataPartition(processed_train$label, p = .8, 
                                 list = FALSE, 
                                 times = 1)
    
    train_labels <- labels[index]
    train_features <- features[index,]
    
    test_labels <- labels[-index]
    test_features <- features[-index,]
    
    The above code splits the set uniformly based on the labels so that the train set is approximately 80% in size of the whole data set.

    Using XGBoost as the classification model

    We can now make our data digestible by the XGBoost library:
    train <- xgb.DMatrix(as.matrix(train_features), label = train_labels)
    test  <- xgb.DMatrix(as.matrix(test_features),  label = test_labels)
    The next step is to make the XGBoost learn from our data. The actual parameters and their explanations are beyond the scope of this overview article, but the reader can look them up on the XGBoost pages:
    model <- xgboost(train,
                     max_depth = 16,
                     nrounds = 600,
                     eta = 0.2,
                     objective = "multi:softmax",
                     num_class = 10)
    
    It's critically important to pass the objective as "multi:softmax" and num_class as 10.

    Simple performance evaluation with confusion matrix

    After waiting a while (couple of minutes) for the last batch of code to finish computing, we now have the classification model ready to be used. Let's use it to predict the labels from our test set:
    predicted = predict(model, test)
    This returns the vector of predicted values. We'd now like to check how well our model predicts the values. One of the easiest ways is to use the so-called confusion matrix.

    As per Wikipedia, confusion matrix is simply:

    (...) also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another).

    The caret library provides a very easy to use function for examining the confusion matrix and statistics derived from it:
    confusionMatrix(data=predicted, reference=labels)
    The function returns an R list that gets pretty printed to the R console. In our case it looks like the following:
    Confusion Matrix and Statistics
    
              Reference
    Prediction   0   1   2   3   4   5   6   7   8   9
             0 819   0   3   3   1   1   2   1  10   5
             1   0 923   0   4   5   1   5   3   4   5
             2   4   2 766  26   2   6   8  12   5   0
             3   2   0  15 799   0  22   2   8   0   8
             4   5   2   1   0 761   1   0  15   4  19
             5   1   3   0  13   2 719   3   0   9   6
             6   5   3   4   1   6   5 790   0  16   2
             7   1   7  12   9   2   3   1 813   4  16
             8   6   2   4   7   8  11   8   5 767  10
             9   5   2   1  13  22   6   1  14  14 746
    
    Overall Statistics
                                             
                   Accuracy : 0.9411         
                     95% CI : (0.9358, 0.946)
        No Information Rate : 0.1124         
        P-Value [Acc > NIR] : < 2.2e-16      
                                             
                      Kappa : 0.9345         
     Mcnemar's Test P-Value : NA             
    
    (...)
    
    Each column in the matrix represents actual labels while rows represent what our algorithms predicted this value to be. There's also the accuracy rate printed for us and in this case it equals 0.9411. This means that our code was able to predict correct values of handwritten digits for 94.11% of observations.

    Submitting the results

    We got 0.9411 of an accuracy rate for our local test set and it turned out to be very close to the one we got against the test set coming from Kaggle. After predicting the competition values and submitting them, the accuracy rate computed by Kaggle was 0.94357. That's quite okay given the fact that we're not using here any of the new and fancy techniques.

    Also, we haven't done any parameter tuning which could surely improve the overall accuracy. We could also revisit the code from the features extraction phase. One improvement I can think of would be to first crop and resize back - and only then compute the skeleton which might preserve more information about the shape. We could also use the confusion matrix and taking the number that was being confused the most, look at the real images that we failed to recognize. This could lead us to conclusions about improvements to our feature extraction code. There's always a way to extract more information.

    Nowadays, Kagglers from around the world were successfully using advanced techniques like Convolutional Neural Networks getting accuracy scores close to 0.999. Those live in somewhat different branch of the machine learning world though. Using this type of neural networks we don't need to do the feature extraction on our own. The algorithm includes the step that automatically gathers features that it later on feeds into the network itself. We will take a look at them in some of the future articles.

    See also


    Comments