xml4h: XML for Humans in Python – 0.1.0 alpha

It’s been quiet here for a while; it’s about time I announced a new project.

Last week I released the first alpha version of xml4h, a library I have created to make it easier to do non-trivial XML processing in Python.

You can find it on:

I made xml4h because I enjoy working with Python — it’s been my language of choice for a while now — but I always found it painful to work with XML using the tools available there. Of course, dealing with XML tends to be painful in its own right, but the existing Python tools only made it harder. Although there was an excellent base of fast, efficient and powerful XML processing tools, interacting with them always seemed to be harder than it should be.

For years I kept an eye out for a better way, until finally of got off my arse and built it myself.

Right now xml4h is young, a little rough, and likely to undergo changes: in other words it’s an alpha. My hope is it will grow quickly into a powerful and intuitive tool that makes it as easy to do XML in Python as it now is to do HTTP or to do SQL.

If you’re into Python and have felt XML-induced pain, it would be great if you could try it out and provide feedback, bug reports, and code to help move the project in the right direction.

Let’s make this thing awesome.

Posted in Coding, Python | Leave a comment

JetS3t 0.9.0 Released

The newest version of JetS3t has been released and is now available for download: JetS3t 0.9.0

This release has been a long time coming, sorry about that. I had intended to get a release out late last year but personal factors left me short of time: relocating back to Australia from the U.S. and helping out with our newly-arrived baby boy have kept me pretty busy.

Still, the 0.9.0 version is now official. Here are some of the major new features and improvements.

HttpComponents Upgrade

A major change in the new version is JetS3t’s use of the newer 4.x generation of the key HttpClient library (now more accurately called HttpComponents). The older HttpClient 3.x had been end-of-lifed and while it still worked fine, relying on the obsolete version was not a good long-term option.

I need to thank two contributors in particular for doing a lot of this work. Cheers to Gilles Gaillard and David Kocher for their invaluable work; the upgrade wouldn’t have happened without them.

Note that since this upgrade involved updates to the core JetS3t HTTP code layer, there is a risk of subtle bugs in the HTTP handling with this release. I think the risk is small and some people have been using the pre-release 0.9.0 code successfully for a while now, but when you update to 0.9.0 it’s worth doing a little more testing than you might normally.

Here are some of the new service-specific features.

Amazon S3

  • Support for multiple object deletes in a single request
  • Explicit support for new S3 locations: Oregon (us-west-2), South America (sa-east-1), GovCloud US West (s3-us-gov-west-1), GovCloud US West FIPS 140-2 (s3-fips-us-gov-west-1)
  • Support for server-side encryption, with per-object setting of algorithm and default algorithm configuration with the new s3service.server-side-encryption property
  • Support for Multipart Upload Part – Copy operation, to add data from existing S3 objects to multipart uploads.
  • Support for signing S3 requests with response-altering request parameters like response-content-type and response-content-disposition

Google Storage

  • Support for OAuth2 authentication mechanism, with automatic access token refresh.

Conclusion

Please grab the latest version, try it out and let me know how it goes.

Visit the JetS3t web site to download the latest packaged release, view the code samples or read the API Javadoc.

For a more complete list of changes see the Release History or Release Notes documents.

Or go to the BitBucket developer site to access the latest code, report issues in the bug tracker, and contribute to the project.

P.S. The latest release is on its way to the official Maven2 repository and should be available within a day or so.

Posted in AWS, Java, JetS3t | 3 Comments

Python code to convert UTF-8 to Latin-1

After dealing with UTF-8 to latin1 encoding issues repeatedly over the years I finally put the time into crafting a somewhat complete conversion script in Python that handles things like “smart” quotes and other commonly-used symbols.

This probably isn’t the best place to put this code but hopefully it will help someone, most likely me at some time in the future. Unless perhaps sanity prevails and everyone starts using UTF-8 everywhere…

import re

def encode_utf8_to_iso88591(utf8_text):
    '''
    Encode and return the given UTF-8 text as ISO-8859-1 (latin1) with
    unsupported characters replaced by '?', except for common special
    characters like smart quotes and symbols that we handle as well as we can.
    For example, the copyright symbol => '(c)' etc.

    If the given value is not a string it is returned unchanged.

    References:
    en.wikipedia.org/wiki/Quotation_mark_glyphs#Quotation_marks_in_Unicode
    en.wikipedia.org/wiki/Copyright_symbol
    en.wikipedia.org/wiki/Registered_trademark_symbol
    en.wikipedia.org/wiki/Sound_recording_copyright_symbol
    en.wikipedia.org/wiki/Service_mark_symbol
    en.wikipedia.org/wiki/Trademark_symbol
    '''

    if not isinstance(utf8_text, basestring):
        return utf8_text
    # Replace "smart" and other single-quote like things
    utf8_text = re.sub(
        u'[\u02bc\u2018\u2019\u201a\u201b\u2039\u203a\u300c\u300d]',
        "'", utf8_text)
    # Replace "smart" and other double-quote like things
    utf8_text = re.sub(
        u'[\u00ab\u00bb\u201c\u201d\u201e\u201f\u300e\u300f]',
        '"', utf8_text)
    # Replace copyright symbol
    utf8_text = re.sub(u'[\u00a9\u24b8\u24d2]', '(c)', utf8_text)
    # Replace registered trademark symbol
    utf8_text = re.sub(u'[\u00ae\u24c7]', '(r)', utf8_text)
    # Replace sound recording copyright symbol
    utf8_text = re.sub(u'[\u2117\u24c5\u24df]', '(p)', utf8_text)
    # Replace service mark symbol
    utf8_text = re.sub(u'[\u2120]', '(sm)', utf8_text)
    # Replace trademark symbol
    utf8_text = re.sub(u'[\u2122]', '(tm)', utf8_text)
    # Replace/clobber any remaining UTF-8 characters that aren't in ISO-8859-1
    return utf8_text.encode('ISO-8859-1', 'replace')

Be sure to only feed this method UTF-8 encoded text.

Posted in Coding, Python | 1 Comment

Vimdiff for three-way merges in Mercurial

I’ve been using vim as my sole code editor for a couple of years now at work. I find that the more I use it and the more I learn (there will always be more to learn about vim) the happier I am with this fantastic tool.

After working through some hairy code merges recently I realised I needed a better approach than relying on inline diffs, where merge conflicts are represented in a single file like so:

<<<<<<< incoming
Someone else's code
=======
My code
>>>>>>> outgoing

Inline diffs are great for resolving relatively simple conflicts but can quickly become confusing if conflicts span many lines or there are significant differences between files.

So I configured Mercurial to open vimdiff upon merge conflicts, but the default three-paned vertical-split view wasn’t quite what I wanted. It didn’t include the base version of the conflicted file, and the default window layout made it hard to see exactly what was going on.

A little research turned up a blog post showing how to better configure vimdiff when using git. We use Mercurial at work so I adapted this hint to work with Mercurial’s MergeProgram configuration:

# Three-way merge with vimdiff (shows result in bottom window)
# Based on http://mercurial.selenic.com/wiki/MergingWithVim
# and http://www.toofishes.net/blog/three-way-merging-git-using-vim/

[ui]
merge = vimdiff

[merge-tools]
vimdiff.executable = vim
vimdiff.args = -d -c "wincmd J" "$output" "$local" "$other" "$base"

This will show the merged file in a large window at the bottom with the three pre-merge files of interest — local changes, incoming/other changes, the base file — in a three-pane vertical split at the top. With this set-up and some practice using the vimdiff commands, complex conflicting merges are much easier to deal with.

If you use vim or want to, be sure to check out the excellent Vim Casts video podcasts to learn (or re-learn) how to get the most out of it. Some recent episodes discuss vimdiff in the context of a git workflow but are still full of useful pointers for those not using git.

Posted in Coding, Tips | Leave a comment

JetS3t 0.8.1 in the wild

The newest version of JetS3t has been released and is now roaming free. Meet 0.8.1.

This release has been a long time coming, mainly due to my reluctance to finish the documentation. But it’s finally here and comes with some great new features.

Goodies

  • Support for Amazon S3′s multipart uploads, both at the API level and with a MultipartUtils tool that makes it very easy to upload files in multiple parts.
  • Support for Amazon S3′s website configuration, which makes an S3 bucket act more like a traditional website. I’m using this new feature to great effect on JetS3t’s new home domain www.jets3t.org.
    The new domain is served from S3 like the old jets3t.s3.amazonaws.com version, but it works much better if you visit places like the root directory (versus this) or a missing page (versus that).
    Now the new domain just needs some Google-juice, so please update your links to point to www.jets3t.org.
  • Massive improvements to the Synchronize application to reduce its memory footprint when syncing large directory hierarchies and improve its efficiency when comparing local and remote files.
    Synchronize also now supports multipart uploads, so you can back up files larger than 5GB and improve reliability by uploading large files in smaller pieces (see the upload.max-part-size configuration setting in synchronize.properties).
  • Support for custom (non-S3) distribution origins in the CloudFront API. Note that these service changes are not backwards-compatible
  • A number of bug fixes and other tweaks

See the full list of changes in the Release History or Release Notes documents.

Yes Please

Visit the JetS3t web site to download the latest packaged release, view the latest code samples or read the API Javadoc.

Or go to the BitBucket developer site to access the latest code, report issues in the bug tracker, and contribute to the project.

P.S. The latest release is on its way to the official Maven2 repository and should be available within a day or so.

Posted in AWS, Cloud Computing, Java, JetS3t | 3 Comments

JetS3t support for S3 Website Hosting

I have just released code for JetS3t that adds API-level support for Amazon S3′s new Website Hosting feature.

With a Website Hosting configuration applied to an S3 bucket, the bucket can serve static content but will also act in a somewhat dynamic way to serve index and error documents if someone visits URL paths that don’t match a real file.

This makes it much more feasible to serve static website content from S3 without having to worry about users receiving strange XML error messages if they venture off the beaten track or try to access partial URL paths. In particular, it allows you serve an index.html file from the root of a bucket, just like a real web server.

To find out more read these:

To try out the feature in JetS3t, grab the latest development code and read the example test code to see how it works.

Posted in JetS3t | Leave a comment