Web Development

Built by maltpress

Big imports: some learnings

I’m currently working on (and close to finishing) a pretty big site. It’s a rebuild of a site built in a proprietary, Django-based CMS, with the intention to have the new site look almost exactly like the old one, except with the addition of a store.

I’ve had to learn a lot on this project. Bits which I thought would be scary and time-consuming (dealing with someone else’s front-end code) weren’t always that bad, thanks to the original designers and developers being pretty damn awesome. Bits I thought might be easier –  importing and handling the data – less so. I thought I’d share some of my learnings for you, which might help with two things. Firstly, if you’re building a site and thinking about the day it might need to be radically changed; and secondly, if you’re moving a site from another (proprietary) system to WordPress.

Future proofing

The most important things I’ve learned – or at least had reinforced – are about how you gather and store data for content.

As an example, the site build I’m doing sets dimensions for items. In the original CMS – and I’m not judging this, I know how these sorts of decisions are made – dimensions were stored as a single string, free text, with no formatting.

In some ways, this is great: the site’s editors have complete freedom to put the weird dimensions of their oddly shaped product in to the site. You could put a note about the weight of the object if you wanted. Brilliant!

On the other hand, it’s entirely possible to then face a situation where the dimensions suddenly have more of a role to play: calculating (or at least informing) shipping. It’s possible, for example, that an object with a volume over a certain amount would cost a lot more to ship. Had the dimensions been constrained at the outset, with separate fields for height, width and depth, and units set, then this wouldn’t be an issue: as it is, we have to do a conversion job on all the dimensions – checking for units (some use “in”, some say 12″, some are in cm, some have no units), converting imperial to metric, and so on. It’s not a massive job, but it’s extra overhead.

Let’s consider the example of a small online store. Because it’s small – say, 20 items – the owner doesn’t use unique ID codes for products (SKUs). Not a problem. But over time, slowly but surely, the number of products creeps up… and then you end up in a situation with 2000 products all needing a code adding to them, manually.

My advice to clients is always to gather as much data about “stuff” (products, events, whatever) as you can, and break it down as much as is practically possible. Addresses: do them line by line so you know you can format however you want for whatever front-end processing you need. Dimensions: set units, break them in to width and height and depth. That kind of thing.

How you actually store them is unimportant – as arrays, serialized or not, as single values, whatever – but by making sure that you’re staying true to variable types when you store the data, it gives you absolute freedom when, in a couple of years’ time, your business has grown hugely and you need to migrate all that content to a new system.

This is not going to be popular with everyone, however. Editors get frustrated about the number of form fields they need to fill in, and more so at the constraints when there’s that one special case. But you don’t want that one special case to cost you hundreds or thousands of pounds in a year’s time: developers need to be aware that there are special cases, and allow for null, or “special” values (i.e. fall back to a value of “see above” or similar if a value is missing or set to special) rather than doing away with the idea of the constraint. You can’t have every field as a free text box.

Moving content

The original system, as I said, was Django-based, using PostGreSQL as a database.

At first I tried to use an automated tool to convert PostGres to MySQL and learned one important thing about doing this: don’t. It’s going to fail; if it works, it’s going to mess up character encoding, and it’s just a huge pain. Instead, rest assured that connecting to, and querying, a PostGres DB is actually pretty simple: it’s just standard SQL. If you’re importing in to WordPress, don’t worry too much about not being able to use $wpdb, because – and I’ll get to this later – you’ll be doing it on a local, isolated box, and as long as you remove all of this stuff before moving to live or staging you’ll be fine.

PostGres is not something I’d ever used before, and it turns out there are several challenges with this. Firstly, if you’re used to using WAMP, XAMPP or other Windows-based LAMP-stack, PostGres is almost certainly not installed by default. There are lots of tutorials on the web for enabling it, which takes a little while, but isn’t too bad.

Secondly, phpPGAdmin sucks if you’re used to using phpMyAdmin. The export options are limited, and when importing the errors if memory is too low (or the import too big) are far from helpful. I spent a long time working out the best way to import the data on to my local dev system, doing pretty much every iteration of a data dump from the live server, before realising that WAMP can use command line tools (yes, I’m an idiot). Run cmd and cd to the bin directory of your PostGres installation, and you can run psql fine.

This is needed because all of these web-based interfaces are terrible at handling big exports and imports. phpPGAdmin would start working, chug away for an hour or so, and then something would interrupt the process and it would fail part way through, either silently or with terrible error messages, and you have to start again. At one point I was copying and pasting individual tables from the dump to try to import them one by one: with some tables having 20,000 rows, a lot of these with long strings of HTML, my PC’s copy buffer couldn’t even cope, let alone the importer. Command line tools? Imported perfectly first time. The browser based tools have a major failing in that they rely on the browser being able to handle the data being output by the importer, and for a big dump this can be enough to chomp through an insane amount of memory – on top of what’s being used to actually handle the data. Making it worse still, the data going in is huge as well as the data going out – it’s not going to end well.

Next up was getting all the data in to WordPress. I may share some of my code for this as a Gist or something, but here are my top tips:

  • Images are sods, and WordPress is not built for batch processing. If you’re adding images manually one by one, the time taken to create the different image sizes isn’t noticeable, but import 2000 raw images (the biggest at 19,000px wide!) and it takes forever. Use Photoshop to batch resize your images to a decent physical size and quality before doing anything else with them. It will still take ages, but much less time than if you don’t do this.
  • Accents and other characters in image file names are pretty bad as well, in part because they’re used to populate the attachment title in WordPress. Lots of the images I was importing had them. I ended up using preg_replace to strip them out, then renaming the files as I imported them (if you’re using PHP’s rename(); function on Windows, escape backslashes so your image path looks like C:\\wamp\\www etc).
    • WordPress thumbnail resizing plugins generally use AJAX to batch process images, passing filenames through JSON encoding, which just nopes the hell out when accents pop up in file names. You can hack them to get around this, or – more efficiently – do it programatically when you import.
  • Do not – do not – do any of this importing on a live, or even staging server. My import script is pretty inefficiently written – lots of WP_Query statements with posts_per_page set to -1 – but even if it was more efficient it would eat up memory and processor like nobody’s business. You might be able to do it on some kind of cloud processing account, I suppose, but better to do it on a local machine. There is nothing more frustrating than an import running for two hours then bumming out: do this on any kind of shared hosting and you’ll be timed out within minutes, or you’ll be taking down any number of other sites. Do it on a dedicated server and it’ll still take forever, and then you still have to move files around afterwards.
    • Run the process on – if you have it – a dedicated Linux box. In php.ini and other config, set everything to maximum – don’t let scripts time out. Even if you’re batching efficiently. Or especially if you’re batching – you need to know exactly where a script timed out in order to avoid duplicates.
    • If you have to run on a Windows box, turn off any power saving options you have. A PHP process running won’t stop your PC going to sleep when you take the dogs out for a quick walk. Yes, I’m that stupid.
    • Say goodbye to productivity while these processes are running. Run them overnight if you can.
  • Experiment with small batches first. See what processing different fields might need just by echoing out values. Try to do this with a random selection of your content.
  • Back everything up regularly – between batches might be too much, but if (like me) you’re importing into several post types, or processing taxonomies after importing the content, back up between these processes.
  • If you’re clearing out data between tests, clear out the post meta table too. Just doing a SQL query to delete things in the wp_posts table will leave all that stuff behind – use a query like the below to clean everything out.

Right – that’s not even the half of what I’ve learned on this project, but hopefully it’s a start if you’re looking at importing a big site from a proprietary system into WordPress!

Back to blog

Leave a comment

By using this site you consent to our use of cookies. To find out more, see our privacy policy. Continue