One of our clients wanted to regularly update a list of dealers along with the parts carried at that dealer, and show them on a map. As I dug into the challenge, I was a bit surprised to find very little information on the web about how to hook up a migration that would essentially import a join table. So I had to create it myself!
The source data is essentially three tables: Dealers, parts, and dealers-to-parts. The parts don't change all that much, and the site was already populated with parts, copy, and photos, so we weren't going to import those. The dealers change regularly, so we need a way to refresh the data, hiding old dealers, importing new, and updating existing. Again, the marketing team may add images, text, and more to the dealers once the data is in the site.
So the main migration challenge is getting the dealers-to-parts data into Drupal.
You would think this is a common thing to migrate in. We've brought over this kind of data many times in integrations, but this is the first time since we've started using Migrate as our standard tool for this kind of job. But, no... if there's an established technique for migrating join tables, I was unable to find it.
One small complication here is we don't have direct access to the source data -- we're getting it exported to CSV and then imported to Drupal on demand, with possible future automation.
A bigger complication is that there is not a single primary key on the source data -- the primary key for the dealer data is spread across two fields: CardCode and Address2. Many of their dealers are chains with multiple locations -- these share a CardCode but have a different Address2 field. (Actual address data is stored separately, in addr1 and addr2 fields -- Address2 usually contains a full address, but sometimes a dealer-specific location code).
We started with Feeds module for this task, but when it became clear there were multiple, linked sources of data, we decided to switch over to Migrate. We've built several large migrations, and I'm very impressed with the mapping and ability to import and rollback that you get basically for free with Migrate.
We've also done direct migrations, most often by creating code that iterates through a CSV, JSON, or XML object and creates appropriate content in Drupal. But again, the built-in object mapping in Migrate makes it a clear winner for many kinds of long-running data migrations.
With built-in Drush integration, object-oriented code, a UI, logging, and more, it's become our go-to solution any time we need to import data more complex than we can readily do with Feeds.
Turns out, there were two crucial things we had to solve to get this to work:
- Get sourceMigration mapping to work with multiple keys
- Make a migration with an entityreference target
Source Migrations with Multiple Keys
The dealer migration is very straightforward in this model. Aside from field mapping, about the only thing necessary is to make sure the MigrateMaps provides the appropriate keys:
Now the tricky part: For the RetailersParts import, we're going to have 3 source key fields and 2 destination key fields.
The RetailersParts import has 3 columns: CardCode, Address2, and SKU. The real trick here, which we'll discuss next, is that our destination is not a node, but an entityreference field. So our destination class is going to be different -- the map we hook up for the RetailersParts import looks like this:
Now, if Migrate properly handled multiple source keys in mapping, we should be able to add a mapping for the nid for the import, and apply the sourceMigration we used for the Retailers, something like this:
... but no, that doesn't work, because the second argument to addFieldMapping is a string. How do you get the mapped nid for the other import?
The approach I took, while far from ideal, seems like a fairly good solution. Use prepareRow to create a new source field that is a string of the keys combined with some character not in the source data, and then use the "separator" method to extract back into an array to use for the source mapping. Like this:
Ok. That works very nicely.... except it doesn't work.
The mappings for sourceMigration are applied in the handleSourceMigration method, at line 1318. This function takes a $source_keys parameter that it uses to look up the destination keys. $source_keys is described in the code as:
So we have managed to send an array of values to this function, but that looks for individual matches to each value, not the mapping for multiple source keys. At this point we're stuck, with no way to get applyMappings to send a multiple source key $source_keys.
Now, the best approach here would be to override either applyMappings() or handleSourceMigration with a subclass of Migration. applyMappings() is long and convoluted, and wouldn't make any difference to wrap, so I don't want to do that. handleSourceMigration() would be better to override -- I could nest the array and then call the parent function. But I didn't really want to do that either.
By digging into the migrate code, I found that fieldMappings have a callback() method. Perfect! Let's add a callback to this mapping to format the $source_key in the way that we want.
There's one problem with this approach, described here: sourceMigration doesn't work if source migration map uses more than 1 destination key ... the problem is, callbacks are applied in Migration::applyMappings right after the sourceMigration mappings are applied.
I've scoured the web for more info about callbacks, migrations, and when these are used, and I did not find anything that specified this sequence, or much about these callbacks. So I decided to patch migrate module and reverse the order these get applied. You can do this in sites/all/modules/migrate/includes/migration.inc, around line 1220 in the current dev release of migrate (I'm using 7.x-2.6-rc1+65-dev from 2014-09-18 according to drush pmi).
Here's a patch:
Ok, so now that we have that done, we can add a callback to wrap the source key in another array so it gets handled as a multi-key map value instead of multiple values. While we're at it, we want to look up the product node that we're going to hook up, and create it if it doesn't already exist. And let's leverage &drupal_static so we don't go looking up products over and over again. Here's what the code ends up looking like:
Ok! We've solved the mapping problem. Now on to the actual entityreference mapping.
Create an EntityReference Destination
It really surprised me that there isn't already a migration destination to handle entityreferences. I mean, there is one, but it's handled as part of the migration for the actual node.
There are several reasons I wanted to handle the references separately from the node content:
- We wanted to preserve some Drupal-side changes to the retailer nodes, so doing a rollback of dealers is not something we really want to do much if at all
- Product selection at dealers is likely to change far more often than the dealers, so it makes sense to split that into its own migration
- While scale wasn't a particular issue in this case due to launching with less than 20 products, it seems to me that stacking a bunch of references inside a single column with a separator (the supported option) seems very brittle and not likely to scale well
- We already had the source data split out into two separate files
- It just seems like a better approach!
This actually turned out to be a far more straightforward thing to do, than the multiple key mapping! Basically I just had to create a custom MigrateDestination subclass. Here are the key methods that needed to get implemented/overridden, annotated.
First up: the key schema. This defines the fields to add to the mapping table for the destination -- our target is not a single node, but a node and a product:
... create the serialization function:
Next up, define the fields this class has, so migration can enforce mapping them:
The import method does the heavy lifting. Much of this comes from MigrationDestinationEntity -- really all we're doing is finding the source node, setting the entityreference field, and saving it. The other crucial thing here is to return the destination key -- matching what we defined in the getKeySchema() method. The rest of this is error handling and calling methods as defined by the base class:
... and finally, provide a rollback method that unsets the single value getting rolled back.
That's it! Aside from the constructor, which defines the field name, entity type and bundle.
As I said, I'm a bit surprised there's not something like this in Migrate already, or other people having this need. But it's a pretty straightforward bit of code to implement.
We now have a robust data integration with two separate migrations for dealers and parts carried by each dealer. The RetailersParts migration can be completely rolled back and re-run on a schedule, keeping the site up-to-date with each retailer's current assortment. For updating the overall list of retailers, we wrap our migration with a couple other housekeeping tasks -- first unpublish all retailers, then run the import, and have the import republish each row as it's processed.
Yet another reason to keep the retailer import separate is geocoding -- on the initial import, the migration tools run far faster than Google's geocoding API will allow us to use. So after a large migration, we need to update geocoded data -- a topic for another post. In short, by preserving the existing retailers, we don't have to repeat the geocoding.
Product assortments can then get updated by rolling back the previous migration and running the migration on the new file.