Data Pipeline configuration oddities

I've been working for a little while on a gem to configure and deploy AWS Data Pipelines. At my day job, we use Data Pipeline to schedule various types of repeating jobs. If you're interested, you can see more details on the company blog. To summarize, we wanted a library to configure Data Pipelines as Ruby objects so we could easily compose, reuse, version control, etc.

In developing waterworks, we decided to build on top of the Ruby AWS SDK. We chose Ruby mostly because we didn't think that any one AWS SDK had significant advantages for Data Pipeline and secondarily because most people in the company are comfortable with Ruby. One other advantage of the Ruby SDK instead of something like Java is that the hashes expected by the SDK are very similar to the JSON ones used by the CLI and web console.

However, there is a caveat to this since the format used is subtly different:

JSON see reference

{
  "objects": [
    {
      "id": "my_id",
      "name": "my_name",
      "field_key_1": "field_value_1",
      ...
      "field_key_n": "field_value_n"
    },
    ...
  ]
}

Ruby see reference

{
  pipeline_objects: [ # required
    {
      id: "my_id", # required
      name: "my_name", # required
      fields: [ # required
        {
          key: "field_key_1", # required
          string_value: "field_value_1",
          # or
          ref_value: "field_value_1",
        },
        ...
      ],
    },
    ...
  ]
}

The main difference is that the fields of each Pipeline Object are encapsulated differently. In the JSON object, the fields are on the same level as the id and name. In the Ruby Hash format, the fields are encapsulated in a field array with objects that have a key and an explicit string_value or ref_value. In the JSON format, the designation of string or reference is implicit. This is a slight annoyance, but since each field is bound to a type (see the fields tables in an example object) we can easily build this into our logic.