Ruby: Removing duplicates in an Array of Hashes

Sat 22 June 2013 by Pol O'Riain

Removing duplicates in data is an issue I meet quite often. If this data is going to a database than it's quite a trivial matter to handle duplicate entries but say that's not an option.

One easy way to handle this is with the built in 'uniq' method in Ruby for Arrays.

Take for example;

1.9.3p448 :086 > a = [{:id => "a"},{:id => "b"},{:id => "a"},{:id => "c"}]
 => [
     {:id=>"a"}, {:id=>"b"}, {:id=>"a"}, {:id=>"c"}
    ]

1.9.3p448 :087 > b = a.uniq
 => [
      {:id=>"a"}, {:id=>"b"}, {:id=>"c"}
    ]

Taking a more complicated example;

1.9.3p448 :180 > g
 => [
     {:id=>"1", :name=>"a"}, {:id=>"2", :name=>"b"}, 
     {:id=>"1", :name=>"a"}, {:id=>3, :name=>"a"}, 
     {:id=>1, :name=>"b"}
    ]
1.9.3p448 :184 > g.uniq
 => [
     {:id=>"1", :name=>"a"}, {:id=>"2", :name=>"b"}, 
     {:id=>3, :name=>"a"}, {:id=>1, :name=>"b"}
    ]

You can see that uniq removes the exact duplicate when both fields are the same but ignores the hashes where only one of the fields is a duplicate of another. Say that the data looks something like the following;

array = [
  {
    : account_name=>"Mark H",
    : id=>"901",
    : username=>"mark01",
    : gender=>"male",
    : order_number=>12
  },
  {
    : account_name=>"Mark H",
    : id=>"901",
    : username=>"mark01",
    : gender=>"male",
    : order_number=>13
  },
  {
    : account_name=>"Mark H",
    : id=>"901",
    : username=>"mark01",
    : gender=>"male",
    : order_number=>14
  }
]

And we want to remove all duplicates of this account. Running uniq on this field will result in;

1.9.3p448 :259 > h = 
    [
     {:account_name=>"Mark Hemingway", :id=>"901", 
     :username=>"mark01", :gender=>"male", :order_number=>12}, 
     {:account_name=>"Mark Hemingway",  :id=>"901", 
     :username=>"mark01", :gender=>"male", :order_number=>13}, 
     {:account_name=>"Mark Hemingway",  :id=>"901", 
     :username=>"mark01", :gender=>"male", :order_number=>14}
    ] 
1.9.3p448 :260 > h.uniq
 => [
     {:account_name=>"Mark Hemingway", :id=>"901", 
     :username=>"mark01", :gender=>"male", :order_number=>12}, 
     {:account_name=>"Mark Hemingway",  :id=>"901", 
     :username=>"mark01", :gender=>"male", :order_number=>13}, 
     {:account_name=>"Mark Hemingway",  :id=>"901", 
     :username=>"mark01", :gender=>"male", :order_number=>14}
    ]

All hashes are considered unique because a single value is different, the order_number. To get around this, uniq provides a way to specify which field should be unique.

1.9.3p448 :261 > h.uniq { |g| g[:id] }
 => [
     {:account_name=>"Mark Hemingway", :id=>"901", 
     :username=>"mark01", :gender=>"male", :order_number=>12}
    ]

Uniq now only matches on the 'id' field so while the 'order_number' is unique in each case, it is ignored.