optimal algorithm grouping data in javascript - Stack Overflow

IT技术

更新时间：2025-04-093

admin管理员组
文章数量:1399887

The following (simplified) json type of data defines a Contact:

{
  id:   number;
  name: string;
  phone: string;
  email: string
}

There is the following group of data:

+---+----------+-------------+---------------------------+ 
|id | name     | phone       |email                      | 
+---+----------+-------------+---------------------------+
|1  | John     | 11111111    |[email protected]              | 
|2  | Marc     | 22222222    |[email protected]              | 
|3  | Ron      | 99999999    |[email protected]              |
|4  | Andrew   | 55555555    |[email protected]              |
|5  | Wim      | 99999999    |[email protected]              |
|6  | Marc     | 33333333    |[email protected]              |
|7  | Dan      | 44444444    |[email protected]              |
+---+----------+-------------+---------------------------+

Goal is to find the groups that belong together using javascript(optionally in lodash, but main idea is to get the algorithm clear) according to the following constraint: a contact belongs to a group when any of the following criteria are the same: name, phone or email. The results shows the id's grouped as arrays in arrays. a contact in a group of 1 is ignored.

In the example above, it means that contacts with ids 1,3,5 belong together since 1,3 share the same email and 3 and 5 share the same phonenumber. Likewise 2,6,7: 2 and 6 have the same name and 6 and 7 have the same email. 5 does not have anything in mon. The expected result therefore is:
[[1,3,5], [2,6,7]]

Background: One solution that works is iterating over every item and check for the remainder of the list if name, email, or phone is the same. If so, group them and take them out of the list (in the example we pare 1 with all the items in the list and only 3 is found). problem is that next items also need to be checked again to these groups, because in this case 5 is not yet detected as part of the group. This makes the algorithm plex, while I suspect there is a simple way of solving this in linear time. There might also be a name for this class of problems? `

The following (simplified) json type of data defines a Contact:

{
  id:   number;
  name: string;
  phone: string;
  email: string
}

There is the following group of data:

+---+----------+-------------+---------------------------+ 
|id | name     | phone       |email                      | 
+---+----------+-------------+---------------------------+
|1  | John     | 11111111    |[email protected]              | 
|2  | Marc     | 22222222    |[email protected]              | 
|3  | Ron      | 99999999    |[email protected]              |
|4  | Andrew   | 55555555    |[email protected]              |
|5  | Wim      | 99999999    |[email protected]              |
|6  | Marc     | 33333333    |[email protected]              |
|7  | Dan      | 44444444    |[email protected]              |
+---+----------+-------------+---------------------------+

Goal is to find the groups that belong together using javascript(optionally in lodash, but main idea is to get the algorithm clear) according to the following constraint: a contact belongs to a group when any of the following criteria are the same: name, phone or email. The results shows the id's grouped as arrays in arrays. a contact in a group of 1 is ignored.

In the example above, it means that contacts with ids 1,3,5 belong together since 1,3 share the same email and 3 and 5 share the same phonenumber. Likewise 2,6,7: 2 and 6 have the same name and 6 and 7 have the same email. 5 does not have anything in mon. The expected result therefore is:
[[1,3,5], [2,6,7]]

Background: One solution that works is iterating over every item and check for the remainder of the list if name, email, or phone is the same. If so, group them and take them out of the list (in the example we pare 1 with all the items in the list and only 3 is found). problem is that next items also need to be checked again to these groups, because in this case 5 is not yet detected as part of the group. This makes the algorithm plex, while I suspect there is a simple way of solving this in linear time. There might also be a name for this class of problems? `

Share Improve this question edited Nov 20, 2018 at 12:01 asked Nov 20, 2018 at 9:18 Han 1468 bronze badges

That looks like a cool problem... Just to make sure that I understood it correctly: what would it be the expected result if #2 had the following phone #: 99999999? Would it be [[1, 2, 3, 5, 6, 7]]? – Josep Commented Nov 20, 2018 at 9:27
1 correct, we would have one group. – Han Commented Nov 20, 2018 at 9:39

Add a ment |

4 Answers 4

Sorted by: Reset to default 3

Idea:

Start with 0 groups
Iterate your list of contacts
Check if there is a group with the contact name, or phone, or email. Merge all the members of those groups as the same group. Then add yourself to that group. If not, begin a new group with yourself and set the name, phone and email group as yourself.

Union find is an efficient structure to handle merging of disjoint sets. Code taken from here. As it uses path pression and union by rank, you can consider that the whole code is linear in the amount of contacts.

var data = [
      {id:1,name:'John',phone:'11111111',email:'[email protected]'},
      {id:2,name:'Marc',phone:'99999999',email:'[email protected]'},
      {id:3,name:'Ron',phone:'99999999',email:'[email protected]'},
      {id:4,name:'Andrew',phone:'55555555',email:'[email protected]'},
      {id:5,name:'Wim',phone:'99999999',email:'[email protected]'},
      {id:6,name:'Marc',phone:'33333333',email:'[email protected]'},
      {id:7,name:'Dan',phone:'44444444',email:'[email protected]'}
];

// UNION-FIND structure, with path ression and union by rank

var UNIONFIND = (function () {
    
    function _find(n)
    {
        if(n.parent == n) return n;	
        n.parent = _find(n.parent);	
        return n.parent;
    }
    
    return {
        makeset:function(id){    
            var newnode = {
                parent: null,
                id: id,
                rank: 0
            };
            newnode.parent = newnode;            
            return newnode;
        },
    
        find: _find,
     
        bine: function(n1, n2) {                                    
            var n1 = _find(n1);
            var n2 = _find(n2);
            
            if (n1 == n2) return;
        
            if(n1.rank < n2.rank)
            {
                n2.parent = n2;
                return n2;
            }
            else if(n2.rank < n1.rank)
            {
                n2.parent = n1;
                return n1;
            }
            else
            {
                n2.parent = n1;
                n1.rank += 1;
                return n1;
            }
        }
    };
})();

var groupHash = {name: {}, phone: {}, email: {}}
var groupNodes = []

data.forEach(function(contact){
  var group = UNIONFIND.makeset(contact.id);
  var groups = new Set();
  ["name", "phone", "email"].forEach(function(attr){
    if (groupHash[attr].hasOwnProperty(contact[attr])) groups.add(groupHash[attr][contact[attr]])
  });
  
  groups = Array.from(groups);
  groups.push(group);
  groupNodes.push(group);
  
  for(var i = 1; i < groups.length; i++) {
    UNIONFIND.bine(groups[0], groups[i]);
  }  
  
  ["name", "phone", "email"].forEach(function(attr){
      groupHash[attr][contact[attr]] = groups[0];
  });
  
})

var contactsInGroup = {}


groupNodes.forEach(function(group){
    var groupId = UNIONFIND.find(group).id;
    
    if (contactsInGroup.hasOwnProperty(groupId) == false) {
      contactsInGroup[groupId] = [];
    }
    
    contactsInGroup[groupId].push(group.id);
})

var result = Object.values(contactsInGroup).filter(function(list){
 return list.length > 1
})

console.log(result)

Any answer that iterates over each of n entries, and then over a growing list of m groups to match against is going to have worst-time performance of O(n*m) (found when there are no two entries that match on any term).

Any answer that iterates over each entry, and then over groups, and uses arrays to test for matching values among q options will further have to pay a penalty of O(q) per match. In the worst case, with say all e-mails the same and all phones different, this will mean O(n*m).

I believe this answer is O(n), because assuming that the number of fields to match with is a constant (in this case, 3: name, phone and email), all operations in the main loop, which is run once per entry, are O(1).

There is an extra plication to fix the fact that, late in the process, we may find a bridge between two (or even 3) groups, as entries can match on different fields with entries from different groups. This may happen several times. To avoid having to rebuild groups during the main loop, we leave merging to the very end, where we first build a map of what-group-ends-up-where, and then finally move all entry IDs to their final group. This can all be done in O(m), with m the number of groups; with an extra O(n) when actually copying entry IDs to the merged groups: overall, we are still in O(n) territory.

The last line builds arrays of ids from the merged groups, and filters out any that does not have over 1 element.

const data = [
    {id:1,name:'John',phone:'11111111',email:'[email protected]'},
    {id:2,name:'Marc',phone:'99999999',email:'[email protected]'},
    {id:3,name:'Ron',phone:'99999999',email:'[email protected]'},
    {id:4,name:'Andrew',phone:'55555555',email:'[email protected]'},
    {id:5,name:'Wim',phone:'99999999',email:'[email protected]'},
    {id:6,name:'Marc',phone:'33333333',email:'[email protected]'},
    {id:7,name:'Dan',phone:'44444444',email:'[email protected]'}
];

const groups = function(inputs) {

    let valuesToGroups = new Map(
        ['name', 'phone', 'email'].map(key => [key, new Map()]));
    let groups = new Map();
    let pendingMerges = [];
    for (const entry of inputs) {
        let group = undefined;
        let found = [];
        for (const [key, valueMap] of valuesToGroups) {
            // look up value in values-index for current key
            group = valueMap.get(entry[key]);
            if (group !== undefined) {
                found.push(group);
                // not breaking allows groups to be merged
            }
        }
        if (found.length === 0) {
            // not found: create new group
            group = groups.size;
            groups.set(group, [entry.id]);
        } else {
            // found: add entry to chosen group
            group = found[0];
            groups.get(group).push(entry.id);
            if (found.length > 1) {
                pendingMerges.push(found);
            }
        }
        // add entry's values to index, pointing to group
        for (const [key, valueMap] of valuesToGroups) {
            valueMap.set(entry[key], group); 
        }        
    }
    // do pending merges; initially, all groups are stand-alone
    let merges = new Map(Array.from(groups.keys()).map(k => [k, k]));
    for (const merge of pendingMerges) {
        // contents will go to the lowest-numbered group
        const sorted = merge.map(groupId => merges.get(groupId)).sort();
        sorted.forEach(groupId => merges.set(groupId, sorted[0]));
    }
    const cleanGroups = new Map();
    groups.forEach((value, key) => { 
        const k = merges.get(key); 
        if ( ! cleanGroups.has(k)) {
            cleanGroups.set(k, []);
        }
        value.forEach(id => cleanGroups.get(k).push(id))
    })
    // return only non-empty groups
    return [... cleanGroups].filter(g => g[1].length>1).map(g => [... g[1]]);
}(data);

console.log(""+JSON.stringify(groups))
// output is [[1,2,3,5,6,7]]

Here is another suggestion of a route you could take. The idea is to use one Array.reduce to group by id and keep all the values (vls) and bined results (ids) in that accumulator object.

This way you can easily pare the name/phone/email using Array.some + Array.includes (which is what the getGroupId function does).

Once you have grouped and have the almost final result just prettify it by removing the groups with length of one and picking only the ids array of the rest:

var data = [ {id:1,name:'John',phone:'11111111',email:'[email protected]'}, {id:2,name:'Marc',phone:'22222222',email:'[email protected]'}, {id:3,name:'Ron',phone:'99999999',email:'[email protected]'}, {id:4,name:'Andrew',phone:'55555555',email:'[email protected]'}, {id:5,name:'Wim',phone:'99999999',email:'[email protected]'}, {id:6,name:'Marc',phone:'33333333',email:'[email protected]'}, {id:7,name:'Dan',phone:'44444444',email:'[email protected]'} ];

const getGroupId = (obj, vals) => Object.entries(obj)
   .find(([k,v]) => v.vls.some(x => vals.includes(x))) || []

const group = d => d.reduce((r, c) => {
   let values = Object.values(c), groupID = getGroupId(r, values)[0]
	
   if(!groupID)	
      r[c.id] = ({ vls: values, ids: [...r[c.id] || [], c.id] })
   else {
      r[groupID] = ({
         vls: [...r[groupID].vls, ...values], ids: [...r[groupID].ids, c.id]
      })
   }
   return r
}, {})

const prettify = grp => Object.values(grp).reduce((r,c) => {
   if(c.ids.length > 1)
     r.push(c.ids)
     return r
}, [])

console.log(prettify(group(data)))

One thing to note is that we do not care about the number of properties since we do Object.values. So you can easily add another address or fax to that list and it would still work with zero code changes.

As per feedback here is another version which works slightly different:

var data = [ {id:1,name:'John',phone:'11111111',email:'[email protected]'}, {id:2,name:'Marc',phone:'22222222',email:'[email protected]'}, {id:3,name:'Ron',phone:'99999999',email:'[email protected]'}, {id:4,name:'Andrew',phone:'55555555',email:'[email protected]'}, {id:5,name:'Wim',phone:'99999999',email:'[email protected]'}, {id:6,name:'Marc',phone:'33333333',email:'[email protected]'}, {id:7,name:'Dan',phone:'44444444',email:'[email protected]'} ];
var testData = [{ id: 1, name: 'John', phone: '1', email: 'a' }, { id: 2, name: 'Marc', phone: '2', email: 'b' }, { id: 3, name: 'Ron', phone: '1', email: 'b' }]; 

const getGroupId = (obj, vals) => Object.entries(obj)
  .find(([k,v]) => v.vls.some(x => vals.includes(x))) || []

const group = d => d.reduce((r,c,i,a) => {
  let values = Object.values(c), groupID = !i ? i : getGroupId(r, values)[0]

  if (!groupID) {		
    let hits = a.filter(x => 
       x.id != c.id && values.some(v => Object.values(x).includes(v)))
    hits.forEach(h => 
       r[c.id] = ({ vls: [...values, ...Object.values(h)], ids: [c.id, h.id] }))
  }
  else
    r[groupID] = r[groupID].ids.includes(c.id) ? r[groupID] : 
      ({ vls: [...r[groupID].vls, ...values], ids: [...r[groupID].ids, c.id] })      
  return r
}, {})

const prettify = grp => Object.values(grp).reduce((r, c) => {
  if (c.ids.length > 1)
    r.push(c.ids)
  return r
}, [])

console.log(prettify(group(data)))      // OP data
console.log(prettify(group(testData)))  // Test data

The reason for this version is due to the testData provided by @Mark which has the 2nd element not matching the first but matching the 3rd which actually matches the 1st ... so they all should be hits.

To get to that once we find a match we look for matches of that same initial match and push in the same group so we can have the maximum amount of data to match on.

The result is that once we get the first group with the first element we then find and push the 3rd as well and from there it is much easier to match the 2nd. The logic is slightly more plex and I would imagine less performant.

One way to acplish what you need, is to separate the contacts into groups. Each group will contain a list of names, phones and emails.

Then iterate through contacts, and see if the current contact falls into any of the groups. If not, create a new group and set its names/phones/emails so that next contacts might fall into the same on.

var data = [
      {id:1,name:'John',phone:'11111111',email:'[email protected]'},
      {id:2,name:'Marc',phone:'22222222',email:'[email protected]'},
      {id:3,name:'Ron',phone:'99999999',email:'[email protected]'},
      {id:4,name:'Andrew',phone:'55555555',email:'[email protected]'},
      {id:5,name:'Wim',phone:'99999999',email:'[email protected]'},
      {id:6,name:'Marc',phone:'33333333',email:'[email protected]'},
      {id:7,name:'Dan',phone:'44444444',email:'[email protected]'}
];

var groups = [];

data.forEach(function(person){
  var phone = person.phone;
  var email = person.email;
  var name = person.name;
  var id = person.id;
  var found = false;
  groups.forEach(function(g){
    if(    g.names.indexOf(name) > -1 
        || g.phones.indexOf(phone)>-1 
        || g.emails.indexOf(email)>-1) {
      found = true;
      g.names.push(name);
      g.phones.push(phone);
      g.emails.push(email);
      g.people.push(id);
    }
  });
  if(!found) {
      groups.push({names:[name],phones:[phone],emails:[email],people:[id]});
  }
  
  
});
var output=[];
groups.forEach(function(g){
  output.push(g.people);
});
console.log(output);   //[ [1,3,5] , [2,6,7] , [4] ]

本文标签： optimal algorithm grouping data in javascriptStack Overflow

版权声明：本文标题：optimal algorithm grouping data in javascript - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744176121a2594010.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

optimal algorithm grouping data in javascript - Stack Overflow

4 Answers 4

更多相关文章