count matches and mismatches group by

Please help with a shell script of the following. I need count number of consistent variables in each lane (col1) across samples (col2). For example, since all the values (col4) of lane1 variable 1 accross all theree samples are the sample, variable1 is counted towards a consistent variable. Similarly, lane 2 variables 2 and 3 are both inconsistent.

lane1  sample1 variable1 ab
lane1  sample2 variable1 ab
lane1  sample3 variable1 ab   

lane1  sample1 variable2 cd
lane1  sample2 variable2 cd
lane1  sample3 variable2 cd

lane1  sample1 variable3 gh
lane1  sample2 variable3 ab
lane1  sample3 variable3 gh

lane2  sample1 variable1 ac
lane2  sample2 variable1 ac
lane2  sample3 variable1 ac

lane2  sample1 variable2 gt
lane2  sample2 variable2 gt
lane2  sample3 variable2 ac

lane2  sample1 variable3 ga
lane2  sample2 variable3 ga
lane2  sample3 variable3 ac


Number of consistent and inconsistent variables accross all three samples

      #Consistent #Inconsistent
lane1  2             1
lane2  1             2


Perl solution:

use warnings;
use strict;
use feature qw{ say };

my %values;
while (<>) {
    next if /^$/; # Skip empty lines
    my ($lane, $sample, $var, $val) = split;
    die "Duplicate $lane $sample $varn" if $values{$lane}{$var}{$val}{$sample};
    $values{$lane}{$var}{$val}{$sample} = 1;

my %results;
for my $lane (keys %values) {
    for my $var (keys %{ $values{$lane} }) {
        my $count = keys %{ $values{$lane}{$var} };
        if (1 == $count) {
        } else {
    say join "t", $lane, @{ $results{$lane} }{qw{ consistent inconsistent }};

Leave a Reply

Your email address will not be published. Required fields are marked *