How to compare 2 lists of ranges in bash?

Answer #1 100 %

It depends on how big your files are, of course. If they are not big enough to exhaust the memory, you can try this 100% bash solution:

declare -a min=() # array of lower bounds of ranges
declare -a max=() # array of upper bounds of ranges

# read ranges in second file, store then in arrays min and max
while read a b; do
    min+=( "$a" );
    max+=( "$b" );
done < file2

# read ranges in first file    
while read a b; do
    # loop over indexes of min (and max) array
    for i in "${!min[@]}"; do
        if (( max[i] >= a && min[i] <= b )); then # if ranges overlap
            echo "${min[i]} ${max[i]}" # print range
            unset min[i] max[i]        # performance optimization
        fi
    done
done < file1

This is just a starting point. There are many possible performance / memory footprint improvements. But they strongly depend on the sizes of your files and on the distributions of your ranges.

EDIT 1: improved the range overlap test.

EDIT 2: reused the excellent optimization proposed by RomanPerekhrest (unset already printed ranges from file2). The performance should be better when the probability that ranges overlap is high.

EDIT 3: performance comparison with the awk version proposed by RomanPerekhrest (after fixing the initial small bugs): awk is between 10 and 20 times faster than bash on this problem. If performance is important and you hesitate between awk and bash, prefer:

awk 'NR == FNR { a[FNR] = $1; b[FNR] = $2; next; }
    { for (i in a)
          if ($1 <= b[i] && a[i] <= $2) {
              print a[i], b[i]; delete a[i]; delete b[i];
          } 
    }' file2 file1

You’ll also like:


© 2023 CodeForDev.com -