Unit 4.13 MERGING FILES

PROG

Merge the files u413a.txt and u413b.txt
Both are assumed to be in sorted order on input
The output file, u413.out, will also be sorted

PED

To learn how to merge files

To prepare us for the Master-File update problem

DO LOOP INVARIANT

number1 contains the current number from u413a -- first file
number2 contains the current number from u413b -- second file
all numbers before number1 from first file are already put in output file
all numbers before number2 from second file are already put in output file
output file is sorted
all the numbers in the output file are less than the current number from first
file
all the numbers in the output file are less than the current number from
second file


This problem introduces us to an important algorithm in business data processing, merging two files. We merge two files when both input files are sorted and we want an output file that is also sorted. Imagine, that you have two companies, each with an alphabetic list of customers. Both companies merge. We would want a single list of customers--also in alphabeticalorder. We say we merge the two lists of customers. The same thing happens in computers all the time.

A similar problem is the master file update problem which will be discussed in yuour homework. There is a big file containing all the customers and their current balance. This file is kept sorted. Each month, we type in a file containing all the new transactions, payments and new orders. This file is sorted with a sort utility. The system outputs a new file containing all the
new balances for each customer. It is also sorted. We can use this new file after the next month's transactions. This is a very common way to handle batch updates in commercial business data processing systems such as banks, utilities, etc. (The other option would be a situation where one can enter the customers and update the information on-line as would be done with an Automatic Teller Machine. This requires arrays or an indexed random-access file. Arrays will be covered in the next section. Random-access files are beyond the scope of this course.)

The fundamental idea of this program is that each time through the loop we add a single number to the output file each time. The number we add should obey the do loop invariant. We also have the next number to be added from file1 in number1. And, likewise, we have the next number to be
added from file2 in number2. If the number to be added is from file1, we read in a new value for number1. If we decide to add the next number from file2, then we must read a new value for number2 from file2.

Thus, our first refinement is:


read in first number from file1
read in first number from file2
while(more numbers to process) do begin
  add the appropriate number from file1 or file2 to the output file
  writeln;
end;

We expand this a little bit:


read in first number from file1
read in first number from file2
while(more numbers to process) do begin
  if number for file1 is smaller then number for file2
    add the number from file1 to output file
    read in next number from file1
  end
  else
  if number for file2 is smaller than number for file1
    add the number from file2 to output file
    read in next number file2
  end;
  writeln;
end;

There's only one special case--when we finish either file1 or file2. If we have read in all the numbers from file1 and put them in the output, then we have to copy the remainder of file2 to file1. Likewise, if we read in all the numbers from file2 and put them in the output, then we have to copy of the remaidnerof file1 to file2.

Let's look at our do loop invariant.

number1 contains the current number from u413a -- first file
number2 contains the current number from u413b -- second file
all numbers before number1 from first file are already put in output file
all numbers before number2 from second file are already put in output file
output file is sorted
all the numbers in the output file are less than the current number from first
file
all the numbers in the output file are less than the current number from
second file

Check 1: that the do loop invariant and the condition being true implies that executing the statements leaves the do loop invariant true

There are two major possibilities to consider. The first is that number1 < number2. From the do loop invariant, we know that all the numbers that came from file1 that were less than this number are also in the file. Since, number1 is less than number2, we know that all the numbers in the output file are also less than number2. So if we add number1 to the output file, then all the numbers in the output file will still be less than number2. When we read the next number from file1, the fact that file1 is sorted implies that the next number read in will be greater than the old value. Thus, it will now be true that all numbers in the output file are both less than number1 and less than number2. Since the file was sorted prior to adding number1 and that number was larger than all the other numbers in the output file--the file will remain sorted. If we executed the other branch of the if, then the same sort of argument would apply.

Check 2: that the do loop invariant and the condition not being true implies the POSTCONDITION for the loop

If the condition is not true, that means that both number1 and number2 were -1 the sentinel value. The loop invariant says that all the numbers from file1 before the -1 sentinel are in the output file. Likewise, all the numbers from file2 are before the -1 sentinel in number2. Also, the file is sorted--so the output file is a sorted merge of the two outputs.

Check 3: that the do loop invariant is initially true. We did that by reading in NUMBER but not including that value in the SUM.

There is nothing in the output file and the first number from file1 is in number1 adn the first number from file2 is in number2.

Check 4: that executing the statements once makes progress towards the goal. Since each read statement removes a number from the file and advances us toward the end of what must be a finite set, then the statement makes progress.

We move one number from either file in each time through the loop.