Migrating applications from a mainframe to distributed systems also means moving the data files from the source system across to the target. Data files must be transformed to ensure they will be usable in the same way on the target system and that the integrity of the data is preserved.

The file systems on Windows and Linux have very different features from those on mainframes, however, so files and the way they are read from and written to deserve a closer look. In this article we will limit our discussion to non-indexed files (be they sequential, generational or libraries). We will examine:

  • The main differences in the file formats on mainframe and distributed platforms, without regard to encoding issues like EBCDIC to ASCII
  • The options for file formats on the target platform
  • The pros and cons of each.

We’ll use an IBM z/OS source system as an example but the issues and choices apply equally to other systems (VSE, BS2000, …). We’ll also use Windows as the example target platform, but a Linux system is equally applicable.

Mainframe file formats

On the z/OS there are two main file formats, Variable (V) and Fixed (F) format. They’re often referred to as FB or VB (the B standing for Blocked, which is the most common way of organizing them on physical storage media).

Fixed format

FB files have (perhaps not too surprisingly) a fixed record length. This means that every record in the file will have exactly the same number of bytes. The file system on the mainframe (the VTOC) stores information on the file, which amongst other things keeps track of this record length. The files can be easily viewed in a text-oriented editor on the mainframe like ISPF, as the system knows where to split the file into records based on the fixed record length. There is no delimiter between the records, and any kind of textual or binary data can be stored in the file.

For example

FILE1 contains the data:

ABCDEFGHI

The VTOC contains information that the logical record length of FILE1 is 3. Therefore the system knows that there are 3 records in the file:

 

Variable Format

Variable format files have a (drum-roll…) variable record length. This means that each record in the file can have a different number of bytes, up to a set maximum length. To be able to know the length of each individual record the systems uses a 4 byte RDW (record descriptor word) field which is prefixed to each record and specifies the length of that particular record. On the z/OS, the VTOC stores information on the maximum permitted length of an individual record in the file. Again the files can be easily viewed in an editor as the system splits the file into records based on the RDW.

For example

FILE1 contains:

ABCDEFGHIJK

The VTOC contains information that the maximum record length of FILE1 is 4. By reading the RDW fields it knows the length of each record:

 

Windows file formats

On the Windows platform there are two main distinctions between file types. That of text files, which can be processed as line sequential files, and binary files.

Text files

When storing data in text files, every logical record is terminated with a line delimiter (on Windows this is the CR LF combination of carriage return and line feed). These files are easily viewable in standard text editors on the platform like Notepad, WordPad, UltraEdit and others.

Since binary data can itself contain LF and CR characters, as well as page breaks and a multitude of other non-display characters, text files are not a good format for storing binary data. Text files are best for storing character data, adhere to a consistent encoding of characters in bytes, and may be prefixed with Unicode byte order marks.

Binary files

Binary files include a huge range of files, but broadly can be seen as anything for which no well-known visualization technique exists, or for which the contents are not consistently mappable to characters. Data in these files can be in any proprietary format, and it’s up to the programs which access them to interpret that data correctly.

When for example doing a COBOL to COBOL migration there usually are a number of binary formats provided by the COBOL compiler vendor which can be used for the target files.

The square peg and the round hole

Trying to map the one set of file formats to the other can seem a little like trying to fit a square peg into a round hole. So how do you determine which type of file format you should target? The following are some of the main considerations when mapping file formats between source and target systems:

Use of binary fields.

If binary fields are used, then the chosen file format must be able to support this.

Need to visualise and edit text in files.
If the files need to be easily viewed by standard utilities, or manually edited with text editors, then text files are the most suitable format.

If FB files need to remain strictly fixed.
Since there is no VTOC in the target system, and line sequential files can be manually edited, the file system won’t stop you from storing for example 79 or 81 characters into a record that used to be FB 80. The same applies for VB files: since there is no VTOC, any previous upper length limit no longer applies! Essentially all files become fully variable when they are line sequential.
Note: From the standpoint of the actual programs this not an issue. A program that previously wrote a file of FB 80, would still write a file with records of length 80. When reading data that was changed by another program or person however, bytes beyond the known length of 80 could cause processing issues since part of the data could get truncated or padded.

If a distinction needs to be made/retained between fixed and variable format files.
On the source system FB and VB files are two separate formats. This distinction may need to be retained for example if you have many sorts in your JCL and some sorts use FB files and others use VB files. On the source system the 4 byte RDW is counted as part of the record and so sort position 5 is actually the start of the real data. If the distinction between FB and VB files is dropped then the VB sort statements need to be modified to shift all the offsets by 4 bytes.

Formatting concerns for files transmitted to remote systems.
Are these formats read from or written to directly by the migrated application or does a conversion step need to be applied first?

The requirements for sorting and other utilities.
Not all file-based utilities like SORT support all file formats, so the more utilities are used the more important that compatibility with open standards becomes.

Target file formats

Below is a list of some of the most frequently used target file formats encountered in mainframe migrations, including formats which are implemented by COBOL compiler vendors. Each format has advantages and disadvantages when considering the points given in the previous section.

Line sequential formats

The following two formats are line sequential, hence the data is more easily visualizable.

Plain Text Files
Records are separated by CRLF. If the data of a record contains CR and/or LF bytes (0x0D or 0x0A), the file cannot be read back correctly (since 1 line no longer matches 1 record!).

COBOL style line sequential files
Supported by Micro Focus COBOL compilers, these are line sequential files, in which all values lower than hex 20 are escaped by adding a leading hex 0x00, and an explicit 0D0A-sequence is added as record delimiter. This means that if a file contains no non-display bytes, it will be identical to a plain text file; if it does contain "binary" data, the escapes ensure that it can be read back correctly (but also cause problems using the files with programs that expect plain text).

Binary formats

The following are vendor specific and generic binary file formats which can contain textual or binary data, but are not easily visualizable.

Record sequential files

These files contain raw data as in the FB files on the mainframe. For a program to be able to split the data into records the records have to be of a fixed length. The information on this length needs to be available to the program/utility which accesses it (e.g. via a VTOC/catalog replacement).

COBOL style variable format files

COBOL compiler vendors Fujitsu and Micro Focus also provide their own formatting for variable format files. What these have in common is they incorporate RDW data for each record. In addition, the Micro Focus format includes a file header that includes a maximum and minimum record length. The Fujitsu COBOL format does not have a maximum record length header but includes RDW data at the beginning and end of each record, which makes it efficient to browse the file data either from top to bottom or backwards.

Target File Format Comparison Matrix

The table below summarizes the properties of the various file formats described above.

File format

Supports binary data?

Supports FB/VB?

Can retain strict fixed length

Easily visualised /edited with Windows tools

Transformation required on transmission to remote systems

Sorting considerations

Generic line sequential

No

Both

No

Yes

No*

Optimal compatibility

COBOL style line sequential

Yes

Both

No

If textual data only

No*

Specialized tools required

Generic record sequential

Yes

FB only

Yes, but requires external length information

No

Yes

Optimal compatibility

COBOL style record sequential

Yes

Both

Yes

No

Yes

Specialized tools required

* Assuming there would be no binary data

Other considerations

Since modern enterprise application platforms make do without concepts like mainframe catalogs and VTOC, modernization of mainframe applications will seek to drop these notions whenever practical.

For those formats where the file itself does not contain information on the maximum record length (line sequential files, record sequential files, Fujitsu variable length files), then this metadata needs to be found in another source. Often the metadata is implicitly present in the application artifacts, for example COBOL FDs in the event files are accessed via COBOL programs, or in JCLs. JCLs often do contain metadata, for instance when the files are accessed through utilities like SORT, and the sort statements identify the maximum record length in INREC/OUTREC/LENGTH clauses.

But what do we do when we don’t have this information in JCLs and COBOL FDs and we can’t do without it? In this case the modernization will look for other places to store it, with alternatives ranging from including in JCLs, setting application-wide defaults, or custom system files During testing phases of a migration project job output and output files are validated against the original to ensure that the file migration is correct and that all cases have been handled correctly. Choosing a COBOL-style file format that contains the record information will help ensure they are always correct, and that there is no additional work required such as modifying JCL. It’s one way to guarantee this information is always available since it is part of the file itself. On the other hand these are COBOL vendor proprietary formats and less usable by third party tools.

Conclusion

Though it’s a complex topic and after reading it you may indeed make you want to curl up in a ball and weep, it’s often quite straightforward to make a choice on the target file format.

Tools like the Anubex JCL analyzers can help to show what file formats are in use and if the length information is provided. The use of binary data limits the options, and the choice of target for the COBOL helps to further limit the options.

During a migration project all of this information can be discussed in a workshop to help determine the best options for your particular setup and requirements.

I hope you’ve enjoy reading this post and it has opened your eyes to the possibilities and pitfalls in this area of a migration project!