Java: using streams vs. iterating over chars for parsing raw data from file [closed]

I have to build a JAVA app to parse data stored in a log file. The file reader code I used returns lines as an array of strings as shown below.

R:16-08-2021 18:32:09 <STX>1120007 - 01           <ETX>0x3A
S:16-08-2021 18:32:09 <ACK>
R:16-08-2021 18:32:09 <STX>2xxxxxxxxxx<ETX>0x3F
S:16-08-2021 18:32:09 <ACK>
R:16-08-2021 18:32:09 <STX>300 15.3  0.6  1.0  0.6  3.0 13.7 83.8  0.0  0.0  0.0  0.0  910<ETX>0x35
S:16-08-2021 18:32:09 <ACK>
R:16-08-2021 18:32:09 <STX>4   02 6 651<ETX>0x11
S:16-08-2021 18:32:09 <ACK>
R:16-08-2021 18:32:09 <STX>5 1A1A  B  15 650  15  76  87    7.16  0.6<ETX>0x7F
S:16-08-2021 18:32:09 <ACK>
R:16-08-2021 18:32:09 <STX>5 2A1B  V  15 650  87 101 123   10.66  1.0<ETX>0x78
S:01-09-2021 16:06:08 <ACK>
R:01-09-2021 16:06:08 <STX>6<ETX>0x35
S:01-09-2021 16:06:08 <ACK>
R:01-09-2021 16:06:08 <STX>7  1   71.350   71.300   71.340   71.338   71.346   71.347   71.348   71.349   71.350   71.350<ETX>0xF
S:01-09-2021 16:06:08 <ACK>
R:01-09-2021 16:06:09 <STX>7  2   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350<ETX>0x6
S:01-09-2021 16:06:09 <ACK>
R:01-09-2021 16:06:09 <STX>7  3   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350<ETX>0x7
S:01-09-2021 16:06:09 <ACK>
R:01-09-2021 16:06:09 <STX>7  4   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350<ETX>0x0
S:01-09-2021 16:06:09 <ACK>
...

I have to consider only the substrings between <STX> and <ETX>. The first character (integer from 1 to 7), also called the header, represents the type of record. Hence, the number of fields to split the remaining characters into, as well as the width of each field, is dependent on the first character.

I have thought of a few ways to do this:

  1. Convert each string after <STX> into into some kind of stream and consume first character, switch-cases according to the number, then consume appropriate length of chars as per the case.
  2. Split the line and keep the first substring after <STX> using String.split, then iterate over each character using String.charAt(int i), switch-case at i==0, then iterate and store appropriate no. of chars.
  3. Iterate over each character in the complete line using String.charAt(int i), switch-case at i==27, and so on.

Which of these methods should produce the most maintainable code while not compromising on the speed/resources?

If I should go with 1, what is the appropriate way of converting the string to a stream and consuming it, for my use case? e.g.: converting to ByteArrayInputStream, or to a Stream<Character> (functional programming).

Currently the log file is 70,000 lines long and growing.

EDIT: The log file is a text file on Windows, with CRLF endings. The <STX> is read from the text file as it is (5 characters) and it’s not a control byte in the log file.

I have used the following code

public class LogParser {

private static long lineNo = 69408;
private static String fileLocation = "C:\xxx\log\log_00.log";

public static void main(String[] args) {
    parseRawData();
}

private static void parseRawData() {
    try {
        Stream<String> nextLinesStream = Files.lines(Paths.get(fileLocation), StandardCharsets.UTF_8)
                .skip(lineNo - 1).filter(s -> s.contains("<STX>"));

        nextLinesStream.forEachOrdered(s -> {
            parseLine(s);
        });

    } catch (IOException e) {
        e.printStackTrace();
    }
}

private static void parseLine(String s) {

    String inter = s.split("\>", 3)[1];
    String line = inter.substring(0, inter.length() - 4);

    System.out.println(line);
    
}

}

to get the output

1120006 - 04           
209010023
300 12.6  0.7  1.0  0.8  3.7 11.0 85.4  0.0  0.0  0.0  0.0  956
4   02 6 651
5 1A1A  B  15 650  15  77  89    8.39  0.7
5 2A1B  V  15 650  89 104 121   13.01  1.0
5 3F    V  15 650 121 139 151   10.55  0.8
5 4LA1C+V  15 650 151 166 186   47.41  3.7
5 5SA1C V  15 650 186 203 256  118.68 11.0
5 6A0   V  15 650 256 308 650 1101.05 85.4
6
7  1   71.350   71.300   71.340   71.338   71.346   71.347   71.348   71.349   71.350   71.350
7  2   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350
7  3   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350
...
7 29   75.540   75.673   75.813   75.957   76.114   76.274   76.448   76.634   76.896   77.296
7 30   77.929   78.745   79.635   80.516   81.720   84.037   88.321   94.472  101.119  106.648
7 31  110.383  114.136  128.814  178.810  275.595  399.911  502.001  562.222  571.605  551.175
7 32  505.796  453.784  400.256  348.288  299.139  254.605  216.039  184.069  158.402  138.224
7 33  122.565  110.508  101.285   94.268   88.971   84.988   82.012   79.790   78.130   76.884
7 34   75.943   75.225   74.674   74.230   73.881   73.592   73.365   73.161   72.985   72.839
...
7 64   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350
7 65   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350   71.350
7 66   71.350    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000    0.000

This data is for plotting an HPLC (chromatography) graph. At every occurrence of ‘1’ as header, I will create a new custom object of type MyGraph that will hold data of subsequent records. The data in record ‘7’ (eg 71.350 etc) are raw data points (around 600 in number in total, each 9 chars long) that will be stored in an ArrayList, the plot has be scaled and normalized based on the min and max value of the raw numbers, as well as some values from ‘5’ records. The scaled points will be stored in another ArrayList. I plan to send the points to a graph plotting REST API. The image data received from the API will be sent to a database along with the HPLC sample number in ‘2’ record.

Answer

When the source is a potentially large file, use Scanner. E.g.

try(Scanner sc = new Scanner(path)) {
    sc.findAll("<STX>(.)(.*)<ETX>")
        .map(mr -> {
            char header = mr.group(1).charAt(0);
            String recordData = mr.group(2);
            return "type " + header + ", data " + recordData;
        })
        .forEach(System.out::println);
}

None of your alternatives seems useful to me, as they all are preoccupied with the idea that you have to iterate or stream over the string.

Each record type could get processed in a different way and produce a different type of object, depending on the format. We don’t know enough to suggest something more specific.