Seperating text into separate elements using a pattern – javascript

Apologies in advance for my atrocious code / attempts at explaining what I’m trying to achieve…

I want to take various transcripts, with time stamps, and convert it to a consistent format for creating subtitles. The transcripts are not from the same sources, and the structure of the documents and the timestamps vary, sometimes even within the same document.

The format of the timestamp is [HH:MM:SS.FF] (the variations I can deal with) and it is contained within the text. And the timestamps sometimes indicate and end point (usually they are just a start point).

So the format is

[Timestamp1]Some text with various line breaks and weird characters.
[Timestamp2]More text where this transcript continues but ends with some silence after this
[Timestamp3]
[Timestamp4]The next sentence begins and ends at the last
[Timestamp5]

What is the best coding approach to this in JavaScript? I’ve gone around the houses with string.split and re.matchAll but none of the regex patterns I come up with can deal with the 2 timestamps in a row.

I think ideally I would have the regex pattern that gets the timestamp and then store an array of objects that have a Start and End timestamp (end is next start if end doesn’t exist) and associated text.

So for the above example I’d have

Start: Timestamp1 End: Timestamp2 Text: "Some text..."

Start: Timestamp2 End: Timestamp3 Text: "More text..."

Start: Timestamp4 End: Timestamp5 Text: "The next..."

This is one of my latest attempts…

function test(){
        str = 
        `[09:35:10.00]
        1. Lorem ipsum...
        [09:35:13.11]
        [09:35:15.14]
        2. sed do eiusmod...
        [09:35:39.20]
        3. anim id est laborum...
        [09:35:43.17]`

        var re = /(?<tc1>[?(?:[0-1][0-9]|2[0-3]|[0-9]):(?:[0-5][0-9]):(?:[0-5][0-9])(?:.(?:[0-9]{2,3})?]?))s*(.*)s*(?<tc2>[?(?:[0-1][0-9]|2[0-3]|[0-9]):(?:[0-5][0-9]):(?:[0-5][0-9])(?:.(?:[0-9]{2,3})?]?))?.*/gm;

        const matches = str.matchAll(re);
        for (const match of matches) {
                console.log(`Start TC:n${match[1]}nText:n${match[2]}nTC2:n${match[3]}`);
        }
}

Which doesn’t cater for the variations, unfortunately.

Thanks for any pointers in the right direction.

Answer

The pattern needs to be composed of 3 parts:

  • Match and capture a timestamp: [, followed by digits, colons, and a period: [d{2}:d{2}:d{2}.d{2}]
  • Match and capture any characters other than a timestamp: (?:(?!TIMESTAMP).)+ where TIMESTAMP is the pattern above
  • Look ahead and capture a timestamp: just use the timestamp pattern above

You have to look ahead for a timestamp instead of matching it normally because the timestamp in question may need to be part of the next match.

Put it together, and you get:

str =
  `[09:35:10.00]
        1. Lorem ipsum...
        [09:35:13.11]
        [09:35:15.14]
        2. sed do eiusmod...
        [09:35:39.20]
        3. anim id est laborum...
        [09:35:43.17]`

var re = /([d{2}:d{2}:d{2}.d{2}])((?:(?![d{2}:d{2}:d{2}.d{2}]).)+)(?=([d{2}:d{2}:d{2}.d{2}]))/gs;

const matches = str.matchAll(re);
for (const match of matches) {
  console.log(`Start TC:n${match[1]}nText:n${match[2]}nTC2:n${match[3]}`);
}

Or, commenting the regex:

const pattern = makeExtendedRegExp(String.raw`
( # First capture group: timestamp
  [d{2}:d{2}:d{2}.d{2}]
)
( # Second capture group: text
  (?:(?!
    # Timestamp pattern again:
    [d{2}:d{2}:d{2}.d{2}]
  ).)+
)
(?=( # Look ahead for and capture the timestamp in 3rd group:
  # Timestamp pattern again:
  [d{2}:d{2}:d{2}.d{2}]
))
`, 'gs');



function makeExtendedRegExp(inputPatternStr, flags) {
  const cleanedPatternStr = inputPatternStr
    .replace(/(^|[^\]) *#.*/g, '$1')
    .replace(/^s+|s+$|n/gm, '');
  return new RegExp(cleanedPatternStr, flags);
}


str =
  `[09:35:10.00]
        1. Lorem ipsum...
        [09:35:13.11]
        [09:35:15.14]
        2. sed do eiusmod...
        [09:35:39.20]
        3. anim id est laborum...
        [09:35:43.17]`

const matches = str.matchAll(pattern);
for (const match of matches) {
  console.log(`Start TC:n${match[1]}nText:n${match[2]}nTC2:n${match[3]}`);
}