Thanks everyone; I wound up making an ugly parser (even uglier than I originally expected).
The lines were LF delimited, thankfully, but were UTF-16 unthankfully.
CS's sed command was generally where I was hoping someone could send me, except that relies on known tags. I don't actually know the complete list of possible tags in the files.
If anyone wants the code (you don't, trust me) here it is, in all its hackish glory (PHP >= 5):
$theConvertedFilename = $filename."_converted";
$theOutput = fopen($theConvertedFilename, "w");
$theFile = fopen($filename, "r");
if ($theFile) {
while (!feof($theFile)) {
$theRecord = fgets($theFile);
$reconstructedString16 = $theRecord; // default for the case where we do nothing
$isSelfClosing = false;
if (($position = mb_strpos($theRecord, mb_convert_encoding('/>', "UTF-16", "UTF-8"), 0, "UTF-16")) !== false) {
echo ("Correcting... $theRecord");
$tagStart = mb_strpos($theRecord, mb_convert_encoding('<', "UTF-16", "UTF-8"), 0, "UTF-16")+1;
$tagLength = $position-$tagStart-1;
$valueStart = $position+2;
$valueLength = (mb_strlen($theRecord, "UTF-16")-($position+2))-1;
//echo ("Total: ".mb_strlen($theRecord, "UTF-16")." POS: $position TS: $tagStart TL: $tagLength VS: $valueStart VL: $valueLength\n");
$mbTag = mb_substr($theRecord, $tagStart, $tagLength, "UTF-16");
$mbValue = mb_substr($theRecord, $valueStart, $valueLength, "UTF-16");
//echo ("Pieces. mbtag: -->$mbTag<-- mbval: -->$mbValue<--\n");
$reconstructedString16 = mb_convert_encoding("<", "UTF-16", "UTF-8").
$mbTag.
mb_convert_encoding(">", "UTF-16", "UTF-8").
$mbValue.
mb_convert_encoding("</", "UTF-16", "UTF-8").
$mbTag.
mb_convert_encoding(">\n", "UTF-16", "UTF-8");
}
fwrite($theOutput, $reconstructedString16);
//echo ("writing string: -->$reconstructedString16<-- length: ".mb_strlen($reconstructedString16, "UTF-16")."\n");
}
fclose($theFile);
fclose($theOutput);
}
To answer the questions about how often, size, and so forth, the files in question are being created at 75MB a shot, every 2-3 minutes, until I can get an updated version of the app that generates them so manual processing is not an option.