Spork Boards
Hot Spork Chat : Join us in an AIM chat room!

Sporkers Helping Sporkers

tomierna's Avatar Picture tomierna (Admin) – December 07, 2007 08:51PM Reply Quote
Well, why not? this is a community, after all.

porruka (Admin) – August 14, 2012 07:55AM Reply Quote
I'm in need of some unix-based textfile processing foo. It's probably one of those easy things, but if someone has this lying around, it would certainly save me some pain and suffering.

I have an application that is creating incorrect XML and while the app is being fixed, it will continue to generate data files in the incorrect format.

The issue is that it is using self-closing tags for elements that actually have data.

I need a way to parse the file for

< tag />data that may include spaces and so forth

and make it

< tag >data that may include spaces and so forth< /tag >

(with spaces in the tag element itself here just for presentation purposes).

Thoughts?

BTW, in case anyone ever is curious, yes,... yes, I hate XML.

YDD – August 14, 2012 11:22AM Reply Quote
PERL - but how is the end of the "data which may include spaces and so forth" delimited?

John Willoughby – August 14, 2012 11:53AM Reply Quote
Homo Sapiens Sedentarius
My question, too.

Cloudscout – August 14, 2012 01:09PM Reply Quote
˙pɹɐoqʎǝʞ ʎɯ ɥʇıʍ ƃuoɹʍ ƃuıɥʇǝɯos sı ǝɹǝɥʇ ʞuıɥʇ ı ?ɹǝʇndɯoɔ ʎɯ ɥʇıʍ ǝɯ dlǝɥ ǝuoǝɯos uɐɔ
If the data looks like this:

<tag />This is some data<tag />This is some more data<tag />Yep, this is data, too

And you want it to look like this:

<tag>This is some data</tag><tag>This is some more data</tag><tag>Yep, this is data, too</tag>

Then the best solution depends on how often you will need to do this and how large the data file will be.

If you're only going to do this a handful of times, you could just use a text editor and do a find/replace:

Find: <tag />
Replace With: </tag><tag>

And then manually edit the first result to remove the </tag> from it and add </tag> to the final result

If you're going to do this on very large files or if you're going to do it frequently, you could automate the find/replace part with sed (the backslashes are important to escape the < > and / characters on the command line):

sed -e 's/\<tag \/\>/\<\/tag\>/' input.xml > output.xml

You would still need to manually edit the first and last result, though. With some awk trickery, you could get it to recognize the first result and handle it properly but there's no good way to determine the arbitrary end of the final result in order to add the appropriate </tag> to the end of it... at least not with the limited assumptions I'm operating under here. :)

johnny k – August 14, 2012 02:24PM Reply Quote
BeautifulSoup might help, again, depending on your structure. It's a Python library that can whip into shape bad HTML/XML. Some cases will just take 3 lines of code.

porruka (Admin) – August 14, 2012 02:39PM Reply Quote
Thanks everyone; I wound up making an ugly parser (even uglier than I originally expected).

The lines were LF delimited, thankfully, but were UTF-16 unthankfully.

CS's sed command was generally where I was hoping someone could send me, except that relies on known tags. I don't actually know the complete list of possible tags in the files.

If anyone wants the code (you don't, trust me) here it is, in all its hackish glory (PHP >= 5):

						$theConvertedFilename = $filename."_converted";
						$theOutput = fopen($theConvertedFilename, "w");
						$theFile = fopen($filename, "r");
						if ($theFile) {
							while (!feof($theFile)) {
								$theRecord = fgets($theFile);
								$reconstructedString16 = $theRecord;	// default for the case where we do nothing
								
								$isSelfClosing = false;
								if (($position = mb_strpos($theRecord, mb_convert_encoding('/>', "UTF-16", "UTF-8"), 0, "UTF-16")) !== false) {

									echo ("Correcting... $theRecord");
									$tagStart = mb_strpos($theRecord, mb_convert_encoding('<', "UTF-16", "UTF-8"), 0, "UTF-16")+1;
									$tagLength = $position-$tagStart-1;
									$valueStart = $position+2;
									$valueLength = (mb_strlen($theRecord, "UTF-16")-($position+2))-1;
									//echo ("Total: ".mb_strlen($theRecord, "UTF-16")." POS: $position TS: $tagStart TL: $tagLength VS: $valueStart VL: $valueLength\n");
									
									$mbTag = mb_substr($theRecord, $tagStart, $tagLength, "UTF-16");
									$mbValue = mb_substr($theRecord, $valueStart, $valueLength, "UTF-16");
									
									//echo ("Pieces. mbtag: -->$mbTag<-- mbval: -->$mbValue<--\n");
									
									
									$reconstructedString16 = mb_convert_encoding("<", "UTF-16", "UTF-8").
															$mbTag.
															mb_convert_encoding(">", "UTF-16", "UTF-8").
															$mbValue.
															mb_convert_encoding("</", "UTF-16", "UTF-8").
															$mbTag.
															mb_convert_encoding(">\n", "UTF-16", "UTF-8");
															
								}
								
								fwrite($theOutput, $reconstructedString16);
								//echo ("writing string: -->$reconstructedString16<-- length: ".mb_strlen($reconstructedString16, "UTF-16")."\n");
							}
							
							fclose($theFile);
							fclose($theOutput);
						}

To answer the questions about how often, size, and so forth, the files in question are being created at 75MB a shot, every 2-3 minutes, until I can get an updated version of the app that generates them so manual processing is not an option.

El Jeffe – August 19, 2012 11:48AM Reply Quote
What a journey.
Any Roomba owners? Opinions on any and all details?

johnny k – August 19, 2012 02:40PM Reply Quote
We have the 532 pet series, bought on a decent sale on Bestbuy.com. Thank goodness for the increased capacity of the pet series, because it does require frequent emptying thanks to a very productive cat. Currently the battery is run down (maybe because it hadn't been charged in a while), but I'm ambivalent about the Roomba. If the capacity was even bigger and it didn't also require pulling cat hair out of the bristles and crevices all the time (can't be helped, I suppose), it'd be much better. It would get stuck on the big thresholds between rooms in our old place (which won't be a problem now), but otherwise pretty good about navigation. But it does a great job of keeping the house at a baseline level of clean, and its busy scooting around is just entertaining. I'd recommend getting one with scheduling so you don't have to remember to turn it loose.

El Jeffe – August 20, 2012 01:52PM Reply Quote
What a journey.
Any reasonably priced 8 port gigabit switch better than any other? If not, any reseller better than any other?
What about this Cisco/Linksys refubr? yay/nay? http://tinyurl.com/9mkst3q

ddt – August 23, 2012 05:33PM Reply Quote
I'll share the FF Chartwell OpenType set with anyone who can help me figure out how to use it in, say, Illustrator, Pages, InDesign (possibly in that order of preference). I try to follow this (https://www.fontfont.com/how-to-use-ff-chartwell) and the font is already all chart-y, not text.

ddt

ddt – August 26, 2012 09:30AM Reply Quote
DPBD: So this meetup group I co-host (http://www.meetup.com/NewsHack-Study-Group/) wants to have a wiki. I know there are hosted options such as wikia.com -- but we have server space and we'd like to use something open-source and easy to set up (read as: none of us know crap about php 'n' shit). Recommendations?

ddt

johnny k – August 26, 2012 09:45AM Reply Quote
Well, you could try the software that Wikia/Wikipedia runs on: MediaWiki
It's PHP but seems easy enough to install. You can only get so easy before you're talking about paid/hosted.

John Willoughby – August 26, 2012 02:59PM Reply Quote
Homo Sapiens Sedentarius
Quote
johnny k
You can only get so easy before you're talking about paid/hosted.

Applies to wikis AND women.

El Jeffe – August 26, 2012 03:25PM Reply Quote
What a journey.
we put the HO in hosted?

ddt – August 26, 2012 06:54PM Reply Quote
I put the d'oh in it.

ddt

Mokers (Moderator) – August 27, 2012 04:53PM Reply Quote
Formerly Remy Martin
mediawiki is fairly easy to setup, but if you are looking for something even simpler, you can try dokuwiki. It uses PHP but has a flat file backend so no messing with MySQL.

https://www.dokuwiki.org/features

El Jeffe – September 01, 2012 01:53AM Reply Quote
What a journey.
What's your'all's take on the Mac Book Air apple sells refurb for $829 ?
http://store.apple.com/us/product/FC969LL/A MacBook Air 1.6GHz dual-core Intel Core i5

They have one for $750, but this model's bump in RAM/SSD seems worth it.

I am finding that TeamBill (aka my family) might benefit from a bit more movable/portable Mac solution.
But I am also tempted by an upcoming iPad Mini iff (if and only if) it has cellular-internet that one can buy per month like the daddy iPads do. (otherwise it's just a big iPod Touch).

John Willoughby – September 01, 2012 08:08AM Reply Quote
Homo Sapiens Sedentarius
In my MBA experience, RAM is very important. SSD is for me, because I am insane and put Boot Camp on them.

El Jeffe – September 04, 2012 01:30AM Reply Quote
What a journey.
$679 now at iStore
http://tinyurl.com/bthh663

John Willoughby – September 04, 2012 07:56AM Reply Quote
Homo Sapiens Sedentarius
Can any of you recommend a dirt-simple way of getting decent audio into a Mac? My wife sings, and often wants to capture the audio. I've gone through some USB mikes, and some which require external boxes, but those solutions suffered from my complete lack of interest and knowledge of audio technology. My wife, also, is not interested in software more complicated than iMovie, so decent software UI is mandatory. (She won't even use Garage Band, though that is probably because I don't understand it well enough to explain it.) I guess it doesn't have to be a Mac thing, if there is a good iOS solution.

The fewest moving pieces, the simplest UI, the lowest price are my objectives. I've got an anniversary coming up, and I thought I'd try to get her some technology that she'd actually use.

Sorry, only registered users may post in this forum.

Click here to login