login
Header Space

 
 

sed howto: stripping slack from webserver files, per-line block transformations and speed optimizations

May 23, 2008 - 1:37pm
Submitted by olecom on May 23, 2008 - 1:37pm.

I claim to be a `sed` lover, not guru (generally bad term), not geek. Since seders is closed private group (stupid, huh), i'd like to post something interesting and useful here.

Tutorials, i know, and could find in google all have silly examples and suck. Let's do a HOWTO for real-life stuff.

I did some comments deeply somewhere here about useful optimizations and tricks, now here's another one and more.

In this issue:
* changing all but last occurrence (persson)
* stripping slack from all php, htm*, css, js, xml on a production-level webserver (Julius Thyssen)
* per line transformations of blocks (David Esterkin, alhajaj), or i'm the best (-:

Message-ID: <8499950a0804221341m638f53b5t770cbc594433d437@mail.gmail.com>
Date: Tue, 22 Apr 2008 21:41:44 +0100
From: "Oleg Verych" <olecom@gmail.com>
To: "sed users" <sed-users@yahoogroups.com>
Subject: Re: Changing all but last occurrence

persson @ Tue, Apr 22, 2008 at 6:18 PM:

>  > Why do you need this? And why you don't try to code anything yourself?
>
>  It's precisely because I tried and can't see a way to accomplish the goal
>  using sed alone that I'm asking here. I already know how to do those
>  things using other tools or sed + other tools, but not with sed alone,
>  so I'm asking whether that is possible at all or I'm just missing
>  something.

Anyway a working showcase of what you want is always better
(if you have it), than words and vague descriptions.

>  > >  - match/change all but last n occurrences of RE *in a file*, RE
>  > > occurs at
>  > >
>  > > most once per input line;
>  > [...]
>  > > many times per input line;
>  >
>  > If operating whole file, number of per-line occurrences doesn't
>  > matter.
>
>  Can you elaborate on this?

Here's what i think about this task. Curiosity can elaborate it.

ftp://flower.upol.cz/dts/sed0000_var/all_but.sed.sh

$ sh all_but.sed.sh
$ sh all_but.sed.sh a 22
$ sh all_but.sed.sh l
$ sh all_but.sed.sh l 22
--

Stripping slack from all php, htm*, js, xml on a webserver

May 23, 2008 - 1:51pm

Don't want to read, here's the code ftp://flower.upol.cz/dts/sed0000_var/strip_html.sh

From: "Julius Thyssen"
Date: Fri, 02 May 2008 20:24:54 -0000
Subject: Stripping slack from all php, htm*, js, xml on a webserver

Hi,

In my search for a way to minimize all (production level)
static code on some webservers, I'm trying to get all
wasted space out of those files.
[...]
Message-ID: <8499950a0805030654s3ea90783mbb3d16a0725bd414@mail.gmail.com>
Date: Sat, 3 May 2008 14:54:18 +0100
From: "Oleg Verych"
To: "Eric Pement"
Subject: Re: Stripping slack from all php, htm*, js, xml on a webserver

Eric Pement @ Sat, May 3, 2008 at 2:55 AM:
> Replying to Julius, who on 2 May 2008 at 20:24, wrote:
[]
>  > in short:  All useless garbage.
>
>    I don't believe whitespace indentation or blank lines in source
>  code is "useless garbage." I have maintained code on production
>  systems for some very large, successful corporations.

Think of it as an addition to `gzip`. It's not a source, but storage/transort
format.

[...]
Message-ID: <8499950a0805102130s280beb13xbfb2d2e42c6f5c23@mail.gmail.com>
Date: Sun, 11 May 2008 05:30:18 +0100
From: "Oleg Verych"
To: "Julius Thyssen"
Subject: Re: Stripping slack from all php, htm*, js, xml on a webserver

> * cjszip.sh C and javaScript code compressor written on sed(1)
> http://sourceforge.net/tracker/download.php?group_id=5757&atid=408691&fi...
>
> (this one may break on some javascript without ';' in some places AFAIR)
>
> * remove C comments
> ftp://flower.upol.cz/dts/sed0000_var/hacks/strip-c.sh

As it can be seen generally c/javascript without good coding style is not
that easy to handle. Think about unmatched comments' symbols inside
quotes and vise versa. There's a lot more "fun" with C++ comments, but if
conditions are not too bad, and output can be validated to be OK, then
something simple can be done, of course.

For now plain HTML (no CSS or javascript). The only content, that is
not touched is <pre>...</pre>. Note, that this is a block element, and
<tt>...</tt> is whitespace-stripped and reformatted by definition.

ftp://flower.upol.cz/dts/sed0000_var/strip_html.sh
[]
Feedback and testing is welcome.

Some rants.

From: "Julius Thyssen"
Date: Sat, 10 May 2008 23:15:40 -0000
Subject: Re: Stripping slack from all php, htm*, js, xml on a webserver

"Eric Pement" <pemente@...> wrote:
> I don't believe whitespace indentation or blank lines in source 
> code is "useless garbage." I have maintained code on production 
> systems for some very large, successful corporations.

Then you have wasted away many millions of bytes and thus
energy and time, probably worldwide.
I have maintained servers *and* code for some very large
successful government agencies. To be perfectly honest,
I'm fed up with your views on this, it wastes my time,
and is pure ignorance.
You assume I'm some kind of idiot for wanting this?
I'm so not interested in arguments over this. I will do this
whether you like it or not. And I will release the finished
script as well, since many people I know will love me for it.

I've once created this in C++, together with my cousin,
for use in Windows;
http://jthz.com/puter/software/HTML-clean/htmclean.cpp
and it was welcomed by many.

In fact, it's ridiculous that I couldn't find a ready-made
shell script for this anywhere. All I get are comments
from people complaining I should not remove their licenses,
info-blocks or guidance comments from their code.
Who needs these in production level code? Nobody I know.
People viewing pages on the internet aren't interested
in reading the source, and so they're useless.

> Indentation, including blank lines, may be extraneous to script
> operation, but it is essential to readability and maintenance,

There is none of that where I'm using it for. Nobody reads
code on served content, it runs and it is being served, period.
Plus, if you're even remotely smart, you'll keep an unstripped
copy for that on the same server (which I ALWAYS have).

I have also decided to go with Perl and RegExp again
for these tasks, since it still seems much easier.
Thanks for the hints anyway.


Julius
Message-ID: <8499950a0805101815o52693f62iec9ee07f4b49c15d@mail.gmail.com>
Date: Sun, 11 May 2008 02:15:20 +0100
From: "Oleg Verych"
To: "Julius Thyssen"
Subject: Re: Stripping slack from all php, htm*, js, xml on a webserver

> Then you have wasted away many millions of bytes and thus
> energy and time, probably worldwide.

Ouu, c'mon, relax!

> I'm so not interested in arguments over this. I will do this
> whether you like it or not. And I will release the finished
> script as well, since many people I know will love me for it.

Still your testing would be welcome here with `sed` version.

> I've once created this in C++, together with my cousin,
> for use in Windows;
> http://jthz.com/puter/software/HTML-clean/htmclean.cpp
> and it was welcomed by many.

'97? Cool!

> In fact, it's ridiculous that I couldn't find a ready-made
> shell script for this anywhere.

I have two non trivial actually. Now i would write then in more
elegant and knowledgeable way, but anyway:

* cjszip.sh C and javaScript code compressor written on sed(1)
http://sourceforge.net/tracker/download.php?group_id=5757&atid=408691&fi...

(this one may break on some javascript without ';' in some places AFAIR)

* remove C comments
ftp://flower.upol.cz/dts/sed0000_var/hacks/strip-c.sh

> All I get are comments from people complaining I should
> not remove their licenses, info-blocks or guidance comments
> from their code. Who needs these in production level code?
> Nobody I know. People viewing pages on the internet aren't
> interested in reading the source, and so they're useless.

Calm down, please. See, i understand and did that stuff as well.
But i also have no support from anybody. So, let's do it now.

You with `perl`, i with `sed` and let's see what is more fast, for
example.

-- 
sed 'sed && sh + olecom = love'  <<  ''
-o--=O`C
 #oo'L O
<___=E M

Something, that i agree with.

Date: Sun, 04 May 2008 20:13:23 -0000
From: "Julius Thyssen"
To: "Oleg Verych"
Subject: Re: Stripping slack from all php, htm*, js, xml on a webserver
Message-ID: <fvl5d3+sudf@eGroups.com>
User-Agent: eGroups-EW/0.82

"Oleg Verych" <olecom@...> wrote:

> Think of it as an addition to `gzip`. It's not a source,
> but storage/transort format.

Exactly. I always keep a backup of the original files.
The internet viewer (i.e. downloader) does not care
to see license blocks or comments or whitespace
in .css, .htm*, xml, php, js etc.
(That is why AdBlockPlus exists for Firefox, etc.)

And I see messages are not coming through in this group.
Hate when that happens. If you don't want to do the time,
don't the crime. I'm disgusted by censoring moderators.
Which is why I'm unsubscribing straight after posting this;

Good luck with SED to you all, but if its only mailinglist
is being moderated this way, it's not something I can use.

Bye,

Julius

multi-line and not multiline

May 23, 2008 - 2:02pm

ftp://flower.upol.cz/dts/sed0000_var/blocks.sed.sh

Message-ID: <8499950a0805211159q30f83efxbcb1e3178956edee@mail.gmail.com>
Date: Wed, 21 May 2008 19:59:11 +0100
From: "Oleg Verych"
Subject: multi-line and not multiline

-- input --

#100
ADD some/file/path
MODIFY diff/file/path
BLANK
#104
MODIFY /another/modified/file/path
DEL /a/deleted/file/path
MODIFY /one/more/file/path
BLANK
...

-- output --

100 ADD some/file/path
100 MODIFY diff/file/path
104 MODIFY/another/modified/file/path
104 DEL /another/file/path
104 MODIFY /one/more/file/path

== proposition ==

multi-line processing doen't mean, one cannot do changes
on blocks line-by-line. This doesn't require `sed`'s multi-line
tools (N, P). You just need to save number in hold buffer on
block start and to insert it on other lines, line-per-line until
the end.

[...]

-- input --

void methodA()
{
 doIt
}

void methodB()
{
 doSomeThingkElse
}

-- output --

void methodA()
{
 throw Exception
 doIt
}

void methodB()
{
 throw Exception
 doSomeThingkElse
}

== proposition ==

Here is work with block also. Note: not whole file
with job only on the final line. Just insert your
condition with needed text processing between
block start and end. Something like

sed '
/^{/,/^}/{
/condition/s-RE-placement-
}'

however if condition is line number inside a block,
then something else required.
In example above, i.e. first line:

sed '
/^{/{
p
i\
placement
d
}'

Or something like that.

NOTE: scripts were just typed in gmail.
-- 
sed 'sed && sh + olecom = love'  <<  ''
-o--=O`C
 #oo'L O
<___=E M
Message-ID: <8499950a0805220316t2a5c3770rbf712e5c57654f1e@mail.gmail.com>
Date: Thu, 22 May 2008 11:16:31 +0100
From: "Oleg Verych"
To: sed-users
Subject: Re: optimization Re: multi-line and not multiline

> And after something works, it's time for optimizations.

OK, thanks to gudermez and i'm being with shell, actual check and run
can be done. It turns out, that both correct and optimized script
actually uses N, but only for speed.  My first two scripts were done
with wrong hopes about hold buffer. Anyway, now it's correct and even
more optimized, then i expected. Also input to my script can be more
human-readable -- i.e. there are can be blank lines after blocks.
Finally empty blocks are also handled as correct input condition.

== possible input  ==

#000
BLANK

#100
ADD some/file/path
MODIFY diff/file/path
BLANK

#104
MODIFY /another/modified/file/path
DEL /a/deleted/file/path
MODIFY /one/more/file/path
BLANK
#177
MODIFY /77another/modified/file/path
DEL /77a/deleted/file/path
MODIFY /77one/more/file/path
BLANK
~

== benckmark results ==

olecom@flower$ du -h blocks.txt
21M     blocks.txt
olecom@flower$ time sh blocks.sed.sh olecom <blocks.txt >/dev/null
olecom

real    0m1.674s
user    0m1.656s
sys     0m0.016s

olecom@flower$ time sh blocks.sed.sh gudermez <blocks.txt >/dev/null
gudermez

real    0m8.453s
user    0m8.441s
sys     0m0.016s

== script ==

olecom(){
sed -n '
/^#/{
s-#--
h
:_append
N
/BLANK$/d
/\n$/d
s`\n` `
p
g
b_append
}'
}

gudermez(){
sed -e '
/^#[1-9][0-9]*$/{
s/.//
h
d
}
/^BLANK$/d
G
s/^\(.*\)\n\(.*\)/\2 \1/
'
}

echo "$1" >&2
$1
exit

== checking output ==

olecom@flower$ sh blocks.sed.sh gudermez <blocks.txt | sed '10q' >g
gudermez
olecom@flower$ sh blocks.sed.sh olecom <blocks.txt | sed '10q' >o
olecom
olecom@flower$ diff g o
olecom@flower$ sed '' <o
100 ADD some/file/path
100 MODIFY diff/file/path
104 MODIFY /another/modified/file/path
104 DEL /a/deleted/file/path
104 MODIFY /one/more/file/path
177 MODIFY /77another/modified/file/path
177 DEL /77a/deleted/file/path
177 MODIFY /77one/more/file/path
100 ADD some/file/path
100 MODIFY diff/file/path
olecom@flower$
-- 
sed 'sed && sh + olecom = love'  <<  ''
-o--=O`C
 #oo'L O
<___=E M

persson's results

May 23, 2008 - 7:26pm
Message-ID: <8499950a0805230940r7075a212o1e4ec1a706e275d5@mail.gmail.com>
Date: Fri, 23 May 2008 17:40:39 +0100
From: "Oleg Verych"
Subject: Re: optimization Re: multi-line and not multiline

Alright. Here we go.

== persson ==

olecom@flower:/tmp$ time <blocks.txt sed '/^BLANK$/d
    /^#[0-9]\{1,\}/{s/^#//;h;d}
    {G;s/\(.*\)\n\(.*\)/\2 \1/}' >/dev/null

real    0m8.406s
user    0m8.397s
sys     0m0.012s
olecom@flower$

== other benckmark results ==
>
> olecom@flower$ du -h blocks.txt
> 21M     blocks.txt
> olecom@flower$ time sh blocks.sed.sh olecom <blocks.txt >/dev/null
> olecom
>
> real    0m1.674s
> user    0m1.656s
> sys     0m0.016s
>
> olecom@flower$ time sh blocks.sed.sh gudermez <blocks.txt >/dev/null
> gudermez
>
> real    0m8.453s
> user    0m8.441s
> sys     0m0.016s
>
_____

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
speck-geostationary