RegExp for a bash script that removes html garbage from a string

Answer #1 100 %

Removing html tags (and script, and style) is not always easy with regex, but since you are looking for a bash way, you can use a simple trick: using a text browser (lynx, links, w3m), example:

lynx -dump input.html > output.txt

Or you can use the inline tool xidel with an XPath query:

xidel ./input.html --extract "//text()[not(parent::style|parent::script)]"

You can try with regex too, but it is less safer:

sed 's/\|\|<[^>]*>//g' input.html

(note that this regex fails with something like: CONTENT )

or you can use this regex that is a little more safer in an html context:

sed -r 's/||<[^>]*>//g' input.html

You can easily preserve tags like "a" and "strong" by adding a capturing group before the last case (i.e. |<[^>]*>):


and then change the replacement pattern by $3 (it's the third group of the pattern)

You’ll also like:

© 2022 CodeForDev.com -