如何通过search和replace来validation大量文件？

我目前正在validation一个客户端的HTML源代码，并且我得到了许多没有Omittag的图像和input文件的validation错误。我会手动做，但这个客户端字面上有成千上万的文件，有很多的情况下，没有。

这个客户端已经validation了一些img标签（无论什么原因）。

只是想知道是否有一个unix命令我可以运行检查，看看是否没有一个Omittag添加它。

我已经做了简单的search，并用以下命令replace：

find . \! -path '*.svn*' -type f -exec sed -i -n '1h;1!H;${;g;s/<b>/<strong>/g;p}' {} \;

但从来没有这么大的东西。任何帮助，将不胜感激。

请参阅我在上面评论的问题。

假设你正在使用GNU sed，而且你正在试图添加尾部/到你的标签来制作兼容XML的<img />和<input /> ，那么用你的命令替换这个sed表达式，它应该这样做： '1h;1!H;${;g;s/$img\|input$$ [^>]*[^/]$>/\1\2\/>/g;p;}'

这里是一个简单的测试文件（SO的着色器做古怪的事情）：

 $ cat test.html This is an <img tag> without closing slash. Here is an <img tag /> with closing slash. This is an <input tag > without closing slash. And here one <input attrib="1" > that spans multiple lines. Finally one <input attrib="1" /> with closing slash. $ sed -n '1h;1!H;${;g;s/\(img\|input\)\( [^>]*[^/]\)>/\1\2\/>/g;p;}' test.html This is an <img tag/> without closing slash. Here is an <img tag /> with closing slash. This is an <input tag /> without closing slash. And here one <input attrib="1" /> that spans multiple lines. Finally one <input attrib="1" /> with closing slash.

这里是GNU sed正则表达式语法以及缓冲如何工作以进行多行搜索/替换。

或者，您可以使用像Tidy这样的设计来清理不良的HTML – 这就是如果我做了比简单的搜索/替换更复杂的任何事情，我会做的。 Tidy的选项会变得非常复杂，所以最好用你选择的脚本语言（Python，Perl）编写一个脚本来调用libtidy并设置你需要的选项。

尝试这个。它会通过你的文件，对每个文件进行备份（perl的-i操作符），并用<img />和<input >替换<img>和<input>标签。

 find . \! -path '*.svn*' -type f -exec perl -pi.orig -e 's{ ( <(?:img|input)\b ([^>]*?) ) \ ?/?> }{$1\ />}sgxi' {} \;

鉴于输入：

 <img> <img/> <img src=".."> <img src="" > <input> <input/> <input id=".."> <input id="" >

它将文件更改为：

 <img /> <img /> <img src=".." /> <img src="" /> <input /> <input /> <input id=".." /> <input id="" />

以下是正则表达式的作用：

 s{(<(?:img|input)\b ([^>]*?)) # capture "<img" or "<input" followed by non-">" chars \ ?/?>} # optional space, optional slash, followed by ">" {$1\ />}sgxi # replace with: captured text, plus " />"