GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Davin Flatten
Just thought this might help someone out.  Thanks to M. Blapp for an
excellent SA Plugin.  Optical Character Recognition (OCR) can be used to
nab those pesky spam messages that are hidden in gif,jpeg, or png images...

Here is what I did to get the plugin running.

Test the components that the plugin uses first.
( Check out the documentation at
http://antispam.imp.ch/patches/patch-ocrtext for requirements. )

1. Copy a spam image for an example to your sa machine.
2. Use giftopnm or jpegtopnm or pngtopnm to convert whatever type of
image you have to a pnm image like so:
      giftopnm Xj105jQX.gif > Xj105jQX.pnm
3. Run gocr on the pnm file like so:
      gocr Xj105jQX.pnm

This should output some text with lots of garbage.  If you got this far
you should be ready to get the plugin going.

1. cd to /etc/mail/spamassassin
2. download the patch file from:
http://antispam.imp.ch/patches/patch-ocrtext
3. type 'patch < patch-ocrtext'
   This will create two files in  your current directory called  
ocrtext.cf and ocrtext.pm
4. Edit v310.pre and add the following lines:

# OCR - performs Optical Character Recognition on spam images
#
loadplugin ocrtext /etc/mail/spamassassin/ocrtext.pm
loadplugin Mail::SpamAssassin::Timeout

5. Edit the ocrtext.cr file and change the following settings:

## This points to your gocr binary not just the path.  Try 'which gocr'.
gocr_path       /usr/local/bin/gocr
## This is JUST the path to your pnm binarys ( i.e. pngtopnm, giftopnm,
jpegtopnm )
pnmtools_path   /usr/bin

6. Run spamassassin -D --lint  and check for errors.

If all went well restart spamassassin or force it to reread it's config
however you would on your system.

Then try typing something like 'tail -f /var/log/mail.log | grep
SPAMPIC_ALPHA', on a high volume server you should see some rules
matching after a few minutes.  If so then you are OCR'ing the images!

Hope this helps!
Sincerely,
Davin Flatten

--
Davin Flatten
Unix Systems Administrator
University of Massachusetts
Amherst, MA 01003

Phone: 413-545-1580
Email: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Timothy C Litwiller
results - are very good on my preliminary tests.
these two spams look exactly the same in my email program except the
subject line

here is a spam before

---snip---
From - Wed Aug 02 22:29:15 2006
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on
        ---.---.com
X-Spam-Level: ***
X-Spam-Status: No, score=3.0 required=5.0 tests=BAYES_00,DATE_IN_FUTURE_12_24,
        FROM_LOCAL_NOVOWEL,HTML_MESSAGE autolearn=no version=3.1.1
Received: (qmail 9735 invoked from network); 2 Aug 2006 21:32:47 -0500
Received: from ip33-5.asiaonline.net (202.85.33.5)
  by ---.---.com with SMTP; 2 Aug 2006 21:32:47 -0500
From: "Is giant" <[hidden email]>
To: ---@---.net
Subject: All Apparel
Date: Thu, 3 Aug 2006 10:39:03 -0800
MIME-Version: 1.0
Content-Type: multipart/related;
        boundary="----=_NextPart_000_0004_01C6B6E9.09F4EF40"
X-Mailer: Microsoft Office Outlook, Build 11.0.5510
Thread-Index: Aca26Qn08PnHMV7tSnWAeaDtkgcv8g==
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2869

Message-Id: <[hidden email]>
---snip---



here is the spam after

---snip ---
From - Wed Aug 02 23:36:26 2006
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on
        ---.---.com
X-Spam-Level: ************
X-Spam-Status: Yes, score=12.6 required=5.0 tests=BAYES_00,
        DATE_IN_FUTURE_12_24,HTML_MESSAGE,INLINE_IMAGE,RCVD_IN_BL_SPAMCOP_NET,
        SPAMPIC_ALPHA_3,SPAMPIC_WORDS_3,SUSPECT_GIF autolearn=no version=3.1.1
X-Spam-Report:
        *  0.9 SUSPECT_GIF Suspect gif image found
        *  1.5 SPAMPIC_ALPHA_3 Image contains many alphanumeric chars
        *  2.8 DATE_IN_FUTURE_12_24 Date: is 12 to 24 hours after Received: date
        * -2.6 BAYES_00 BODY: Bayesian spam probability is 0 to 1%
        *      [score: 0.0000]
        *  0.0 HTML_MESSAGE BODY: HTML included in message
        *  1.5 INLINE_IMAGE RAW: Inline Images
        *  1.6 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
        *      [Blocked - see <http://www.spamcop.net/bl.shtml?221.2.37.198>]
        *  7.0 SPAMPIC_WORDS_3 Contains inline spam picture (3)
Received: (qmail 10707 invoked from network); 2 Aug 2006 23:20:49 -0500
Received: from unknown (HELO ?221.2.37.198?) (221.2.37.198)
  by ---.---.com with SMTP; 2 Aug 2006 23:20:49 -0500
From: "Value" <[hidden email]>
To: ---@---.net
Subject: [SPAM-Score-12.6] Kodak CamerasIn
Date: Thu, 3 Aug 2006 12:19:46 -0800
MIME-Version: 1.0
Content-Type: multipart/related;
        boundary="----=_NextPart_000_0004_01C6B6F7.1B91DFC0"
X-Mailer: Microsoft Office Outlook, Build 11.0.5510
Thread-Index: Aca29xuT14ns0TbSSbinW6TfiY1R5w==
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2869
Message-Id: <[hidden email]>

X-Spam-Prev-Subject: Kodak CamerasIn

---snip---
Reply | Threaded
Open this post in threaded view
|

Re: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Bugzilla from linux@matthias-keller.ch
In reply to this post by Davin Flatten
Davin Flatten wrote:

> Just thought this might help someone out.  Thanks to M. Blapp for an
> excellent SA Plugin.  Optical Character Recognition (OCR) can be used
> to nab those pesky spam messages that are hidden in gif,jpeg, or png
> images...
>
> Here is what I did to get the plugin running.
> (...)
> # OCR - performs Optical Character Recognition on spam images
> #
> loadplugin ocrtext /etc/mail/spamassassin/ocrtext.pm
> loadplugin Mail::SpamAssassin::Timeout

Hi

First of all, thanks for the detailed instructions.

I'm running SA 3.1.0 with perl 5.8.7

After following your instructions I get this error tough:

[31630] warn: plugin: failed to parse plugin (from @INC): Can't locate
Mail/SpamAssassin/Timeout.pm in @INC (@INC contains: lib
/usr/lib/perl5/site_perl/5.8.7/i586-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.7
/usr/lib/perl5/5.8.7/i586-linux-thread-multi /usr/lib/perl5/5.8.7
/usr/lib/perl5/site_perl
/usr/lib/perl5/vendor_perl/5.8.7/i586-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.7 /usr/lib/perl5/vendor_perl) at (eval
48) line 1.
[31630] warn: plugin: failed to create instance of plugin
Mail::SpamAssassin::Timeout: Can't locate object method "new" via
package "Mail::SpamAssassin::Timeout" at (eval 49) line 1.
[31630] warn: plugin: eval failed: Can't locate object method "new" via
package "Mail::SpamAssassin::Timeout" at
/etc/mail/spamassassin/ocrtext.pm line 396.

It seems the Timeout thingie doesn't exist here -- can i just leave out
the line in the v310.pre  or is it needed??

Thanks

Matt
Reply | Threaded
Open this post in threaded view
|

Re: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Davin Flatten
Matthias-

Yes I had the same issue on my setup which I forgot to mention.  I had
to copy the Timeout.pm module from the SpamAssassin source tree into the
installation path.  On my machine it was

cp
/usr/local/src/Mail-SpamAssassin-3.1.1/lib/Mail/SpamAssassin/Timeout.pm
/usr/local/share/perl/5.8.8/Mail/SpamAssassin/Timeout.pm

You could try commenting out the line that loads this module.  On you
machine it might be already loaded, but on my installation it does not
get loaded by default.

Maybe someone else knows why this is not installed by default.  If you
don't have the source you can download it from:
http://spamassassin.apache.org/downloads.cgi?update=200607261000

Notice on the bottom of the page that they have an archive.  I would try
to match the version numbers so you don't introduce any wierd bugs.

Hope this helps.

Sincerely,
Davin Flatten

Reply | Threaded
Open this post in threaded view
|

Re: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Bugzilla from linux@matthias-keller.ch
Davin Flatten wrote:
> Matthias-
>
> Yes I had the same issue on my setup which I forgot to mention.  I had
> to copy the Timeout.pm module from the SpamAssassin source tree into
> the installation path.  On my machine it was
Hmm
I downloaded the archive for 3.1.0 and there's no Timeout.pm at all - so
i guess this has been introduced in 3.1.1 or so..?

Does anyone know if it's safe to let it away?

Thanks

Matt
Reply | Threaded
Open this post in threaded view
|

Re: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Loren Wilton
> I downloaded the archive for 3.1.0 and there's no Timeout.pm at all - so
> i guess this has been introduced in 3.1.1 or so..?
>
> Does anyone know if it's safe to let it away?

JM would be the one with the definitive answer.  But my recollection is that
it is a new/clean implementation of a manager for timeout signals, and can
probably be used alongside the stuff in older versions of SA just fine.  I
assume that something in the OCR plugin must be using it?  If not, I can't
see much reason to load the Timeout module if it isn't already there.

        Loren

Reply | Threaded
Open this post in threaded view
|

Re: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Theo Van Dinter-2
In reply to this post by Bugzilla from linux@matthias-keller.ch
On Thu, Aug 03, 2006 at 02:14:38PM +0200, Matthias Keller wrote:
> I downloaded the archive for 3.1.0 and there's no Timeout.pm at all - so
> i guess this has been introduced in 3.1.1 or so..?

Correct, it was added into 3.1.1 (bug 4696).

> Does anyone know if it's safe to let it away?

I haven't looked at the plugin -- if the Timeout code is not actively being
used by the plugin, then you should be able to just comment out the line.


The flip side is, why are you still running 3.1.0? ;)

--
Randomly Generated Tagline:
"It's always darkest before dawn. So if you're going to steal your
 neighbour's newspaper, that's the time to do it." - Zen Musings

attachment0 (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Theo Van Dinter-2
In reply to this post by Davin Flatten
On Thu, Aug 03, 2006 at 07:19:51AM -0400, Davin Flatten wrote:
> You could try commenting out the line that loads this module.  On you
> machine it might be already loaded, but on my installation it does not
> get loaded by default.

If you have 3.1.1 installed it should be there already.

> Notice on the bottom of the page that they have an archive.  I would try
> to match the version numbers so you don't introduce any wierd bugs.

Alternately, think about upgrading.  3.1.4 fixes a lot of bugs from,
say, 3.1.0.  :)

--
Randomly Generated Tagline:
Leela: Bender, why are you spending so much time in the bathroom? Are
  you jacking on in there?

attachment0 (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Stuart Johnston
In reply to this post by Davin Flatten
Davin Flatten wrote:
> Just thought this might help someone out.  Thanks to M. Blapp for an
> excellent SA Plugin.  Optical Character Recognition (OCR) can be used to
> nab those pesky spam messages that are hidden in gif,jpeg, or png images...

This OCR stuff looks promising.  Any comments on performance?  How much extra load does it put on a
server?

Reply | Threaded
Open this post in threaded view
|

Re: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Bugzilla from linux@matthias-keller.ch
In reply to this post by Theo Van Dinter-2
Theo Van Dinter wrote:

> On Thu, Aug 03, 2006 at 02:14:38PM +0200, Matthias Keller wrote:
>  
>> I downloaded the archive for 3.1.0 and there's no Timeout.pm at all - so
>> i guess this has been introduced in 3.1.1 or so..?
>>    
>
> Correct, it was added into 3.1.1 (bug 4696).
>
>  
>> Does anyone know if it's safe to let it away?
>>    
>
> I haven't looked at the plugin -- if the Timeout code is not actively being
> used by the plugin, then you should be able to just comment out the line.
>  
Hmm it seems to be used, at least I find one occurence of
Mail::SpamAssassin::Timeout in the .pm file

#
# Limit the scantime
#
$permsgstatus->enter_helper_run_mode();
my $timer = Mail::SpamAssassin::Timeout->new({ secs =>
$self->{main}->{conf}->{ocrtext_timeout} });
my $err = $timer->run_and_catch(sub {
......

So I guess this plugins really only runs from 3.1.1 onwards??
> The flip side is, why are you still running 3.1.0? ;)
>  
I know, but this is a productive system and I'll have to test an upgrade
first on the test server as I cant take any risks on that server...
But an upgrade is on top of my to do list....

Matt

Reply | Threaded
Open this post in threaded view
|

Re: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Stephan Bosch-2
In reply to this post by Davin Flatten
Davin Flatten schreef:
> Just thought this might help someone out.  Thanks to M. Blapp for an
> excellent SA Plugin.  Optical Character Recognition (OCR) can be used to
> nab those pesky spam messages that are hidden in gif,jpeg, or png images...
>
I ran a search on the patch and I didn't see any references to the bayes
learner. Wouldn't it be a logical choice to feed (and test) the OCR text
to the bayes learner just like any other plaintext mail content? The OCR
results will of course contain some gibberish, but that shouldn't be
very different from the usual bayes poison. I think this could further
improve the OCR feature (haven't tested the patch yet btw).

Regards,

Stephan

Reply | Threaded
Open this post in threaded view
|

Re: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Davin Flatten
In reply to this post by Stuart Johnston
Stuart-

Not significant that I have noticed.  We are running a dedicated
spamassassin gateway
however.  It's only job is to process spam.  It is running dual Xeon
2.80GHz/2MB cache with 4GB of RAM over RAID5 with some scratch partitions
loaded in RAM.  We also run clamav, mimedefang, bayes out of mysql, and
milter-greylist on the same machine.

We process 15,000-30,000 emails a day on this machine.

One thing that could be improved would be to add which directory the
plugin uses as scratch.  I would put this over into my memory based
mounts and that would at least lower the I/O overhead.

-Davin

Reply | Threaded
Open this post in threaded view
|

RE: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Jeff Moss-2
In reply to this post by Davin Flatten
We're getting some image-spam stuck in the queue because they crash SA
with this plugin turned on. We are using a custom setup built from
amavisd-lite.
I'm still trying to figure out what's causing it.

  Jeff Moss

-----Original Message-----
From: Stuart Johnston [mailto:[hidden email]]
Sent: Thursday, August 03, 2006 10:41 AM
To: [hidden email]
Subject: Re: GIF Spam -- Setting up the 'OCR scanner and image validator
SA-plugin'

Davin Flatten wrote:
> Just thought this might help someone out.  Thanks to M. Blapp for an
> excellent SA Plugin.  Optical Character Recognition (OCR) can be used
to
> nab those pesky spam messages that are hidden in gif,jpeg, or png
images...

This OCR stuff looks promising.  Any comments on performance?  How much
extra load does it put on a
server?

Reply | Threaded
Open this post in threaded view
|

RE: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Dave Augustus-2
I will be testing this later this evening using the instructions provided.
I will keep you posted.

Dave Augustus

> We're getting some image-spam stuck in the queue because they crash SA
> with this plugin turned on. We are using a custom setup built from
> amavisd-lite.
> I'm still trying to figure out what's causing it.
>
>   Jeff Moss
>
> -----Original Message-----
> From: Stuart Johnston [mailto:[hidden email]]
> Sent: Thursday, August 03, 2006 10:41 AM
> To: [hidden email]
> Subject: Re: GIF Spam -- Setting up the 'OCR scanner and image validator
> SA-plugin'
>
> Davin Flatten wrote:
>> Just thought this might help someone out.  Thanks to M. Blapp for an
>> excellent SA Plugin.  Optical Character Recognition (OCR) can be used
> to
>> nab those pesky spam messages that are hidden in gif,jpeg, or png
> images...
>
> This OCR stuff looks promising.  Any comments on performance?  How much
> extra load does it put on a
> server?
>
>

Reply | Threaded
Open this post in threaded view
|

RE: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Jeff Moss-2
In reply to this post by Davin Flatten
Still trying to debug SA crashing with the OCR plugin.  I extracted the
base64 encoding from one of the offending messages.  Then I converted it
to image001.gif with uudeview.  But when I try to convert it to a pnm
file from the command line I get errors.

[filter]# giftopnm image001.gif > image001.pnm
giftopnm: too much input data, ignoring extra...
giftopnm: bogus character 0x00, ignoring
[filter]#

I have no idea what's causing this, how to fix it, or if it's even
related to the crashing problem.

  Jeff Moss


-----Original Message-----
From: Stuart Johnston [mailto:[hidden email]]
Sent: Thursday, August 03, 2006 10:41 AM
To: [hidden email]
Subject: Re: GIF Spam -- Setting up the 'OCR scanner and image validator
SA-plugin'

Davin Flatten wrote:
> Just thought this might help someone out.  Thanks to M. Blapp for an
> excellent SA Plugin.  Optical Character Recognition (OCR) can be used
to
> nab those pesky spam messages that are hidden in gif,jpeg, or png
images...

This OCR stuff looks promising.  Any comments on performance?  How much
extra load does it put on a
server?

Reply | Threaded
Open this post in threaded view
|

Re: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Davin Flatten
Jeff-

Make sure you apply the patches to both the gocr source and
Image::ExifTool.   The gocr patch deals specifically with the segfault
issues.

 From the docs:

# - Perl module Image::ExifTool and a patch for GIF pics:
#   http://antispam.imp.ch/patches/patch-GIF-Colortable
#
# - Gocr from http://jocr.sourceforge.net and a patch to
#   avoid segfaults with gocr:
#   http://antispam.imp.ch/patches/patch-gocr-segfault


Hope this helps.
-Davin
Reply | Threaded
Open this post in threaded view
|

Re: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Davin Flatten
In reply to this post by Jeff Moss-2
Jeff-

You might also want to see if you copy the message out of a client
application like Thunderbird and then copy the image to your server and
running giftopnm on it.  It might be that uudeview is the problem and
not giftopnm.  The errors sounds like a corrupt gif image.  This should
not effect the plugin however.

I would suggest turning on debugging output on Spamassassin to see where
in the plugin the problem is occurring. Use the facility 'ocrtext' to
and grep your logs for 'ocrtext'.  Should give you some info.

If you running spamd try:  --debug=ocrtext

-D, --debug[=areas]                Print debugging messages (for areas)

Hope this helps.
-Davin
Reply | Threaded
Open this post in threaded view
|

RE: GIF Spam -- Setting up the 'OCR scanner and image validator SA-plugin'

Jeff Moss-2
In reply to this post by Davin Flatten
Patching GIF.pm seems to have fixed the problem.  I patched gocr because
that was in the instructions that got posted, but patching GIF.pm wasn't
so I missed it.

  Jeff Moss

-----Original Message-----
From: Davin Flatten [mailto:[hidden email]]
Sent: Thursday, August 03, 2006 3:54 PM
To: Jeff Moss
Cc: [hidden email]
Subject: Re: GIF Spam -- Setting up the 'OCR scanner and image validator
SA-plugin'

Jeff-

Make sure you apply the patches to both the gocr source and
Image::ExifTool.   The gocr patch deals specifically with the segfault
issues.

 From the docs:

# - Perl module Image::ExifTool and a patch for GIF pics:
#   http://antispam.imp.ch/patches/patch-GIF-Colortable
#
# - Gocr from http://jocr.sourceforge.net and a patch to
#   avoid segfaults with gocr:
#   http://antispam.imp.ch/patches/patch-gocr-segfault


Hope this helps.
-Davin