Filter Foreign Language Spam?
Posted by Adam Pash at 11:00 AM on February 28, 2008

Dear Lifehacker,
I don't know what I did to deserve it, but I've recently been hit with an overwhelming flood of foreign-language (primarily Russian) spam. I can't even read these emails, so I have no idea what the point is, other than to make my Gmail inbox a miserable place to be. What can I do?
Signed,
Exasperated English-Speaker
Dear Exasperated,
We've all been there—you get a rush of spam to your inbox full of characters you can't begin to decipher. Luckily, you're using Gmail, and filtering these emails out of your inbox is a piece of cake.
Since you're having a problem with Russian-language spam, I'll focus on that, but this could work for basically any language that uses a different set of characters or symbols than the Roman alphabet. The quickest and easiest method for avoiding these messages would be to set up a filter in Gmail that looks for Cyrillic characters (which is the alphabet Russian uses). If you head to the Cyrillic alphabet page on Wikipedia, you can grab a few letters by copying and pasting from the bottom of the Letter-forms and typography section. For example, you could grab И (sort of like the 'i' in the Roman alphabet).
To test that out, copy and paste it into your Gmail search box and see what you end up with. Most if not all of your Russian-language spam should show up there. If it looks good to you, now's the time to build your simple Gmail filter using that letter in the Has the words textbox. I'd recommend creating a PSpam label (possible spam) rather than outright deleting it so you can review it just in case, since there is always the chance that you'll also see an email or two that isn't spam. If that one letter by itself isn't catching all your spam, you may also want to beef up your filter with a few other letters.
The same method should work for other non-Roman alphabet languages, and you could probably just grab characters directly from your foreign-language emails. Just remember, you don't want to rid yourself of every email with a little language obscurity in it by default! You may actually want an email with an odd character or two in it. But if you're getting tonnes of foreign-language spam and you just want to filter it out of your inbox for reviewing at a later time, this method should do the trick.
With peace, love, and understanding,
Lifehacker
Tags: ask lifehacker | email | spam | spam filters

Comments (AU Comments · US Comments)
There are currently no AU comments for this post.
BlogsOfSteel
Posted 12:41 PM 28/2/08
Too bad there isn't a way to pipe the Russian SPAM into "Learn It List" & kill 2 birds with one stone! ;-)
BlogsOfSteel
ph15h
Posted 12:41 PM 28/2/08
I was getting Japanese spam after I signed up for Famitsu... I guess the site wasn't famitsu after all.
ph15h
rscotta
Posted 12:41 PM 28/2/08
Not a problem I have, but seems like a pretty clever solution. Nice one, LH.
rscotta
chrissv
Posted 12:41 PM 28/2/08
For a while I was getting Chinese spam.
I asked a Chinese co-worker what it was (after warning him that it was Spam, and could be offensive!) and he said it was advertising some kind of MBA-type seminar / program. It was up to 2-3 messages per day!
The messages have since now stopped. So I am back to just the "normal" spam for various pharmaceutical and/or medical enhancement devices, with subjects very creative in order to attempt to get past the spam filters (I always say: who would buy drugs from a company which has to obfuscate the name of the drug to get the e-mail past the spam filters!)
chrissv
Jenkinsm
Posted 1:57 PM 28/2/08
I get Spanish spam, how could I get rid of that?
Jenkinsm
bdk184
Posted 1:57 PM 28/2/08
thats y I have 2 online aliases... This one and one where I send all my spam porn and anything else that I don't want in my primary account...
bdk184
rlee
Posted 4:50 PM 28/2/08
@Jenkism: That's a bit trickier, obviously, but there are some accented characters you could look for. ñ is the obvious one, plus assorted accented vowels that would catch other western European languages as well: á ø ü and many more
Failing that, you'd probably have to look for common words that don't occur in English.
rlee
greenbot
Posted 6:05 PM 28/2/08
Some time ago, I was getting a lot of Chinese language spam. I just set up an e-mail filter to send all "Big-5" encoded messages to the junk folder, where I would skim the subject lines before deleting permanently.
Also, a lot of spammers tend to use mass e-mailer software. Just check the headers and look for common header information that indicates the client software or generator. And then create a message filter based on your findings.
greenbot
Onouris
Posted 6:55 PM 28/2/08
Look for words like Viagra and Penis Enlarger in a language. Surely that will catch 90% of the spam!
Onouris
Sanja
Posted 6:55 PM 28/2/08
This is funny. I am Russian Gmail user and I suffer from Brazilian spam. Seems that someone at Google has strange sense of humour :)
Sanja
aphexbr
Posted 10:40 PM 28/2/08
"...I have no idea what the point is"
Sorry, I just had to comment on this. Welcome to the truly international internet where not everyone speaks English as their first language.
Face it - spam is not, and never has been, targetted. Just as Chinese and Russian people have had to read English language spam for the last decade, so the rest of us now have to endure these kinds of emails. This should be a good thing, as it makes them easier to filter without a high risk of false positives.
aphexbr
tommertron
Posted 11:55 PM 28/2/08
I was getting Chinese spam for a while, and I just started hitting the "SPAM" button every time it popped up. Eventually, Gmail got the picture. I'd venture that the same would work here.
tommertron
Thor
Posted 3:06 AM 29/2/08
If you use an email program like Thunderbird or Outlook, or such, the spamfilter program Mailwasher can filter out foreign language messages. Look at the header of the incoming email and look for "charset=(something)" where something is the other language charset. Include this in a new filter for the header and it will automatically mark these messages as spam. I have a friend who gets lots of spam in Cyrillic, and this filter works great.
Thor
ahoier
Posted 3:56 AM 29/2/08
yea, it would be sooooo much easier if Google would allow us to make filters using the header details....
Until then, just gotta look at character-filtering...pfft.
some tupra dot com (NSFW skin spam stuff, cover your eyes!) has been a bit culprit for me, luckily I don't deal with it filling up my inbox, Gmail filters it all to spam for forwarding to KnujOn using gknujon from submanifold.be ;)
ahoier
lucy_pigpuppet
Posted 5:56 AM 29/2/08
Searching for a single character doesn't bring up spam with that character within a word. Am I doing something wrong? Does gmail have a wildcard operator?
lucy_pigpuppet
salamich
Posted 5:56 AM 29/2/08
You can search by language using the "lang:" operator. A search for "lang:russian" shows up all my russian SPAM.
salamich
Jefo
Posted 5:56 AM 29/2/08
@greenbot: All the russian (ie kyrillic) messages are "koi8-r" or "koi8-u" encoded (russian or ukrainian). This gives your spam filter a flying start.
Jefo
Jefo
Posted 5:56 AM 29/2/08
You might be better off just looking for the phrase "koi8-r" in the mail header. It stands for the kyrillic encoding.
Jefo
k48
Posted 5:56 AM 29/2/08
OK, after a little research I've done this one filter based on the list of most frequent Chinese characters to get rid of Chinese spam. Put it into the "Has the words:" line when creating a filter in Gmail.
的 OR 一 OR 是 OR 不 OR 了 OR 人 OR 我 OR 在 OR 有 OR 他 OR 这 OR 中 OR 大 OR 来 OR 上 OR 国 OR 个 OR 到 OR 说 OR 们 OR 为 OR 子 OR 和 OR 你 OR 年 OR 时
Then, as in the article, choose "Delete it" on the next page. This should free you of 99.9% Chinese spam. But beware! It may also delete mail from your Chinese colleague who decided e.g. to have a quotation or signature in Chinese at the end of the letter, or may delete mail from your Chinese partners.
k48
k48
Posted 5:56 AM 29/2/08
Oh, and besides, if you press the Spam button every time you receive Russian spam, I think Gmail's spam filter will finally learn to put it into the Spam folder, because it relies not only on static rules but learns from what you think is spam, too.
k48
k48
Posted 5:56 AM 29/2/08
Hello there. If you get Russian spam, I advise you to create filters with letters "и" (means "and") and "в" (means "in") - just copy and paste them. These are probably the most common words.
But as for Chinese spam I've been getting recently, it is much more difficult: in Russian alphabet there's only 33 letters, while the number of characters in Chinese is huge.
Another, more reliable way is to create a filter with OR expressions. In Gmail, put the following line into "Has the words:" field when creating a filter -
и OR в OR не OR а OR у OR о OR вы OR вас OR вам
It means that if the letter contains any of these Russian words which are very frequent, it matches the filter.
I believe lifehackers from China could make a similar filter which would include 30-50 most used Chinese characters and share it with us. Yes, that would be a long line of ORs.
k48
nerfball1976
Posted 8:17 AM 29/2/08
Seems like this something that Google would allow you to filter automatically. I could see a setting that would allow, English (or whatever default language) messages only.
nerfball1976
shefarted
Posted 8:17 AM 29/2/08
A while back, I requested a GMAIL-specific script on the Greasemonkey wiki that would assume all emails not in your default encoding or language locale be considered spam. At the time I was getting a lot of Chinese spam. Shortly after, my Chinese spam started always going to the spam folder, so I figured that a GMail person saw my request on the GreaseMoneky Wiki and incorporated that into GMail, but I guess I was thinking too highly of myself if people are still experiencing this. My Gmail must have just learned from my actions as noted above.
shefarted
drsmith
Posted 9:06 AM 29/2/08
Kinda useless for the simple fact that Google still doesn't let you put a default label on your incoming messages. So if you have any email that doesn't match at least one filter (and we all will have some) you still have to look at the unorganized view of your inbox every once in a while to read those emails. When you do that, you also see all of the labeled messages at the same time.
Don't know what's going on at Google these days, but the dev teams seem to either be busy working on something else or their just plain ignoring the obvious requests from the user base. I complained about the lack of a default rule/label over 2 years ago...
drsmith
NineTailedFox
Posted 3:30 PM 29/2/08
As others have noted, the efficacy of grabbing random characters for this kind of filter will vary from language to language.
For Japanese, avoid the kanji (Chinese characters, e.g. 無神論、綺麗、猪) and go for the hiragana (the curly, swoopy, generally simpler ones). す (su), る (ru) and た (ta) should cover you, given their use in verb endings. k48's list for Chinese is good, but I think you could get away with less; I'd probably struggle to write any respectable spam without using 的 (possessive marker etc.), 不 (negative marker), 个 (counter) or 是 (to be (more or less)). Actually, "。" might work well, assuming the filter picks it up and can distinguish between that and ".".
greenbot's method of blocking Big-5 is even better if it's available, but will only block traditional characters (from Taiwan, Hong Kong, etc.). To cover the PRC, block GB 2312-80 and GB 18030-2000 as well.
NineTailedFox
Dereks
Posted 5:35 AM 18/3/08
oh, I can translate those for him)))))
Dereks