rotten news relay - comp.lang.python - Re: How to manage accented characters in mail header?

I have a Python script that filters my incoming E-Mail. It has been
working OK (with various updates and improvements) for many years.

I now have a minor new problem when handling E-Mail with a From: that
has accented characters in it:-

From: Sébastien Crignon <sebastien.crignon@amvs.fr>

I use Python mailbox to parse the message:-

import mailbox
...
...
msg = mailbox.MaildirMessage(sys.stdin.buffer.read())

Then various mailbox methods to get headers etc.
I use the following to get the From: address:-

str(msg.get('from', "unknown").lower()

The result has the part with the accented character wrapped as follows:-

From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon@amvs.fr>

I know I have hit this issue before but I can't rememeber the fix. The
problem I have now is that searching the above doesn't work as
expected. Basically I just need to get rid of the ?utf-8? wrapped bit
altogether as I'm only interested in the 'real' address. How can I
easily remove the UTF8 section in a way that will work whether or not
it's there?

--
Chris Green
·

Subject: Re: How to manage accented characters in mail header?
From: Stefan Ram
Newsgroups: comp.lang.python
Organization: Stefan Ram
Date: Sat, 4 Jan 2025 14:49 UTC
References: 1

Path: news.eternal-september.org!eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.python
Subject: Re: How to manage accented characters in mail header?
Date: 4 Jan 2025 14:49:38 GMT
Organization: Stefan Ram
Lines: 56
Expires: 1 Jan 2026 11:59:58 GMT
Message-ID: <decode_header-20250104154914@ram.dialup.fu-berlin.de>
References: <satn4l-6sqh.ln1@q957.zbmc.eu>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de QsY0BrpMt8eJKzMIpqRjFQtf+P7iJ8hioMhHWM5vvHAPHK
Cancel-Lock: sha1:MWrj05wheRgblKWC1UBRlrDqfas= sha256:860dXZQ7d0CJ/Zbtn8ICCHtCl+ssLJol2WtNC43eClA=
X-Copyright: (C) Copyright 2025 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US

View all headers

Chris Green <cl@isbd.net> wrote or quoted:
>From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon@amvs.fr>

In Python, when you roll with decode_header from the email.header
module, it spits out a list of parts, where each part is like
a tuple of (decoded string, charset). To smash these decoded
sections into one string, you’ll want to loop through the list,
decode each piece (if it needs it), and then throw them together.
Here’s a straightforward example of how to pull this off:

from email.header import decode_header

# Example header
header_example = \
'From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon@amvs.fr>'

# Decode the header
decoded_parts = decode_header(header_example)

# Kick off an empty list for the decoded strings
decoded_strings = []

for part, charset in decoded_parts:
if isinstance(part, bytes):
# Decode the bytes to a string using the charset
decoded_string = part.decode(charset or 'utf-8')
else:
# If it’s already a string, just roll with it
decoded_string = part
decoded_strings.append(decoded_string)

# Join the parts into a single string
final_string = ''.join(decoded_strings)

print(final_string)# From: Sébastien Crignon <sebastien.crignon@amvs.fr>

Breakdown

decode_header(header_example): This line takes your email header
and breaks it down into a list of tuples.

Looping through decoded_parts: You check if each part is in
bytes. If it is, you decode it using whatever charset it’s
got (defaulting to 'utf-8' if it’s a little vague).

Appending Decoded Strings: You toss each decoded part into a list.

Joining Strings: Finally, you use ''.join(decoded_strings) to glue
all the decoded strings into a single, coherent piece.

Just a Heads Up

Keep an eye out for cases where the charset might be None. In those
moments, it’s smart to fall back to 'utf-8' or something safe.

Subject: Re: How to manage accented characters in mail header?
From: Peter Pearson
Newsgroups: comp.lang.python
Date: Sat, 4 Jan 2025 15:00 UTC
References: 1

Path: news.eternal-september.org!eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: pkpearson@nowhere.invalid (Peter Pearson)
Newsgroups: comp.lang.python
Subject: Re: How to manage accented characters in mail header?
Date: 4 Jan 2025 15:00:21 GMT
Lines: 42
Message-ID: <ltt0o4FlcuoU1@mid.individual.net>
References: <satn4l-6sqh.ln1@q957.zbmc.eu>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: individual.net T0YWnbVIlibo+Ml6/dCc/wdy4aN14wo7msBmPNWfqm1zB2ojjF
Cancel-Lock: sha1:ZK70ezogtfHY+ofEI6aFFBGvWGQ= sha256:5nFDZMPYsaLT1LmT/6oaq64NL2MawKTp7hJczXoxMao=
User-Agent: slrn/1.0.3 (Linux)

View all headers

On Sat, 4 Jan 2025 14:31:24 +0000, Chris Green <cl@isbd.net> wrote:
> I have a Python script that filters my incoming E-Mail. It has been
> working OK (with various updates and improvements) for many years.
>
> I now have a minor new problem when handling E-Mail with a From: that
> has accented characters in it:-
>
> From: Sébastien Crignon <sebastien.crignon@amvs.fr>
>
>
> I use Python mailbox to parse the message:-
>
> import mailbox
> ...
> ...
> msg = mailbox.MaildirMessage(sys.stdin.buffer.read())
>
> Then various mailbox methods to get headers etc.
> I use the following to get the From: address:-
>
> str(msg.get('from', "unknown").lower()
>
> The result has the part with the accented character wrapped as follows:-
>
> From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon@amvs.fr>
>
>
> I know I have hit this issue before but I can't rememeber the fix. The
> problem I have now is that searching the above doesn't work as
> expected. Basically I just need to get rid of the ?utf-8? wrapped bit
> altogether as I'm only interested in the 'real' address. How can I
> easily remove the UTF8 section in a way that will work whether or not
> it's there?

This seemed to work for me:

import email.header
text, encoding = email.header.decode_header(some_string)[0]

--
To email me, substitute nowhere->runbox, invalid->com.

Subject: Re: How to manage accented characters in mail header?
From: Chris Green
Newsgroups: comp.lang.python
Date: Sat, 4 Jan 2025 19:07 UTC
References: 1 2

Path: news.eternal-september.org!eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: cl@isbd.net (Chris Green)
Newsgroups: comp.lang.python
Subject: Re: How to manage accented characters in mail header?
Date: Sat, 4 Jan 2025 19:07:57 +0000
Lines: 66
Message-ID: <dhdo4l-uvsi.ln1@q957.zbmc.eu>
References: <satn4l-6sqh.ln1@q957.zbmc.eu> <decode_header-20250104154914@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: individual.net R5M3ejnQfItEaj1TEbGojQA7UlsDvP0V29loSnAJct8WN2DDw=
X-Orig-Path: not-for-mail
Cancel-Lock: sha1:llAwGq28VhUUGhM5t3Fy3qLww4w= sha256:G3bghw2uNJP3rITngfc4Cs8Hn2nYFZOdSfrH4n+wGlE=
User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (Linux/6.1.0-28-amd64 (x86_64))

View all headers

Stefan Ram <ram@zedat.fu-berlin.de> wrote:
> Chris Green <cl@isbd.net> wrote or quoted:
> >From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon@amvs.fr>
>
> In Python, when you roll with decode_header from the email.header
> module, it spits out a list of parts, where each part is like
> a tuple of (decoded string, charset). To smash these decoded
> sections into one string, you’ll want to loop through the list,
> decode each piece (if it needs it), and then throw them together.
> Here’s a straightforward example of how to pull this off:
>
> from email.header import decode_header
>
> # Example header
> header_example = \
> 'From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon@amvs.fr>'
>
> # Decode the header
> decoded_parts = decode_header(header_example)
>
> # Kick off an empty list for the decoded strings
> decoded_strings = []
>
> for part, charset in decoded_parts:
> if isinstance(part, bytes):
> # Decode the bytes to a string using the charset
> decoded_string = part.decode(charset or 'utf-8')
> else:
> # If it’s already a string, just roll with it
> decoded_string = part
> decoded_strings.append(decoded_string)
>
> # Join the parts into a single string
> final_string = ''.join(decoded_strings)
>
> print(final_string)# From: Sébastien Crignon <sebastien.crignon@amvs.fr>
>
> Breakdown
>
> decode_header(header_example): This line takes your email header
> and breaks it down into a list of tuples.
>
> Looping through decoded_parts: You check if each part is in
> bytes. If it is, you decode it using whatever charset it’s
> got (defaulting to 'utf-8' if it’s a little vague).
>
> Appending Decoded Strings: You toss each decoded part into a list.
>
> Joining Strings: Finally, you use ''.join(decoded_strings) to glue
> all the decoded strings into a single, coherent piece.
>
> Just a Heads Up
>
> Keep an eye out for cases where the charset might be None. In those
> moments, it’s smart to fall back to 'utf-8' or something safe.
>
Thanks, I think! :-)

Is there a simple[r] way to extract just the 'real' address between
the <>, that's all I actually need. I think it has the be the last
chunk of the From: doesn't it?

--
Chris Green
·

Path: news.eternal-september.org!eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.python
Subject: Re: How to manage accented characters in mail header?
Date: 4 Jan 2025 19:40:34 GMT
Organization: Stefan Ram
Lines: 27
Expires: 1 Jan 2026 11:59:58 GMT
Message-ID: <parseaddr-20250104204008@ram.dialup.fu-berlin.de>
References: <satn4l-6sqh.ln1@q957.zbmc.eu> <decode_header-20250104154914@ram.dialup.fu-berlin.de> <dhdo4l-uvsi.ln1@q957.zbmc.eu>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de l47VGg45bVtZOiKZJjhG5gnGTe1MUEHr7+CWX8XFiQ2t/h
Cancel-Lock: sha1:/FadT/qiy1T8+B8xscnKfzzb3Zg= sha256:rtRK8kNxeiUfu/+TTRk3qqeYshJwsowxnQ/5YNi5kE0=
X-Copyright: (C) Copyright 2025 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US

View all headers

Chris Green <cl@isbd.net> wrote or quoted:
>>print(final_string)# From: Sébastien Crignon <sebastien.crignon@amvs.fr>
>Is there a simple[r] way to extract just the 'real' address between
>the <>, that's all I actually need. I think it has the be the last
>chunk of the From: doesn't it?

Besides the deal with the pointy brackets, there's also this
other setup with round ones, like in

sebastien.crignon@amvs.fr (Sébastien Crignon)

. The standard library has:

email.utils.parseaddr(address)

Parse address – which should be the value of some
address-containing field such as To or Cc - into its
constituent realname and email address parts. Returns a tuple
of that information, unless the parse fails, in which case a
2-tuple of ('', '') is returned.

Subject: Re: How to manage accented characters in mail header?
From: Peter J. Holzer
Newsgroups: comp.lang.python
Date: Mon, 6 Jan 2025 19:43 UTC
References: 1 2 3 4
Attachments: signature.asc (application/pgp-signature)

Path: news.eternal-september.org!eternal-september.org!feeder3.eternal-september.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: hjp-python@hjp.at (Peter J. Holzer)
Newsgroups: comp.lang.python
Subject: Re: How to manage accented characters in mail header?
Date: Mon, 6 Jan 2025 20:43:21 +0100
Lines: 58
Message-ID: <mailman.52.1736192610.2912.python-list@python.org>
References: <satn4l-6sqh.ln1@q957.zbmc.eu>
<decode_header-20250104154914@ram.dialup.fu-berlin.de>
<dhdo4l-uvsi.ln1@q957.zbmc.eu>
<20250106194321.oz27jt37xpuhdssn@hjp.at>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
protocol="application/pgp-signature"; boundary="e7ytifu4m7upqtfl"
X-Trace: news.uni-berlin.de cBdVP93fFSUaivD5Qav+Pwt0VTtBMpDs8HCF3uPoNiDw==
Cancel-Lock: sha1:gRItGcyjw+aVXJEj4e+CFD36xfo= sha256:kPwhZ6+rpfMJhr5kkQuZFK0oQPdh/bODTze9n4wNBBw=
Return-Path: <hjp-python@hjp.at>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.001
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'content-
type:multipart/signed': 0.05; 'ram': 0.07; '<>,': 0.09; 'content-
type:application/pgp-signature': 0.09; 'filename:fname piece:asc':
0.09; 'filename:fname piece:signature': 0.09;
'filename:fname:signature.asc': 0.09; 'subject:header': 0.09;
'"creative': 0.16; '__/': 0.16; 'challenge!"': 0.16; 'encoded,':
0.16; 'from:addr:hjp-python': 0.16; 'from:addr:hjp.at': 0.16;
'from:name:peter j. holzer': 0.16; 'hjp@hjp.at': 0.16; 'holzer':
0.16; 'parsing': 0.16; 'reality.': 0.16; 'stross,': 0.16;
'subject:characters': 0.16; 'url-ip:212.17.106.129/32': 0.16;
'url-ip:212.17.106/24': 0.16; 'url-ip:212.17/16': 0.16; 'url:hjp':
0.16; '|_|_)': 0.16; 'wrote:': 0.16; 'addresses': 0.19; 'it?':
0.19; 'to:addr:python-list': 0.20; 'header': 0.23; 'stefan': 0.26;
'chris': 0.28; 'sense': 0.28; 'think': 0.29; 'subject:How': 0.31;
"doesn't": 0.32; 'extract': 0.32; 'python-list': 0.32; 'there':
0.33; 'header:In-Reply-To:1': 0.34; 'also,': 0.36; 'way': 0.38;
'use': 0.39; "that's": 0.39; 'wrote': 0.39; 'email': 0.63;
'between': 0.63; 'from:': 0.63; 'received:userid': 0.66; 'skip:e
20': 0.67; 'latter': 0.69; 'charset:iso-8859-1': 0.73; 'supposed':
0.76; 'need.': 0.84; 'decode': 0.84; 'received:at': 0.84;
'subject:mail': 0.95; 'green': 0.96
Mail-Followup-To: python-list@python.org
Content-Disposition: inline
In-Reply-To: <dhdo4l-uvsi.ln1@q957.zbmc.eu>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <20250106194321.oz27jt37xpuhdssn@hjp.at>
X-Mailman-Original-References: <satn4l-6sqh.ln1@q957.zbmc.eu>
<decode_header-20250104154914@ram.dialup.fu-berlin.de>
<dhdo4l-uvsi.ln1@q957.zbmc.eu>

View all headers

On 2025-01-04 19:07:57 +0000, Chris Green via Python-list wrote:
> Stefan Ram <ram@zedat.fu-berlin.de> wrote:
> > Chris Green <cl@isbd.net> wrote or quoted:
> > >From: =?utf-8?B?U8OpYmFzdGllbiBDcmlnbm9u?= <sebastien.crignon@amvs.fr>
> >
> Is there a simple[r] way to extract just the 'real' address between
> the <>, that's all I actually need. I think it has the be the last
> chunk of the From: doesn't it?

No,
From: <sebastien.crignon@amvs.fr> (Sébastien Crignon)
would also be permissible (properly encoded, of course), and even
From: < sebastien (Sébastien) . crignon (Crignon) @ amvs . fr >
(although I think the latter is deprecated).

And also, there can be more than one address in a From header.

To properly extract email addresses from a header, use
email.utils.getaddresses(). You don't have to decode the header first.
The MIME-encoding is supposed to not interfere with parsing headers for
machine-readable information like addresses or message ids.

Attachments: signature.asc (application/pgp-signature)

You get along very well with everyone except animals and people.

comp / comp.lang.python / Re: How to manage accented characters in mail header?

Subject	Author
How to manage accented characters in mail header?	Chris Green
Re: How to manage accented characters in mail header?	Stefan Ram
Re: How to manage accented characters in mail header?	Chris Green
Re: How to manage accented characters in mail header?	Stefan Ram
Re: How to manage accented characters in mail header?	Peter J. Holzer
Re: How to manage accented characters in mail header?	Peter Pearson