Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A non-English domain is considered invalid. #21

Closed
Rombobeorn opened this issue Jul 19, 2014 · 14 comments · Fixed by fedora-infra/fmn.lib#44
Closed

A non-English domain is considered invalid. #21

Rombobeorn opened this issue Jul 19, 2014 · 14 comments · Fixed by fedora-infra/fmn.lib#44

Comments

@Rombobeorn
Copy link

When I enter my email address Bjorn@Rombobjörn.se I get this error
message:

{
"reason": "value must be an email address"
}

Perhaps you prefer to not support IDNA (RFC 5890) until full support
for international email addresses (RFC 6530) is in place, but in that
case the error message should be "International email addresses aren't
supported yet. Please enter an address with only English letters and
digits."

@ralphbean
Copy link

Yes! My apologies for missing this.

The code where the regular expressions are evaluated is here: https://github.com/fedora-infra/fmn.lib/blob/develop/fmn/lib/__init__.py#L166-L179

And the regular expressions themselves are defined here: https://github.com/fedora-infra/fmn.lib/blob/develop/fmn/lib/__init__.py#L24-L26

(those code locations are for anyone who is interested in submitting a patch/pull-request for this).

@abdelgmartinezl
Copy link

Hi!

I tried this regex and locally worked. But I want to know if this could work properly on the fmn project:

import re
re.match("^.+@.+.([a-zA-Z]{2,3}|[0-9]{1,3})$", "Bjorn@Rombobjörn.se")

@Rombobeorn
Copy link
Author

That pattern fails in at least three cases:

  • Non-English top-level domains exist now.
  • There are top-level domains with more than three letters.
  • Email addresses directly in a top-level domain may be rare, but they can technically exist, and they may become more common now that corporations can buy top-level domains.

@abdelgmartinezl
Copy link

So the best approach for now is changing the message or would you like me to improve the pattern?

@Rombobeorn
Copy link
Author

What the pattern and the error message should be depends on what the rest of the system can handle.

If the underlying email machinery receives an address that it considers invalid, can the error it returns be passed back to the web interface? If the address is syntactically valid but the domain part can't be resolved to a mail server address, can that error be passed back to the web interface? If the email is sent but bounces, can that error be passed back to the web interface? If that error reporting is in place, then the web interface should only do basic sanity checking and delegate the full address validation to the email machinery.

Does the underlying email machinery implement RFC 6530? If not, does it support IDNA by converting the domain part of the address to ACE before it composes the message? If not, then the error message I suggested earlier should be displayed in the case when the address conforms to this syntax:
https://tools.ietf.org/html/rfc6531#section-3.3
but does not conform to this syntax:
https://tools.ietf.org/html/rfc5322#section-3.4.1
https://tools.ietf.org/html/rfc5322#section-3.2.3
https://tools.ietf.org/html/rfc5321#section-4.1.2

If address validation is delegated to the email machinery, but the email machinery doesn't support international addresses, then the web interface should check whether the input is valid UTF-8 but not ASCII, and in that case display the error message I suggested earlier.

By the way, in the proposed new pattern it looks like you tried to support an IPv4 address in the domain part, but you forgot the square brackets and you forgot about IPv6.

One problem with using a single regular expression for input validation is that it gives only a yes-or-no result. It can't distinguish between different ways that an email address can be invalid or unsupported. To compensate you have to write a long error message that enumerates all the possible reasons for rejecting the input.

As you might have guessed by now I don't agree with the "EasyFix" label. I suspect that you have a fair bit of coding to do before you can give accurate error messages.

@abdelgmartinezl
Copy link

Understood and agree all you mention. Maybe we can work it as multiple scenario. First, solve a regex that satisfies an English domain (first scenario). If an error appears, then try another regex that satisfies a non-English domain (second scenario) and so on. What do you think of this approach?

@pypingou
Copy link
Member

pypingou commented Aug 8, 2014

@Rombobeorn You seem to know the context and problem quite well, could you advise us on what would be the best way to deal with this?
You say regex isn't so what is?

@Rombobeorn
Copy link
Author

Hi. Sorry for the long silence. I've been very busy lately.

I can't give very specific advice without knowing what components you use to send email, what addresses those components can handle, what their APIs are like, how your error handling works, and so on – and I don't have time to read your code to find out.

Validating an email address needs to be done in multiple steps, and different error messages are needed for different errors. Therefore a single regex isn't an adequate solution, but regexes may well be used in some of the steps.

One of the validation steps is to actually send email to the address, which means that it will be passed to some email-handling library, or at least to a mail server, and will be validated there. It would make sense to leverage that validation rather than duplicating it, but it depends on how well that library or server reports errors. I think you should start by finding answers to the questions in my previous comment to get a better idea of what you can delegate and what you need to implement. If you're using a good library you might find that you only need to check that the input is a valid UTF-8 string containing exactly one @, or something like that.

@pypingou
Copy link
Member

Maybe we could try using this library to help us: https://github.com/clarete/emailverifier @Rombobeorn what do you think?

@abdelgmartinezl
Copy link

Hi @pypingou! I have been playing around with https://github.com/syrusakbary/validate_email, perhaps could help us here. Does it worth it?

@robert-scheck
Copy link

Given that there are IDNA 2003 and IDNA 2008, it would be important that especially IDNA 2008 is supported (think of "straße.tld" which gets "strasse.tld" on IDNA 2003, but "xn--strae-oqa.tld" on IDNA 2008).

IIRC more than one @ in an e-mail address is allowed (in the local part), too. By the way, spaces in the local part of e-mail addresses are also allowed. Oh and finally there is also UTF8SMTP.

@ralphbean
Copy link

It would make sense to leverage that validation rather than duplicating it

This does make a lot of sense.. but I didn't know how to go about it. But I just found this, which I think will do:

>>> import smtplib
>>> server = smtplib.SMTP("bastion")  # our mail sending host
>>> server.verify("ralph@fedoraproject.org")
(252, '2.0.0 ralph@fedoraproject.org')

it depends on how well that library or server reports errors.

I think the error we get back here is going to be good enough to surface to users.

>>> server.verify("not-a-real-user-for-sure@fedoraproject.org")
(550, '5.1.1 <not-a-real-user-for-sure@fedoraproject.org>: Recipient address rejected: User unknown in local recipient table')
>>> server.verify("Bjorn@Rombobjörn.se")
(501, '5.1.3 Bad recipient address syntax')
>>> server.verify("Bjorn@Rombobjorn.se")
(252, '2.0.0 Bjorn@Rombobjorn.se')

Note that "full validation" at this step would involve really sending an email to the user and we do do that in a subsequent step. This will do for the preliminary step, I think.

@puiterwijk
Copy link

This sounds like a nice solution to me, as we'll automatically upgrade/downgrade the email addresses supported based on the mail server we're using (if it doesn't support international domains, we can't either).

ralphbean added a commit to fedora-infra/fmn.lib that referenced this issue Mar 24, 2015
...instead of trying to do it ourselves with a regex.

This should fix fedora-infra/fmn#21 but will require one more change to
fmn.web.
@robert-scheck
Copy link

@puiterwijk, I think the conversion from UTF-8 to ACE should usually happen on client side not on MTA side. Instead of passing UTF-8 we IMHO should pass ACE to server.verify().

jeremycline pushed a commit to jeremycline/fmn that referenced this issue Dec 7, 2016
…-date

Add a relative arrow date to the irc message
jeremycline pushed a commit to jeremycline/fmn that referenced this issue Dec 7, 2016
jeremycline pushed a commit to jeremycline/fmn that referenced this issue Dec 7, 2016
…echa

If a rule throws an exception, then the match should fail.
jeremycline pushed a commit that referenced this issue Apr 26, 2017
need to have endpoints so fedmsg load_configs doesnt complain
jeremycline pushed a commit that referenced this issue Apr 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants