content-type of the server is
requests.get() returns improperly encoded data.
However, if we have the content type explicitly as
'Content-Type:text/html; charset=utf-8', it returns properly encoded data.
Also, when we use
urllib.urlopen(), it returns properly encoded data.
Has anyone noticed this before? Why does
requests.get() behave like this?
From requests documentation:
When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.
>>> r.encoding 'utf-8' >>> r.encoding = 'ISO-8859-1'
Check the encoding requests used for your page, and if it’s not the right one – try to force it to be the one you need.
Regarding the differences between
urllib.urlopen – they probably use different ways to guess the encoding. Thats all.