How to read an HTML page with UTF8 encoding using PowerShell

Yesterday I was working on a project that requires heavy HTML pages content scraping.

What I wrote were several PowerShell files which were scraping the content using Invoke-WebRequest and Invoke-RestMethod. And everything was great and smooth… until I got to an HTML page with some Greek letters inside. To my surprise both PowerShell built in functions failed miserably when I tried to retrieve those UTF8 encoded pages. In short, I was bombarded with ℠and ¢ here and there.

So, after I lost several hours trying to figure out what’s going on and experimenting with all kinds of options it turned out It’s impossible to read properly a UTF8 encoded page without BOM with Invoke-WebRequest.

Here is a simple function I wrote which uses .NET classes to tackle the problem.

Note that this is just a simple example and it lacks the extensible functionality you get with Invoke-WebRequest.

Leave a Reply