Yesterday I was working on a project that requires heavy HTML pages content scraping.
What I wrote were several PowerShell files which were scraping the content using Invoke-WebRequest and Invoke-RestMethod. And everything was great and smooth… until I got to an HTML page with some Greek letters inside. To my surprise both PowerShell built in functions failed miserably when I tried to retrieve those UTF8 encoded pages. In short, I was bombarded with â and ¢ here and there.
So, after I lost several hours trying to figure out what’s going on and experimenting with all kinds of options it turned out It’s impossible to read properly a UTF8 encoded page without BOM with Invoke-WebRequest.
Here is a simple function I wrote which uses .NET classes to tackle the problem.
Note that this is just a simple example and it lacks the extensible functionality you get with Invoke-WebRequest.