Retrieve text in HTML with powershell

Question:

In this html code :

I want to retrieve the text IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip or just the date and hour between IP_PHONE_BACKUP- and .zip

How can I do that ?

Answer:

What makes this question so interesting is that HTML looks and smells just like XML, the latter being much more programmably palatable due to its well-behaved and orderly structure. In an ideal world HTML would be a subset of XML, but HTML in the real-world is emphatically not XML. If you feed the example in the question into any XML parser it will balk on a variety of infractions. That being said, the desired result can be achieved with a single line of PowerShell. This one returns the whole text of the href:

And this one extracts the desired substring:

The catch, however, is in the overhead/setup to be able to run that one line of code. You need to:

  • Install HtmlAgilityPack to make HTML parsing look just like XML parsing.
  • Install PowerShell Community Extensions if you want to parse a live web page.
  • Understand XPath to be able to construct a navigable path to your target node.
  • Understand regular expressions to be able to extract a substring from your target node.

With those requirements satisfied you can add the HTMLAgilityPath type to your environment and define the Select-NodeContent function, both shown below. The very end of the code shows how you assign a value to the $doc variable used in the above one-liners. I show how to load HTML from a file or from the web, depending on your needs.

Source:

Retrieve text in HTML with powershell by licensed under CC BY-SA | With most appropriate answer!

Leave a Reply