Question:
In this html code :
I want to retrieve the text IP_PHONE_BACKUP-2012-Jul-25_15:47:47.zip
or just the date and hour between IP_PHONE_BACKUP-
and .zip
How can I do that ?
Answer:
What makes this question so interesting is that HTML looks and smells just like XML, the latter being much more programmably palatable due to its well-behaved and orderly structure. In an ideal world HTML would be a subset of XML, but HTML in the real-world is emphatically not XML. If you feed the example in the question into any XML parser it will balk on a variety of infractions. That being said, the desired result can be achieved with a single line of PowerShell. This one returns the whole text of the href:
1 2 |
Select-NodeContent $doc.DocumentNode "//a/@href" |
And this one extracts the desired substring:
1 2 |
Select-NodeContent $doc.DocumentNode "//a/@href" "IP_PHONE_BACKUP-(.*)\.zip" |
The catch, however, is in the overhead/setup to be able to run that one line of code. You need to:
- Install HtmlAgilityPack to make HTML parsing look just like XML parsing.
- Install PowerShell Community Extensions if you want to parse a live web page.
- Understand XPath to be able to construct a navigable path to your target node.
- Understand regular expressions to be able to extract a substring from your target node.
With those requirements satisfied you can add the HTMLAgilityPath
type to your environment and define the Select-NodeContent
function, both shown below. The very end of the code shows how you assign a value to the $doc
variable used in the above one-liners. I show how to load HTML from a file or from the web, depending on your needs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
Set-StrictMode -Version Latest $HtmlAgilityPackPath = [System.IO.Path]::Combine((Get-Item $PROFILE).DirectoryName, "bin\HtmlAgilityPack.dll") Add-Type -Path $HtmlAgilityPackPath function Select-NodeContent( [HtmlAgilityPack.HtmlNode]$node, [string] $xpath, [string] $regex, [Object] $default = "") { if ($xpath -match "(.*)/@(\w+)$") { # If standard XPath to retrieve an attribute is given, # map to supported operations to retrieve the attribute's text. ($xpath, $attribute) = $matches[1], $matches[2] $resultNode = $node.SelectSingleNode($xpath) $text = ?: { $resultNode } { $resultNode.Attributes[$attribute].Value } { $default } } else { # retrieve an element's text $resultNode = $node.SelectSingleNode($xpath) $text = ?: { $resultNode } { $resultNode.InnerText } { $default } } # If a regex is given, use it to extract a substring from the text if ($regex) { if ($text -match $regex) { $text = $matches[1] } else { $text = $default } } return $text } $doc = New-Object HtmlAgilityPack.HtmlDocument $result = $doc.Load("tmp\temp.html") # Use this to load a file #$result = $doc.LoadHtml((Get-HttpResource $url)) # Use this PSCX cmdlet to load a live web page |