Update: this was caused by a bug in libxml2, which has in the meantime been reported and fixed. https://bugzilla.gnome.org/show_bug.cgi?id=795299
I have an HTML page with two tables, one of which has a row with mixed header and data cells, the second having only data cells:
<html>
<head>
<title>Table test page</title>
</head>
<body>
<table>
<tr>
<th>Header 1</th>
<td>Header 2</td>
</tr>
</table>
<table>
<tr>
<th>Header 1</th>
<th>Header 2</th>
</tr>
</table>
</body>
</html>
I am using the following XPath expression to select table cells from the first row of the first table, by combining the results for <th>
and <td>
. This works fine, I get both cells back. Note that the first table combines and in a single row.
$ xmllint --xpath '(((descendant-or-self::html/descendant-or-self::table)[1]//tr)[1]/td|((descendant-or-self::html/descendant-or-self::table)[1]//tr)[1]/th)' test.html
<th>Cell 1</th><td>Cell 2</td>
When I try to get the first cell by appending [1]
to the expression I get the second cell back instead of the first. Why?
The expected result of this expression is <th>Cell 1</th>
:
$ xmllint --xpath '(((descendant-or-self::html/descendant-or-self::table)[1]//tr)[1]/td|((descendant-or-self::html/descendant-or-self::table)[1]//tr)[1]/th)[1]' test.html
<td>Cell 2</td>
Appending [2]
yields the second cell successfully. But why not the first?
$ xmllint --xpath '(((descendant-or-self::html/descendant-or-self::table)[1]//tr)[1]/td|((descendant-or-self::html/descendant-or-self::table)[1]//tr)[1]/th)[2]' test.html
<td>Cell 2</td>
Similarly, for the second table I can get all cells with this expression. The second table only contains <th>
cells in the first row.
$ xmllint --xpath '(((descendant-or-self::html/descendant-or-self::table)[2]//tr)[1]/td|((descendant-or-self::html/descendant-or-self::table)[2]//tr)[1]/th)' test.html
<th>Cell 1</th><th>Cell 2</th>
However, when I append [1]
to get only the first cell back, I get an empty result. The expected result for this expression is <th>Cell 1</th>
.
$ xmllint --xpath '(((descendant-or-self::html/descendant-or-self::table)[2]//tr)[1]/td|((descendant-or-self::html/descendant-or-self::table)[2]//tr)[1]/th)[1]' test.html
XPath set is empty
The second cell can be retrieved successfully by appending [2]
. But why not the first?
$ xmllint --xpath '(((descendant-or-self::html/descendant-or-self::table)[2]//tr)[1]/td|((descendant-or-self::html/descendant-or-self::table)[2]//tr)[1]/th)[2]' test.html
<th>Cell 2</th>
Note that by replacing the first descendant-or-self
with //
the expression seems to work as expected in all cases for both the first and second table:
$ xmllint --xpath '(((//html/descendant-or-self::table)[1]//tr)[1]/td|((//html/descendant-or-self::table)[1]//tr)[1]/th)' test.html
<th>Cell 1</th><td>Cell 2</td>
$ xmllint --xpath '(((//html/descendant-or-self::table)[1]//tr)[1]/td|((//html/descendant-or-self::table)[1]//tr)[1]/th)[1]' test.html
<th>Cell 1</th>
$ xmllint --xpath '(((//html/descendant-or-self::table)[1]//tr)[1]/td|((//html/descendant-or-self::table)[1]//tr)[1]/th)[2]' test.html
<td>Cell 2</td>
$ xmllint --xpath '(((//html/descendant-or-self::table)[2]//tr)[1]/td|((//html/descendant-or-self::table)[2]//tr)[1]/th)' test.html
<th>Cell 1</th><th>Cell 2</th>
$ xmllint --xpath '(((//html/descendant-or-self::table)[2]//tr)[1]/td|((//html/descendant-or-self::table)[2]//tr)[1]/th)[1]' test.html
<th>Cell 1</th>
$ xmllint --xpath '(((//html/descendant-or-self::table)[2]//tr)[1]/td|((//html/descendant-or-self::table)[2]//tr)[1]/th)[2]' test.html
<th>Cell 2</th>
Here is the test case implemented in PHP which also uses libxml2: https://3v4l.org/WuPSt