Read HTML table with C#

Kmcnet 1,181 Reputation points
2025-08-22T21:47:24.9233333+00:00

Hello everyone and thanks for the help in advance. I wrote a screen scrape program that returns a variable number of rows HTML table that I need to extract data from. I'm really not sure the best way to proceed, i.e. regex or some other method. Here is what the table looks like:

                    <table id="pcpHistoryTable" class="table table-striped table-bordered table-condensed unit size1of2">
                        <thead>
                        <tr>
                            <th>Name</th>
                            <th width="28%">Start Date</th>
                            <th width="28%">End Date</th>
                        </tr>
                        </thead>
                        <tbody>
			<tr>
				<td>John Doe</td><td>Feb 1, 2025</td><td>Current</td></tr>
				<tr><td>Jane Doe</td><td>Apr 1, 2022</td><td>Jan 31, 2025</td>
			</tr>
			</tbody>
                    </table>

Any help would be appreciated.

Developer technologies | C#
{count} votes

Accepted answer
  1. Marcin Policht 54,995 Reputation points MVP Volunteer Moderator
    2025-08-22T22:30:33.11+00:00

    Since you're scraping a table with structured data, regex is likely not the best choice — HTML can get messy, and regex doesn't handle nested tags or malformed markup well. Instead, you should use an HTML parser library, which will give you the rows and cells cleanly.

    C# (using HtmlAgilityPack)

    using HtmlAgilityPack;
    using System;
    using System.Linq;
    
    class Program
    {
        static void Main()
        {
            var html = @"<table id='pcpHistoryTable'>
                <tbody>
                    <tr><td>John Doe</td><td>Feb 1, 2025</td><td>Current</td></tr>
                    <tr><td>Jane Doe</td><td>Apr 1, 2022</td><td>Jan 31, 2025</td></tr>
                </tbody>
            </table>";
    
            var doc = new HtmlDocument();
            doc.LoadHtml(html);
    
            var rows = doc.DocumentNode
                          .SelectNodes("//table[@id='pcpHistoryTable']//tbody//tr");
    
            foreach (var row in rows)
            {
                var cells = row.SelectNodes("td").Select(td => td.InnerText.Trim()).ToList();
                Console.WriteLine(string.Join(" | ", cells));
            }
        }
    }
    

    Output:

    John Doe | Feb 1, 2025 | Current
    Jane Doe | Apr 1, 2022 | Jan 31, 2025
    

    If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

    hth

    Marcin


8 additional answers

Sort by: Most helpful
  1. Castorix31 91,066 Reputation points
    2025-08-22T23:19:16.6533333+00:00

    One way is with System.Xml

    With help from ChatGPT :

        string sHtml = @"<table id=""pcpHistoryTable"" class=""table table-striped table-bordered table-condensed unit size1of2"">
                <thead>
                <tr>
                    <th>Name</th>
                    <th width=""28%"">Start Date</th>
                    <th width=""28%"">End Date</th>
                </tr>
                </thead>
                <tbody>
    <tr>
    	<td>John Doe</td><td>Feb 1, 2025</td><td>Current</td></tr>
    	<tr><td>Jane Doe</td><td>Apr 1, 2022</td><td>Jan 31, 2025</td>
    </tr>
    </tbody>
            </table>";
    
        var xml = new System.Xml.XmlDocument();
        xml.LoadXml($"<root>{sHtml}</root>");
        foreach (System.Xml.XmlNode row in xml.SelectNodes("//table[@id='pcpHistoryTable']//tr"))
        {
            var cells = row.SelectNodes("th|td");
            if (cells == null) continue;
            Debug.WriteLine(string.Join(" | ", cells.Cast<System.Xml.XmlNode>().Select(c => c.InnerText.Trim())));
        }
    
    

  2. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

  3. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

  4. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.