Scraping data from SSR React app

Jul 27, 2024

I recently worked on a project where I needed numeric data from a table. The UI was formatting values (rounding and abbreviating), and some fields I needed were not rendered in the visible row at all. There were no useful network requests to hook into, so I needed a different approach.

I started with React DevTools, since it lets you inspect component props and state:

jsx

const browser = await puppeteer.launch({
  headless: false,
  devtools: true,
  args: [
    "--disable-setuid-sandbox",
    "--no-sandbox",
    "--disable-extensions-except=./extension/",
    "--load-extension=./extension/",
  ],
});

My first idea was to use the React DevTools global hook directly in the page context, but selecting the right component reliably became painful.

After digging around, I found this Hacker News thread: https://news.ycombinator.com/item?id=24898016

One of the comments pointed to RESQ, a library for querying React components. That ended up being the missing piece.

jsx

const resqScriptPath = path.resolve("./node_modules/resq/dist/index.js");
const resqScript = fs.readFileSync(resqScriptPath, "utf8");

const browser = await puppeteer.launch({
  headless: true,
  devtools: false,
  args: ["--disable-setuid-sandbox", "--no-sandbox"],
});

await page.evaluate(resqScript);
await page.waitForFunction(() => window.resq);

Then I queried the component props for each row:

jsx

const data = await page.evaluate(() => {
  const root = document.getElementById("__next");
  const components = window.resq.resq$$("eE", root);

  return Object.keys(components).map((key) => {
    const component = components[key];
    return component.props;
  });
});

page.evaluate() runs JavaScript inside the page. As long as the returned value is serializable, Puppeteer can pass it back directly (so JSON.stringify is optional).

In this case, the "eE" selector came from inspecting component names in React DevTools.

Using Puppeteer + RESQ gave me access to unformatted values and hidden fields that were not available in the rendered table. For React-heavy apps, this is a useful fallback when request-based scraping is not possible.