Scraping content hijacking the endpoint calls in the front-end

Whenever Scraping is talked about, XPATH, CSS, or RegEx are usually mentioned, but there is another way to extract content from a site, and that is by intercepting the calls made by the site to its services, like Google.

Let’s use Google site as an example on how we can achieve that. Just for practical reason obviously.

For those who are not familiar, Scraping is a practice that involves extracting information from one or several sites. Although the legality of this practice is a bit gray, companies of all types do it, with Google being the king of it.

Googlebot is the most active bot on the internet.

How can scraping be done without needing the DOM?

Identify the endpoint calls

Let’s take the last script I made as an example, which extracts data from autocomplete.

The script: https://github.com/NachoSEO/google-autocomplete-extractor

The first thing to do is to see what calls the site makes when certain actions happen.

For example, in Google, when we type something in the search bar and autocomplete appears, a call is made to request that data. You can see it in Dev Tools > Network > Fetch/XHR.

Emulate the same payload

Once we have identified which request the site is making, we click on it to see what payload it is sending.

That is the body of the request and what it needs to receive in order to return the autocomplete data.

Let’s continue.

To be able to quickly emulate that call, we can right-click on the URL and select copy > copy as cURL.

This way, we have the entire call copied so we can paste it into our terminal.

Fecth in local with cURL

We open the terminal and paste the call to see which elements we will need and which ones we won’t (in order to format it in the language of our choice, whether it be Python, NodeJS, or another).

curl 'https://www.google.com/complete/search?q=link%20building&cp=13&client=gws-wiz&xssi=t&hl=en-ES&authuser=0&psi=PigLZIWHI9iYkdUPxp2_yAs.1678452798806&dpr=2' \
  -H 'authority: www.google.com' \
  -H 'accept: */*' \
  -H 'accept-language: es-ES,es;q=0.9,en;q=0.8,ca;q=0.7,fr;q=0.6' \
  -H 'cookie: CONSENT=PENDING+991; SEARCH_SAMESITE=CgQIlJcB; OTZ=6913057_52_52_123900_48_436380; AEC=ARSKqsJXaqgLDe6YwmWFTloKnjeR_pCnzMZUJhlTq-Ew0RDKP7G7ltgg6w; SID=UQgvW7iZEjY_02qA9q9PPjsUUFBGYgl9qFT6J86LteGaiznHD5KyLF4HCoxRR8GiLPn1UA.; __Secure-1PSID=UQgvW7iZEjY_02qA9q9PPjsUUFBGYgl9qFT6J86LteGaiznHn7YrSdJ1zIZKVAEdo-FZBw.; __Secure-3PSID=UQgvW7iZEjY_02qA9q9PPjsUUFBGYgl9qFT6J86LteGaiznH8BfVTB2FoRBrwSyOIbhsog.; HSID=A0drwesvEjmVEgxbT; SSID=A8b7OnD1Fi4u4iy3m; APISID=kzHaE227KsSo55hy/A661mO1od2n0wD1wu; SAPISID=fLQVdWihk3F5G-IU/A3lj769Bph4Rh2Ij4; __Secure-1PAPISID=fLQVdWihk3F5G-IU/A3lj769Bph4Rh2Ij4; __Secure-3PAPISID=fLQVdWihk3F5G-IU/A3lj769Bph4Rh2Ij4; NID=511=AezPyYk9hzd_SCQkog8QjAtpQxnWfgJRQdHn3ghud_k8HGR-rUEosgMY8LQ8U0FPK_snAJ4I6pou4HfUgkF8IcbNMdpN5ikExSQlJJJv12NqtdtcNXVX8XSK73FasKkNBccnqArzQMFp0JjY_Wxrrn0bK3qcNHtIgG47OFNKmQfT0KCCNhwzu_XjRmWqiczrF3CkuHa3XmfdnfH9x3gM6is; 1P_JAR=2023-3-10-12; __Secure-ENID=10.SE=ZOl9sem1YrQ9Mt4TDHt4yTTFgg3-gSczPwePDZODS9rZK4uJnhryIpualYtAtEygItJw57GC3Q7BhXyu3cJ0Ted6Ys8tswsy5ceuf9A7-BWz8pQYBinCAvDBKAUbab0zNfELUsnegB2yNZp6yuA1NtF67VuoMiF92oIC6vjqHPQTYVeQvxq5UkLrjmghwSHnZS66k0vtc2adIiGoFTXwsyvReCqjA7Kuitgv9nuZXC7x9iq4c3e10zB3Whwk2OiXwTFjZaBemItiD3Pf0Ww7_fk; SIDCC=AFvIBn_wkMWN80NiedsB9KtK6SnzmE69g4AmBzg3t6g7GxkA8muUgAWAxcn5zwP1McfsBsB1dFE; __Secure-1PSIDCC=AFvIBn-aELKi757awA4jsmPUZEsOuEkbglyNVZ22fl4uo6GVlZmpef9bsAybaiUXCedsk85Ub1M; __Secure-3PSIDCC=AFvIBn_aQhdSjQv42XCzIk4tKl2Nq-VO86usyPf68wu3eAe1GfZQpFZX4AzeQRwcqrzs4t_xpw' \
  -H 'referer: https://www.google.com/' \
  -H 'sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"' \
  -H 'sec-ch-ua-arch;' \
  -H 'sec-ch-ua-bitness: "64"' \
  -H 'sec-ch-ua-full-version: "110.0.5481.177"' \
  -H 'sec-ch-ua-full-version-list: "Chromium";v="110.0.5481.177", "Not A(Brand";v="24.0.0.0", "Google Chrome";v="110.0.5481.177"' \
  -H 'sec-ch-ua-mobile: ?1' \
  -H 'sec-ch-ua-model: "Nexus 5"' \
  -H 'sec-ch-ua-platform: "Android"' \
  -H 'sec-ch-ua-platform-version: "6.0"' \
  -H 'sec-ch-ua-wow64: ?0' \
  -H 'sec-fetch-dest: empty' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-site: same-origin' \
  -H 'user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Mobile Safari/537.36' \
  -H 'x-client-data: CIy2yQEIpLbJAQjAtskBCKmdygEIzPrKAQiVocsBCOb1zAEIkIzNAQjolc0BCKeWzQEI4pfNAQjjl80BCMyYzQEI85nNAQi1ms0BCNLhrAI=' \
  --compressed

The output:

1
2

)]}'
[[["link building",0,[273]],["link building\u003cb\u003e que es\u003c\/b\u003e",0,[512,203]],["link building\u003cb\u003e ejemplos\u003c\/b\u003e",0,[512,203]],["link building\u003cb\u003e seo\u003c\/b\u003e",0,[512,203]],["link building\u003cb\u003e gratis\u003c\/b\u003e",0,[512,203]],["link building\u003cb\u003e strategy\u003c\/b\u003e",0,[512,203]],["link building\u003cb\u003e definición\u003c\/b\u003e",0,[512,203]],["link building\u003cb\u003e strategies\u003c\/b\u003e",0,[512]],["link building\u003cb\u003e marketing\u003c\/b\u003e",0,[512,203]],["link building\u003cb\u003e metrics\u003c\/b\u003e",0,[512]]],{"q":"FJZXCRMlW5nhmIhHSFMGfZ1QSPw"}]

Viewing the terminal like this is a bit like seeing the matrix, I know.

Now we will change that.

We execute the request to see what the response is: a list of terms in Unicode format.

Decode unicode format

To make it more readable, we can use a converter.

Now it’s a little clearer.

We have to capture the data, but let’s make it a little more scalable.

Let’s use NodeJS.

Scaling the scraping with NodeJS

We transform the cURL request to Axios (a library for making requests).

We will pass it the query for which we want the information and the language to modify the hl attribute (I’m not sure if using a VPN when executing it would be optimal for getting results by country).

It would look like this:

async _requestAutocomplete(query, lang) {
    const res = await this.requestRepository.request({
      url: 'https://www.google.com/complete/search',
      method: 'get',
      params: {
        q: query,
        hl: lang,
        authuser: 0,
        cp: 2,
        client: 'gws-wiz',
        xssi: 't',
        psi: '_HDNY_H-KZC6gAbssLaQBQ.1674408188910',
        dpr: '1'
      },
      headers: {
        // ...
      }
    })

Since the response comes in Unicode, we have to convert and parse the output.

This way, we get a list without annoying formats.

_parse(data) {
    const convert = this.compile({
      wordwrap: 130
    });

    const unicodeQueries = data.map(unicodeResponse => {

      return unicodeResponse.replace(')]}\'\n[[[', '').split(',')
    }).flat()

    const decodedQueries = unicodeQueries.map(queries => {
      try {

        return decodeURIComponent(
          JSON.parse(queries.replace('[', ''))
        )
      } catch (err) {
        ''
      }
    })
      .filter(Boolean)
      .filter(query => isNaN(query))

    return decodedQueries.map(convert);
  }

Iterate for each letter to obtain more data

And we add to the script that it iterates over each letter of the alphabet to extract the autocomplete of the query + letter and save the result in a .txt file.

async execute(query, lang) {
    const dataP = this.abc.map(letter => {

      return this._requestAutocomplete(`${letter} ${query}`, lang)
    })

    const data = await Promise.all(dataP)

    const queries = this._parse(data);

    this.fileRepository.saveToFile('./src/output/queries.txt', queries.join('\n'))
  }

Finally, we add that it reads the arguments we pass it from the terminal, and voilà!