Effortless Web Scraping: Overcoming Anti-Scraping Techniques with Multilogin

Discover how Playwright and Multilogin can help you scrape data like a pro, even from the most secure websites

Effortless Web Scraping: Overcoming Anti-Scraping Techniques with Multilogin

Web Scrapping

Web scraping involves the process of extracting data from websites. It's performed using software that simulates human browsing to collect information from web pages. This data can then be analyzed or stored for various purposes. However, scraping websites often leads to blocking due to automated detection mechanisms. That's where tools like Multilogin come into play. In this guide, we'll walk you through setting up web scraping with Playwright and Multilogin.

Challenges of Web Scrapping

When you scrap websites that has anti-scrapping measures built-in, you’ll need to simulate various aspects of browser behavior to avoid detection. Websites employ multiple techniques to detect and block automated scraping, so your goal is to make your scraping bot appear as human-like as possible. Here is a non-exhaustive list that you might need to worry about:

  1. User-Agent Strings: Websites often use User-Agent strings to identify the browser and device. To avoid detection, you should rotate User-Agent strings to mimic different browsers and devices.

  2. IP Rotation and Proxies: Using the same IP address for all your requests can quickly lead to blocks. Use a pool of proxies and rotate them to distribute your requests across different IP addresses.

  3. Browser Fingerprinting: Websites use browser fingerprinting to detect automation by analyzing properties like screen resolution, installed plugins, time zone, and more.

  4. Randomized Input: Automated bots often exhibit non-human interaction patterns. Randomize mouse movements, clicks, and typing to mimic human behavior.

  5. Handling Cookies and Local Storage: Simulate a real browsing session by handling cookies and local storage. Persist cookies between sessions to maintain logged-in states.

  6. Avoiding Detection APIs: Some websites use specific JavaScript APIs to detect automation tools. Override or mock these APIs to avoid detection.

  7. Timing and Frequency: Human users don't browse in perfectly timed intervals. Introduce random delays between actions and avoid sending too many requests in a short period

  8. Screen Resolution and Viewport: Randomize screen resolution and viewport size to simulate different devices.

As you can see, simulating browser behavior to avoid detection requires addressing numerous aspects of how a human interacts with a website. This complexity can quickly become overwhelming and detract from your primary goal of extracting data efficiently and effectively. Fortunately, there are specialized services like Multilogin that can handle these complexities for you.

What is Multilogin?

Multilogin is designed to simplify and automate the process of creating undetectable browser profiles. By leveraging Multilogin, you can focus on the actual task of scraping data, while it takes care of mimicking human browsing behavior. Multilogin is offered as a Software as a Service (SaaS) product, which means you need to purchase a subscription to access its features.

How Multilogin Works with Playwright

Now let's dive into the working process of Multilogin and how it can integrate with playwright.

Step 1: Purchase a Multilogin Package

Multilogin is offered as a Software as a Service (SaaS) product. You start by purchasing a package that suits your needs. Each package provides access to a certain number of profiles, browser instances, and other advanced features. This subscription model ensures that you always have up-to-date tools and support for your web scraping needs.

Step 2: Download the Multilogin X Browser

Once you have a Multilogin subscription, you need to download and install the Multilogin X browser from the Multilogin website. This custom browser is designed to mimic human behavior and avoid detection. It is built on top of standard browser engines (Chromium, Firefox) but includes enhancements to evade fingerprinting and other anti-scraping measures.

Step 3: Launch A Profile

You have two options to launch a profile with Multilogin:

Option 1: Directly from the Multilogin Website

  1. Log in to your Multilogin account on the Multilogin website.

  2. Navigate to the profiles section and select the profile you want to use.

  3. Click the 'Start' button to launch the profile. This will open the Multilogin X browser with the selected profile.

Option 2: Connect Playwright Over CDP with APIs

To use Playwright with Multilogin, you need to launch a browser profile and connect to it using the Chrome DevTools Protocol (CDP).

Integration with Multilogin APIs

Multilogin provides a comprehensive set of APIs that allow you to interact with their service programmatically. These APIs enable you to perform actions such as signing in, starting and stopping profiles, and managing your configurations. The key endpoints you'll be using include:

  • User Authentication: To obtain a token for subsequent API requests.

  • Profile Management: To start and stop browser profiles.

Here’s a high-level overview of the API endpoints we'll be using:

  1. Sign In: POST /user/signin

    • This endpoint is used to authenticate and obtain an access token.
  2. Start Profile: GET /profile/f/{folderId}/p/{profileId}/start?automation_type=playwright

    • This endpoint starts a specified browser profile and returns the necessary information to connect via CDP.
  3. Stop Profile: GET /profile/stop/p/{profileId}

    • This endpoint stops a specified browser profile.

If you want to look more into APIs, here is the full documentation.

Let's create a TypeScript file that encapsulates the functionality to interact with Multilogin's APIs and integrates it with Playwright. You can setup a TypeScript project and create the following multilogin.ts file.

import { chromium } from 'playwright';
import fetch from 'node-fetch';
import * as dotenv from 'dotenv';
dotenv.config();

export class Multilogin {
  static MLX_BASE = 'https://api.multilogin.com';
  static MLX_LAUNCHER = 'https://launcher.mlx.yt:45001/api/v1';

  static REQUEST_HEADERS = {
    Accept: 'application/json',
    'Content-Type': 'application/json',
    'Accept-Language': 'en',
  };

  private folderId: string;
  private profileId: string;
  private token: string | null = null;

  constructor({ folderId, profileId }: MultiloginOptions) {
    this.folderId = folderId;
    this.profileId = profileId;
  }

  public async signIn({ email, password }: SignInArgs) {
    const payload = {
      email,
      password,
    };
    try {
      const response = await fetch(`${Multilogin.MLX_BASE}/user/signin`, {
        method: 'POST',
        headers: Multilogin.REQUEST_HEADERS,
        body: JSON.stringify(payload),
      });
      const data: SignInResponse = await response.json();
      this.token = data.data.token;
      return {
        token: this.token,
      };
    } catch (error: any) {
      throw new Error('SignIn failed');
    }
  }

  public async startProfile() {
    if (!this.token) {
      throw new Error('Please use signIn() before startProfile()');
    }
    try {
      const response = await fetch(
        `${Multilogin.MLX_LAUNCHER}/profile/f/${this.folderId}/p/${this.profileId}/start?automation_type=playwright`,
        {
          headers: {
            ...Multilogin.REQUEST_HEADERS,
            Authorization: `Bearer ${this.token}`,
          },
        }
      );
      const data: StartProfileResponse = await response.json();
      const browserURL = `http://127.0.0.1:${data.status.message}`;
      const browser = await chromium.connectOverCDP(browserURL);
      const context = browser.contexts()[0];
      const page = context.pages()[0] || await context.newPage();
      return {
        browser,
        page,
        context,
      };
    } catch (error) {
      console.log(error);
      throw new Error('StartProfile failed');
    }
  }

  public async stopProfile() {
    try {
      await fetch(
        `${Multilogin.MLX_LAUNCHER}/profile/stop/p/${this.profileId}`,
        {
          headers: {
            ...Multilogin.REQUEST_HEADERS,
            Authorization: `Bearer ${this.token}`,
          },
        }
      );
    } catch (error) {
      throw new Error('StopProfile failed');
    }
  }
}

export type SignInResponse = {
  data: {
    token: string;
  };
};

export type StartProfileResponse = {
  status: {
    message: string;
  };
};

export type MultiloginOptions = {
  folderId: string;
  profileId: string;
};

export type SignInArgs = {
  email: string;
  password: string;
};

We also need to ensure that we have the necessary environment variables set up. Create a .env file in your project root with your Multilogin credentials and profile information:

MULTILOGIN_EMAIL=your-email@example.com
MULTILOGIN_PASSWORD=yourpassword
PROFILE_ID=your-profile-id
FOLDER_ID=your-folder-id

Writing Test Scripts with Playwright

By default, Playwright will manage the browser instance itself, but here you will manage it through Multilogin. However, we can use Playwright's beforeEach and afterEach hooks to set up and clean up resources for each test.

import { expect, test } from '@playwright/test';
import { Browser, BrowserContext, Page } from "playwright";
import { Multilogin } from "../src/Multilogin";

let browser: Browser;
let context: BrowserContext;
let page: Page;
let multilogin: Multilogin;

test.beforeEach(async () => {
  multilogin = new Multilogin({
    profileId: process.env.PROFILE_ID!,
    folderId: process.env.FOLDER_ID!,
  });
  await multilogin.signIn({
    email: process.env.MULTILOGIN_EMAIL!,
    password: process.env.MULTILOGIN_PASSWORD!,
  });
  const profile = await multilogin.startProfile();
  browser = profile.browser;
  page = profile.page;
  context = profile.context;
});

test('example test', async () => {
  await page.goto('https://example.com');
  const title = await page.title();
  expect(title).toBe('Example Domain');
});

test.afterEach(async () => {
  await context.close();
  await multilogin.stopProfile();
});

Limitations

While Multilogin is a powerful tool for bypassing anti-scraping measures and creating undetectable browser profiles, it's important to consider some of the limitations and challenges associated with its use. These limitations can impact the performance, cost, and complexity of your web scraping operations, particularly when scaling up or hosting in cloud environments. Here are some key considerations:

  1. Headful Browsers: Multilogin relies on headful browsers, which run with a graphical user interface (GUI). While this approach is excellent for mimicking human behavior and avoiding detection, it comes with some drawbacks. Headful browsers are more resource-intensive, leading to slower performance and higher latency compared to headless browsers. This can make large-scale scraping operations more challenging and costly to manage.

  2. Cost of Packages: Multilogin operates on a subscription model, which means you need to purchase a package to access its features. These packages can become costly, especially if you require multiple profiles or extensive usage. For smaller projects or budgets, the recurring costs of maintaining a Multilogin subscription might be a significant consideration. The need for more powerful infrastructure to run headful browsers can also add to the overall expense.

  3. Infrastructure Complexity on Cloud Servers: If you plan to host your scraping setup on a cloud server, the use of headful browsers introduces additional infrastructure complexity. Configuring and managing these environments to ensure smooth operation at scale can be technically challenging, requiring expertise in cloud infrastructure and automation. This complexity adds another layer of difficulty and cost to your web scraping projects.

Conclusion

In this guide, we demonstrated how to integrate Playwright with Multilogin to create a powerful web scraping setup that can bypass even the most stringent defenses. By leveraging Multilogin's ability to create undetectable browser profiles and Playwright's robust automation capabilities, you can effectively and efficiently scrape data without being blocked.

Did you find this article valuable?

Support Naimul Haque's Blog by becoming a sponsor. Any amount is appreciated!