Unleashing the Power of Python: A guide to Telegram Scraping

Unleashing the Power of Python: A guide to Telegram Scraping

A blog about Python, telegram and much more.

ยท

5 min read

Hey everyone this is my first blog post. I plan on writing content on Python, AWS and so on. Hope you like it.๐Ÿ˜€๐Ÿ˜€๐Ÿ˜€

There are several ways that can be used for scrapping data from telegram. These include libraries like telethon and pyrogram for python. In this article, we will go through how to use pyrogram.

For this article, I will be using Vscode as my text editor. You can follow along with it or choose the preferred editor that suits you better.

Here is the official documentation for pyrogram. For the sake of simplicity, we will be sticking to some of the basics and most commonly used methods.

Pyrogram Docs

Building your API Keys -

To start off the prerequisites are having a telegram account and API setup. If you have the API keys you can skip the setup section.

Telegram API

Upon following this link, you can click over to the API development tools and enter the details. After filling up the form the application gets created. Once the application is created you get an App api_id and App api_hash using which we can use the program. Store these API keys in a safe place as you will be needing them further.

For starters, we can use this template from Pyrogram docs.

from pyrogram import Client

app = Client("my_account")

with app:
    app.send_message("me", "Hi!")

Create a new folder called PYROGRAM. Within that folder create a .env that stores the API id and API hash which you had saved before.

TG_API_ID="Your api id"
TG_API_HASH="Your api hash"

You will also need to install certain packages -

  • os

  • pyrogram

  • dotenv

Copy and paste these lines into the terminal.

Packages -

pip install os-sys
pip install pyrogram
pip install python-dotenv

Then install tgcrypto to improve the pyrogram performance.

pip install TgCrypto

Now for the last step, our python file looks like this. Create a new file as pyrogram_starter.py and paste the following code.

from pyrogram import Client
from dotenv import load_dotenv
import os

load_dotenv()

CONFIG = {
    "telegram_api_id": int(os.getenv("TG_API_ID")),
    "telegram_hash": os.getenv("TG_API_HASH"),
}

app = Client("my_account",CONFIG["telegram_api_id"],CONFIG["telegram_hash"])


with app:
    app.send_message("me", "Hello from pyrogram")

Here we have used the dotenv package to read our API keys from the .env file. For initializing the Client for telegram it needs the API keys which we have loaded through the load_dotenv() method.

On running this file for the first time it asks for your telegram number upon which a session file is created which can be used directly and these steps won't have to be repeated again.

Upon confirmation, we can verify this message from our telegram account.

Hurray, we have done the initial setup required for pyrogram ๐Ÿฅณ๐Ÿฅณ๐Ÿฅณ๐Ÿฅณ

There are several API methods available in the pyrogram API we are going to test and implement a few of them.

Let's start with the scrapping of channel messages for this we are gonna use the get_chat_history API call. This method returns the messages in reverse chronological order. There are several parameters which we can pass such as limit, and offset. By default, there are no limits applied to this API call.

To scrape data from a particular channel the user must join the particular channel.

For this particular call, we will be scrapping our messages from the New York Times.

Our first task is to extract the channel id or ref which we need to pass to this API call.

If it's a public channel you can simply use its username as an id. Another way is to extract the channel id through the URL.

Remember to use the Web Version.

For this channel, the corresponding channel id is -1242127973. But for the API call to work, we need to add -1001242127973 as the new channel id.

For more information, you can refer to this Telegram ID GitHub.

Now as this call is an asynchronous method we need to use async in Python. For example to imitate the asynchronous behaviour we can use the asyncio library. Buts that's not in the scope of this article maybe for another one.

Now for our code to work it looks like.

app = Client("my_account",CONFIG["telegram_api_id"],CONFIG["telegram_hash"])

chat_id = -1001242127973

async def main():
    async with app:
        async for message in app.get_chat_history(chat_id):
            print(message.text)

app.run(main())

You can also add chat_id as an env variable or as a part of utilities.

If you execute this code it will run infinitely as there are no limits applied. Interrupt the terminal to stop the flow of execution. The terminal output gives us the messages from this telegram channel.

If you want to have a glance at the JSON for the message try writing the print statement by removing the .text.

print(message)

By using a limit we can put a hard stop after getting the required number of messages.

Let's say you want to access the chat for personal contact from the telegram web. The same process can be repeated which we used to get for the telegram group. But in this case, we don't need to add -100 as a prefix to the chat id just adding - will do the work.

For getting information corresponding to you. You can use the get_me api call.

me = await app.get_me()
print(me)

Let's say you have joined a public group and want to extract all the user's info. You can use the get_chat_members method. But there are some cases where permissions are set and you can't access the member info.
Also, there is a limitation to this call. It only returns the first 10k members. There is no way for one to extract all the users if the member count exceeds more than 10k.

async def main():
    async with app:
       async for member in app.get_chat_members(chat_id):
            print(member)

app.run(main())

If you want to send a message such as an update through an API call you can use the send_message method which we had used in the very beginning. Replace the chat_id with your respective channel id.

Advice - Create a separate group and try testing these methods on that group.

๐Ÿ’ก๐Ÿ’ก๐Ÿ’ก๐Ÿ’ก

async def main():
    async with app:
       await app.send_message(chat_id, "Message sent with **Pyrogram**!")

app.run(main())

Conclusion -

By following this article we got a basic understanding of how to use pyrogram and its API calls. We learnt how to use commonly used methods such as get_chat_history, send_message and so on. Now to dive deeper it's advised to go through the official pyrogram docs and try testing out by building your own application in python.

Hope you enjoyed it. ๐Ÿ˜€๐Ÿ˜€๐Ÿ˜€๐Ÿ˜€