Using Python for Telegram Scrapping Part 2

Using Python for Telegram Scrapping Part 2

Learn about python, telegram scrapping and much more...

Β·

6 min read

Hey techies how are you doing? πŸ˜€πŸ˜€ In the last article, we had gone through pyrogram and its basic methods. If you haven't read that article please go through it to get a better perspective on this one. This is the link for the same.

Telegram Scrapping Part 1πŸ“°πŸ“°

Before you proceed it's assumed you have done the setup for API, know how to use basic methods of pyrogram and also have the necessary packages.

Now we will dive deep into some other methods and their use cases. Earlier we had gone through a method get_chat_history where we could retrieve the method of the chats from a particular channel. Let's assume there is a use case where you want to retrieve the messages from a particular date range let's say the previous 2 years' messages. How do you go about that? Well, that's what we will look onto.

Firstly offset_date parameter needs to be set to a checkpoint from which we want to start scrapping in the get_chat_history method. For this, we will be needing the DateTime package which can be installed by pasting these lines onto the terminal.

pip install DateTime

To learn more about DateTime here are the official docs.

To compare DateTime we need it in a particular format. Try executing these lines in a new file or in the python terminal itself to have a look into the format.

from datetime import datetime

offset_date = datetime(2022, 12, 31)
print(offset_date)

Now we are all set for using DateTime. In the .env I'll be adding another parameter chat_id to use it in the program.

TG_API_ID=
TG_API_HASH=
CHAT_ID=

Let's set the offset date as 31st December 2022 and start_date as 31st December 2020. Offset refers to the date from which we want to start scrapping our messages. In all, we will get data for 2 years in this range. Let's save this code in another python file and execute this.

RememberπŸ’‘πŸ’‘ - To join the telegram channel for scrapping before proceeding further. Here the chat_id for the channel New York Times is -1001606432449.

from pyrogram import Client
from dotenv import load_dotenv
import os
from datetime import datetime

load_dotenv()

CONFIG = {
    "telegram_api_id": int(os.getenv("TG_API_ID")),
    "telegram_hash": os.getenv("TG_API_HASH"),
    "chat_id": os.getenv("CHAT_ID")
}

app = Client("my_account",CONFIG["telegram_api_id"],CONFIG["telegram_hash"])

def get_dates():
    offset_date = datetime(2022, 12, 31)
    start_date = datetime(2020, 12, 31)
    return offset_date,start_date

async def main():
    chat_id = CONFIG["chat_id"]
    offset_date, start_date = get_dates()

    async with app:
        async for message in app.get_chat_history(chat_id, offset_date=offset_date):
            if(message.date > start_date):
                print(message)
            else:
                break

app.run(main())

Here we have set the offset date and start date. Messages are scrapped from the offset date and the loop breaks when we reach the start date.

Interrupt the terminal to stop the flow of execution. The terminal output represents the raw data which we receive from telegram.

Let's see one for an example.

{
    "_": "Message",
    "id": 1765,
    "sender_chat": {
        "_": "Chat",
        "id": -1001606432449,
        "type": "ChatType.CHANNEL",
        "is_verified": true,
        "is_restricted": false,
        "is_creator": false,
        "is_scam": false,
        "is_fake": false,
        "title": "The New York Times",
        "username": "nytimes",
        "photo": {
            "_": "ChatPhoto",
            "small_file_id": "AQADAQAD5KkxG-K0WEUAEAIAAz-5mssW____YNzUE_UN61kABB4E",
            "small_photo_unique_id": "AgAD5KkxG-K0WEU",
            "big_file_id": "AQADAQAD5KkxG-K0WEUAEAMAAz-5mssW____YNzUE_UN61kABB4E",
            "big_photo_unique_id": "AgAD5KkxG-K0WEU"
        },
        "dc_id": 1,
        "has_protected_content": false
    },
    "date": "2022-11-21 07:44:28",
    "chat": {
        "_": "Chat",
        "id": -1001606432449,
        "type": "ChatType.CHANNEL",
        "is_verified": true,
        "is_restricted": false,
        "is_creator": false,
        "is_scam": false,
        "is_fake": false,
        "title": "The New York Times",
        "username": "nytimes",
        "photo": {
            "_": "ChatPhoto",
            "small_file_id": "AQADAQAD5KkxG-K0WEUAEAIAAz-5mssW____YNzUE_UN61kABB4E",
            "small_photo_unique_id": "AgAD5KkxG-K0WEU",
            "big_file_id": "AQADAQAD5KkxG-K0WEUAEAMAAz-5mssW____YNzUE_UN61kABB4E",
            "big_photo_unique_id": "AgAD5KkxG-K0WEU"
        },
        "dc_id": 1,
        "has_protected_content": false
    },
    "mentioned": false,
    "scheduled": false,
    "from_scheduled": false,
    "media": "MessageMediaType.PHOTO",
    "media_group_id": 13351974942043841,
    "has_protected_content": false,
    "photo": {
        "_": "Photo",
        "file_id": "AgACAgQAAx0CX8A2wQACBuVjxpZxAznh6hEOOjxOQgKfUFhWXAACHbAxG8GJ3FOUMbI-UPTM6QAIAQADAgADeQAHHgQ",
        "file_unique_id": "AgADHbAxG8GJ3FM",
        "width": 1050,
        "height": 550,
        "file_size": 195554,
        "date": "2022-11-21 07:44:10",
        "thumbs": [
            {
                "_": "Thumbnail",
                "file_id": "AgACAgQAAx0CX8A2wQACBuVjxpZxAznh6hEOOjxOQgKfUFhWXAACHbAxG8GJ3FOUMbI-UPTM6QAIAQADAgADbQAHHgQ",
                "file_unique_id": "AgADHbAxG8GJ3FM",
                "width": 320,
                "height": 168,
                "file_size": 20631
            }
        ]
    },
    "views": 22543,
    "outgoing": false
}

A small summary -

  • Each message has a unique id which starts from 1 and keeps on incrementing.

  • Each message consists info of about the sender's chat such as username, isbot and so on.

  • For a message, if a media is embedded, it contains data about that.

  • For example here each photo has a big_file_id or small_file_id which can be used to download the media.

  • Similarly, if a document or emoji is embedded it contains information regarding that.

Let's try downloading this media to our local machine.

To retrieve the same message as above we are going to use the get_messages API call. At once one can retrieve 200 messages. Params required to pass are first the chat_id, second the message id or an array of message ids. Here the message id is 1765.

    async with app:
        message = await app.get_messages(chat_id, 1765)
        print(message)

This will give the same message. Our goal is to extract the media from the message.

To download this media there is a method called download_media present in python that needs params file_id of media and optional file_name to save in our directory.

async def main():
    chat_id = CONFIG["chat_id"]
    async with app:
        message = await app.get_messages(chat_id, 1765)
        file = await app.download_media(message.photo.file_id, file_name="example.png")
        print(file)

app.run(main())

On running this file the media gets stored in a separate folder inside downloads as example.png.

Now since we are done with exploring some methods another use case arrives. Now our goal is to extract all the new messages. For example in WhatsApp on the notification panel whenever you receive a message a notification is pinned which displays that message. Similar behaviour we would like to imitate in telegram. Well, it's an interesting use case and is possible to do in pyrogram.

The decorator method on_message can be used for the same.

@app.on_message()
def log(client, message):
    print(message)

app.run()

This catches all the new messages which you will receive. To test this on a single group add an extra parameter group with chat_id or an array of chat_ids.

@app.on_message(group=chat_id)

Try sending a message to a group to test this code.

Hush we have reached the end.

Conclusion -

Firstly we can extract messages of a channel for a particular interval of time for which we got to have a glance at how the JSON message comes from telegram API. We also go to see how to store media like photos on our local machine. The same can be done for documents like pdf or videos of different formats. Lastly, the use case which was covered was how to display new messages for the whole telegram or a particular group chat. These are only some of the use cases covered which gives us a general idea of how to go about our way using telegram API. We can further extend the application to several use cases as the application demands.

One task which I would suggest is to go through pyrogram docs, create a separate group and try sending media of all formats, messages and downloading, loading them and tinkering around with them.