Binance interview question

How vision language model works?