PaddleMM

Introduction

PaddleMM aims to provide modal joint learning and cross-modal learning algorithm model libraries, providing efficient solutions for processing multi-modal data such as images and texts, which promote applications of multi-modal machine learning .

Recent updates

  • 2022.1.5 release PaddleMM v1.0

Features

  • Vast task scenarios: PaddleMM provides a variety of multi-modal learning task algorithm model libraries such as multi-modal fusion, cross-modal retrieval, image caption, and supports user-defined data and training.
  • Successful applications: There are related practical applications based on the PaddleMM, such as sneaker authenticity identification, sneaker style migration, automatic description of furniture pictures, rumor detection, etc.

Visualization

  • Sneaker authenticity identification

For more information, please visit our website [Ysneaker](http://www.ysneaker.com/) !

  • more visualization

Enterprise Application

  • Cooperation with Baidu TIC Smart Recruitment Resume analysis, based on multi-modal fusion algorithm and successfully implemented.

Framework

PaddleMM include the following modules:

  • Data processing: Provide a unified data interface and multiple data processing formats.
  • Model library: Including multi-modal fusion, cross-modal retrieval, image caption, and multi-task algorithms.
  • Trainer: Set up a unified training process and related score calculations for each task.

Use

Download the toolkit:

git clone https://github.com/njustkmg/PaddleMM.git
  • Data construction instructions here
  • Dependent files download here

Example:

from paddlemm import PaddleMM # config: Model running parameters, see configs/# data_root: Path to dataset# image_root: Path to images# gpu: Which gpu to userunner = PaddleMM(config='configs/cmml.yml', data_root='data/COCO', image_root='data/COCO/images', gpu=0) runner.train() runner.test()

or

python run.py --config configs/cmml.yml --data_root data/COCO --image_root data/COCO/images --gpu 0

Model library (updating)

[1] Comprehensive Semi-Supervised Multi-Modal Learning

[2] Stacked Cross Attention for Image-Text Matching

[3] Similarity Reasoning and Filtration for Image-Text Matching

[4] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

[5] Attention on Attention for Image Captioning

[6] VQA: Visual Question Answering

[7] ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Experimental result on COCO (updating)

  • Multimodal fusion
Average_PrecisionCoverageExample_AUCMacro_AUCMicro_AUCRanking_Loss
CMML0.68218.8270.9480.9270.9500.052semi-supervised
Early(add)0.97416.6810.9690.9520.9680.031VGG+LSTM
Early(add)0.97416.5320.9710.9580.9720.029ResNet+GRU
Early(concat)0.79716.3660.9720.9590.9730.028ResNet+LSTM
Early(concat)0.79816.5410.9710.9590.9720.029ResNet+GRU
Early(concat)0.79516.7040.9690.9520.9680.031VGG+LSTM
Late(mean)0.73317.8490.9590.9470.9630.041ResNet+LSTM
Late(mean)0.73417.8380.9590.9450.9620.041ResNet+GRU
Late(mean)0.73817.8180.9600.9430.9620.040VGG+LSTM
Late(mean)0.73517.9380.9590.9410.9590.041VGG+GRU
Late(max)0.74217.9530.9590.9440.9610.041ResNet+LSTM
Late(max)0.73617.9550.9590.9410.9610.041ResNet+GRU
Late(max)0.72717.9490.9580.9400.9590.042VGG+LSTM
Late(max)0.73717.9830.9590.9420.9590.041VGG+GRU
  • Image caption
Bleu-1Bleu-2Bleu-3Bleu-4MeteorRougeCider
NIC(paper)71.850.335.725.023.0--
NIC-VGG(ours)69.952.437.927.123.451.484.5
NIC-ResNet(ours)72.856.041.430.125.253.795.9
AoANet-CE(paper)78.7--38.128.457.5119.8
AoANet-CE(ours)75.158.744.433.227.255.8109.3

Achievement

Multi-Modal papers

  • Yang Yang, Chubing Zhang, Yi-Chu Xu, Dianhai Yu, De-Chuan Zhan, Jian Yang. Rethinking Label-Wise Cross-Modal Retrieval from A Semantic Sharing Perspective. Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI-2021), Montreal, Canada, 2021. (CCF-A).
  • Yang Yang, Ke-Tao Wang, De-Chuan Zhan, Hui Xiong, Yuan Jiang. Comprehensive Semi-Supervised Multi-Modal Learning. Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI-2019) , Macao, China, 2019. [Pytorch Code] [Paddle Code]
  • Yang Yang, Yi-Feng Wu, De-Chuan Zhan, Zhi-Bin Liu, Yuan Jiang. Deep Robust Unsupervised Multi-Modal Network. Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI-2019) , Honolulu, Hawaii, 2019.
  • Yang Yang, Yi-Feng Wu, De-Chuan Zhan, Yuan Jiang. Deep Multi-modal Learning with Cascade Consensus. Proceedings of the Pacific Rim International Conference on Artificial Intelligence (PRICAI-2018) , Nanjing, China, 2018.
  • Yang Yang, Yi-Feng Wu, De-Chuan Zhan, Zhi-Bin Liu, Yuan Jiang. Complex Object Classification: A Multi-Modal Multi-Instance Multi-Label Deep Network with Optimal Transport. Proceedings of the Annual Conference on ACM SIGKDD (KDD-2018) , London, UK, 2018. [Code]
  • Yang Yang, De-Chuan Zhan, Xiang-Rong Sheng, Yuan Jiang. Semi-Supervised Multi-Modal Learning with Incomplete Modalities. Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI-2018) , Stockholm, Sweden, 2018.
  • Yang Yang, De-Chuan Zhan, Ying Fan, and Yuan Jiang. Instance Specific Discriminative Modal Pursuit: A Serialized Approach. Proceedings of the 9th Asian Conference on Machine Learning (ACML-2017) , Seoul, Korea, 2017. [Best Paper] [Code]
  • Yang Yang, De-Chuan Zhan, Xiang-Yu Guo, and Yuan Jiang. Modal Consistency based Pre-trained Multi-Model Reuse. Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI-2017) , Melbourne, Australia, 2017.
  • Yang Yang, De-Chuan Zhan, Yin Fan, Yuan Jiang, and Zhi-Hua Zhou. Deep Learning for Fixed Model Reuse. Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI-2017), San Francisco, CA. 2017.
  • Yang Yang, De-Chuan Zhan and Yuan Jiang. Learning by Actively Querying Strong Modal Features. Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI-2016), New York, NY. 2016, Page: 1033-1039.
  • Yang Yang, Han-Jia Ye, De-Chuan Zhan and Yuan Jiang. Auxiliary Information Regularized Machine for Multiple Modality Feature Learning. Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI-2015), Buenos Aires, Argentina, 2015, Page: 1033-1039.
  • Yang Yang, De-Chuan Zhan, Yi-Feng Wu, Zhi-Bin Liu, Hui Xiong, and Yuan Jiang. Semi-Supervised Multi-Modal Clustering and Classification with Incomplete Modalities. IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), 2020. (CCF-A)
  • Yang Yang, Zhao-Yang Fu, De-Chuan Zhan, Zhi-Bin Liu, Yuan Jiang. Semi-Supervised Multi-Modal Multi-Instance Multi-Label Deep Network with Optimal Transport. IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), 2020. (CCF-A)

For more papers, welcome to our website njustlkmg !

PaddlePaddle Paper Reproduction Competition

  • Paddle Paper Reproduction Competition (4st): "Comprehensive Semi-Supervised Multi-Modal Learning" Championship
  • Paddle Paper Reproduction Competition (5st): "From Recognition to Cognition: Visual Commonsense Reasoning" Championship

Contribution

  • Thanks very much for the technical and application support provided by Baidu TIC.
  • We welcome you to contribute code to PaddleMM, and thank you very much for your feedback.

License

This project is released under Apache 2.0 license

GitHub - njustkmg/PaddleMM at pythonawesome.com
Multi-Modal Machine Learning toolkit based on PaddlePaddle. - GitHub - njustkmg/PaddleMM at pythonawesome.com