Fine-grained vehicle classification is a challenging task due to the subtle differences between vehicle classes. Several successful approaches to fine-grained image classification rely on part-based models, where the image is classified according to discriminative object parts. Such approaches require however that parts in the training images be manually annotated, a laborintensive process. We propose a convolutional architecture realizing a transform network capable of discovering the most discriminative parts of a vehicle at multiple scales. We experimentally show that our architecture outperforms a baseline reference if trained on class labels only, and performs closely to a reference based on a part-model if trained on loose vehicle localization bounding boxes. © 2017 IEEE.