Image analysis can play an important role in supporting histopathological diagnoses of lung cancer, with deep learning methods already achieving remarkable results. However, due to the large scale of whole-slide images (WSIs), creating manual pixel-wise annotations from expert pathologists is expensive and time-consuming. In addition, the heterogeneity of tumors and similarities in the morphological phenotype of tumor subtypes have caused inter-observer variability in annotations, which limits optimal performance. Effective use of weak labels could potentially alleviate these issues. In this paper, we propose a two-stage transformer-based weakly supervised learning framework called Simple Shuffle-Remix Vision Transformer (SSRViT). Firstly, we introduce a Shuffle-Remix Vision Transformer (SRViT) to retrieve discriminative local tokens and extract effective representative features. Then, the token features are selected and aggregated to generate sparse representations of WSIs, which are fed into a simple transformer-based classifier (SViT) for slide-level prediction. Experimental results demonstrate that the performance of our proposed SSRViT is significantly improved compared with other state-of-the-art methods in discriminating between adenocarcinoma, pulmonary sclerosing pneumocytoma and normal lung tissue (accuracy of 96.9% and AUC of 99.6%).